For some people debugging comes natural, and they will think that this entire article is obvious, however to many people I hope that this guide will help them become better at debugging systems. If you are one of those people who is an expert debugger (which comes from experience), it is often hard to put into words how you go about debugging so here is an attempt at it.
While this should help give you the tools for how to approach problem solving and the debugging of complex systems, it is not a replacement for experience. The only way to get better at debugging is to debug. Jerri Ellsworth is known to say that she hopes your first project fails, and that you should fail often. She does not mean that you should fail and give up, but rather that things should not always work for you, which will give you the opportunity to troubleshoot and improve your debugging skills. Often while debugging a system you will really learn how things work and that will help you grow.
I started to write this post and I had many ideas on how to debug a system written down on paper. I then found a book called Debugging: The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems by David J. Agans and his 9 rules of debugging systems. I am happy to say that I had all nine of his rules on my paper (plus more) in some form, however while it took me a full sentence to capture some of the rules he captures the rule in a few words. I have ordered a copy of his book and should be getting it soon.
While there are many different types of systems that might need to be debugged, following some basic rules can be applied to successfully debug any problem. While these rules are simple in theory, the practice is often much more complex. Mike from Mike’s Electric Stuff recently sent a tweet about how long it takes to debug a problem, “Debug time is proportional to t^n, where n = the number of bugs that affect the situation & t is the difficulty finding the most obscure one”. Just remember not to rush into a “fix” that might end up being wrong. Debugging often takes time; managers, customers, etc.. will often complain about how long it took to debug a problem, but remember debugging takes time and can be difficult. If it was simple they would have fixed it (you can direct them to this post).
Since David Agans was able to make the beautiful poster above (see the link at the bottom of the post to get the poster), and come up with concise wording for these rules, I will use his wording and then expand upon them based on my experience.
1. Understand the system
How can you fix a system if you do not understand it? While there are simple things (is it plugged in), many problems will require a varying amount of specific knowledge about the system. For example, if you are troubleshooting a PCB knowing how input waveform looks, the output waveform looks, and how they are supposed to look is key for isolating, diagnosing, and fixing the problem. Make sure you understand what is happening and avoid making hasty decisions (that could make the problem even worse). If you are not familiar with a system that you work with, make sure to get familiar with it BEFORE a problem occurs. If something is operating different the datasheet/manual says it should then figure out why and also check the errata sheets to see if they changed the datasheet/manual at some point.
2. Make it fail
I am not sure if I am interpreting this one the same as David Agans, but to me this means that you want to make sure that you can make the system fail and that you know what triggers the failure. Having the ability to repeatably make a system fail will not only help you find the problem, but also make sure that the problem is corrected.
When you are making it fail you might try to use another new/working component. This can be dangerous, if the cause is not the component you might just destroy the new one. If this happens there are two takeaways:
– Do not try a third component assuming that both of the other components were broken.
– Make sure to check what interacts with that component to see if the problem is external to the component.
There are transient or “random” events that are very difficult to accurately reproduce in order to make if fail. However with a little imagination and work it is often possible to cause these “random” events to occur. You might also discover that these “random” events are actually not that random. There are many stories about devices randomly exhibiting strange behavior only to discover that it is really caused from a similar source in every occasions, such as from a 3G phone signal (while there was no affect from voice, edge data, or 4G data).
3. Quit thinking and look
People often have a tendency to start thinking/talking about what could have caused the failure before having the appropriate information. It is important to know when to stop talking and when to start looking. You might spend a while talking about “what if the CPU was spiking, which caused xyz to happen” however by running a simple test and monitoring the system you might determine a completely different problem was the culprit, such as a memory leak. Try keeping your imagination from going wild at first. After some time you might need to use your imagination to determine the problem, but make sure the imaginations are based on observation and fact. As Sherlock Holmes (I know he is a fictional character) used to say ” when you have eliminated the impossible, whatever remains, however improbable, must be the truth.”
4. Divide and conquer
When troubleshooting a problem one of the first steps should be isolating the sub-system where the problem exists. For example if you have a signal going into an analog-to-digital converter section of a PCB but no signal coming out, then that is where you should focus on finding the problem. If you start looking at some LCD display you will probably not find the problem. It is easy to waste time following wild goose chases that are completely unrelated to the problem.
In software it is often useful to add print statements all over your code to help narrow down where the problem is. I regularly add lines like print(got here n,
5. Change one thing at a time
If you change several things at a time and then test if the problem is fixed you can not be sure which change fixed the problem. This is especially common in software where it is easy to change many things at once.
6. Keep an audit trail
This one is important, you should record what tests you have tried and what the results are. When debugging any moderately complex system it is easy to forget exactly what things you have tried and what you have not tried. If you are debugging something over an extended time this becomes even more important. I know you might be telling yourself that you will remember what you have tried. You wont!
7. Check the plug
Check the obvious things. Something can be so simple that you forget to check it or assume that it is correct. Don’t do that. Often the simplest things are the most overlooked items. Make sure to put aside your assumptions and truly test everything. If your robot is not starting; are the batteries charged? are they connected? is the switch on? are the fuses blown? did a wire come loose? It is easy to jump to a conclusion that there is a hardware problem while it is probably more likely that there is a simple solution.
8. Get a fresh view
When you stare at something for a long time or become overly familiar with a system it is easy to miss an obvious problem. Having somebody take a second look at the problem can be very useful. Especially when you are tired or getting frustrated the second set of eyes is very valuable. Often if you walk away from the problem and come back in a few hours the problem will become clearer. If you are working on a “common” component make sure to check the internet to see if others have experienced a similar problem that can help you solve your problem.
9. If you didn’t fix it, it ain’t fixed
This one burns people all the time. Often if a problem disappears on its own people will consider it fixed, while in reality it still exists and can reappear at the worst time. If you want to make sure the problem does not reappear make sure that it is actually fixed. Rebooting a computer does not count as fixing the problem (no matter what every IT help desk will tell you)!
10. Do no harm
I am adding a 10th rule to this list. Avoid making things worse or damaging things further when you are debugging somethings. This also applies to yourself and others, make sure nobody gets hurt. If you are not a surgeon do not try to debug somebody else’s heart by doing surgery. That is a VERY BAD idea idea (and will probably land you in jail).
All of the rules above apply when debugging just about all (if not all) systems. There are a few more rules that are specific to robots:
Check the Voltages
One of the most valuable pieces of information when testing electronics and robots is checking that all of the voltages are correct. This includes batteries, bus voltages, and voltages at the component. David L. Jones from the EEVBlog always says that the first rule of debugging is “thou shall test thy voltages.”
Check for lens caps
If your camera (or other) perception system is not working and you have lens caps make sure they have been removed. I can not begin to tell you how many times I have seen people forget to remove lens caps.
Can you ping the device you are trying to connect to? This is useful for ethernet based devices. You should be aware that some devices have respond to pings disabled.
If you are hiring someone, and you really want to know what they worked on, Elon Musk has some good advice for you. Ask about the problems they have encountered and how they debugged and fixed them. Only the person who actually did the work will be able to give a multi-level detailed response.
I just received David Agans Debugging book and am very impressed with it, and I think all Roboticists (and other people) should purchase a copy. Perhaps because my name is also David but we seem to think very similar. Everything I said above plus much more is in his book, he also has an affection for Sherlock Holmes. Every chapter starts with a Sherlock Holmes quote and a war story about the rule he is introducing. There is a chapter going into each rule in detail and many war stories scattered throughout the chapters. Towards the end of the book are some scenarios that you can use to test your knowledge of debugging rules.
There were a few significant points that he mentions that I did not mention (and I should have), here are some of them:
– Read the manual cover-to-cover!
– Start at step one. Before checking if a screen works make sure to start from the beginning and check the power, the switch, etc..
– Don’t blindly trust your tools. There are many cases when a debugging tool can be the culprit and have an error, or you are misinterpreting the results from the debugging tool.
– Ask an expert (or the vendor) and don’t be proud. – Sometimes you can ask an expert who can provide a solution or help direct your debugging. If you are on a tight deadline this can be very valuable, however try to spend some time debugging before you ask in order to learn more and improve your debugging skills. Don’t let your pride/ego get in the way of solving the problem. Remember that even an expert can make a mistake.
– If you have a working version, try to use that as a baseline to identify what is different (and what has changed) in the device that has the problem.
You can get the poster from the main image by clicking here.