Debugging 290
Debugging: The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems | |
author | David J. Agans |
pages | 192 |
publisher | Amacom |
rating | 9 |
reviewer | David A. Wheeler |
ISBN | 0814471684 |
summary | A classic book on debugging principles |
Debugging explains the fundamentals of finding and fixing bugs (once a bug has been detected), rather than any particular technology. It's best for developers who are novices or who are only moderately experienced, but even old pros will find helpful reminders of things they know they should do but forget in the rush of the moment. This book will help you fix those inevitable bugs, particularly if you're not a pro at debugging. It's hard to bottle experience; this book does a good job. This is a book I expect to find useful many, many, years from now.
The entire book revolves around the "nine rules." After the typical introduction and list of the rules, there's one chapter for each rule. Each of these chapters describes the rule, explains why it's a rule, and includes several "sub-rules" that explain how to apply the rule. Most importantly, there are lots of "war stories" that are both fun to read and good illustrations of how to put the rule into practice.
Since the whole book revolves around the nine rules, it might help to understand the book by skimming the rules and their sub-rules:
- Understand the system: Read the manual, read everything in depth, know the fundamentals, know the road map, understand your tools, and look up the details.
- Make it fail: Do it again, start at the beginning, stimulate the failure, don't simulate the failure, find the uncontrolled condition that makes it intermittent, record everything and find the signature of intermittent bugs, don't trust statistics too much, know that "that" can happen, and never throw away a debugging tool.
- Quit thinking and look (get data first, don't just do complicated repairs based on guessing): See the failure, see the details, build instrumentation in, add instrumentation on, don't be afraid to dive in, watch out for Heisenberg, and guess only to focus the search.
- Divide and conquer: Narrow the search with successive approximation, get the range, determine which side of the bug you're on, use easy-to-spot test patterns, start with the bad, fix the bugs you know about, and fix the noise first.
- Change one thing at a time: Isolate the key factor, grab the brass bar with both hands (understand what's wrong before fixing), change one test at a time, compare it with a good one, and determine what you changed since the last time it worked.
- Keep an audit trail: Write down what you did in what order and what happened as a result, understand that any detail could be the important one, correlate events, understand that audit trails for design are also good for testing, and write it down!
- Check the plug: Question your assumptions, start at the beginning, and test the tool.
- Get a fresh view: Ask for fresh insights, tap expertise, listen to the voice of experience, know that help is all around you, don't be proud, report symptoms (not theories), and realize that you don't have to be sure.
- If you didn't fix it, it ain't fixed: Check that it's really fixed, check that it's really your fix that fixed it, know that it never just goes away by itself, fix the cause, and fix the process.
This list by itself looks dry, but the detailed explanations and war stories make the entire book come alive. Many of the war stories jump deeply into technical details; some might find the details overwhelming, but I found that they were excellent in helping the principles come alive in a practical way. Many war stories were about obsolete technology, but since the principle is the point that isn't a problem. Not all the war stories are about computing; there's a funny story involving house wiring, for example. But if you don't know anything about computer hardware and software, you won't be able to follow many of the examples.
After detailed explanations of the rules, the rest of the book has a single story showing all the rules in action, a set of "easy exercises for the reader," tips for help desks, and closing remarks.
There are lots of good points here. One that particularly stands out is "quit thinking and look." Too many try to "fix" things based on a guess instead of gathering and observing data to prove or disprove a hypothesis. Another principle that stands out is "if you didn't fix it, it ain't fixed;" there are several vendors I'd like to give that advice to. The whole "stimulate the failure, don't simulate the failure" discussion is not as clearly explained as most of the book, but it's a valid point worth understanding.
I particularly appreciated Agans' discussions on intermittent problems (particularly in "Make it Fail"). Intermittent problems are usually the hardest to deal with, and the author gives straightforward advice on how to deal with them. One odd thing is that although he mentions Heisenberg, he never mentions the term "Heisenbug," a common jargon term in software development (a Heisenbug is a bug that disappears or alters its behavior when one attempts to probe or isolate it). At least a note would've been appropriate.
The back cover includes a number of endorsements, including one from somebody named Rob Malda. But don't worry, the book's good anyway :-).
It's important to note that this is a book on fundamentals, and different than most other books related to debugging. There are many other books on debugging, such as Richard Stallman et al's Debugging with GDB: The GNU Source-Level Debugger. But these other texts usually concentrate primarily on a specific technology and/or on explaining tool commands. A few (like Norman Matloff's guide to faster, less-frustrating debugging ) have a few more general suggestions on debugging, but are nothing like Agans' book. There are many books on testing, like Boris Beizer's Software Testing Techniques, but they tend to emphasize how to create tests to detect bugs, and less on how to fix a bug once it's been detected. Agans' book concentrates on the big picture on debugging; these other books are complementary to it.
Debugging has an accompanying website at debuggingrules.com, where you can find various little extras and links to related information. In particular, the website has an amusing poster of the nine rules you can download and print.
No book's perfect, so here are my gripes and wishes:
- The sub-rules are really important for understanding the rules, but there's no "master list" in the book or website that shows all the rules and sub-rules on one page. The end of the chapter about a given rule summarizes the sub-rules for that one rule, but it'd sure be easier to have them all in one place. So, print out the list of sub-rules above after you've read the book.
- The book left me wishing for more detailed suggestions about specific common technology. This is probably unfair, since the author is trying to give timeless advice rather than a "how to use tool X" tutorial. But it'd be very useful to give good general advice, specific suggestions, and examples of what approaches to take for common types of tools (like symbolic debuggers, digital logic probes, etc.), specific widely-used tools (like ddd on gdb), and common problems. Even after the specific tools are gone, such advice can help you use later ones. A little of this is hinted at in the "know your tools" section, but I'd like to have seen much more of it. Vendors often crow about what their tools can do, but rarely explain their weaknesses or how to apply them in a broader context.
- There's probably a need for another book that takes the same rules, but broadens them to solving arbitrary problems. Frankly, the rules apply to many situations beyond computing, but the war stories are far too technical for the non-computer person to understand.
But as you can tell, I think this is a great book. In some sense, what it says is "obvious," but it's only obvious as all fundamentals are obvious. Many sports teams know the fundamentals, but fail to consistently apply them - and fail because of it. Novices need to learn the fundamentals, and pros need occasional reminders of them; this book is a good way to learn or be reminded of them. Get this book.
If you like this review, feel free to see Wheeler's home page, including his book on developing secure programs and his paper on quantitative analysis of open source software / Free Software. You can purchase Debugging: The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
Change one thing at a time (Score:5, Insightful)
> key factor, grab the brass bar with both
> hands (understand what's wrong before fixing),
> change one test at a time, compare it with a
> good one, and determine what you changed
> since the last time it worked.
This is helpful with unit tests, too. If I find a bug, I want to figure out which unit test should have caught this and why it didn't. Then I can either fix the current tests, or add new ones to catch this.
Either way, if someone reintroduces that particular bug it'll get caught by the unit tests during the next hourly build [ultralog.net].
He forgot regression tests (Score:5, Insightful)
Just my 2 cents.
Good read (Score:5, Insightful)
I can think of a WHOLE lot of tech's and admin's who really need to follow number 9 a lot closer.
Especially those Windows admins/techs who think 'restart' is the ultimate fix-all. Though, sadly, I suppose in many cases that's about all you can do with proprietary software. Well, that and beg vendors to fix the problem. (We all know how productive that is....)
but how do you know it's fixed? (Score:5, Insightful)
More programmers need to get Test Infected [sourceforge.net].
Re:Hardware *Debugging*? (Score:5, Insightful)
Negative. (Score:1, Insightful)
Time (Score:5, Insightful)
Unfortunately, speaking as an ex-programmer, time is one luxury that PHBs don't afford their minions. A project needs to be completed and knocked out of the door as soon as possible. The less time spent on unnecessary work, the better.
It is also unfortunate that PC users have been brought up expecting to have buggy software in front of them and expecting to have to reboot/reinstall. What motivation is there to produce bug free code when the users will accept buggy code?
Ho well, at least I run my own company now - master of my own wallet - and can concentrate on quality solutions.
Re:Hardware *Debugging*? (Score:5, Insightful)
Troubleshooting is what you do to fix your mom's ethernet card. "Oooh, it's on the bottom PCI slot, has no interrupt line. I'll just move it up one slot..."
Debugging is what you do with an oscilloscope to figure out why a particular circuit design isn't working as anticipated. You don't "troubleshoot" a circuit design. You debug it.
Or, to put it another way, "troubleshooting" is what a tech support monkey does. "Debugging" is what an engineer does.
Re:Hardware *Debugging*? (Score:3, Insightful)
Re:yuck (Score:2, Insightful)
We all (except Dijstra [utexas.edu], perhaps) take trade-offs, for a reason. Perhaps that reason is only ignorance, but then we wouldn't get anything done.
Race Conditions? (Score:5, Insightful)
Make It Fail is pretty hard to do when it comes to race conditions. This has got to be the most frustrating kind of bug. Others are referring to the Heisenbug which comes in a variety of flavors.
Sometimes you don't KNOW when there's multiple threads or processes, or when there are other factors involved.
Have you noticed that a new thread is spawned on behalf of your process when you open a Win32 common file dialog? Have you noticed that MSVC++ likes to initialize your memory to values like 0xCDCDCDCD after operator new, but before the constructor is called? It also overwrites memory with 0xDDDDDDDD after the destructors are called. And that it ONLY does these things when using the DEBUG variant build process? Did you know that .obj and .lib can be incompatible if one expects DEBUG and the other expects non-DEBUG memory management?
Someone on perlmonks.org was just asking about a Heisenbug where just the timing of the debugger threw off his network queries. Add the debugger, it works. Take away the debugger, it fails. I've got a serial-port device which comes with proprietary drivers that seem to have the same sort of race condition.
The top 9 rules mentioned here look great. But you could write a whole book on just debugging common race conditions for the modern multi-threaded soup that passes for operating systems, these days.
Re:Heisenbugs... (Score:5, Insightful)
Heisenbugs are almost always caused by buffer overflows.
They are also almost always caused by race conditions, the most insidious of which is thread-safe code that turns out only to be safe on a uniprocessor system.
And don't forget the phase of the moon, or for the truly unlucky, intermittently glitchy hardware.
I really liked the book, but I would have... (Score:5, Insightful)
On the surface, this flies in the face of "divide and conquer" - but what I'm really saying here is make sure you have the problem bounded before you attack it.
Also, with Step 9, I would have liked to see more emphasis on ensuring that nothing else is affected by the "fix". Making changes to code to fix a problem is often a one step forward and two steps backwards when you don't completely understand the function of the code that was being changed.
All in all, an excellent book in a little understood area.
myke
Missed one: explain it to someone (Score:5, Insightful)
If you start explaining the bug to someone, there's a good chance in mid-explanation you'll realize a solution to the problem.
Some school (can't remember which) had a Teddy Bear in their programming consulting office... There was a sign. "Explain it to the bear first, before you talk to a human". Silly as it sounds, people would do it, and a large portion of the time they'd never actually have to consult the staff... by explaining it to the bear, they solved the problem.
Weird, but true.
Missing rule (Score:3, Insightful)
The second pair of eyes often finds the problem
even if they don't have a clue what you are talking
about.
Re:Heisenbugs... (Score:5, Insightful)
In my experience, Heisenbugs are almost always caused by stack problems. That's why they go away when you put print statements in the code - because you're causing the usage of the stack to change.
Buffer overflows (to arrays on the stack) are one good way to munge the stack. Returning the address of an input parameter or automatic variable is another way, because these are declared on the stack and cease to exist when the enclosing block exits. Anybody else using such an address is writing into the stack in an undefined manner, and chaos can result!
A missing rule (Score:5, Insightful)
Re:Good read (Score:5, Insightful)
This is especially important when changing a second variable can actually mask the fix of the change of the first variable or cause a second failure that appears to be the same as the initial failure.
I guess they should have added a rule 10: be patient and systematic. Obvious problems usually have non-obvious solutions, and a thorough examination of the situation is time consuming. Don't take short cuts or you might miss the problem.
An extra rule (Score:5, Insightful)
This is so effective that it doesn't require the person to whom you're explaining it to pay attention, or even understand. A manager will do
The process of describing the behaviour of the program as it ought to be versus the behaviour it is exhibiting forces you to step back and consider only the facts. This in turn is often enough to give you an insight into the disconnect between what's really happening and what you know should be happening.
If you catch yourself saying "that's impossible" when debugging some particularly freaky bit of behaviour, it's definitely time to try this.
The input of the other party is so irrelevant in this process that we used to joke about keeping a cardboard cut-out programmer to save wear and tear on the real ones...
Fresh view: visit next lower level of abstraction (Score:4, Insightful)
A good list. As part of rule 8, it's often extremely helpful to look at the problem from a different level of abstraction than one normally would (e.g., different than you coded, or that you best understand it). This often exposes false assumptions that may be blocking a proper analysis.
Successful debugging is a lot like any hard science, particularly if you are not, and cannot, become familiar with the entire system first. Your "universe" is the failing system. You develop hypotheses (failure modes and potential fixes) and run experiments (test them). You have solved the problem only if you completely close the loop (your fix worked, it worked in the way you expected, your hypothesis completely explains the circumstances, and peer review concurrs).
A big part of the "art" is cultivating an attitude of how systems are stressed, and how they may fail under those stresses.
Re:Now That It's Written Down (Score:3, Insightful)
--Parity None
Re:The first law of debugging (Score:3, Insightful)
Actually that's a corollary to the first law, which is:
"Every bug fix will cause two more."
"Thats a feature" (Score:3, Insightful)
In fact, popular bug-tracking databases like Scopus usually merge bugs and enhancement requests together, due to this ambiguity.
Re:Race Conditions? (Score:3, Insightful)
Two of my favourite rules... (Score:3, Insightful)
Examine the input data. Often it isn't a bug. Often the program is doing an entirely reasonable thing given the input data. Or perhaps the program mishandled bad input data (in which case it is a bug, but now you know what to look for).
Re:Heisenbugs... (Score:3, Insightful)
Yeah, but thread problems are so slippery, I don't even think of them as Heisenbugs. I think of them as Neutrinobugs.
A stack-related Heisenbug (or really any kind of Heisenbug, for that matter) will always occur in the same place, given the same conditions. Always the same location, always the same stack trace. But when you stick in a print statement, the bug moves, or - worse - it goes away altogether. That'll make you pull your hair out the first couple of times it happens to you, but after a while you learn to spot them pretty quickly.
Race conditions between threads, however, are maddening in their irregularity. They rarely happen in the same place at the same time. (If they do, you're lucky.) They can be random in when they choose to pop up. One time you might run five minutes before you see a crash. Next time, you might run hours before the program falls over and dies. And when you do get a crash, it's never in the same place, and often it's not even "near" the bad code. Two threads write to the same data structure at the same time because it wasn't locked correctly - and the program can continue running WAAAAY past the bad access. I've seen grown men brought to their knees, sobbing like little children, over threading problems. Race conditions keep suicide hotlines in business.
And people wonder why I'm compulsive about putting locks around my data.
Re:Change one thing at a time (Score:3, Insightful)
Sometimes reading the code is enough. If you're good at reading code, then sometimes all you have to do is briefly look over what you wrote to spot the bug. YMMV, of course. If you've looked at the code for a few minutes and nothing looks obviously wrong, then it's probably time to use the debugger/add print statements. I've found that this is the most efficient way to go bug-hunting, because a quick re-read can find a lot of the "easy" bugs. This is similar to having a code review, but in this case you, the author, are the only reviewer. If there is another coder nearby, go ahead and ask him/her to give it a quick look as well, because he/she will probably have an easier time spotting the error.
I've met people who skip this step, and it drives me up the wall to see them waste their time (sometimes hours!) poking around in a debugger/writing print statements when the code they are debugging is simple. If it's a small, straightforward bit of code, then a quick look should uncover the bug. I suppose this falls under rule #1 (understand the system), but my point is more specific: understand the code.
None of the above is particularly groundbreaking, of course, and probably doesn't deserve to be mentioned in the book. These are more like "things you do before debugging".
Re: Heisenbugs... (Score:4, Insightful)
Missed one (Score:3, Insightful)
There is one that appears to be left out (from the summary, perhaps not from the book - I haven't read it): fix it everywhere.
Once you have found a bug, search the rest of your tree for similar bugs. Chances are that you will find and fix several. This is especially true of bugs caused by bad assumptions.
FYI: This is one of the central audit methodologies of the OpenBSD project. It works much better for the BSDs as they keep the entire system in one CVS tree, rather than scattering it around FTP servers in the forms of tarballs. The whole system is readily available to search for entire classes of bugs.
The most important thing to prevent bugs (Score:3, Insightful)
I think writing solid code is all in the attitude of the programmers - I had one guy who had a memory overwrite bug that was corrupting some characters in his string table when he called a certain function. Do you know how he fixed it? He wrote some code that put the right characters back over the corrupted ones after the call to this function!!! If you have that attitude, things WILL blow up in your face...