Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Programming Businesses IT

Lessons From Your Toughest Software Bugs 285

Nerval's Lobster writes: Most programmers experience some tough bugs in their careers, but only occasionally do they encounter something truly memorable. In developer David Bolton's new posting, he discusses the bugs that he still remembers years later. One messed up the figures for a day's worth of oil trading by $800 million. ('The code was correct, but the exception happened because a new financial instrument being traded had a zero value for "number of days," and nobody had told us,' he writes.) Another program kept shutting down because a professor working on the project decided to sneak in and do a little DIY coding. While care and testing can sometimes allow you to snuff out serious bugs before they occur, some truly spectacular ones occasionally end up in the release... despite your best efforts.
This discussion has been archived. No new comments can be posted.

Lessons From Your Toughest Software Bugs

Comments Filter:
  • by Dan East ( 318230 ) on Monday August 03, 2015 @07:23PM (#50244963) Journal

    Some of the bugs I've beat my head against the wall over the most are compiler bugs. It's easy to have the mindset that the compiler is infallible, and so programmers don't usually debug in a way that tests whether fundamentals like operators are really working right. This was particularly bad developing for Windows CE back around 2000 when you had to build for 3 different processors (Arm, MIPS and SH3). I ran into a number of optimizer bugs usually related to binary operators. The usual solution was precompiler directives to disable the optimizer around a specific block of code.

    • Re: (Score:2, Interesting)

      by Anonymous Coward
      Just after I graduated and I was working at my first job writing my first program ever that was not a homework assignment, I decided to write it as a multi-threaded program. I had a race condition that was causing a datastructure to give bad data. Took me almost 30 minutes to track it down. Now that I've gotten better at programming, race conditions take me much less time and rarely involve any debugging.
      • by Jeremi ( 14640 ) on Monday August 03, 2015 @10:33PM (#50246021) Homepage

        I was working at my first job writing my first program ever that was not a homework assignment, I decided to write it as a multi-threaded program

        ^^^ 2015 nominee for most terrifying sentence on Slashdot :)

        • ^^^ 2015 nominee for most terrifying sentence on Slashdot :)

          I don't get scared when I read that stuff. I just say, "Oh, that explains Adobe" or whatever. The truth is that the world is a fractally more fucked up place than you think it is. Most people are doing it wrong and proud, regardless of their job. Or, they're phoning it in, and they know it. But since our world is not even close to being a meritocracy, we're going to have more of that.

    • by eulernet ( 1132389 ) on Monday August 03, 2015 @08:10PM (#50245269)

      I had a worst experience: hardware bugs.

      Back in the 90s, I was working on a trucks game.
      Strangely, when playing via network, the trucks on some computers sometimes desynchronized.
      I spent one week locating the problem by digging into verbose logs: it was due to the FDIV bug, which was subtly changing the positions of some trucks.

      More recently, I spent a lot of time figuring why some programs crashed on my computer.
      After a few weeks, I realized that some bits in the RAM were dead, writing into them returned random values.

      • by epyT-R ( 613989 )

        Was that 'trucks' game called "Over the Road Racing" by any chance?

        https://www.youtube.com/watch?... [youtube.com]

      • This is tough in college when it happens. No one believes the student who says things aren't working because of a hardware problem, the other students don't even believe it. There are a lot of software people who are trained to assume hardware never has problems; some even think operating systems don't have problems.

        So in my class at school we always got the new minicomputers, the ones that had never been tested on a full classroom yet. One of them had a bug in a divide instruction, and when used incorre

        • Re: (Score:3, Insightful)

          The number of times i've had fellow developers complain that their bug *must* be caused by the compiler, or the OS, or the framework, or the hardware only for it to turn out to be their fault all along is the reason why i always suspect my code before i blame anything else.
      • Yeah, I spent two weeks trying to track down an instant blue-screen bug in a 3D simulation. Running it in a debugger didn't help - it would still blue-screen, though it did allow me to narrow it down to an innocuous-looking piece of code. I went over it with a fine-toothed comb and couldn't find anything wrong with it.

        After two weeks, a co-worker was assigned a task similar to mine. She asked for my code so she wouldn't have to start from scratch. I gave it to her with the warning that it was blue-sc
      • I had a fun one with a nameless companies "gateway" onto a wireless network. You could write a repeater path onto the gateway, but if any of the addresses contained a 0xFF byte (most did) the gateway would write 0x0F. It took a while to track down as I had to use an external tool to read everything back, and when I reported it to the nameless company, they informed me they were "too busy" to fix it. This is about when I learned to swear in french again.

      • by Xest ( 935314 )

        "I spent one week locating the problem by digging into verbose logs: it was due to the FDIV bug, which was subtly changing the positions of some trucks."

        Similar issues are actually a fairly common occurrence in network code for video games during development when the developer is fairly new to the task. A lot of people writing network code for games run into it before learning their lesson.

        See this SE question and the associated links for example for some interesting points:

        http://gamedev.stackexchange.c... [stackexchange.com]

    • by arglebargle_xiv ( 2212710 ) on Monday August 03, 2015 @09:13PM (#50245639)

      Some of the bugs I've beat my head against the wall over the most are compiler bugs.

      Ah yes, the gift that keeps on giving. Every new version of gcc that gets deployed has new optimizer bugs, to the point that, several years ago, we stopped using O3 and above since the small loss in performance (if there even was any) was easier than handling a long tail of compiler bugs across dozens of different CPU types with every new release ("dozens" may be an under-estimate depending on how you want to count families of ARM, MIPS, Power, and other embedded CPUs).

      • by arglebargle_xiv ( 2212710 ) on Tuesday August 04, 2015 @05:41AM (#50247277)
        Having said that, there was one gcc compiler bug that got me a trip to Europe. A client had spent about three months trying to track down an impossible data corruption bug on their NIOS II embedded device, and eventually flew me over to try and sort it out. Our code is paranoid enough to run checksums on internal memory blocks, and that was reporting a memory-corruption problem. After about a week of work (with half-hour turnaround times on the prototype hardware whenever we made a change) we found that gcc was adjusting some memory offset by 32 bits. Everything looked fine at a high level, e.g. in a debugger, but if you took a cycle-by-cycle memory snapshot then at some stage writes started being out by four bytes. It was only the memory-checksumming code that caught it initially, it knew there was a fault but you couldn't see it using any normal debugging tools. We fixed it by detecting when the memory block had "moved" due to the alignment bug and memcpy'ing it 32 bits over so it was where gcc thought it was.
    • by Anonymous Coward on Monday August 03, 2015 @09:17PM (#50245653)

      A compiler guy here, who used to work for one of the RISC companies. Most compiler bugs are not that difficult to debug. But I worked on instruction scheduling and register allocation, hence always got assigned all the weird bugs. The most memorable one for me was actually a hardware bug - most people don't realize but most of the commercial microprocessors have a lot of bug in them. See published erratas and you will find many bugs. A few years after the particular generation of this processor was on the market, I got assigned a bug from this commercial DBMS vendor (I.e. very important customer) on this weird crash bug. It took me forever to figure out but it turns out to be a bug in the processor that corrupts a particular register (due to the register renaming logic screwing up in a rare combination of instructions) that is dependent on the timing and the instruction combination. It became anothet errata item, and I ended up implementing a workaround - if you notice some benign but odd code sequence a compiler generates, there might be a good reason behind :)

      • by TheRaven64 ( 641858 ) on Tuesday August 04, 2015 @02:59AM (#50246893) Journal

        Most compiler bugs are not that difficult to debug

        Another compiler guy here: Some compiler bugs are not that difficult to debug if you have a reduced test case that triggers the issue. Most are caused by subtle interactions of assumptions in different optimisations and so go away with very small changes to the code and are horrible to debug (and involve starring at the before and after parts for each step in the pipeline to find out exactly where the incorrect code was introduced, which is often not where the bug is, so then backtracking to find what produced the code that contained the invalid assumption that was then exploited later).

    • by Anonymous Coward

      Because it stymied me for weeks years back when I first started in C++. I'd written some code that made assumptions about where variables were initialised and what happened when said variable were returned, using some custom stuff in operator= and the constructor. (irrelevant detail: I wanted to be able to return sub-matrices of a matrix that could be assigned to to overwrite the relevant parts of the full matrix. Think matlab A([1 2 3], [3 4 5]) = B overwrites part (but not all) of matrix A style. And

    • by Z00L00K ( 682162 )

      One memorable one is when someone used Pascal coding in C;

              int c='a' - 'z';

      • What's wrong with that? c now contains the delta from 'z' to 'a', which is a well-defined –25 because char literals are signed ints.
        • by mwvdlee ( 775178 )

          It's well defined in ASCII. This code will produce different results depending on the native character set.
          On an EBCDIC machine (IBM z/OS mainframes), it will not return -25.

  • OMG....a Meme...
  • by Rei ( 128717 ) on Monday August 03, 2015 @07:28PM (#50245001) Homepage

    Program crashing at startup? Okay, let's add debugging statements.

    Can't get the debugging statements to execute? Okay, let's try removing code.

    Doesn't fix the problem? Okay, let's keep removing more... and more...

    A couple hours later, so much code was removed that the entire program had become nothing more than an empty main function that still crashed. This led to the following rule which I try to follow to this day: Make sure that you're actually compiling and executing the same copy of the code that you're modifying. ;)

    • by JazzXP ( 770338 )
      Been there done that. Lol yep, lost half a day on it...
    • by maugle ( 1369813 )
      Oh man, that's happened to me twice, with several hours lost in each instance. I've sworn to never allow it to happen a third time.
    • by Dutch Gun ( 899105 ) on Monday August 03, 2015 @10:18PM (#50245943)

      Oh, damn... yeah, done that as well. Frustrating as hell, because it just doesn't make sense until you finally figure out you're not even debugging the code you're working with.

      Other variations of "the impossible is happening" include:

      * Syncing to new code, recompiling, and crashing. Crashes only go away once you force a full rebuilt to update stale precompiled headers.
      * Program crashes mysteriously, and only is fixed after the machine is rebooted (likely some process in RAM has been corrupted).
      * When you get automated crash debug reports from hundreds of thousands of customers, you eventually realize that a staggering number of people simply have bad hardware, due to the impossible crashes that occur (e.g. a = b + c; // --- crashes here. all variables are integers).
      * Compiler or hardware bugs - thankfully much more rare than they used to be.

    • by bl968 ( 190792 )

      Been there done that!

    • by Z00L00K ( 682162 )

      That's on par with rebooting the wrong machine.

  • by Etherwalk ( 681268 ) on Monday August 03, 2015 @07:32PM (#50245021)

    I had a bug once where red and blue values were swapping places across thousands of pixels that took quite a while to hunt down once. It turns out there was a function doSomething called with parameters (pixel[i++],pixel[i++],pixel[i++]) while doing transformations. The compiled code pushed the third parameter onto the stack first, so it was using the red value from the array in the blue spot and vise-versa across the entire image.

    • by Anonymous Coward on Tuesday August 04, 2015 @01:18AM (#50246629)

      Actually, what you're describing is formally defined as undefined behavior in the C and C++ standards.

      Undefined behavior:

      doSomething(pixel[i++],pixel[i++],pixel[i++]); /* function call commas are NOT sequence points, so the result is undefined */

      Refer to the Sequence point [wikipedia.org] article. The [3] citation says

      "Clause 6.5#2 of the C99 specification: "Between the previous and next sequence point an object shall have its stored value modified at most once by the evaluation of an expression. Furthermore, the prior value shall be accessed only to determine the value to be stored."

      Pay spectial attention to see point #4 under "Sequence points in C and C++", because that talks about your exact problem. But beware that you'd still have a bug even if you hid the increment inside of a function, because order of argument evaluation is not specified (as oppposed to undefined behavior, which can cause nasal demons or format your hard drive).

      Fixed with least diff:

      int r=pixel[i++], g=pixel[i++], b=pixel[i++]; /* commas between declarators ARE sequence points */
      doSomething(r,g,b);

      See also: S.O. questions related to undefined behavior and sequence points in C [stackoverflow.com] and C++ [stackoverflow.com].

      • by TheRaven64 ( 641858 ) on Tuesday August 04, 2015 @03:28AM (#50246975) Journal

        The order of parameter evaluation is one that bites a lot of people because most compilers do it the expected way. When you're walking an AST to emit some intermediate representation, you're going to traverse the parameter nodes either left-to-right or right-to-left and most compiler IRs don't make it easy to express the idea that these can happen in any order depending on what later optimisations want. If they have side effects that generate dependencies between them (as these do) then they're likely to remain in the order of the AST walker. Most compilers will walk left-to-right (because a surprising amount of code breaks if they don't), but a few will do it the other way.

        To understand why this is in the spec, you have to understand the calling conventions. Pascal used a stack-based IR (p-code) and had a left-to-right order for parameter evaluation, which meant that the first parameter was evaluated and then pushed onto the stack, so the last parameter would be at the top of the stack. The natural thing when compiling Pascal (as opposed to interpreting the p-code) was to use the same calling convention, with parameters pushed onto the call stack left to right. Unfortunately, C can't do this and support variadic functions (not: some implementations wanted to do this, which is why the C spec says that variadic and non-variadic functions are allowed to use completely different calling conventions), because if the last variadic argument is the top of the stack then there's no way to find the non-variadic arguments unless you also do something like push the number / size of variadic arguments onto the stack.

        This meant that C implementations tended to push parameters onto the stack right to left. This is less of an issue now that modern architectures have enough registers for most function arguments, but is still an issue on i386. Because of the order of the calling convention, it's more efficient on some architectures to evaluate arguments right to left. Some compilers that are heavily performance-focussed (GPU and DSP ones in particular, where they don't have a large body of legacy code that they need to support) will do this, because it reduces register pressure (evaluate the rightmost argument using some temporaries, push it to the stack, move onto the next, reusing all of those temporary registers).

  • by sectokia ( 3999401 ) on Monday August 03, 2015 @07:33PM (#50245027)
    When ARM first came out on some philips CPUs it had bugs in the C compiler. The IT department called us hardware engineers in after being stuck on a bug for months. The problem with programmers is to many of them work at a high level, and they hit a wall at some abstraction layer, usually at assembly code. The other problem with these compiler bugs was as you removed unrelated code, they went away, as the compiler had pointer corruption issues. So to get the vendor to fix it, you often had to submit an entire copy of your code project. Sometimes we had to submit images of entire machines because the compiler would interact with an IDE and with Windows. These days we use only open source compilers to ensure we arnt held up and can identify and fix problems quickly.
    • by sectokia ( 3999401 ) on Monday August 03, 2015 @07:48PM (#50245123)
      The absolute worst I've had was a soft cpu in a altera fpga. It shipped with a C compiler. A programmer came to me to explain how his program would crash if he changed the order in which subroutines were defined. After carefully checking the logic it, there was nothing wrong with his code. So i then trawled through the assembly. Again i could find nothing wrong And thought i was losing my mind. I had to painstakingly check the cpu state after each instruction until i eventually found one instruction that did not set a flag as per the manual, and the assembler matched the manual. It was a fault that would only trigger it you did a certain conditional jump after a certain fetch increment then store sequence. It was a bug in the cpu pipeline logic. I learnt a valuable lesson never to trust anything. We wasted allot of time because we were convinced we must have been the source of the fault.
    • *too
  • Back in the 80's, I was working on a project with three other programmers. Nobody had heard of version control back then; we were using VAX/VMS and it would keep a few versions of a file around after you changed it, which seemed good enough (after all, we all trusted each other, right?)

    Well, I don't remember the exact bug(s), but one day I fixed something, and tested it. Fine. A few days later the bug came back. So I went back, fixed it again (wait, didn't I already make this change?). A few days later it came back again.

    It turned out that one of the other guys had fixed a different bug, which I had introduced with my fix. So, his fix was to change the code back the way it was. We went back and forth a few times un-doing each others' changes before we realized what was going on. Seeing a revision log with comments on the changes might have helped...

    • by Z00L00K ( 682162 )

      That's why you comment in the code why something is done there.

    • Nobody had heard of version control back then [80's]

      I don't think that's correct. Wikipedia says that SCCS was first released in 1972.

  • by myowntrueself ( 607117 ) on Monday August 03, 2015 @07:42PM (#50245077)

    I recall a proverb, something like

    "It takes twice as much intelligence to debug code as it took to write it.
    So if you code to the best of your ability you are, by definition,
    not qualified to debug it."

    • by bloodhawk ( 813939 ) on Monday August 03, 2015 @08:35PM (#50245437)
      The full quote is

      “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”
      - Brian Kernighan

      I used to use this as my signature a few years back to try and make devs think about what they are writing. It is nearly always better to make the code simple and readable than to try and produce the best possible code. No it isn't as fun, but it is a damn side better for those that have to try and decipher your clever coding tricks later.
      • by Z00L00K ( 682162 )

        I see one reason to write clever code - and that's when you try to optimize for performance. But the "hot spots" in code are rare, so in those cases you can get away with it if you put in a decent comment in the code describing why it's done in a particular way.

        Multithreading is also a special beast - looks simple, is simple to implement but you really have to watch out if you have shared variables/memory. It can lead to some not so obvious errors!

        • An issue is that often times when optimizing for performance that even though performance is only important in those "hot spots" that the optimization frequently involves large chunks of the code base. If you don't understand this then you probably don't understand real optimization efforts and have just been toying with the kind of optimizations that the compiler should already be doing for you.

          The professional optimizers, the guys called in because nobody on the team can get even close to acceptable per
  • Debugging Gone Wrong (Score:4, Interesting)

    by mlookaba ( 2802163 ) on Monday August 03, 2015 @07:47PM (#50245119)

    Bug 1 (my fault) : Took over working on a financial application that took an identifier and enriched them with all sorts of useful data. The original programmer had left, and nobody at the company knew anything about how it worked. Soon after, we were troubleshooting an issue reported by a client that the output data wasn't consistent between runs. I grabbed a list of all the unique security IDs I could find (about 100k) and pushed them through a couple of times just to try and replicate the issue. HOWEVER... it turns out the application was actually using the Bloomberg "By Security" interface under the hood. That was a service where you drop a list of IDs onto Bloomberg's FTP server, and they would respond with data... for a fee of $1 per security. The client got an unexpected bill of nearly $200k that month, and I had the most awkward talk ever with my boss. Fortunately, Bloomberg forgave the charges, and it turns out they were actually responsible for the inconsistent data - which was fixed on their end shortly thereafter.

    Bug 2 (not my fault) : A client/server application is returning odd responses to a particular query. Developer (we'll call him "Jason") inserts a switch into the code that dumps this query out to a hardcoded folder on the server. The code then gets checked into production WITH THE SWITCH TURNED ON. It went undetected for nearly a year because the query wasn't terribly high volume. But slowly and steadily, the query files built up over time. Our IT had lots of money to play with, so server space was not an issue. Unfortunately, the number of files was. Server performance went steadily downward every so often, until finally this query would make it crash every time. When we eventually tracked down the cause, there were millions of files sitting in the same folder of every single server in the group. It took nearly three days just to get the OSs to delete the files without falling over.

    • by Z00L00K ( 682162 )

      Bug #1 - not a bug really. Just an awkward mistake, but good that Bloomberg dropped it. But that also shows the need for documentation of how stuff works when someone quits.

      I once developed an SMS gateway and did a test run on it but forgot to change the list of phone numbers so my manager at the time got 50 text messages with the same content. Ooops! :)

  • by DamonHD ( 794830 ) <d@hd.org> on Monday August 03, 2015 @07:50PM (#50245133) Homepage

    A stray ; 30 years ago in some C took me a week to find, replacing the intended body of a loop with an empty block IIRC. I have ever since tried always to { } statement blocks so that it is easy to tell what was intended...

    Also I strongly echo the "make sure that you're editing what you're running/debugging" comment elsewhere. Still horribly easy to get that one wrong in lots of different ways...

    Rgds

    Damon

    • "Also I strongly echo the "make sure that you're editing what you're running/debugging" comment elsewhere. Still horribly easy to get that one wrong in lots of different ways..."

      Agreed, although a modern VCS really really helps avoid this. Wish I'd had GIT back in the '80s.

      • If you're paying attention that is. You can edit code, save it, pop up the window and type "make", see stuff actually build, then hit your debugger and it's loading the wrong code. Happens if you're forced to use some lame IDE or debugger for the chip while using better tools to develop with (because every damn chip maker thinks they should make some proprietary half assed IDE rather than make open debugging tools).

        • by Z00L00K ( 682162 )

          Or just have the path variable set wrong - current directory last in the path can yield some interesting effects when coding.

    • I stick with the ";" for loop bodies but it's always on a line by itself so that it's obvious.
      Another problem along lines with this is to not trust the indentation from other people's code. So you miss the ";" at the end of the line with the "while" because the indentation is fooling you. Some people just insist on their own indentation style even if the code above and blow it use different styles. I even had a boss once who cut and pasted code without re-indenting afterwords.

      • I even had a boss once who cut and pasted code without re-indenting afterwords.

        These go into a separate chapter usually, just like forewords, so they might be indented differently on purpose.

  • A trigger on a busy table was using a Rule Based Optimizer
    We had done a 'rough' system test for upgrading from 9i to 10g, but the system did not have a realistic production load put on it
    The DBA group placed the upgrade into production and suddenly the system drags to a crawl
    It took us a very short amount of time to figure out the problem, but a few hours to deal with the existing change control process and satisfying a DBA manager, who failed to let us know that there was a major change with the database r

    • by Z00L00K ( 682162 )

      On the level of someone changing order of columns in an indexing for no particular reason, possibly because it looked better to have the index column in alphabetical order.

  • I have a bug in javascript that I can't fix.
    I can't remember what it is now but it's documented in the code that if you remove the Are you sure? prompt (or remove the now-hidden debug statement), the code doesn't work. When you display the variable, or just wait and ask, then the code does work.

    Every couple years when someone scans thru the code, they'll spend a day or two trying to figure out what's really happening.
    • This has probably something to do with global/local scope.
      If your variable is declared with a "var", it's local, otherwise, it's global.
      You probably missed some var i, and your i variable is global, leading to random crashes if the loop is used at several locations.

  • by coop247 ( 974899 ) on Monday August 03, 2015 @07:56PM (#50245171)
    First job out of college doing tech support for a big corp. One day thousands of Win2000 computers start taking multiple hours to boot up. Nobody can figure out what the problem is, got like 20 people working on it for almost two weeks.

    After digging through logs and error messages I discover than some idiot who had denied doing anything had sent out an update via our client management software to add a new local user for support purposes. He didn't do this via a script, rather "recorded" him adding it to a machine and then sent out a copy of the files and registry entries that had changed. Unbeknownst to this genius, the local security database is an binary (pretty sure encrypted) file that you can't just go copying between machines.

    I put together a script that repaired the local database and fixed the problem in a couple minutes. But literally had thousands of workers sitting around doing nothing waiting for computers to boot for like 2 weeks.
  • One of my toughest bugs didn't exist.

    My code was actually working correctly, but the debugger until certain conditions would display wrong values. I wasted a lot of time trying to find the bug in my code.

    • Re:debugger (Score:5, Interesting)

      by Jeremi ( 14640 ) on Monday August 03, 2015 @10:10PM (#50245915) Homepage

      Some people, when trying to analyze a buggy program, think "I know, I'll use a debugger". Now they have two buggy programs to analyze.

      -- a grumpy old programmer

      • by Z00L00K ( 682162 )

        Programs that crashes when running under a debugger are always fun, sometimes it's better and easier to run the program normally and then do a post mortem on the core file generated. Hence "generating core dumps" is a standing joke in some development.

        Fortunately the number of cases where a debugger don't work have diminished greatly over the years compared to how it was under MS-DOS.

  • by GoodNewsJimDotCom ( 2244874 ) on Monday August 03, 2015 @08:03PM (#50245217)
    I once had a hiesenbug, which was a simple dereferrenced pointer. The problem is that I had a couple thousand lines of code, and the bug wasn't where I was recently coding. Every coder knows to check for bugs in their most recent code, but a derefferenced pointer can be anywhere in the code. Anyway, I decided to break down and pray for help. Then within moments I read through a random line of code in some random file and debugged the problem. Since then, I often pray I do well in general, then I don't get stuck on a brick wall of tech, that God helps me while I code, and a host of other cool stuff. I find things flow more smoothly since then and I don't fight with code. I know God is real, and I've come to discover prayer does help too. In addition to that, I've been more careful with pointer math, biasing array memory structures more.
    • by frank_adrian314159 ( 469671 ) on Monday August 03, 2015 @08:26PM (#50245381) Homepage

      I'm glad you found the truth - that being more careful with pointer math and biasing array memory structures more is truly a blessing. May you also discover the higher truth that coding in languages that need no such nonsense (as their automated memory allocation and deallocation routines have been far better debugged than yours) is even more blessed and may lead you more quickly to the communion with defect-free code you desire.

    • Use valgrind. It helps. A lot.

    • by Jeremi ( 14640 )

      I know God is real, and I've come to discover prayer does help too.

      Interesting; I found just the opposite. When I was a programming n00b working on my C assignments in college, and it was the night before it was due and I couldn't figure out why it was crashing, I tried praying, hoping, wishing, random changes to the code, furrowing my brow at the screen, loud cursing, exhaustive special-case-logic, and a dozen other increasingly desperate non-methods to "make the code work" without actually understanding it.

      Just before the 4 AM deadline for submissions, the code would st

  • Write code to test your code. Hit every edge case hard, every boundary condition.

    All too often we tend to test our code by just running the overall program, but this is not good enough. Running the overall program does not introduce a wide enough range of input parameters to every function.

    Write test code. Write code to log your inputs and outputs to files early in the development cycle. Don't get swamped down in the land of trying to debug code that was never written to be debugged.

    I had many many tough bu

  • by Anonymous Coward

    My favourite head scratcher - back using Motorola's version of Unix, we had a voice response (IVR) application that would poll for activity, and otherwise sit idle using the sleep() command. The code had interrupt handlers SIGUSR (iirc) that would perform "real-time" activities as necessary (handling call hang ups, touch tone digit receipt, etc). When running under a load test scenario during a quality cycle, we kept running into scenarios where 1 in a 1000 or so instances of our event handlers were NOT h

  • about unpublished errata. According to the lead engineer (at a major cpu vendor) there are more hardware bugs than software bugs.
  • by FrozenGeek ( 1219968 ) on Monday August 03, 2015 @08:14PM (#50245301)
    Seriously, great topic.

    Two bugs come to mind, one that I wrote and fixed, one that I fixed but did not create. The one that I created was an assembler bug, code written in UKY-502 assembler (military computer). I screwed up one op code, specifying LK (load constant) instead of L (load from memory address). The difference in the code was one bit, but I had to single-step through the code to find the bug - took me hours for one stinking bit.

    The other bug, also on the UYK-502 computer, was a bug in the micro-code. The guy who wrote the micro-code for one particular instruction had ignored the user guide for the bit-slice processor and had implemented a read-modify-write operation in a single micro-code instruction. It worked for him because the timing hardware was slow enough. Unfortunately, a couple of years later, the manufacturer of one of the chips in the timing hardware improved the internal workings of the chip so that one of the line dropped sooner than it did on older versions of the chip (NB: the chip still met the same specs - it was just faster). Debugging was a pain. The computer used a back-plane, and the timing hardward and the bit-slice processor were on difference cards. When we put either card on a extender so we could connect a logic analyser, the delay added by the traces on the extender caused the problem to go away. It took two of a week to find the problem. The fix was to update the microcode ROMs for every computer that received the new timer card.

  • Stop writing so many of them?

  • by Snotnose ( 212196 ) on Monday August 03, 2015 @08:17PM (#50245329)
    For about 10 years I was a troubleshooter, they'd assign me something to work on and then interrupt me for a big ass bug.

    First big bug? Linux system would crash after about a week. Diagnosis? When it crashed it was out of FDs. Turns out a kernel resource was opening a file, exiting, and never closing the fd. Time to find? About a week. Time to diagnose? About a minute. Time to fix? About 10 minutes.

    How did I find it? Waiting until it died, did some built in command to see WTF happened, looked at the source code, fixed.

    Second big bug. System would reboot randomly within an hour to a week due to a watchdog timer firing. Even had a "magic" laptop that made it crash more often. Diagnosis? When you read from a register the chip would sometimes hang. Time to diagnose? About a month, most of that waiting for the damned system to crash. Didn't help I only had 1 JTAG, I couldn't do anything else while waiting for the sytem to crash. I spent a lot of time looking for interesting websites during that month. Time to fix? For me, about 30 seconds. It was a system status register, nobody cared except the hardware folks, I quit reading it. For the hardware folks? Don't know, don't care.

    How did I find it? It was a cellphone. When it restarted JTAG was initialized at the reboot point. I found the point in software that initialized the memory controller. As the system never lost power memory was intact. Found the process crashing. Then I created an in-memory array. As the code progressed I updated this in-memory array, stuff like "code does something, I put 0x10 into my array. Code does something else, 0x20 into my array". After a couple days of "it's just reading a register, I messed up somewhere" I finally concluded "reading this register causes it to crash about 1 time in 10,000"

    Third big bug? Cellphone base station. Card handled 3 T1 lines, did the analog/digital and digital/analog muxing for each call. Cells would randomly drop out after a day or so, they didn't come back until you rebooted the system. It's a base station, you never reboot the system. After about 3 months of this I got asked to look into it. I'm like, dafuq? It's a DSP issue, I don't know jack about DSP, I'm screwed. Honestly, I had no idea how to even approach this problem.

    The fix? I was telling myself how screwed I was, and I'd never get a raise, and generally killing time reading the docs. Found a library call that said "do not call this during an ISR". It was being called from an ISR. Sent email to the DSP folks asking them to comment out that line, they did and sent me the binary blob to load onto the card. I did, problem went away.
    • Diagnosis? When it crashed it was out of FDs.

      I usually just switch to using FEs and FFs when the FDs run out.

  • Incrementing (Score:4, Interesting)

    by darkain ( 749283 ) on Monday August 03, 2015 @08:18PM (#50245341) Homepage

    One night while coding half asleep, I wrote the following to increment a variable in C++

    x = x++;

    The problem with this code is that it is an undefined behavior. It looks okay at first glance, and then when you consider the machine code that would be built from it, a bit of ambiguity arises. The problem comes in with the = sign vs the ++ operator. Both of which are assignment operators for the x variable, but it is not well defined which assignment should happen first/last. The code in use was actively being used in both MSVC and GCC environments, each producing opposite assignment ordering. This was awesome to debug, since the code "worked" on one platform but not the other!

  • I once contracted with a shop that had a process that generated garbled output data rows. It appeared to be extra stuff that didn't affect (over-write) the intended rows. The shop had added an extra processing step to filter out the garbage rows and eventually just worked around the glitch.

    They had asked me to try to track it down, among other projects, because they were newbie programmers. I couldn't figure it out either because it never appeared in my intermediate trace statements. I put a trace (print) s

    • by Z00L00K ( 682162 )

      It's way too common that people re-use variables for different purposes. Especially in Visual Basic.

  • by Cali Thalen ( 627449 ) on Monday August 03, 2015 @08:22PM (#50245353) Homepage

    http://blogs.msdn.com/b/rick_s... [msdn.com]

    Read this years ago, and thought it was interesting at the time...I've saved the link for years. Really detailed story about finding a really complicated bug in MS Word way back in the day.

  • Self-Checking Code (Score:4, Insightful)

    by Cassini2 ( 956052 ) on Monday August 03, 2015 @08:37PM (#50245447)

    I gave up on the concept that I would be able to write and debug programs correctly the first time. Now all the central data structures in any long-lived control system get error-checking code added to them. For example, the sorted-list code is built with a checker to ensure it stays in order. The communications code gets error-checking. The PID controllers get min/max testing, etc.

    Every once in a while I come across a bugs that are not in the source code. Often they are compiler errors. Sometimes the bugs involve a rare C/C++ or operating system eccentricity. Sometimes the errors are caused by obscure library changes. Sometimes they are hardware errors.

    Especially with the embedded micro-controllers, I leave the consistency checking code in, because you just can't assume the everything always works. The nature of software bugs change with time, and it is not always in the way a programmer would expect. I am frequently surprised by how obscure some of the bugs are.

  • Back in my student days I had a runaway pointer. On one of mid-1980s Motorola 68000 Macs, it would trigger the power-off function if it wasn't running under a debugger. Talk about frustrating.

    At least it was consistent.

    Remember, this was back in the days before protected memory. Also, if memory serves, the MacOS and applications always ran in "supervisor mode" (analogous to "ring 0" on Intel chips), so your program 0wned the machine while it was running.

  • by shoor ( 33382 ) on Monday August 03, 2015 @09:40PM (#50245755)

    The novel is The Bug by Ellen Ullman.

    Here's quote from one of the reviewshttps://www.kirkusreviews.com/book-reviews/ellen-ullman/the-bug/ [kirkusreviews.com]:

    Her first fiction - which descends back into this realm of basement cafes and windowless break rooms, of buzzing fluorescents, whining computers, and cussing hackers - sustains a haunting tone of revulsion mingled with nostalgia. This artful tension distinguishes heroine Roberta Walton, who tells about the dramatic undoing in 1984 of Ethan Levin, a slightly odious but efficient programmer plagued by a highly odious but efficient computer bug.

  • by Darinbob ( 1142669 ) on Monday August 03, 2015 @10:19PM (#50245951)

    I had a job with a group managing shared minicomputers. One program I was writing was to log someone off after being inactive for some time, to free up a port for other users. So my loop to check every 5 minutes involved incrementing the time to wake up by 5 minutes on each iteration. Ie, it woke up at a specific time. So it would theoretically wake up at 12:00, 12:05, 12:10, etc.

    The problem was that this operating system for some reason blocked when sending the alert message to someone's terminal. There was possibly some non-blocking way to do this with some extra effort, but it didn't seem like any additional effort was needed. However some user type Control-S on his terminal and then went off to lunch, probably typed it by accident. So a warning message went to his terminal, but blocked because of the Control-S. So the program was stuck until he came back from lunch and typed Control-Q. At which point this unblocked my program which then printed out one after the other on everyone's terminal in two buildings:
    "your terminal has been idle and you will be logged off in 15 minutes",
    "your terminal has been idle and you will be logged off in 10 minutes",
    "your terminal has been idle and you will be logged off in 5 minutes",
    "logging off due to inactivity."
    This was shortly followed by a line of people coming into the office to complain, including my boss.

  • by wolf12886 ( 1206182 ) on Monday August 03, 2015 @10:23PM (#50245971)

    I was working on an embedded system recently that had a 5 minute timer to shut off the machine. We had received customer complaints that the machine occasionally shut off early. The code was a simple while loop that ran some pid controls and every loop checked "If (run_time > 5 minutes): exit;". I ran the machine in the lab for a while and sure enough, it shut off early once in a while. I looked through, and eventually SCOURED the code, assuming there was a subtle bug, such as clock corruption due to interrupts, or some kind of type conversion mistake, I couldn't find anything. I eventually set up a serial printout from the machine so I could see what was happening. And it would run and then print out "5 minutes elapsed, shutting down". No glitches or resets (which is what I was expected). So now I'm staring at this one line "If (run_time > 5 minutes): exit;", pulling my hair out. Finally in a moment of insane desperation, I added another line to the while loop. "if (4000 > 5000): print("Something is very wrong!"); I carry the machine to the lab and set it up, and IT PRINTS. Every few minutes or so it pops up on the display. So now I'm just like "fuck everything" how can I possibly run code if I can't even trust the basic principal that the computer will do what I tell it too. So the first thing I do is add triple checks to all critical comparisons, that eliminates the symptoms for now but I know it's going to cause weird problems forever if I leave it like that. Ok so the execution is buggy, I get out the scope and check the power line and various other things and it looks ok, but I notice at this point that the problem never occurs when the machine is running empty, only when it's loaded, so I clip ferrites everywhere you can possibly fit one and spend half a day putting metal covers on everything. As I run the machine this time I'm practically holding my breath, 1 run good, 2, 3. I'm getting super excited at this point, then bam "Something is very wrong!" prints and I die a little inside. After walking out to my car and screaming at the sky for a while, I get back to it. At least I know it has something to do with noise. Since the machine can't possibly be more shielded a take a look at the schematic, it looks normal, but there's a bunch of funky stuff on the reset line. I ask around and nobody knows why its there. It's got a regular pull up resistor, but somebody added a diode in series, and a ferrite bead right before the pin. Due to the voltage drop the MCLR is only being pulled up the 3.9v instead of 5v, so that's not good. Then I take a look at the ferrite on the board and it's sticking off the board with a coil of wire through it not 2 inches from a brushed motor the size of my fist. It must be acting like a transformer secondary. I shorted the diode and the ferrite and the problem never happened again!

  • by Cassini2 ( 956052 ) on Monday August 03, 2015 @10:39PM (#50246043)

    while (something) {
    // do_stuff
    } while (something_else);

    It compiles, is legal C, and loops endlessly if something_else is true.

    It can be done in a careless moment when switching a complex piece of code from a while () loop to a do-while () loop.

  • by tdelaney ( 458893 ) on Monday August 03, 2015 @11:30PM (#50246239)

    We had a program that was doing session matching of RTP streams (via RTCP). We had to be able to handle a potentially very high load.

    Things had been going OK - development progressing, QA testing going well. And then one day our scaling tests took a nosedive. Whereas we had been handling tens of thousands of RTP sessions with decent CPU load, suddenly we were running at 100% CPU with an order of magnitude fewer sessions.

    I spent over a week inspecting recent commits, profiling, etc. I could see where it was happening in a general sense, but couldn't pin down the precise cause. And then a comment by one of the other developers connected up with everything I'd been looking at.

    Turns out that we had been using a single instance of an object to handle all sessions going through a particular server, but that resulted in incorrect matching - it was missing a vital identifier. So an additional field had been added to hold the conversation ID, and an instance was created for each conversation.

    Now, that in itself wasn't an issue - but the objects were stored in a hash table. Objects for the same server but different conversations compared non-equal ... but the conversation ID hadn't been included as part of the hashcode calculation. So all conversation objects for a particular server would hash the same (but compare different).

    We had 3 servers and tens of thousands of conversations between endpoints. Instead of the respective server objects being approximately evenly spread across the hash map, they were all stuck into a single bucket per server ... so instead of a nice amortised O(1) lookup, we instead effectively had an O(N) lookup for these objects - and they were being looked up a lot.

    The effect was completely invisible under low load and in unit tests. The hash codes weren't verified as being different in the unit tests as there was the theoretical possibility that the hashcodes being verified as different could end up the same with a new version of the compiler/library/etc.

  • One I'll always remember was some Actionscript or Javascript which would never happen with the debugging console open, but would always halt the program if the debugging console was closed.

    It turned out to be a call to console.log, which is a fatal error in IE if the debug console isn't visible at the moment.

  • When viewing this I got the footer quote "%DCL-MEM-BAD, bad memory VMS-F-PDGERS, pudding between the ears".

    I find it very suitable to this article.

  • Worst bug we ever ran across was a program that absolutely would not work as soon as anyone looked at it to see if it was working or just to observe the GUI. If you did that, it broke. So we spent a LOT of time trying to run it, debug it, rerun it, and no matter what we did it never worked right as long as someone was looking.

    But the moment you stopped looking, locked that PC and walked away, the program would run fine on files dropped into the appropriate input hot folder. It would happily do its thing

  • Most time consuming bug - The AMD cpu stack corruption bug. Errata 721. It took me a year to track it down. Half that period I thought it was a software bug in the kernel, for a month I thought it was memory corruption in gcc. And most of the rest of the time was spent trying to reproduce it reliably and examine the cores from gcc to characterize the bug. Somewhere in there I realized it was a cpu bug. It took a while to reduce the cases enough to be able to reproduce the bug within 60 seconds. And t

  • by jandersen ( 462034 ) on Tuesday August 04, 2015 @03:39AM (#50246995)

    Looking around, it seems that most people take 'tough' to mean 'spectacular'; I disagree with that. I think some of the most difficult bugs are the subtle ones that don't give many symptoms, or which masquerade as something else.

    Probably the hardest one to solve - or the one that required most insight - was in an application is worked with on Windows NT. The architecture was messy, to say the least, with anonymous pipes everywhere, but the real trouble came from the toolset, which tempted developers into doing stupid things. I think it was written using a an IDE for C++ from Borland (I forget the name), and they had got this 'brilliant' idea of making a number of objects that you could drag onto your design surface to create a Windowed application with automatically generated code behind. One class of objects were for things like FTP, etc, which was used in a central place. The problem, as it turned out, after I had thought deeply about it, was that network communication is asynchronous by its very nature, whereas the graphical toolset in Windows is non-reentrant, meaning that it is not a good idea to call functions that update the desktop before they have returned from a previous call. See what I mean: When a network packet arrives, you update your progress bar or whatever, which looks cool - but if the next packet arrives too soon, it tends to kill not just the application, but the whole desktop. The solution was to not use the network objects at all and instead rely on POSIX network calls running in a separate thread and communicating to the main loop via a pipe. Not quite synchronous, but much more robust.

  • by speedplane ( 552872 ) on Tuesday August 04, 2015 @04:42AM (#50247129) Homepage
    Many of the "hard" bugs discussed in the article do not seem very hard. Divide by zero errors and a +Inf in an input file are straightforward issues that should be caught using standard practice techniques (bounds checking and exception handling). Two of these three hard bugs would have been easy to catch with version control and continuous integration. It seems like the article is more about dealing with other people's crappy code and poor software development practice rather than debugging nasty bugs.

    The nastiest bugs are almost always race conditions, which are by their nature non-deterministic and may not be reproducible across time or certain hardware.

Sendmail may be safely run set-user-id to root. -- Eric Allman, "Sendmail Installation Guide"

Working...