Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Bug

Pet Bugs II - Debugger War Stories 121

AlphaHelix queries: "A few weeks back there was an article on Pet Bugs, where people were asked about their favorite bugs. I have a different sort of question: what was your greatest debugging challenge? I've been debugging for a long time, from analog circuits all the way up to multi-kLOC multithreaded servers, and I have some pretty grisly war stories, like the time I debugged a problem in a third-party DLL in machine code because the client didn't have the source for it (yay open source.) What was your greatest debugging triumph?" The first time Slashdot did this it was more about bugs that you had encountered (and may not have solved), this one is about bugs in your own projects code and the trials and tribulations you had to go thru to get them fixed.
This discussion has been archived. No new comments can be posted.

Pet Bugs II - Debugger War Stories

Comments Filter:
  • by Pauly ( 382 ) on Thursday August 01, 2002 @04:43PM (#3994801)

    Once I had to debug a program written in MFC... Wait. Sorry. The memory is too painful to recall.

    • Pussy. MFC debugging is easy. All the source is there on the Visual Studio CDs and the VS Debugger is great. Hell, it's one of the EASIEST things to debug. MFC's problems lie within its architecture, not within the execution of said architecture.
    • ... and I'm already beginning to feel that tingling sensation I get in my fingers and on my scalp when I'm about to loose my temper in a big way. It's funny how my hearing fades away when I'm really angry and I can only hear the blood rushing through my ears... My conscious mind has likewise blotted out most of that excruciatingly painful experience of trying to fix some extremely incapable butthole's code on Microsoft NT. I guess this is probably related to the postwar-stress symptoms some soldiers had after returning from Vietnam. Anyway... my psychiatrist says that by forgetting all about it the conscious mind tries to protect itself because if it didn't blot out all that literally mind-blowing pain my mind just couldn't go on functioning. Anyway... just forgetting about it works for me, but sometimes I get sudden flashbacks of stepping through the MFC code and the fantasies I have about torturing the Microborgs who wrote this piece of crap to death with red hot iron pokers and electrical stunners (they'd put me away if I really told you what I would love to do to them). That was in my rookie years, now you will excuse me I think I have a major flashback coming on and need to be alone...
    • Once upon a time (1994), in a far away place called "BBN" (now 'Genuity', but my division is not part of Brooks Automation), there lived some statistical software called "RS/1". This software spent its days happily compiling and linking and running self tests on VMS (VAX only; the Alpha port is another story), on Unix (HP/UP, Ultrix, AIX, SunOS, and Solaris), and reminising about the "old" days when it ran on PDP-11s and IBM /370s.

      And then the good witch of marketing (Hi peggy!) said, "run on Windows, too. All our customers want it". And the bad old troll called "billg" said, "win32s works just like windows NT or Windows 95 -- it's a single, stable base of 32 bit code".

      So we get the build elves to build the code, and it all compiles and links, and we run the self checks, and everything works on NT. So we try under win32s. And a moldy, dusty, forgotton old self-test, which had spent a decade waking up, saying "pass", and quietly going to sleep, roused itself, and with a rusty, creaky voice, said, "fail!".

      But that test never fails! It's an old "can we sort a table" test; it uses table (think 'spreadsheet') routines that had, over the years, been build to Keep On Working And Never Stop. And the tables weren't sorting.

      In the end, we put two (wow!) PCs into one office and ran NT on one, and Win3.1 with win32s on the other, and stepped and checked, stepped and checked, until the answer came back.

      When traced down, we learned that the table-sort routines worked by sorting into a "temporary" table -- a table that hopefully is kept in memory, but might be saved on disk. Once the table is sorted, the original on-disk table was deleleted, and the new table dropped into place using a 'rename' call. If there was no on-disk version of the original table, then a bunch of buffer magic was made to happen; otherwise different buffer magic happened.

      unlink is defined to return a fail code if you try to delete a file that doesn't exist. On all version of windows it was correct except win32s. Our test case, of course, made a temporary table to sort, it never existed on disk, so when the code tried to unlink it, the wrong result code was set and then the wrong buffer magic happened - and then the table wouldn't appear sorted.

      It never occured to us that that's where the problem might lie -- who would think that a bad 'unlink' call would result in table sort failure?

      Peter Smith
      then: BBN Software Products.
      now: WildTangent.com

      Small print: I don't remember if it was the 'unlink' call or the 'rename' call. I seem to remember it was doing an 'unlink'.

  • by Anonymous Coward
    This one time, at computer camp, I found a bug and stuffed it into my pussy!
  • Without a doubt, the most difficult problems I've ever had to debug are the multi-threading/synchronisation issues. You can get good tools to deal with things like buffer overruns or off-by-one errors, but I've never seen anything that helps debug a serious multi-threaded app.

    • Re:Multi-threading (Score:4, Interesting)

      by PD ( 9577 ) <slashdotlinux@pdrap.org> on Thursday August 01, 2002 @04:56PM (#3994879) Homepage Journal
      Not the most difficult bug I ever encountered, but one that didn't pop right out.

      Project was porting code from Solaris to AIX, multithreaded app. At one point in the code, two threads were started, and they needed to synchronize with each other.

      Anyway, on Solaris, the threads would start and interact properly. On AIX, the system would crash. Turned out that right after a thread was started on Solaris, the scheduler would stop one thread and allow the other one to start up, and from then on both threads existed at the same time as they should.

      Under AIX, the scheduler would start a thread, and that thread would run through to completion before the other one even got started. To fix this, we had to add in a rendezvous point at the top of each thread, so that the first thread would stop and wait for the second one to be created.
      • I met that pet too but with two different Solaris thread libraries.

        Strangely, my code worked fine with the first one and freezed with the second one... until I trussed the process making it going on :-o
        It took me some time to figure out that, with the second thread library, a thread owning a mutex, releasing it and immediately claiming it back always remains active without giving any chance to another one to be scheduled and thus to acquire the mutex.
        It wasn't a bug of the library as POSIX doesn't guarantee a descheduling on mutex release, but a lack of an explicit yield in my code.

        Just another illustration of experimentation being definitely not a good strategy when dealing with multi-threading.

    • Agreed. Working right now on porting KERNEL CODE from solaris to linux. Not exactly a straghtforward port. Anyway, there's like 50,000 lines of code, and it's all reentrant. I have a deadlock to find that happens with interrupts disabled!

      If only I had an ICE.

      I have a feeling that when I'm done I'll wish I had hair.
    • Re:Multi-threading (Score:2, Informative)

      by ssun ( 87579 )
      Take a look at VeriSoft, "a tool for software developers and testers of concurrent/reactive/real-time systems."
      http://www.bell-labs.com/project/verisoft/ [bell-labs.com]
    • Re:Multi-threading (Score:2, Informative)

      by hotpotato ( 569630 )
      For those working in the comfy Java world, there's JProbe Threadalyzer [sitraka.com]. It can detect deadlocks, race-conditions, and other such niceties, and display them visually.
  • Linus Torvalds, widely respected throughout the industry (yay open source!) as a programmer par excellence, has stated in public that debuggers are for wimps.

    Thanks to Linus, we of the Free Software community can rest assured in the knowledge that we have the most stable, most secure operating system in the world. And to what is it that makes Linux so great? Why, the fact that it was debugged entirely with printf's.
    • Yeah, well you can't find the real bugs with a kernel debugger anyway. You need some kind of hardware assist. For everything else that a kernel debugger does help with, printk debugging works just fine. It just takes longer.

      I'm impatient, so I use kGDB. Or maybe I am a wimp. Whatever.
    • Er, isn't that printk()?
  • Back in the day I would have nothing but trouble with this:
    String x = "anything";

    if(x == "anything") doSomething();
    I could never understand what was going on - until it hit me that I was being an idiot and needed to use .equals().
    • Yeah, i banged my head against the wall with that one for a long time before figuring it out.
      One language indivisible by architechure, where no two objects are created equal
    • but be careful,
      x.equals("anything")
      can cause null pointer exceptions. Never count on a param being passed in as never null no matter how well you document that fact.
      (x != null) && (x.equals("anything"))
      is a common code pattern, but did you know that equals() already checks for null? So you are checking twice?
      "anything".equals(x)
      is a quick way to code it, both because it takes less code space and less instructions are executed.

      Now if you used constants instead of free strings you would have been ok....

      static final String ANYTHING = "anything; /*...*/
      String x = ANYTHING;
      if (x == ANYGHING) { /*...*/ }

      • static final String ANYTHING = "anything; /*...*/
        String x = ANYTHING;
        if (x == ANYGHING) { /*...*/ }


        Ahh, the irony.
      • The really sad part here is that certain types of people rail on C and C++ for having pointers, and consequently being susceptible to null-pointer bugs, comparing at the wrong level of indirection, etc, and yet here is an equally subtle and unfortunate situation in Java, a language much hyped for its improved safety.

        I can see a good reason for low-level languages to provide this level of control, and the consequent risks associated with it. Surely, though, it would be better for most applications if higher level languages prevented such things happening at compile-time, rather than leaving you to clean up the mess in debugging (assuming, of course, that you actually hit the code in question during your testing). Until then, threads like this will forever feature bugs that should never be able to happen...

        • certain types of people rail on C and C++ for having pointers, and consequently being susceptible to null-pointer bugs, comparing at the wrong level of indirection, etc, and yet here is an equally subtle and unfortunate situation in Java, a language much hyped for its improved safety.

          The stuff that happens when you dereference a null pointer in Java is nothing like what happens in C. If you think it's "equally subtle and unfortunate" you're crazy. The JVM instantiates a NullPointerException and propagates it up the call stack. You can catch it at any level and even use the exception for flow control purposes (although this only seems clever and really isn't a good idea for anything besides last minute emergency bug fixes). In C, dereferencing a bad pointer is like pissing on an electric fence. It's nondeterministic. You're not running bytecode- that's real machine code.

          The issue with == and equals() just comes from unfamiliarity with the language. The == operator in Java compares pointer values. Once you learn how that works, the problem goes away.

          Surely, though, it would be better for most applications if higher level languages prevented such things happening at compile-time, rather than leaving you to clean up the mess in debugging (assuming, of course, that you actually hit the code in question during your testing). Until then, threads like this will forever feature bugs that should never be able to happen...

          It sounds like what you want is more checked exceptions. Java could easily have been made to be like this. They could have defined NullPointerException, ArrayIndexOutOfBoundsException, ClassCastException, etc. as being checked exceptions (which don't extend RuntimeException). So every single method call, field access, or array element expression would have to be wrapped in a try{...}and have a "catch (NullPointerException e) {...}" or "catch (ArrayIndexOutOfBoundsException e) {...}" block underneath, or else the compiler would complain. And you'd go nuts! You would swallow the exceptions the way people always swallow checked exceptions, just to make the compiler shut up. Then your program would silently fail and continue... then fall flat on its face some time later and you wouldn't know why.
          Checked exceptions are good in theory but they have problems that have a lot to do with psychology. Impatient people don't like to spend time writing error handling code. And it only takes one guy in a development team who swallows checked exceptions to make the whole idea useless.
          Microsoft decided to completely do away with checked exceptions in C#. You can compile anything without writing a single try/catch. That's actually probably going too far. Some checked exceptions are useful (SQLException, IOException, etc.) and force you to write error handling code you should really be writing. But Java really has too many of them. CloneNotSupportedException? Why am I forced to catch that one? That's ridiculous.
          • The JVM instantiates a NullPointerException and propagates it up the call stack.... In C, dereferencing a bad pointer is like pissing on an electric fence. It's nondeterministic. You're not running bytecode- that's real machine code.
            Don't be ridiculous. Machine code vs. bytecode is irrelevant. On POSIX systems, dereferencing a null pointer causes a SIGSEGV signal to be sent to the process. SIGSEGV is catchable, so it would be fairly trivial to simply throw your own nullpointer exception in C++, or to integrate it with whatever exception mechanism you may have built in to your C program. I suppose you could even set SIGSEGV to be ignored, but POSIX says that the result is undefined (could be a fun source of bizarre program failure :-)
            • Well, when you compare Java to C, you have to be careful to avoid apples-to-oranges comparisons because there's Java the language and Java the platform (i.e. the JVM). In Java, the JVM has a standard way to catch null pointer dereferences that has its hooks embedded in the language syntax. (C# is the same way- although they toot their horn about "language independence" it's all marketing- it's really the same type of setup as Java with the same kind of blurred distinction between the language and the CLR.)
              C and C++ are platform-agnostic languages. They don't come with baggage like a pseudo-OS, the way Java and C# do, and so while it's possible to catch a null pointer dereference in the form of a SIGSEGV signal on a POSIX system, that's really a feature of the OS, not the language. The language doesn't get in the way of this, but that's because it has left the behavior undefined in the first place so that POSIX is free to define it using a process signal mechanism.
          • I'm sorry, perhaps I wasn't clear. I wasn't having a go at Java for the way it handles null pointer situations; I completely agree that things like checked exceptions rarely help in practice. I was having a go at Java because the situation can arise at all in the first place. This is sensible in low-level language like C. It makes little sense in high level language semantics to force everything to be a pointer that could be null and thus needs checking in some way. My argument is that the way to resolve this issue is not to fix the problem, but to prevent it from ever happening in the first place.

            • Well, yeah it is annoying to have to write if (x!=null) all the time. I guess there are ways you could fix it in an HLL. You could make the language not use references and make everything pass-by-value. A lot of languages do this, but it involves inefficiency and duplication. Unless you want to work at a really high level and don't care too much about efficiency, you have to be able to handle things by pointing to them. Even if you can't do pointer arithmetic, pointers are way more efficient than handling everything by value.
              Or, you could keep the references and avoid the exception throwing part, by taking the approach used with other primitives like int, float, etc. If you divide by zero in Java you don't get a "DivideByZeroException", you just get NaN. So null would become a sort of magic default object that takes the form of whatever reference type it's being handled as. (A null String would return null from substring() instead of throwing an exception, etc.) But that doesn't make much sense. Calling a void function on a null would have no symptoms at all. And it would duplicate the headaches with NaNs. If you divide x+y by x-y and add it to z, and x and y happen to be 0, z is set to NaN. If you add the NaN to something else, you get NaN. Soon the NaNs are spreading all over the place and you can't figure out where they came from! So that's a bad idea. The exception is better.
              Although it is running virtually, and is always hyped as being high level, Java isn't really a high level language. It's much higher up than C or C++, and if you listen to people who use those languages, you would think Java is like Tcl or Lisp. But it's designed for midlevel processing, where you aren't working on bare metal but speed and algorithmic efficiency are still concerns. So there will always be issues with pointers.

          • I hate having to wade through so many varieties of exceptions when trying to use Java Classes (not classes, Classes -- instances of the java.lang.Class class).

            Sidenote: It bothers me to no end that there's no way to get a Class object for a class through a static method. I realize something like that would go against the lack of static method inheritance in Java, but Object.getClass() must contain some pretty nasty voodoo anyway.

            As it is, I have to either:

            (new Foo(...)).getClass()

            which gets really annoying if Foo has no default constructor or uses a lot of resources on creation

            or

            Class.forName("pack.Foo")

            which is somewhat more elegant, except that I have to handle an exception where the string constant I give doesn't match a class. That would be moot -- as it should be -- if the class name could be checked at compile time.

            Add on another load of exceptions if you want to use Methods or Constructors (again, distinct from methods and constructors), and for the same reason... methods can't have their corresponding Method referenced statically at compile time.

            Grumble.
            • reflection (Score:3, Insightful)

              Reflection is a nice feature in Java except they made it a pain in the ass to use. I work on a product that is a Java application running on Windows, Linux, Solaris, and Mac (both OSX and Mac OS 9). Because we are still supporting Mac OS 9, we cannot use a Java 2 compiler at all- so we are squeezing the entire tree through Sun's 1.1.8 compiler every night. (So we're still writing Java 1.1 code! In this day and age! If you call any Java 2 method, like add() on a Vector, it breaks the nightly build.)
              Now there are some things that our customers want that absolutely require Java 2, like drag and drop. If you are running Mac OS 9, drag and drop won't work in our program. Sorry. But we have it working for everybody else on all other platforms- by using reflection to access the DnD classes! And the code looks horrible. One line of ordinary code balloons to five lines of incomprehensible gibberish when you use reflection.
              The way I see it there are two primary uses for reflection. One is the use that Sun originally intended- for people writing IDEs, bean containers, debuggers, profilers, etc. The other is for people like us, who are compiling against a fossilized version of the JDK but need to introduce some forward-compatibility and access classes we know are usually there but we can't compile against statically. Sun's attitude is always to tell all customers to upgrade to their latest and greatest version of Java. (Sun's inability to take on the backward-compatibility issue from either a design or a policy perspective is really annoying. It's what killed the whole applet idea. And now their JDK 1.4 compiler is spitting out classes with version numbers that make old software freak out. I still have to find the compiler switch that turns that off.)
              I think it would be cool if Java had a "reflection" keyword with which you could declare a block of code as being dynamically and not statically compiled- so you could write ordinary code in there and the compiler would break it down during a preprocessing step into the required Class/Method/Field gibberish and let you catch something like an "UnsupportedApiException" in a catch block underneath. Of course, the chance of that happening is zero, and even if it did happen, the 1.1.8 compiler wouldn't understand it anyway. Does anyone know if Sun has any plans for introducing a standard for compiler extensions? It strikes me as a move that would involve relinquishing too much control.

          • In C++, you could get much the same benefits of a pointer that checks itself for being non null each time it is dereferenced and throws an exception if it is null. Just use a smart pointer template class which is smart in that particular way.

            Save the unchecked pointers for the most speed critical part of the program.
    • String x = "anything"; if(x == "anything") doSomething();

      Actually, the 'doSomething' method will be executed. Because Java guarantees (I think) that String literals of equal values will be assigned the same object, you can sometimes use '==' to speed things up. But yeah...if you read 'x' in from a file and compared it to a literal, it would be false.

      • by Anonymous Coward
        Java does _not_ guarantee that. It does guarantee that for the special case of a = "hello", b = a, then b == a.

        Once you realise that Strings are Objects, all Objects are by reference (essentially any variable that is a subtype of Object is a pointer^H^Hreference), it makes perfect sense. == is a reference comparison, = is a reference assignment - so a and b point to the same string object.

        This also has implications for function calls. Java is NOT a pass-by-reference language, but a pass-reference-by-value. For the most part, these are effectively the same thing, with one major difference. If I

        baz () {
        bar = "hello" ;
        foo(bar);
        [more code]...;
        }

        in java , the call to the foo method CANNOT change bar to point to a different object in the enclosing scope [more code] in baz(), but it can modify the object bar points to. In a true pass-by-reference language, foo could change bar to point to a different object, and the change would afffect [more code] i.e. it can modify bar itself...

        This is a subtle distinction, and one that can catch people out in Java and Lisp.

        • No...he's right. Java only creates one object for all String literals of the same value in the same class. This works out, because Strings are immutable in Java.

          This can be found in the Java language specification [sun.com].

          In general, the parent of this post is correct. In that specific instance, the grandparent is correct.

  • by ghamerly ( 309371 ) on Thursday August 01, 2002 @05:01PM (#3994896)
    My favorite bug was allocating memory inside of an assert() using VisualC++ (I hate MS tools; I had to use it for work).

    So the gist of the code went something like this:

    ...
    0. int array[];
    1. assert(array = new int[SIZE]);
    2. for (int i = 0; i < SIZE; i++) {
    3. array[i] = i;
    ...
    and the code would segfault on line 3. So I brought it into debug mode, and stepped through. But it worked fine. Back to release mode, and it segfaults.

    To restate, here we have the classic example of something you don't want: it works fine in debug mode, but it bombs in release mode.

    Of course, since I have simplified the code the answer should be obvious -- in release mode, VisualC++'s compiler was stripping out the assert(), and the allocation inside. In debug mode, it left the assert() in, so the allocation worked fine. I had never changed a flag that said I wanted it to strip them, so I assumed it wouldn't. Never trust M$...
    • by codexus ( 538087 ) on Thursday August 01, 2002 @05:09PM (#3994945)
      Never trust M$? I'm sorry but it's clearly documented that the asserts are stripped from the release code. The macro to use for code you want to check in debug mode but still execute in release mode is VERIFY()

      I'm no fan of Microsoft, but it's a bit easy to blame them for your own mistakes.
      • Yes, yes, it was my own fault for using an assert like this. I should not blame MS for this, you're right, but that was not the point of this post.

        THE POINT OF MY POST WAS: this bug presented an interesting problem (difference in debug builds versus release builds), and the fact that sometimes you cannot reproduce bugs in debug mode that occur in release mode.
        • I understand the point of your first post, but the guy flamed you because you added your own little comment about how one should trust Microsoft tools.
          Have you changed your phrasing, everybody would have been happy and would have understood the true essence of your post...
    • by ComputerSlicer23 ( 516509 ) on Thursday August 01, 2002 @05:17PM (#3994989)
      Hmmm, you broke the rules from the MS Press books. Both "Writing Solid Code", and "Code Complete" mentions specifically to never ever have code with side affects inside of an assert statement (more generally no side affects in debugging code). Both outstanding books, that have lead me down the path so I don't have war stories about debugging anymore. That and I don't do embedded programming anymore.

      Good books, MS tools are weird mainly because I like my commandline a bit too much, but they publish some damn fine books about programming.

      This is also a case of found easily by reading the output of gcc -E, you best friend when debugging code that has macro's anywhere near it.

      I had never changed a flag that said I wanted it to strip them, so I assumed it wouldn't. Never trust M$

      I hate to post a flame, but RTFM. On every compiler or tool you ever use, spend several days reading the manual and all associated docs you can find. Knowing how the compiler works, and how all the tools work is a hallmark of all the finest programmers I know. I used VC++ a handful of times 5 year ago, and I could have told you the asserts were stripped in from release mode. All you have to do is look at the full list of options it puts on the command line. That's relatively easy to find in the menuing system on VC 4.0 (the only version I used). The -DNDEBUG=1 flag turns off asserts.

      Kirby

      PS: Other then the keyboard and mouse I use, I haven't used a Microsoft product on a daily basis in years. It's about craftsmanship, and knowing your tools.

      • Aren't you an astroturf shill?

        If not, you're working for free, my friend.

        • Hmmm, I'm not sure exactly what you're implying, but MS shouldn't pay me money for any kind of promotion. Goodness knows I'd never use one of their software products for anything important.

          I could supply good reference material on a number of programming topics, it's just that the MS Press ones are some of the best I know of on that topic. Some of the Extreme programming stuff has similar ideas, but I haven't read those too closely. I know that Large Scale C++ Programming has similar advice, but is more advanced. It also isn't nearly as "Do this, don't do that" advice. It's bigger picture not nuts and bolts. The Mythical Man Month, is one of the finest books about engineering project management I've ever read, but again not much in the way of practical everyday usable advice. Those come from a variety of publishers, and aren't as focused, is that less astroturf'ish?

          I merely post my experince and point out honest to goodness good reference material, and it's astroturf'ing. Hmm, possibly it could be "grassroots", "friendly" advice, you know the thing that astroturfing is for simulates. Some people still point out information for free, with no reason other then to be helpful.... Granted my came with a flame for extra bonus.

          However, after reading a number of your posts, you seem too cynical to believe that is the possible. Either way, my day job pays the bills so that I don't need to pick up change from MS for saying nice things about books they published 10-15 years ago. I'm also curious about you're sign-on, does Cricket or Liu give you a kick back on promoting their very fine book (hear the 4th edition is great, but I haven't picked it up yet, owning the 1, and 3rd seems enough for me, until I need to set up BIND 9)?

          Kirby

          • Don't worry about it my friend - anyone with a 23 in their handle is OK with me. I picked up my own handle due to it being on a book within my field of vision when I finally decided to bite the bullet and get a stupid slashdot account instead of posting anonymous like I should be able to.
      • by Martin S. ( 98249 )
        you broke the rules from the MS Press books

        Um, Microsoft rules eh; if that's not ms-troll I don't know what is...

        However you are missing the point; setting aside it should have been implemented as assert and not Assert, it was the none standard behaviour in what was supposed to be an ANSI compliant C++ that is the real issue. So yes, that behaviour is none standard, and since practically every C++ programmer I know, who used VC++ including some world class acts and apparently plenty of slashdotters, fell for it. It is the Microsoft convention that was/is wrong, it is counter intuitive, so yes it was/is YAMB (Yet Another Microsoft Bug). The MASM segmentation alignment issue is the another example of the same attitude.
        We are always right and that is the way it works,
        that is the way it works, so you are wrong,
        if you are wrong, we are right;
        we are always right. ad nausium.
        .

        So that is why when Microsoft break a convention or standard and the fault is everybody elses.

        Well frankly you need to grow up and down scale your ego, think freely instead engaging in group think and start listerning to others. Don't you know the customer is always right. So if we raise something, you should appologise for wasting my time, thank me for the contribution, and don't under any circumstance imply I'm stupid simply because my opinion differs. That really pisses me off about Microsoft consultants ops I mean 'evangelists'. I wonder what prat though it a good idea to send out 'evangelists' to preach 'belief' to 'Engineers'?

        The best laugh I ever got at an evangelists expense was telling him that nature provided him one mouth and two ears and perhaps he should use them in that ratio and then he may understand our requirements. Did he shut up ? Well for about 3 seconds, before launching into his spiel.

        So yes, none standard behaviour is YAMB (Yet Another Microsoft Bug), it is not RTFM.

        • I'd highly recommend you go *READ* the two books in question, "Writing Solid Code", and "Code Complete". Both of them are generic books about good old fashioned C code. Other then referencing MS projects they worked on (which I'm not sure if they do in those books, it's been 5 years since I read them). They clearly talk about using the assert/Assert/ASSERT macro's. The macro "assert" is standard, and it is standardized so that if you define NDEBUG it isn't in the code. I own a copy of the C++ standard. I know is in the C standard. This is merely a case of know your tools. Go read assert.h in /usr/include on a RH7.2 box. It acts precisely the same as poster described.

          I wouldn't use a MS program if my life depended on it. I game on it at a friends house once a week, other then that, I never use the stupid things any more. If you're coding in C/C++ and don't realized the command line arguments you're passing the compiler, you're not a good craftsman, and don't know you tools. There's a reason I've read the entire gcc/g++ man pages, why I own the Using/Porting the gcc compiler by the FSF, and I've read it several times. I haven't use the MASM ever, let alone enough to know what the hell the segment alignment issue is. It's probably idiotic, and it the same reason I use the gcc compilers on all platforms. Including Cygwin and the OpenStep compilers 3 years ago when I worked on multiplatform code that had to run under windows.

          Realize that not all things published by MS Press are marketing materials, and that they have some good advice in them about how to code. It's a flat out fact. They have zero to do with Microsoft products. Merely how to write good solid code that is easy to maintain and debug. They publish lots of good lessons that too many people learned the hard way.

          If you insist on learning them the hard way, that's fine. I really couldn't care less. No evagelising here. Merely know you're tools. If he'd done the same thing in KDevelop, I'd still tell him to RTFM. KNOW YOUR TOOLS. If you don't know how your tools work, even if they're MS tools you should spend more time learning about your tools. It's time well invested. If you don't know how assert works, and you didn't read the implementation, and didn't know how to get the pre-processor output from you're compiler, maybe you should spend some time learning how to do that. It's very useful.

          Trust me go read the books. They are very good, and you want have to wash you hands to get the MS feel of of them.

          For the record, the three machines in my home, run Linux. I own official copies of RedHat 7.[0123], 6.[012], 5.2, and one from the 4 series. I have the boxed set of FreeBSD 4.4. Somewhere packed away in a box, I have an MS Win95 CD that came with my Pentium 100. On the last 6 machines I've had running in my apartment not a one of them, has ever had windows on it, ever. No MS troll here. My entire companys servers runs on Linux (they insist on Win98/2000 on the desktop, I run Linux on my machine), I'm posting from Mozilla. Free Software guy all the way. Sorry no MS troll. Go read the books. Go code. Come back when you get the fact that knowing you're tools, and how they work is important. If you don't get that, you're wasting your time.

          Kirby

      • Writing Solid code is an Excelent book. Too bad it's from Microsoft and tends to be ignored.
    • As others have noticed, this one is somewhat your own fault for failing to RTFM; the standard assert() and MS-specific ASSERT() and VERIFY() features all quite deliberately have this property.

      This does raise an interesting question, though: should you have a separate "debug" build? I'm coming around to the view that, for most applications today, the answer is probably no. A separate debug build has traditionally been used to provide extra diagnostic information and checks during development, which are then stripped out before the program is released by build in "release" mode. On the other hand, this has the following downsides.

      • The release code may be subtly different in behaviour from the debug code. You either have to run all your final tests on release code (and lack diagnostic info if things go wrong) or test against the debug build and hope no-one put a side-effect inside an assert() or #ifdef block.
      • For much the same reason, it is often helpful to include diagnostics, albeit normally hidden ones, in a production version of code. Even Microsoft have started doing this overtly recently, with their "report this bug" features. And why not? After all, when your client says "it crashed", would you rather work from their slightly misunderstood idea of what they saw, or your own automatically-generated detailed diagnostic file that they e-mailed through to you?
      • Debug-only code is a pain to manage as well; as you read down your source, it may or may not be obvious when different code is being included, or not. In many modern programming languages, exceptions and a suitable catch-all at the top of your code provide a far more graceful way to close down after a fatal logic error (or to recover after a non-fatal one) than the traditional assert(), which typically dumps you out unceremoniously and immediately if anything goes wrong -- not acceptable behaviour in most modern applications, as far as I'm concerned.

      So, I hereby begin the campaign to abolish separate debug and release builds in non-performance-critical applications. :-)

      • Yes, it was my fault -- bugs are the fault of the programmer!

        But thank you for bringing more light on the topic I was raising -- that bugs in release mode may not be replicable in debug mode, which leaves you with a very strange situation.

        I disagree that debug builds should be disavowed altogether; using a debugger can still be very useful.
        • I disagree that debug builds should be disavowed altogether; using a debugger can still be very useful.

          Sorry, that wasn't what I meant to suggest at all. My argument is that you would often be better off developing a hybrid build where diagnostics were routinely built in anyway. Your final build would be closer to what is currently a "debug" build than a "release" one, albeit perhaps taking a different approach to handling/logging failures.

      • compiler optimizations such as code motion, stack compression, local elimination (register usage), frame pointer omission and inlining often make debugging a release build impossible. some debuggers handle this stuff better than others, but none that I've seen do it acceptibly.

        in my view you can't have enough assert()s, and it's unlikely that you'll want to kill the performance of a release build by including all this diagnostic code. if your assert() macro is a pain to use (such as the standard unix printf/exit behaviour), then use something different. I like the MS abort/retry/ignore assert dialog that give you a choice to quit, attach a debugger, or ignore the assert.

        for debug builds you often want to make use of other diagnostic utilities such as heap consistency checkers which alter the behaviour of malloc, that would definitely hurt performance of your app were you to include them in the final version.

        • You make what is, I think, the only reasonable counter to my argument: debugging might actually be hurt if you always work with optimised code. I agree that, at present, it can be difficult to debug "release" code if you need to get down to the assembly level. However, if you're debugging at that level, it is questionable whether the results of your debug build and release build will actually be usefully comparable anyway, since surely they are quite different at assembly level.

          If you don't go that low, there is no reason why debuggers shouldn't present your code in whatever language you're using the way they do any other code, whether it's been rearranged due to optimisation or not. Granted they can't at present, but they don't handle things like reentrant/multithreaded code, inline functions and generic functions (particularly in languages like C++) very well at present, either. As time goes on, development tools will get more powerful and these things will hopefully improve. And of course, you can always just switch off optimisations from a compiler flag if they're really getting in the way during a debugging session; they don't change your source code, which is really what this is all about.

          I see the performance hit of adding diagnostics as far less of a concern. Many applications are held up waiting for the user, system resources, etc. The cost of adding assertions by the legion to such code is negligible (see Eiffel, where it's standard practice to assert() left, right and centre). Similarly, compared to the cost of a memory allocation or release in the first place, the cost of framing your memory and checking for errors is normally small. Obviously, there are some processor-intensive applications where this reasoning doesn't hold, and that is why I left myself a loophole in my original post. For most things on most hardware used today, though, I suspect the pro's now massively outweight the con's, and it's just that old habits die hard.

      • hope no-one put a side-effect inside an assert()

        If assert() were made part of the language, maybe the compiler could make sure that there are no side-effects in the expression. Java's assert is part of the language but unlike C++, it doesn't have the 'const' modifier to mark whether functions have side-effects or not. Could a compiler-checked assert be added to C++ (kind of like the format of printf is checked by the compiler, even though it's not really part of the language)?

        • Could a compiler-checked assert be added to C++

          I don't think so, at least, not yet.

          The problem here is that while const is a laudable idea from some points of view, it's currently a mess. Applied to a member function, it doesn't actually mean "no side effects", it means "this is a member function of a class that will not change the state of the object on which it is called". That, in turn, can be interpreted either in a bitwise sense (absolutely no member data is changed) or, usually more helpfully, in a logical sense (no externally visible state is changed). const can't be applied to any other function. Applied to data, it means either "this data is genuinely constant", or "this variable will not be modified once initialised".

          I have never quite bought into the "const-correctness is sacrosanct" approach because of these limitations. I'm all for a more declarative style of programming and making applicative behaviour the exception rather than the norm; this would have well-documented advantages for languages like C++. OTOH, the concepts of lvalue/rvalue and of side-effect or non-side-effect implementations are at least as useful at const in practice, but currently not supported by the langauge. One of these days, I'll get around to finishing an article I started on this subject a while back and put it up on the C++ newsgroups to see what the "experts" think... :-)

        • Though its not obvious how to do it, it is possible to make some assertions work at compile time in C++.

          For a good example of this check out the boost library [boost.org] and look for static_assert [boost.org].

      • I would like to second that notion. Having seperate builds for release and debug can hurt the stability of your product. It is known that the development and production environments should be as similar as possible. This includes issues such as hardware, compiler and many more. This is done to ensure that what works in dev also works "in the field". But using debug and release builds ruins those effords. That's because one of the basic things, which is the code that is being executed, is quite different. Of course, some would say that release code is a lot faster than debug code. But is that gain in performence (which definitly exists) is worth it? Most desktop applications won't be slower in debug mode, IMHO. That's why I don't like the assert macro (or the assert keyword in java). If you want code in the debug phase, you want it in production as well!
    • Aside from the flamewars below (flaming because someone mentioned an MS book? geez), the problem is still yours.

      assert() isn't an MSism, it's an ANSI Cism. It specifically tells you it is only used to test code and is stripped out in all compilers if you define NDEBUG, which a VC++ release build does. You just didn't know what assert()s did, nor what a VC++ a release build meant. You can look at the -D flags VC++ sets in release vs. debug bulds (I forgot how, haven't used it since version 4 something).

      Since assert() worked properly, it means you coded it wrong. You should have coded this as

      1. array = new int[SIZE];
      2. assert(array != NULL);
      3. for (int i = 0; i < SIZE; i++) {


      The assert() is only a test, and can be safely removed in a VC++ release build.

      As a side point, this should never have been an assert() anyway. assert()s are meant for algorithm and data structure consistency checks, where a failed assert() means a programming error. Say, you do a consistency check when adding to a linked list. Look up preconditions and postconditions in any good programming text, there's where you put assert()s.

      In this case, it's a run time thing. The assert() fails if we don't have enough memory. Has nothing to do with the correctness of the algorithm, nothing about a bug, just quit other programs or add memory.

      As a side note, this is all moot anyway. If this is any recent spec compliant compiler, new will never fail and return NULL, but it will instead throw an exception, which if uncaught, will terminate your program.
  • by codexus ( 538087 ) on Thursday August 01, 2002 @05:03PM (#3994910)
    That one is an oldie :) Back in the Amiga days I had made this game that worked fine on my A500 but that stopped working after a while on most other A500. That was strange as the machines were supposed to be identical and I couldn't make more tests at home.

    So I used the Action Replay cartridge. For those who don't know about Action Replay, those were "hardware debuggers" that pluged on the bus and could stop and restore the execution of the running program. They were very powerful debugging tools.

    After inspecting the content of the hardware registers thanks to the Action Replay, the result was that on some revision of the A500 motherboard the audio interupts had a slightly different timing that caused an improbable case were the audio samples always stopped playing on offset 0 retrigering an audio interupt as soon as one was handled.

    The Amiga was so much fun...
    • This kind of thing's more common than a lot of people realize. A while back, I was in the OS group for on of the earlier SMP vendors. The chip we were using at the time was the Motorola 68040, and I discovered that there were serious differences between different chip revisions. For example, one revision would push a six-byte frame onto the stack for a particular floating-point exception, and another would push a ten-byte frame. Run code written for one revision on the other and you'd trash your stack. What we had to do was extract the chip revision at startup (somewhat of an interesting hack in itself, IIRC) and use it in our exception handlers to figure out where stuff was. Ick.

  • Video games (Score:5, Funny)

    by inkfox ( 580440 ) on Thursday August 01, 2002 @05:05PM (#3994918) Homepage
    I don't think I caught the original article; this was a fun one though:

    When we were working on a game title on the Nintendo 64, we were maintaining a parallel version for Windows, as it was substantially easier to develop for. Basically, we created some rendering, sound, controller and I/O code for Windows that duplicated what we'd created on the N64.

    At one point, we were trying to find some problems with the camera behavior, so we created a flying camera object that coincided with the real camera. It looked like an old Hollywood camera, though the lens cap, reels, everything was just flat black. Then, we'd set up a fixed camera and watch what the game camera would be trying to do by observing where the flying camera went.

    Time passed, and we'd forgotten about the added camera altogether. Then, as we were approaching a critical milestone, we went to bring the N64 build up to date... and the screen was black.

    The game seemed to be playing, the menus were there, and the framerate counter was up, but - black. The Z-buffer had a constant value all the way across it, meaning there was some mysterious polygon that was exactly covering the screen all the time.

    We were there until something like 2am, trying to figure out what the hell was going on. We were risking blowing this milestone, and with that, taking on a pretty hefty late delivery penalty.

    We finally figured it out. Stepping back... well, the N64 engine didn't support back-face culling at that point, whereas the Windows engine did. So what's the upshot of the whole thing?

    We'd left the lens cap on.

  • No contest. Bad DMA. (Score:5, Interesting)

    by inkfox ( 580440 ) on Thursday August 01, 2002 @05:12PM (#3994959) Homepage
    As a game programmer, and as someone who's banged on device drivers, I have to say there's no contest: The most fun bugs result from errant DMA.

    It's really easy to set up a bad DMA chain on most architectures, and when that happens, it can do wonderful damage that's tough to reproduce.

    One of the more fun things about it is that DMA generally ignores the MMU completely, so you can consistently trash whatever's at some physical address time and again, ignoring all protection.

    Even better, DMA doesn't cause hardware breakpoints, so even if your debugger/system are capable of watching for all writes to a given address or page, it'll still merrily corrupt it.

    Even more fun if data has been corrupted, but the correct data is still in the data cache from a previous access, making failures even more unpredictable, often relying on an interrupt or other random bit of interference clearing the cache.

    On top of all this, a bad DMA chunk may not manifest itself in an obvious way. The program may crash in random sections well before you realize that the DMA you intended didn't happen, or the program may just keep on running with a single fram graphic glitch or a brief bit of static in the case of DMA meant to go to video or audio hardware. That's easy to miss when you're focussing on the debugger and not the running program.

    I've seen products ship weeks late just because of a single hard-to-find DMA glitch.

    • by inkfox ( 580440 )
      I was talking to a coworker while I wrote the above. I should have stopped to read or at least add a few points before submitting. I like sharing info, so I'll just continue in another post:

      Some of you may not know the difference between a physical and logical address. The big fun in the case of the DMA ignoring the MMU above, is that each time you run a program on a system with virtual memory, your logical (application-specific) address is likely to map to a different physical (hardware) address. This is also why bad RAM can cause failures in seemingly random ways.

      It's also worth clarifying that DMA ignores not only the MMU, but cache memory, going straight to physical memory. I've yet to see an architecture where that wasn't true. This is why the bit with the data cache above is so nasty, especially on machines with large caches. It can put a large chunk of time between your corruption and your crash, which makes it hideously difficult to pinpoint the cause.

      Also, as DMA is generally tied to very time-sensitive code, it's also worth mentioning that most operating systems/architectures do little or no checking on DMA, instead kicking it off as fast as is possible. This is the reason it's so easy to cause DMA problems, and why they're typically such a hairy area.

      Lastly, DMA runs asynchronously. DMA, the CPU and other subsystems can all be banging on memory at the same time, with the DMA or the CPU taking priority, depending on the architecture or the DMA options specified. This makes it as bad as thread synchronization issues in many ways as well.

  • by renehollan ( 138013 ) <rhollan@@@clearwire...net> on Thursday August 01, 2002 @05:15PM (#3994973) Homepage Journal
    PUSH SP does NOT do the same thing on 8086 and 80286 architectures: in one case it pushes the stack pointer value before decrementing it, and in the other case it pushes it after decrementing it.

    I got stung by that on a Friday before a long weekend in 1984 or 1985. A dirty INT21 hook I was applying to DOS worked on ATs but not on XTs (or was it the other way around?). I had set up a structure on the stack and needed to pass its address to a higher language (prolly K&R C) routine, so PUSH SP seamed like the right thing to do.

    Hardly a complex bug, but one where it is non-obvious that a 286 is not a superset of an 86.

    Then there was the time I had to download a patch to over a thousand embedded controllers spread over a whole country whose problem was that downloading didn't work.... a truck roll to wach one was not an option. But, that's another story (bootstrapping the fix was horrendously more complex that finding the bug).

  • Satellite systems (Score:4, Interesting)

    by itwerx ( 165526 ) on Thursday August 01, 2002 @05:15PM (#3994975) Homepage
    I worked on code for a satellite ground-control system for a few years. It was all Fortran-5/77 and handled a couple-dozen satellites and ground stations in real-time. The problem was that it was written back in the 60's and the programmers who implemented it had really old slow MV-8000's. There weren't enough spare clock cycles to have decent synchronization between modules so they just depended on different subroutines taking an exact number of cycles to execute so they'd match up with whatever they were talking to. Change a single line in anything and you had to recompile it and time every possible way it could execute. Horrible stuff...
  • Bug in the CPU (Score:5, Interesting)

    by dant ( 25668 ) on Thursday August 01, 2002 @05:16PM (#3994980) Journal
    Once I was writing some C code to run on an old Motorolla DSP in an embedded system.

    One particular function kept crashing. My debugging tools were very limited in this environment--basically, I had a total of 4 LEDs that I could blink on and off by insert function calls into my code. That and a logic analyzer for when things got really nasty.

    Well, things did get really nasty. After reviewing and rewriting that function dozens of times, I finally decided the bug couldn't be in my C code. So I had the compiler spit out the assembly it was generating, brushed up on my DSP assembly, and read through its code on the hunch that there was a bug in the compiler (the compiler was very new and still pretty crappy).

    But after spending a couple of days staring at the assembly, I concluded that it was perfectly fine. What else could be going wrong? I started thinking maybe something was going wrong in the link step or in the process of getting the file transferred down onto the embedded controller.

    I went and learned more than I wanted to know about XCOFF format and used a little binary file editor to see what the linker output was. Again, everything just as it should be.

    I just knew that somehow, what was getting executed was different from what was in the file. So we fired up the logic analyzer, and attached it to the DSP, and set it up to watch the contents of the address bus and data bus at each clock cycle.

    This is incredibly painstaking--you have to look at 32 lines of step-functions to read off the address, and 48 lines of step-functions to read off the data (yes, it was a 48-bit data register; go figure) for EACH OPCODE. This will make your eyes bug out in a hurry.

    But even then--nothing was wrong! The opcodes being loaded into the processor were exactly what they should be. But on this one particular test-and-branch instruction, the processor would just start to go crazy (address and data lines full of random noise; had to be powered down).

    I dug out the processor manual and triple-checked the opcode name and number, addressing mode, and operands. Every bit was correct.

    In utter frustration, we decided to call Motorolla to see if we could get some assistance from them. After going through a small maze of transfers, we finally ended up talking to the right person who knew (and quickly told us) that:

    That particular addressing mode, when used with that particular
    opcode, was known to throw the DSP into a hosed state.

    It was a bug in the processor itself. The solution was simply to change my code to use a different addressing mode, and all was well.
    • This reminds me of programming the DSP and GPU in the Atari Jaguar.

      Many instructions didn't implement scoreboarding correctly, and their strange little multiply unit would take more cycles to multiply some numbers than others.

      This meant that it was possible to have code that worked correctly for some values, then for certain larger values, to suddently end up using one of the two arguments of the multiply as the result. The multiply simply didn't happen in time, and since the scoreboarding was broken, there wasn't a stall if you tried to use the result of the incomplete operation. Even worse than getting the wrong result, you could even end up having the result stuffed in a register at a point where you were now using it for another purpose!

  • by guerby ( 49204 ) on Thursday August 01, 2002 @05:36PM (#3995064) Homepage
    When GNAT (the Ada front-end for GCC) was commited into the CVS GCC, there was a bootstrap object file comparison failure.
    2001-10-27 Laurent Guerby <guerby@acm.org>


    * trans.c (gigi): Fix non determinism leading to bootstrap comparison failures for debugging information.
    The culprit was the following line:
    init_gigi_decls (gnat_to_gnu_entity (Base_Type (standard_long_long_float), NULL_TREE, 0),

    gnat_to_gnu_entity (Base_Type (standard_exception_type), NULL_TREE, 0));
    The two stage compilers were calling the gnat_to_gnu_entity in different order (as authorized by the C language) leading to different debugging id assigned to both created types hence the object debugging information comparison failure. Luckily it stroke me while reading the entry point in the compiler and thinking about non-determinism.

    Cute :).

    Laurent

  • Purchased library (Score:3, Interesting)

    by topham ( 32406 ) on Thursday August 01, 2002 @05:40PM (#3995078) Homepage
    I purchased a library to support multiple serial ports under DOS, way back in 1989/90. I purchased the library because I didn't know assembler well enough, and I specificly wanted flexible support for multi-port serial devices (Digiboards come to mind).

    I wrote up my program, testing it with a single serial port and had success quickly. I expanded it to support multiple ports and had it working, up until it was supposed to actually communicate with both ports at the same time.

    The company which released the library failed to reference anything except the first port in their interrupt handler. Leaving me to trace through unknown assembler code trying to figure out what was wrong with code that worked, but not correctly.

    I sent them a letter complaining about the problem, they sent back disks with that bug patched (exactly the same as my patch) and a couple of other bugs fixed (which didn't effect me).
    And vowed at that point to never trust thrid party code. (I did have full sourcecode though, which was nice).

  • by kxr ( 176150 )
    Come back and ask me in about twenty-four hours. I've been wrestling with an absolute doozy for the last six, and somehow I get the impression this is going to be the one ;)
  • We use a 3rd party software package. It has no documentation. It takes in legacy basic code and spits out horribly ugly C code. I have 2 bugs that I have yet to fix cause of their software. I know it is their stuff but cannot figure out where and their solution is to upgrade their software which I don't have time to test and debug. It is NOT open source so it is blackbox debugging. I can say for sure it is there stuff cause the code is transpiled through their stuff and works on our other two platforms.
  • by martyb ( 196687 ) on Thursday August 01, 2002 @05:54PM (#3995157)

    Two situations come to mind.

    The first is not so much a specific bug that was difficult to find, as it was the general means that I was forced to use to locate bugs. (Yes, I found quite a few.) Back in the early 80's I was working at IBM in the QA group responsible for testing their VM operating system. We were tasked with taking the existing VM OS and not only were we to improve its performance on multi-processor systems, we were also to improve its reliability by doing extensive reviews and testing. I was responsible for testing the free storage allocator.

    Some background: For those who may not be aware, that whole operating system was written in IBM BAL (Basic Assembler Language) using the 370 instruction set. The VM operating system created a virtual machine environment for each user - thus producing the appearance of a machine, identical in [almost] all respects to running on the bare hardware. Those few differences pertained to some optimizations in the the virtual memory management's use of PTLBs (page table look-aside buffers) among others.

    So, I needed to test memory allocation on the bare hardware to make sure that it worked okay. Once that was nailed down, I had to test memory allocation when VM was running on VM. But, there was yet another set of optimizations that I needed to test when a VM was running on VM running on VM (i.e. a "3rd level" VM).

    It was not possible to just issue VM commands to test the various code paths. So, each test consisted of setting hardware breakpoints at the appropriate hex offset, and single stepping through these allocations. By the time I got to testing the 3rd level VM code, I was tracing and debugging these PTLB calculations and allocations, single stepping through instructions in hexadecimal and verifying multiple levels of indirection to memory pages where those calculations were also in hex. Those were the days! Just a year or so out of college, and I had all to myself a multi-million dollar mainframe computer that could support several hundred people!

    The other bug was actually a specific bug that caused much early hair loss. I was working at a place where there was a lot of new employees come on board. Along with that, new departments were being formed, people were being promoted and moving to different groups, and there was a great deal of office moves as a result. So, it soon became a problem finding somebody's office. "Gee, wasn't Mary here just last week?"

    Sensing a need, I wrote up a quick REXX program (yes, this was back in the early 80's, too) which did data aquisition through forms and supported the generation of reports sorted by various categories: Name, Department, Room Number, etc. This was pretty straightforward and in a couple days I'd gotten it coded, tested, and all the data populated. As there were only a few hundred people I used a flat file (the other alternative was creating a DB2 database and disk space was very dear back then!)

    Rolled it out and received much positive feedback. Except, there was one person who noticed there was an error in the ordering of room numbers. See the format was: "Floor: 1, 2, or 3"; then the "building wing: compass direction: N, S, E, or W", and lastly, "room number: 2-digit number". As this was a rapidly growing organization, managers would be allocated several empty offices in advance for the people they'd hire during the next quarter. Also, the building had just been constructed and some areas and sometimes whole wings were still not yet ready for use so there were many gaps in the data. Here is a selection of the kinds of results I saw for the room numbers, in ascending order:

    • 1E17
    • 1E18
    • 2E18
    • 1E19
    • 1E23
    • ...
    • 1S17
    • 1S18
    • 1S19
    • 2S03
    • 2S05
    • 2S15

    Why are the rooms in the South wing nicely ordered, but things are really messed up for the East wing? I spent HOURS and HOURS trying to figure this one out. See if you can tell what the problem is before reading ahead.

    The problem? There is a single set of comparison operators <, =, and > and they just did the right kind of comparison depending on the data type of the operands. Well, here I was thinking the data was text, but the program was making the comparisons as if these were numbers that contained an exponent part

  • My trickiest bug fix was very confusing.

    TI 6701 DSP, custom embedded hardware, eprom for first and second boot stage, third boot stage is on flash.

    The second boot stage had to be compressed to fit in the eprom. The first boot stage was limited in what it could do. The second boot stage in eprom was a subset of the flash code.

    We would be fixing bugs and adding features in the firmware, but once in a while we would add a line of code and the firmware would crash in an entirely different spot, sometimes even on boot up.

    Add one line of code elsewhere and the problem goes away. After a while we realized it did not matter what the line of code was that we added! it mattered where, though. We quadruple checked our cinit linker section and almost everything concievable including lost pointers and trashed code segments. The in-circuit-emulator proved almost useless with major crashes and special case things. So no debugger access.

    What happened?

    There are write-only latches that the boot loader needs to write to in order to control the front panel. At some point, the flash code was updated so that the last written pattern to the latches would be stored so you could query the value. When an eprom was made based on this code, the problem occured.

    After the flash code was read from flash, and relocated to the appropriate spot by code safely residing in internal SRAM of the DSP, the internal code blinked a front panel LED. The latch buffer was still residing in DRAM, in the same area that the new code was loaded into!

    So if one specific bit in the flash code was supposed to be a 1, the bootloader would change it to a zero and we would get a spurious crash.

    If the bit was supposed to be zero, there would be no problems!!!!

    --jeff++
  • a while ago, I was trying to spawn another thread. The thread started off like:

    mythread(void){
    char buffer[0x10000];
    ...

    and it died right away. Finally, I realized I was blowing away the stack by allocating too large a buffer (I only wanted 0x1000). PITA to figure out.

  • Beginner's bug... (Score:3, Interesting)

    by jmv ( 93421 ) on Friday August 02, 2002 @12:53AM (#3996983) Homepage
    It's a simple one, but the first time (it happened to me a couple years ago), you really search for it:

    #define square(x) x*x ...

    value = 1/square(x);

    Ever since, any C macro I'd write has a dozen ('s in them... That's why I like inline functions...
    • I love the arguments against using paretheses. I think it comes from people having a reaction to LISP. I don't blame them, but I don't let them bitch at me for an extra set or 2 in a complicated formula either.

      • It's not that I don't like the parentheses. I don't like C macros because even with all the parentheses you add they can still be unsafe. For example:

        #define max(x,y) ((x)>(y)?(x):(y))

        it will work with max(x*x,y/z) but it won't work for max(x++,--y) and there isn't much you can do about it... C/C++ inline functions don't have this problem and that's why they're much safer.
  • ...debugging a Beta-smartcard driver for Windows CE and "hacking" it so it works with particular smartcards it wasn't capable of until then =).

    Matz
  • Well, it isn't quite a bug, but it comes pretty damn close. ASP (VBScript) doesn't handle SQL queries longer than 255 characters. link [tek-tips.com] This is the only place I could find it referenced on the web. Lots of hair-pulling on this one. Still haven't fixed it, and any workarounds I can think of are extremely ugly, hairy and otherwise full of cruftiness.
    • Stored procedures. Learn 'em, love 'em, live 'em. BTW - your link points to an article regarding the Docmd.RunSQL method in Access. ???
  • by BranMan ( 29917 ) on Friday August 02, 2002 @09:55AM (#3998296)
    My greatest debugging was on the first production run of the upgraded PATRIOT radar transmitter I was working on. The particular unit in question would start fine, run up, warm up the humoungus amplifier tubes for the radar, switch into high power (for long range operation and tracking) and BAM! reset itself.

    Looked like a S/W bug to everyone, so I (as the last 'surviving' member of the S/W team at that point) was called in to find it and fix it.

    Well, gathering data was the hard part - I needed to figure out what was happening with scope probes (tracing didn't work, and I couldn't rewrite all the firmware to do any logging or checkpointing). Small catch - the cycle running the system up to high power (where the bug was seen) takes 8+ minutes. Each time.

    So I basically had 7 tries per hour (max) to figure out where to hang a scope probe off a backplane of about 4000 wires to figure out what the heck was going on. While at the same time leafing through 40K of assembler code trying to eye-ball the problem.

    Three solid days of doing that (about 10-12 hours per day) with my boss constantly pestering me for a accurate estimate of how long it would take to fix it (Gee, thanks for that). Did I mention that I was 3 years on this project at this point - and that it was the first project I was on right out of school - and that I'd 'inherited' 2/3rds of the firmware from other developers who'd moved on? Way to be supportive boss.

    Anyhow, I finally figured out it was a H/W fault, not the S/W at all. Turns out a 24 volt PS was "weak". When the 208 3-phase power that runs the transmitter dipped from the load of switching to high power, the 24 volt PS would drop it's voltage. Just enough that the 5 volt PS running the logic detected the drop in 24 volt PS voltage, and due to the fail-safes to protect the circuitry - shut itself off!

    Which resets the control logic, brings down the power, steadies the 208 3-phase, bringing the 24 volt PS back in line, starting up the 5 volt PS, and away we go again.

    All found with a couple of scopes. Boy that was fun.
    • LOL! I also had a mysterious software bug that I eventually proved to be hardware (with great satisfaction ;).

      Anyway, I was working on a robotic lid for a Thermocycler (used in wet labs). Thing is, every few hundred iterations through the open/close test cycle would show a dropped command. Communication was I2C written by some third-party hacks, and the boss figured it was the problem. In all honesty, the implementation was pretty crappy, so I poured over every minute detail (small protocol I2C, not much to go wrong). I even gouged out a good portion of the code to put in retries and good failure behaviours. After about a week of reimplementing the server and client end of the bus I decided to go downstairs and get an O-Scope from production. I had to beg for the thing. Finally, when I traced the lines it was all too obvious what was causing the drops. Some engineering bozo placed the I2C lines adjacent to the power lines, and failed to shield the power lines.

      The result: 60-cycle induction across the bus (slow compared to I2C, so collision happened rarely, and my failure adjustments were pretty darn good at rectifying them). Anyway, knowing this I was able to tune the code and save the release date, though the designer of the hardware should've known better in the first place and saved me some headaches!
  • Just yesterday I spent a great deal of time trying to figure out why our application that never leaked memory before was now leaking after a conversion to VC++.NET with /CLR.

    The problem ended up being that while DllMain does get a call with DLL_PROCESS_ATTACH, it no longer gets a call with DLL_PROCESS_DETACH.

    So if you are, like me, allocating global tables when the DLL loads and deallocating them when the DLL unloads, you've got problems!

    To see my (unanswered) post along with code to reproduce it, goto microsoft.public.dotnet.vc [microsoft.com].
  • The weirdest bug I have encountered was... Last month.

    The app I'm developing right now, must wait for data to appear in a table, then read it, process it and then send it.
    I did not want to constantly poll the DB, so I used the Oracle package DBMS_ALERT. Basically, you do a dbms_alert.wait on a named alert and then, in my trigger, the alert is notified, so my thread in the java application unlocks and can go read the data.

    Well, everything was running fine, we did some test and new code all week long. But then, one day, it started to behave bizzarly...
    We were getting notified "randomly", even when we were not inserting any data into the table! (Was an on insert trigger). I was pretty amazed, I must admit.

    So after a whole day trying to figure out what the hell was going on, our no-more-appreciated dba came to us and told us he had created a new schema for our test on the database, with all the triggers and indexes and etc etc... I immediatly took a look at the table using the alert trigger, and there was something like 200 rows in there...

    It turned out that the alerts were propagating accross all the different schema, and it was not specified in the documentation the dba gave me... So we just added dynamic alert names and the problem was solved!
  • The research project I'm working on uses a Graph library called LEDA which I don't fully understand and which won't compile when our source files have a .cpp extension. *.c works fine (using g++ as the compiler)
    A couple days ago, I started adding a polymorphic class heirarchy to the project, and the first time I tried to compile it, I got a bunch of strange linker errors telling me "undefined reference to [classname] type_info function". I, of course, blamed LEDA and made another test program to see if I could isolate "LEDA's bug." Of course, I couldn't get the same thing to break in another program.
    "When in doubt, turn to Google": A search turned up that this error meant I hadn't put the "= 0" in a pure virtual function's declaration. I didn't want any pure virtual functions, but it turned out that I had forgotten the [classname]:: in front of one of my functions.
    Talk about dumb.
  • I once had to debug a program in an embedded controller - the company I
    worked for had this as a product that we'd written 10 years earlier - the
    computer that had the assembler and debugger for that controller had been
    hauled off for scrap years earlier and the only known copies of the source
    code for the program was on that machine's hard drive and on a tape backup
    for which we had no functioning tape drive.

    Then a bug was found in a mode of operation that we'd hardly ever used
    before...just one day before we were planning to demo it to a customer. I
    had 24 hours to write a disassembler, an assembler and a simulator for the
    CPU - then disassemble the program, find the bug, fix it, reassemble it, test
    and burn new EPROM's.

    The assembler was essentially just the disassembler run backwards and the
    disassembler was little more than a lookup table for the opcodes.

    Fortunately, the bug was easy to find and I made it with just one all-nighter.
  • This was a problem with an Algol program I wrote on an ancient Singer mainframe when I was in college back in the mid 1970's. I don't remember enough Algol to write it out - but in C++ it would be something like:

    void foo ( double &bar ) { bar = bar + 1.0 ; } ...and then later...

    foo ( 6.0 ) ; ...and *much* later...

    printf ( "%f", 6.0 ) ;

    Well, in C++, this is illegal because you can't take the address of a constant like 6.0 - but on this old mainframe, the Algol compiler optimised the use of storage by keeping one copy of each large constant that you used (like 6.0) in a table somewhere in memory. Being a crappy compiler, it didn't error-detect my little faux-pas.

    Hence the value of the constant 'six' was invisibly changed to seven - everywhere in the program from the point where foo() was called onwards.

    Of course this happened in a 3,000 line program that was much more complicated than the example above. Finding that problem was a *BITCH*. You just don't suspect things like simple numeric constants!
  • I was writing a serial port driver for a 68000 CPU (it might have been
    a 68010 or an '020 - I forget) - it worked byte-by-byte - no special
    line-by-line reading. It echoed everything that came into the serial
    port directly to the output.

    Trouble was, the first character on every line of keyboard
    input would get 'lost'...despite the fact that the application
    was reading one character at a time and the return key wasn't
    treated as anything special.

    The dumb terminals of those days couldn't scroll an entire screenful
    of text at full baud rate - so you were advised to send a null byte
    after every linefeed to give the terminal an extra millisecond or so
    to do scrolling in case the cursor was on the last line of the screen.

    My serial port output code (which seemed to work perfectly)
    said something like:

    char *UART_data_port = SOME_HARDWARE_ADDRESS ;

    *UART_data_port = character ;

    if ( character == '\n' )
    *UART_data_port = 0 ; /* Send a null byte */

    The problem was that the compiler silently optimised the
    'UART_data_port=0' to:

    CLR UART_data_port ...but the 68000 has a bizarre quirk in it's instruction set
    that when you clear a memory location, it *reads* that address
    first. Normally, it doesn't matter a damn that you read something
    when you needn't have ... unless the location you are reading from
    is memory mapped hardware...like my UART.

    Since the data port was memory-mapped and the same address was
    used for READING the serial port as for WRITING to it, whenever
    my code wrote a newline, the ensuing null byte would cause the
    CPU to read a character from the input stream and throw it away!

    Argh! That was hard to find...the output code was in a separate
    module from the (suspected) input code - even when I looked at the
    compiler output, it looked OK - in the end I had to hang a logic
    analyser on the UART and the CPU address bus to find out where it
    was executing when the character was 'eaten'. Then reading the
    Motorola data book revealed the awful horror.
  • I once had to write some Delphi code to call into a VC++ dll that worked as a data only smart card driver.

    Basically this smart card had something like 8Kbit addressable space (it was the early days of smart cards as storage devices).

    I had to write code to detect card insertion, then detec card version, talk to the smart card reader (on the serial port) and read/write data to the smart card.

    Got it all to work fine, except for one thing. The first letter of any string written to the smart card always got hosed.

    The cause? The manufacture of the smart card used the first 2 bits of the addressable space on the card as some form of control bits. (this was burried in a foot note somewhere in the documentation). Anyways after finally finding this problem I figured out that I would always write a space char infront of any string going into the card, and always discarded the first char coming out of the card.

    Problem solved.
  • by ebbe11 ( 121118 ) on Monday August 05, 2002 @04:16AM (#4010789)
    I spent a couple of days with a this bug in an ECG monitor:

    When the users pressed a specific (valid) key sequence rapidly enough, the keys stopped working.

    The monitor used a message passing RTOS and it turned out that a small utility function was the culrpit. This function aquired a message buffer, filled it in and sent the message to a specific task. That task was then responsible for releasing the buffer. This worked fine - until someone used this utility function in that very same task. When the key sequence was entered, one task would send a message that in turn would require a call to the utility function in that task. But before that message was processed, another task would call the utility and thus aquire the message buffer. So when the offending task became running, it would have two messages in its queue: The first message would require it to try to aquire the message buffer already used in the second message.
    Result: deadlock.

    I should add: this was in 8086 assembly...

  • A developer I know was writing documentation ('programming in word') and found that she had typed something like 'once this product has been vaselined'!
    Turns out that the v and the b are next door to each other on the keyboard.
    Similarly, I used to use int prefixes to identify integers and once came up with a funny when mistyping intCount by swapping the t and the c.
  • if ( bSomething == TRUE )
    ...
  • by Animats ( 122034 ) on Tuesday August 06, 2002 @12:38AM (#4016319) Homepage
    My worst experiences involved making a multiprocessor UNIVAC 1108 system work reliably in the 1970s.

    This was a mainframe. A physically huge mainframe. The two CPUs and memories alone took three rows of cabinets, each row about 25 feet long, connected by cross-cabinets. Then there were about forty cabinets of peripheral gear, including many drums, tape drives, printers, and two desk-sized consoles, one per CPU. All of this gear, though, delivered only 1.25 MIPS, and there was only about 1MB of memory (256K of 36 bit words.)

    The system kept crashing. For each crash, a dump was produced - a stack of paper about two inches thick, with some of the major data structures decoded at the front, followed by the entire contents of memory, in octal. When I arrived for the job, there were two stacks of these six feet high waiting for me.

    So I started in on this, figuring out what had caused each crash, tracking pointers with multicolored pens, and fixing the bugs in the operating system, which was all in assembly. After a while, the most common crashes had been fixed, and I was then spending time on the more difficult problems.

    Some problems required software workarounds for hardware problems. The system clock would make errors when its fan filters were clogged. (Yes, electronics was so big back then that the system clock had multiple muffin fans.) Code was written to deal with this.

    Occasionally, code overlays would be misread from the drums. A checksum crashed the system when this happened, but reread support was added to make that error recoverable.

    The most intractable problem involved data that seemed to be corrupted when written by one processor and read by the other. We looked and looked for race conditions, but even additional locking didn't help. Finally, a hardware consultant was brought in, and he built a custom hardware device that checked that certain bits matched between the processor and one of the memories. This was used during operation, and finally, after several days, the device triggered, the whole system froze with its clock stopped, and we could verify that the neon lamps at the processor end of the data path didn't match those at the memory end.

    Eventually, we had that beast running reliably, with a month or so between crashes. Gradually, the operation was expanded, until there were five mainframes crunching away.

Never test for an error condition you don't know how to handle. -- Steinbach

Working...