Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Programming Unix

The Most Expensive One-Byte Mistake 594

An anonymous reader writes "Poul-Henning Kamp looks back at some of the bad decisions made in language design, specifically the C/Unix/Posix use of NUL-terminated text strings. 'The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end? ... Using an address + length format would cost one more byte of overhead than an address + magic_marker format, and their PDP computer had limited core memory. In other words, this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day; but this one had quite atypical economic consequences.'"
This discussion has been archived. No new comments can be posted.

The Most Expensive One-Byte Mistake

Comments Filter:
  • The Road Not Taken (Score:5, Insightful)

    by symbolset ( 646467 ) * on Tuesday August 02, 2011 @11:18PM (#36968460) Journal

    Two roads diverged in a yellow wood,
    And sorry I could not travel both
    And be one traveler, long I stood
    And looked down one as far as I could
    To where it bent in the undergrowth;

    Then took the other, as just as fair,
    And having perhaps the better claim,
    Because it was grassy and wanted wear;
    Though as for that the passing there
    Had worn them really about the same,

    And both that morning equally lay
    In leaves no step had trodden black.
    Oh, I kept the first for another day!
    Yet knowing how way leads on to way,
    I doubted if I should ever come back.

    I shall be telling this with a sigh
    Somewhere ages and ages hence:
    Two roads diverged in a wood, and I—
    I took the one less traveled by,
    And that has made all the difference.

    - Robert Frost, 1920

    • Slashdot is lost.
    • Re: (Score:3, Interesting)

      by IICV ( 652597 )

      Everyone misunderstands that poem.

      Robert Frost had a fairly depressing outlook on life, and the point of the poem is that it doesn't matter what road you take.

      I mean, just pay attention to the narrative tense in the last stanza, the one people take to be so life-affirming and "do something different!". The narrator isn't saying "I did this, and I know it was important"; he's saying "I did this, and I think that in the future I'm going to tell people it was important".

      The narrator is a vain, shallow individu

      • by jhoegl ( 638955 )
        Perhaps it means that regardless of which path you take, the one you do take(the decision you make), should be analyzed and reflected upon to verify it is inline with what you wish to accomplish.
        Of course, I could continue and counter your you vain and shallow remarks and how they reflect upon a person taking literal interpretation of a poem and scrutinizing it with the inability of the author to respond.
        But I digress.
      • by looie ( 9995 ) <michael@trollope.org> on Wednesday August 03, 2011 @07:14AM (#36970650) Homepage

        Not sure where you took your "poetical exegesis" class, but you should ask for a refund.

        The narrator as "vain, shallow individual" is entirely a character pulled out of your hindquarters, as there is nothing in the text of the poem to lead to that conclusion.

        The poem is simply a reflection on how we, as individuals, make choices in life. Some of us choose to take the direction taken by most of those around us. That might be university, family, job, retirement in FL. Some of us choose to turn aside from that direction and try another path. Programming a PDP to play "Space Travel," for example. Or writing an operating system "just for fun."

        Frost's suggestion is that these choices of path may seem insignificant at the time -- both paths being nearly the same; but that, as "way leads on to way," there's no going back and thus we may find ourselves down a path that leads to unexpected places. When Linus Torvalds wrote linux, he could not know that "the path less traveled" would lead to fame and fortune, literally. The college kids who created Slashdot could not know it would make them rich.

        In fact, the point of the poem is exactly that it does matter which path you take. But that you don't always know how your choice is going to turn out. Frost himself might have continued his career as a teacher, a stable and certain means of supporting his family. Instead, he chose to focus on his poetry. He took a chance. And it worked well for him.

        mp

        • The narrator as "vain, shallow individual" is entirely a character pulled out of your hindquarters, as there is nothing in the text of the poem to lead to that conclusion.

          Ahem.

          The ironic interpretation, widely held by critics,[2][3] is that the poem is instead about making personal choices and rationalizing our decisions, whether with pride or with regret.

          Source: http://en.wikipedia.org/wiki/The_Road_Not_Taken_(poem) [wikipedia.org]

          I'm tempted to bookmark this response as a great example of why engineers should not fear breadth requirements. (I'm assuming anyone with such a low Slashdot ID works in engineering...)

          The ironic interpretation is widely held because it's supported not only by the text, but also Frost's own statements, and the broader context of his work--in which seemingly simple descriptive verse hides darker, more complex themes. (A major reason why he is

        • by IICV ( 652597 )

          Frost's suggestion is that these choices of path may seem insignificant at the time -- both paths being nearly the same; but that, as "way leads on to way," there's no going back and thus we may find ourselves down a path that leads to unexpected places. When Linus Torvalds wrote linux, he could not know that "the path less traveled" would lead to fame and fortune, literally. The college kids who created Slashdot could not know it would make them rich.

          In fact, the point of the poem is exactly that it does m

      • http://poetrypages.lemon8.nl/life/roadnottaken/roadnottaken.htm [lemon8.nl] Robert Frost on his own poetry: "One stanza of 'The Road Not Taken' was written while I was sitting on a sofa in the middle of England: Was found three or four years later, and I couldn't bear not to finish it. I wasn't thinking about myself there, but about a friend who had gone off to war, a person who, whichever road he went, would be sorry he didn't go the other. He was hard on himself that way."
    • Re: (Score:3, Informative)

      As a nitpick, this poem is not from 1920. I have an original copy that was inscribed by the owner in 1919.

      According to Wikipedia, the original poetry was published in 1916. The 1920 version was a second edition.

    • by o'reor ( 581921 ) on Wednesday August 03, 2011 @02:18AM (#36969316) Journal
      I for one welcome that refreshing new way of writing "Frost's pissed."
    • by brusk ( 135896 )
      The meaning of the poem lies in the NUL character at the end.
    • by Waffle Iron ( 339739 ) on Wednesday August 03, 2011 @10:33AM (#36973306)

      Whose code is this I think I know
      'Tis filled with buffer overflows
      His pointer is not stopping here
      As the megs of garbage data grow

      My CPU must think it queer
      To scan for null bytes not found here
      Between the stack and blocks of code
      Canary values, segfault near

      It gives the PC bell a quake
      To ask if there is some mistake
      The only other sound's the sweep
      Of swapping pages disk head shake

      The stack is swelling very fast
      But allocated buffer's past
      And megs to fill before a crash
      And megs to fill before a crash

  • Missed the point (Score:5, Informative)

    by mgiuca ( 1040724 ) on Tuesday August 02, 2011 @11:24PM (#36968492)

    Interesting, but I think this article largely misses the point.

    Firstly, it makes it seem like the address+length format is a no-brainer, but there are quite a lot of problems with that. It would have had the undesirable consequence of making a string larger than a pointer. Alternatively, it could be a pointer to a length+data block, but then it wouldn't be possible to take a suffix of a string by moving the pointer forward. Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer. (Though a size_t length would make more sense.) Furthermore, it would be more complex for interoperating between languages -- right now, a char* is a char*. If we used a length field, how many bytes would it be? What endianness? Would the length be first or last? How many implementations would trip up on strings > 128 bytes (treating it as a signed quantity)? In some ways, it is nice that getaddrinfo takes a NUL-terminated char* and not a more complicated monster. I'm not saying this makes NUL-termination the right decision, but it certainly has a number of advantages over addr+length.

    Secondly, this article puts the blame on the C language. It misses the historical step of B, which had the same design decision (by the same people), except it used ASCII 4 (EOT) to terminate strings. I think switching to NUL was a good decision ;)

    Hardware development, performance, and compiler development costs are all valid. But on the security costs section, it focuses on the buffer overflow issue, which is irrelevant. gets is a very bad idea, and it would be whether C had used NUL-terminated strings or addr+len strings. The decision which led to all these buffer overflow problems is that the C library tends to use a "you allocate, I fill" model, rather than an "I allocate and fill" model (strdup being one of the few exceptions). That's got nothing to do with the NUL terminator.

    What the article missed was the real security problems caused by the NUL terminator. The obvious fact that if you forget to NUL-terminate a string, anything which traverses it will read on past the end of the buffer for who knows how long. The author blames gets, but this isn't why gets is bad -- gets correctly NUL-terminates the string. There are other, sneaky subtle NUL-termination problems that aren't buffer overflows. A couple of years back, a vulnerability was found in Microsoft's crypto libraries (I don't have a link unfortunately) affecting all web browsers except Firefox (which has its own). The problem was that it allowed NUL bytes in domain names, and used strcmp to compare domain names when checking certificates. This meant that "google.com" and "google.com\0.malicioushacker.com" compared equal, so if I got a certificate for "*.com\0.malicioushacker.com" I could use it to impersonate any legitimate .com domain. That would have been an interesting case to mention rather than merely equating "NUL pointer problem" with "buffer overflow".

    • Re:Missed the point (Score:5, Informative)

      by Anonymous Coward on Tuesday August 02, 2011 @11:37PM (#36968578)
    • "...it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer."

      Compatible with what? Seems to me they could have just used continuation bit for the size field, much the way UTF-8 works to store non-ASCII characters.

      • "...it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer."

        Compatible with what? Seems to me they could have just used continuation bit for the size field, much the way UTF-8 works to store non-ASCII characters.

        This would still make the strings incompatible, because you would only have a 127-byte string length before the "continuation bit" comes into play and you need to switch to a 15-bit string length. All the previous code written with longer-than-127-byte strings would be incompatible.

        • If we were to switch now, is that the compatibility you're referring to? Well sure.

          But nobody's talking about switching now, the point of the topic is that C should have been designed differently. In those days there was very little backwards compatibility to worry about.

          • by snowgirl ( 978879 ) on Wednesday August 03, 2011 @03:52AM (#36969742) Journal

            If we were to switch now, is that the compatibility you're referring to? Well sure.

            But nobody's talking about switching now, the point of the topic is that C should have been designed differently. In those days there was very little backwards compatibility to worry about.

            And if it had been decided to be 1-byte length + data, and everyone used it like that, and assumed that the full 8-bits are available for the length, then when we switch to variable-byte length encoding, it would create an incompatibility. The incompatibility I speak of is the hypothetical one switching from 1-byte fixed-length length encoding to variable-byte length encoding.

            "They could have just used variable length encoding from the beginning." Sure, and they could have programmed everything in Java from the start... the idea of a variable length encoding would have been over-engineering the problem that they were facing.

      • by mgiuca ( 1040724 )

        They could have but they didn't (e.g., in Pascal, where strings actually are limited to 255 bytes). So, history has made some worse string representations than C.

      • Re:Missed the point (Score:5, Informative)

        by dbc ( 135354 ) on Wednesday August 03, 2011 @12:02AM (#36968692)

        Oh, Lordy, if you had ever programmed in a language with a 255 character limit for strings you would praise $DIETY every time you use a C string. Dealing with length limited strings is the largest PITA of any senseless and time-wasting programming task.

        Suppose C had originally had a length for strings? The only thing that makes sense is for the string length count to be the same size as a pointer, so that it could effectively be all of memory. A long is, by C language definition, large enough to hold a pointer that has been cast into it. So string length computations all become longs. Not such a big deal for most of life... until.... 64 bit addressing. Then all sorts of string breakage occurs.

        The bottom line is that in an application programming language strings need to be atomic, as they are in Python. You just should not care how strings are implemented, and you should never worry about a length limit. The trouble is, C is a systems programming language, so it is imperative that the language allow direct access to bit-level implementation. If you chose to use a systems programming language for application programming, well, then it sucks to be you. So why did we do that for so long? Because all the other alternatives were worse.

        Hell, I've used languages where the statement separator was a 12-11-0-7-8-9 punch. (Bonus points if you can tell me what that is and how to make one.) So a NUL terminated string looks positively modern compared to that.

        • On any sane platform, "long" has that property, and can also hold the machine word or more. Win64 is not sane.

          Because of it, the standard had to add intptr_t which is the only type portably known to be of same width as void* and char* (but _not_ necessarily the same as a pointer to any other type!). Of course, MSVC doesn't follow standards and doesn't have this type nor stdint.h/inttypes.h at all.

          Thus, your code will not work there, and will cut pointers. To make things worse, it will actually work if yo

    • Re:Missed the point (Score:4, Interesting)

      by snowgirl ( 978879 ) on Tuesday August 02, 2011 @11:41PM (#36968596) Journal

      Not to mention the argument for "because space was at a premium" is specious, because either you had a 8-bit length prepended to the string, or you had an 8-bit special value appended to the end of the string. Both ways result in the same space usages.

      From what I read in the summary, (didn't read TFA) this whole thing sounds like a propaganda piece supporting the idea that we should use length+string, by presenting it as "this should have been a no-brainer but the idiots making C screwed up."

      As a nitpicky pedantic note though, if C had gone with length+string format, then other languages would have been written around the C standard, since most of them were written around the C standards to begin with to increase interoperability in the first place.

      • Re:Missed the point (Score:5, Informative)

        by snowgirl ( 978879 ) on Tuesday August 02, 2011 @11:48PM (#36968630) Journal

        I'm correcting myself here... apparently they weren't considering going with a 255-byte limit, but a 65535-byte limit, which would have increased the size overhead by one.

        • Re:Missed the point (Score:4, Informative)

          by arth1 ( 260657 ) on Wednesday August 03, 2011 @12:25AM (#36968804) Homepage Journal

          That's still an arbitrary limit.

          The advantages that I see for counted length are:
          - it makes copying easier - you know beforehand how much space to allocate, and how much to copy.
          - it makes certain cases of strcmp() faster - if the length doesn't match, you can assume the strings are different.
          - It makes reverse searches faster.
          - You can put binary in a string.
          But that must be weighed against the disadvantages, like not being able to take advantage of CPUs zero test conditions, but instead having to maintain a counter which eats up a valuable register. Or having to convert text blocks to print them. Or not being well suited for piped text or multiple threads; you can't just spew the text into an already nulled area, and it will be valid as it comes in; you have to update a text length counter for every byte you make available. And... and...

          Getting a free strlen() is NOT an advantage, by the way. In fact, that became a liability when UTF-8 arrived. With a library strlen() function, all you had to do was update the library, but when the compiler was hardcoded to just return the byte count, that wasn't an option. Sure, one could go to UTF-16 instead, but then there's a lot of wasted space.

          All in all, having worked with both systems, I find more advantages with null-termination.

          There's also a third system for text - linked lists. It doesn't have the disadvantage of an artificial string length limit, and allows for easy cuts and pastes, and even COW speedups, but requires far more advanced (and thus slower) routines and housekeeping, and has many of the same disadvantages as byte-counted text.. Some text processors have used this as a native string format, due to the specific advantages.

          I'd still take NULL-terminated for most purposes.

          • Re:Missed the point (Score:4, Interesting)

            by jeremyp ( 130771 ) on Wednesday August 03, 2011 @05:02AM (#36970024) Homepage Journal

            That's still an arbitrary limit.

            An arbitrary limit equal to the virtual machine size of the computer that was originally targeted.

            The advantages that I see for counted length are:
            - it makes copying easier - you know beforehand how much space to allocate, and how much to copy.
            - it makes certain cases of strcmp() faster - if the length doesn't match, you can assume the strings are different.
            - It makes reverse searches faster.
            - You can put binary in a string.

            - It all but eliminates the possibility of buffer overruns for strings.

            But that must be weighed against the disadvantages, like not being able to take advantage of CPUs zero test conditions, but instead having to maintain a counter which eats up a valuable register.

            But lots of CPUs have an instruction a bit like "decrement register and jump if not zero" which can be used for length+data strings.

            Or not being well suited for piped text or multiple threads; you can't just spew the text into an already nulled area, and it will be valid as it comes in;

            With modern character encodings, you can't guarantee that whatever string format you use. Couple that with the fact that streamed data tends to be read and written in blocks with a length parameter anyway, and the whole advantage is gone. This is why almost all modern languages have some variation on length + data for their strings and utilities for manipulating raw byte buffers.

            Getting a free strlen() is NOT an advantage, by the way. In fact, that became a liability when UTF-8 arrived. With a library strlen() function, all you had to do was update the library, but when the compiler was hardcoded to just return the byte count, that wasn't an option.

            Except that strlen() has always and still does count the number of C chars before the null byte. This is enshrined in the C99 standard. UTF-8 has not changed the implementation of strlen(). Also, gcc and probably many other compilers will normally optimise things like strlen() to a few lines of assembler rather than a call to libc, so you'd have to recompile anyway if it does change.

            Sure, one could go to UTF-16 instead, but then there's a lot of wasted space.

            All in all, having worked with both systems, I find more advantages with null-termination.

            There's also a third system for text - linked lists. It doesn't have the disadvantage of an artificial string length limit, and allows for easy cuts and pastes, and even COW speedups, but requires far more advanced (and thus slower) routines and housekeeping, and has many of the same disadvantages as byte-counted text.. Some text processors have used this as a native string format, due to the specific advantages.

            I'd still take NULL-terminated for most purposes.

            Most modern languages have a proper string type and I would always take that over null terminated char sequences. You can bet that Java's internal implementation of String uses length+data.

            • by arth1 ( 260657 )

              But that must be weighed against the disadvantages, like not being able to take advantage of CPUs zero test conditions, but instead having to maintain a counter which eats up a valuable register.

              But lots of CPUs have an instruction a bit like "decrement register and jump if not zero" which can be used for length+data strings.

              Um, that's pretty much what I said, isn't it? The "instead having to maintain a counter which eats up a valuable register" part.

          • UTF-16 is not fixed width. It combines all disadvantages of UTF-8 and UCS4 while having no advantages of either.

          • Are you suggesting strlen() should return the number of UTF-8 characters, not the number of bytes? That's insane... the entire point of UTF-8 is that stuff like strlen() can treat it as a narrow string. If you want to have a function for returning the number of printable characters in a UTF-8 string, that's going to be a separate function, and isn't any easier or harder with sized strings v.s null-terminated strings.

      • by mgiuca ( 1040724 )

        Good point.

        As a nitpicky pedantic note though, if C had gone with length+string format, then other languages would have been written around the C standard, since most of them were written around the C standards to begin with to increase interoperability in the first place.

        Yes, but perhaps the simplicity was partly why it caught on. The reason I raised all of the "what about..." questions was to illustrate just how many small variations in an address+length standard there could have been. Even if C had made

        • by mini me ( 132455 )

          The C string has its place, but what I never understood is why the C standard library hasn't also included a string type. Something like the following with all of the accompanying bound checking functions to go along with it.


          struct string
          {
              size_t length;
              char *buffer;
          };

          There are several third party libraries that do just that, but it seems like something worthy of being there out of the box.

          • by mgiuca ( 1040724 )

            Well C++ includes a class that is pretty much exactly what you ask for. It wouldn't make sense for C to include that, as the whole point is that C gives you the ability to manipulate data however you want. If C included that, it would be criticised for having two incompatible string types. If it only included that, it would be criticised for not being low-level enough (the programmer is forced to call all these inefficient string manipulation functions that do bounds checking).

            You might ask why C doesn't in

    • by e9th ( 652576 )
      My personal fave is strncpy(), which will silently not terminate the string if the buffer is too small, but if you give it a huge buffer it punishes you by NUL padding the string all the way to the end of the buffer.
      • by inflex ( 123318 )

        Yeah, I take a stab at strncpy and a lot of others in my little open-source/aka-incomplete book about C... "C of Peril" - http://www.pldaniels.com/c-of-peril/ [pldaniels.com]

      • by yuhong ( 1378501 )

        I think it was intended to convert null-terminated strings to fixed-length null padded strings, as used in many places in the Unix kernel at the time it was invented, like filenames.

    • What is so undesirable about making a string larger than a pointer?

      Also, have a look at how mysql deals with varchars. There is no 255 byte limit - when length exceeds that value, you just go to 2 bytes of length, etc. Your arguments about what type of integer to use conveniently ignore conventions like network order. In short, it is not too hard to solve. Do you really think the state of programming was so bad back then that people wouldn't test 129 byte strings?

      And no, the article didn't miss the "real se

      • by Rakishi ( 759894 )

        Fail, just fail.

        Also, have a look at how mysql deals with varchars. There is no 255 byte limit

        Before Mysql 5.0.3 the limit was 255 and 65535 afterward.

        when length exceeds that value, you just go to 2 bytes of length, etc.

        It does this because each column defines the maximum length for the varchar and the number of bytes used for length is fixed for each column. This however is also overhead, this information for the size of the length field needs to be stored for each variable. In C this means that each variable now has even more overhead (the actual amount depending on how you encode such information).

      • by mgiuca ( 1040724 )

        Note that my post was not necessarily saying that NUL was the right decision. Just that it isn't a no-brainer -- going the other route has a lot of complications.

        What is so undesirable about making a string larger than a pointer?

        It would mean that the C library would need to declare a "string" struct instead of using char*. Now rather than passing a char* as an argument, you would have to decide whether it's worth passing the two word "string" struct, or a string* pointer (allowing it to fit into a register

        • Dealing with NUL is significantly simpler than dealing with length fields, and there are significantly fewer sources for confusion.

          Is it, or are you just used to dealing with NUL-terminated strings?

          If you wanted the standard to use a variable-length length as you suggest, you would need to make sure that all the programmers correctly store and parse variable-length strings. Of course they could get it right, but there are lots of ways they could get it wrong.

          That's what libraries are for :-)

          Here's a quest

          • by mgiuca ( 1040724 )

            Is it, or are you just used to dealing with NUL-terminated strings?

            Nope, they are simpler. Re-read all of the questions I asked regarding design decisions that could be made around address+length formatted strings and tell me that they are just as simple. Now I think higher-level languages should be using lengths, because their libraries abstract the details (e.g., C++ or Java). But in a language where programmers fabricate their own strings, simplicity is best.

            That's what libraries are for :-)

            Well, let's a

      • And no, the article didn't miss the "real security problems" caused by null termination. Where did you stop reading?

        The point was: the security problems the article mentions (buffer overflows/underflows) aren't actually caused by NULL terminated strings, they are caused by buffers that are allocated too small. If the buffer is too small, it won't matter if the string is measured at the beginning or terminated at the end. (It can be fixed by measuring the size of the buffer, but that is a different topic).

        However there is a real security problem, as the GP described, although it really was a problem of mixing two standa

      • by 0123456 ( 636235 )

        There is no 255 byte limit - when length exceeds that value, you just go to 2 bytes of length, etc.

        So you have a 255 byte string. You append one byte to it. What do you do now?

        Are you really suggesting that people should have to move all the bytes of the string one further along so they can increase the length field to two bytes, and then append the new character, and that programmers who can't remember to put a 0 at the end of a string can do that without screwing up?

        Sure, you can force everyone to use library calls for all their string operations, but C was intended to be cheap, dirty and fast, which i

        • by Graff ( 532189 )

          So you have a 255 byte string. You append one byte to it. What do you do now?

          Are you really suggesting that people should have to move all the bytes of the string one further along so they can increase the length field to two bytes, and then append the new character, and that programmers who can't remember to put a 0 at the end of a string can do that without screwing up?

          I would think that the more logical way to handle this is to handle the string in chunks. The first byte (8 bits) of the string is the length, if all of the length bits are set then at offset 256 from the length byte you have another length byte and another section of the string. Rinse, repeat.

          Yes, for very large strings this isn't as efficient as simply using 16 bits at the front of the string (65536 positions in 2 bytes vs 512 positions in 2 bytes) but for small strings - probably the most common case - i

    • by alta ( 1263 )

      It was all good to the end there, then you started sending me these coded ssl certs, and I think you just hacked my computer. damn you smart people and your buffer overflows.

    • by mcvos ( 645701 )

      Allow me to summarize that for the tl;dr crowd:

      C's "everything is a pointer" approach gives you the power to easily do lots of cool stuff, and adding length to a string would break that elegance. But using NUL-terminated strings creates a lot of security problems, not merely limited to buffer overflows, which are really caused by C's backward memory allocation.

  • That's the way it happens in Soviet Russia, too.

    Seriously, though, it's hard to know what language you as a system administrator should use for something like a data logger that has to run continuously (or cron every minute or so) other than C, but then there's the security problem that some user will come up with some weird filename hack to subvert the system.

    • Isn't the filename hack a problem for any language? My simple method for avoiding it is putting all filenames in single-quotes, and filtering out any single-quotes from user input.
  • by phantomfive ( 622387 ) on Tuesday August 02, 2011 @11:33PM (#36968546) Journal
    C. A. R. Hoare, the inventor of Quicksort, also invented the NULL pointer. Something he apologized for [wikipedia.org]:

    I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.

  • Whatever (Score:5, Funny)

    by Old Wolf ( 56093 ) on Tuesday August 02, 2011 @11:37PM (#36968572)

    Come on , this is complete rubbish___8^)_#;3,2,.3root>^$)(^(943hellomax0984)_))1..l2l2_}[[}{

  • this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day

    Actually the tradeoff may not have been rational. The storage bytes saved may have been offset by the extra code bytes necessary for handling unknown length strings. Perhaps this is actually an example of premature optimization, optimizing things before proper profiling and analysis has shown the problem exists and the proposed solution is beneficial.

    • by c0lo ( 1497653 )

      this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day

      Actually the tradeoff may not have been rational.

      Actually, the choice was rational [bell-labs.com] (at least, on purpose) - you see, it's not about a single byte, it's about a new data type.

      C treats strings as arrays of characters conventionally terminated by a marker. Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type. Some costs accrue from its approach: certain string operations are more expensive than in other designs because application code or a library routine must occasionally search for the end of a string, because few built-in operations are available, and because the burden of storage management for strings falls more heavily on the user.

    • by osu-neko ( 2604 )

      Actually the tradeoff may not have been rational. The storage bytes saved may have been offset by the extra code bytes necessary for handling unknown length strings.

      Not really, no. Having written basic library code for both, it usually requires more code to handle Pascal-style (length+data) strings than C-style (data+null) strings. You save quite a bit of code ("quite a bit" being relative, but I've had to squeeze code into 208 bytes of RAM before) by using the C-style strings most of the time.

  • hmm. marker character, or a length.

    Marker: same type as string, so no need to worry about bit size, start/stop bits or other extraneous. String can be any size and only restricted by available memory. (given the ability to swap darn near unlimited pages in current hardware.... and the ability to virtualize across computers... this means strings have a potentially <i>infinite</i> limit)

    Length: What's the size? What byte order? What bit size? How will this affect communications between pla
    • by Sloppy ( 14984 )

      Length: What's the size? What byte order? What bit size? How will this affect communications between platforms?

      These aren't hard questions, IMHO. Just say the length is an int (or an unsigned int), and then assuming you didn't freak out when someone asked all those very same questions about ints, then you should be fairly happy with the result.

    • by c0lo ( 1497653 )

      Length: What's the size? What byte order? What bit size? How will this affect communications between platforms?

      Adding: how do you pass to you the tail substring? (like: I parsed to here, take over from now on. Oh, yes, on top of the length, deal with another offset).

  • The real problem with the addr+len approach is that now every string becomes a struct, or a structptr.

    This means that when passing a string to a function, either the string takes up two register/stack slots, or you're passing around a const-ptr (but the contents of the struct are not const), which means one more memory access due to pointer indirection.

    x86 and the PDP-11 are register-starved. the x86 has 8 registers, with 4 or 5 available as general-purpose registers.
    The PDP-11 was similar with 8 registers

    • My assembly class in college was on a PDP-11. I've done quite a bit of x86 assembly over the years. I'm confused as to why you think a pascal style string structure pointer requires any more registers or stack than a C character pointer. In assembly if I want the the length I would reference a size_t at the pointer address, and if I want text I would reference a char at pointer+offset where offset is sizeof(size_t).
  • by gmhowell ( 26755 ) <gmhowell@gmail.com> on Tuesday August 02, 2011 @11:50PM (#36968644) Homepage Journal

    FTA:

    We learn from our mistakes, so let me say for the record, before somebody comes up with a catchy but totally misleading Internet headline for this article, that there is absolutely no way Ken, Dennis, and Brian could have foreseen the full consequences of their choice some 30 years ago, and they disclaimed all warranties back then. For all I know, it took at least 15 years before anybody realized why this subtle decision was a bad idea, and few, if any, of my own IT decisions have stood up that long.

    In other words, Ken, Dennis, and Brian did the right thing.

  • Wow, this place has come a long way from a simple news for nerds site. Now, the authors are placing disclaimers specifically addressed to us :)

  • Got it wrong (Score:4, Insightful)

    by Spazmania ( 174582 ) on Wednesday August 03, 2011 @12:05AM (#36968716) Homepage

    It probably wasn't about the bytes. The factors are:

    1. Complexity. Without exception, every variable in C is an integer, a pointer or a struct. A null terminated string is a pointer to a series of integers -- barely one step more complex than a single integer. To keep the string length, you'd have to employ a struct. That or you'd have to create a magic type for strings that's on the same level as integers, pointers and structs. And you don't want to use a magic type because then you can't edit it as an array. Simplicity was important in C -- keep it close to the metal.

    2. Computational efficiency. Many if not most operations on strings don't need to know how long they are. So why suffer the overhead of keeping track? That makes string operations on null terminated strings on average faster than string operations on a string bounded by an integer.

    3. Bytes. It's only one extra byte with a magic type or an advanced topic struct. In both cases with an assumption that the maximum possible length on which the standard string functions will work is 64kb. If you're talking about a more mundane struct then you're talking about an int and a pointer to a block of memory which has an extra set of malloc overhead. That's a lot of extra bytes, not just one.

    For the kind of language C aimed to be -- a replacement for assembly language -- the choice of null terminated strings was both obvious and correct.

    • by PCM2 ( 4486 )

      Beyond those points:

      It is interesting to compare C's approach with that of two nearly contemporaneous languages, Algol 68 and Pascal [Jensen 74]. Arrays in Algol 68 either have fixed bounds, or are `flexible:' considerable mechanism is required both in the language definition, and in compilers, to accommodate flexible arrays (and not all compilers fully implement them.) Original Pascal had only fixed-sized arrays and strings, and this proved confining [Kernighan 81]. Later, this was partially fixed, though the resulting language is not yet universally available.

      C treats strings as arrays of characters conventionally terminated by a marker. Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type. Some costs accrue from its approach: certain string operations are more expensive than in other designs because application code or a library routine must occasionally search for the end of a string, because few built-in operations are available, and because the burden of storage management for strings falls more heavily on the user. Nevertheless, C's approach to strings works well.

      And that's coming from Dennis Ritchie, [bell-labs.com] who was there.

    • by Homburg ( 213427 )

      To keep the string length, you'd have to employ a struct.

      No, strings with a listed length would also be pointers to a series of integers - it's just that, instead of giving a value special semantics (0 as end of string), you give a position in the series special semantics (store the length in the first two bytes). In both cases, you need your string-handling functions to be aware of whatever the convention is.

      Computational efficiency. Many if not most operations on strings don't need to know how long they are. So why suffer the overhead of keeping track? That makes string operations on null terminated strings on average faster than string operations on a string bounded by an integer.

      I don't know that that's true. Operations that do need to know the length of the string could be quicker, and I'm not sure that these cases are less frequen

      • by Arlet ( 29997 )

        (store the length in the first two bytes)

        So 65536 byte strings should be enough for anybody ?

        Operations that do need to know the length of the string could be quicker, and I'm not sure that these cases are less frequent. What are the common cases you are thinking of where C-style strings are faster?

        C-style strings are simpler. That's the biggest advantage. For the few cases where performance matters, you can always define your own string type.

      • by osu-neko ( 2604 )

        What are the common cases you are thinking of where C-style strings are faster?

        strcpy(char *d, char *s)
        {
        while ( *d++ = *s++ );
        }

        Challenge: come up with the equivalent for pascal-style strings in a way that doesn't compile into at least twice as much code.

        In fact, aside from strlen, are there any string functions that aren't made at least twice complex by using P-style instead of C-style strings? Most of strlib.c can be implemented as one-liners, assuming C-style strings.

  • by epine ( 68316 ) on Wednesday August 03, 2011 @12:18AM (#36968774)

    Normally I tend to agree with what I've read from PHK, but this one seems wide of the mark. If you involve a *real* C guru in the discussion, I don't think there would be much sentiment toward nixing the sentinel.

    C makes a big deal about the equivalence of pointers and arrays. Plus in C a string also represents every suffix string.

    char string [] = { 't', 'e', 's', 't', '\0' };
    char* cdr_string = string + 1;

    Perfectly valid, as God intended. A string with a length prefix is a hybrid data structure. What is the size of the length structure up front? It can be interesting in C to sort all suffixes of a string, having only one copy of the string itself. Try that with length prefix strings. (The trivial algorithm is far from ideal for large or degenerate character sequences, but it does provide insight into position trees and the Burrows-Wheeler transform.)

    Nor would I blame all the stupid coding errors on the '\0' terminator convention. In C, a determined idiot can mess up just about anything, unless the compiler takes over and does things for you, a la Pascal by another name. If that had been the bias, would be all be using C now, or some other language? Repeat after me: Generativity Rocks. Nanny languages usually manage to bork generativity over. Correct Programming Made Easy never strays far from the subtitle Composition Made Difficult.

    No one who ever read Dijkstra and took him serious ever made a tiny fraction of the stupid mistakes blamed on hapless zero.

    If you want to point to a real steaming pile, strcpy() was designed by a moron with a bad hang-over and no copy of Dijkstra within a 100 mile radius. It was tantamount to declaring "you don't really need to test your preconditions ... what kind of sissy would do that?"

    C is a nice design, as evidenced by how seamlessly the STL was grafted onto C++ at the abstraction layer (at the syntax layer, not so much). The problem with C was always a communication problem. To use C well one must test preconditions on operation validity. To use algebra well one must test preconditions on operation validity.

    Where does PHK lay the blame for the algebraist who made it possible to divide both side of an equation by zero, or multiply an inequality by -1? Preferably with the complete moron who doesn't check preconditions on the validity of the operation. Two thousand years later, now we have a better solution?

    PHK is right about cache hierarchies. By the time cache hierarchies arrived, we had C++ with entirely different string representations.

    For some reason I've never been keen on having a programmer who can't manage to correctly test the precondition for buffer overflow making deep design decisions about little blocks of lead in the radiation path.

    And it's not even much of a burden. As Dijkstra observed, for many algorithms, once you have all your preconditions right and you've got a provable variant, there's often very little left to decide. It actually makes the design of many algorithms simpler in the mode of divide and conquer: first get your preconditions and variant right (you're now half done and you've barely begun to think hard), *then* worry about additional logic constraints (or performance felicitous sequencing of legal alternatives).

    The coders who first try to get their logical requirements correct and then puzzle out the preconditions do indeed make the original task more difficult than not bothering with preconditions at all, supposing there's some kind of accurate measure over crap solutions, which I refuse to concede.

    • Re: (Score:3, Insightful)

      by EvanED ( 569694 )

      If you want to point to a real steaming pile, strcpy() was designed by a moron with a bad hang-over and no copy of Dijkstra within a 100 mile radius. It was tantamount to declaring "you don't really need to test your preconditions ... what kind of sissy would do that?"

      To play Devil's advocate, strcpy cannot check it's precondition. You can't tell whether a pointer you're given is valid, or how much space is left in the buffer.

      (Well, I guess you could go make malloc record far more information than it otherw

  • by Animats ( 122034 ) on Wednesday August 03, 2011 @12:31AM (#36968814) Homepage

    The problem with C isn't strings. It's arrays. Strings are just a special case of arrays.

    Understand that when C came out, it barely had types. "structs" were not typed; field names were just offsets. All fields in all structs, program-wide, had to have unique names. There was no "typedef". There was no parameter type checking on function calls. There were no function pointers. All parameters were passed as "int" or "float", including pointers and chars. Strong typing and function prototypes came years later, with ANSI C.

    This was rather lame, even for the late 1970s. Pascal was much more advanced at the time. Pascal powered much of the personal computer revolution, including the Macintosh. But you couldn't write an OS in Pascal at the time; it made too many assumptions about object formats. In particular, arrays had descriptors which contained length information, and this was incompatible with assembly-language code with other conventions. By design, C has no data layout conventions built into the language.

    Why was C so lame? Because it had to run on PDP-11 machines, which were weaker than PCs. On a PC, at least you had 640Kb. On a PDP-11, you had 64Kb of data space and (on the later PDP-11 models) 64Kb of code space, for each program. The C compiler had to be crammed into that. That's why the original C is so dumb.

    The price of this was a language with a built in lie - arrays are described as pointers. The language has no idea how big an array is, and there's not even a way to usefully talk about array size in C. This is the fundamental cause of buffer overflows. Millions of programs crash every day because of that problem.

    That's how we got into this mess.

    As I point out occasionally, the right answer would have been array syntax like

    int read(int fd, char[n]& buf, size_t n);

    That says buf is an array of length n, passed by reference. There's no array descriptor and no extra overhead, but the language now says what's actually going on. The classic syntax,

    int read(int fd, char* buf, size_t n);

    is a lie - you're not passing a pointer by value, you're passing an array by reference.

    C++ tries to wallpaper over the problem by hiding it under a layer of templates, but the mold always seeps through the wallpaper when a C pointer is needed to call some API.

    • Why was C so lame? Because it had to run on PDP-11 machines, which were weaker than PCs. On a PC, at least you had 640Kb. On a PDP-11, you had 64Kb of data space and (on the later PDP-11 models) 64Kb of code space, for each program.

      Your relative comparisons are a bit off. The Altair from 1975 (the first versions of C were finished around 1973) had a whopping 1KB of memory. The mini computers of the day ran rings around what PCs there were, both in raw power and in memory.

  • by hamster_nz ( 656572 ) on Wednesday August 03, 2011 @12:41AM (#36968860)

    After 25 years of using C, I don't mind the strings being terminated by nulls. If you want to do something else, just don't include string.h.

    Terminating with a null is only a convention - the C language itself has no concept of strings. As others point out, it is either an array of bytes or a pointer to bytes.

    it isn't forced on to you - you don't have to follow it.

    • it isn't forced on to you - you don't have to follow it.

      It's forced in practice by the fact that the entire standard library, and all third-party libraries, all produce and consume null-terminated strings.

      What's far worse is that, since C FFI is the lowest common denominator that we have across various languages, null-terminated strings become the standard way to marshal strings between libraries written in different languages. This means many things: for one, no embedded nulls, which is bad for many scenarios where handling them is desired.

      For another, it means

  • It would have been more urgent to find out where an allocated part of RAM ends.

    Or just like Integers and floats, strings could have been their very own basic type. Essentially leave the implementation of it to the compiler, so it can do range checks. Most C-programmers seem to believe that this is done already.

    BTW range check on integers don't cost anything anymore. I've benchmarked some real-life code using large arrays (doing statistics on it) and range checks didn't cause any slow down. Essentially the c

  • Faster loops (Score:5, Insightful)

    by Sloppy ( 14984 ) on Wednesday August 03, 2011 @12:55AM (#36968932) Homepage Journal

    TFA suggests the decision was to save a byte, but I don't believe that's the main reason it happened.

    If you're traversing a string anyway, what happens is that when you load the data into your register (which you'll be doing anyway, for whatever reason you're traversing the string), you get a status flag set "for free" if it's zero, so that's your loop test right there. Branch if zero. If you have to compare an offset to a length on every iteration, then now you're having to store this offset in another register (great, like I have lots of registers to spare on 1970s CPUs?) and compare (i.e. subtract) to the length which is stored in memory (great, a memory access) or another register (oh great, I need to use another register in the 1970s?!) and the code is bigger and slower.

    It's easy to laugh these days about anyone caring about how many clock cycles a loop takes and whether it uses 2 registers or 4 registers, but this stuff used to be pretty important (and more recently than the 1970s). Kids these days: if you weren't there, you just don't know what it was like.

    BTW, I have a hunch K & R didn't know they were building such an eternal legacy. It's reasonable to speculate that this is still going to be part of systems a hundred years from now, but in 1970 you would have been a mad man to suggest such a thing. (Not that this invalidates TFA's point at all; I'm just making excuses for K&R I guess.)

  • by dutchd00d ( 823703 ) on Wednesday August 03, 2011 @01:52AM (#36969192) Homepage

    If they had gone with the embedded length option we'd be sitting around bitching about how short-sighted it was to use just two bytes for the length. Including how Dennis Ritchie supposedly said "64K strings should be enough for anybody".

"It's like deja vu all over again." -- Yogi Berra

Working...