Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Programming Unix

The Most Expensive One-Byte Mistake 594

An anonymous reader writes "Poul-Henning Kamp looks back at some of the bad decisions made in language design, specifically the C/Unix/Posix use of NUL-terminated text strings. 'The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end? ... Using an address + length format would cost one more byte of overhead than an address + magic_marker format, and their PDP computer had limited core memory. In other words, this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day; but this one had quite atypical economic consequences.'"
This discussion has been archived. No new comments can be posted.

The Most Expensive One-Byte Mistake

Comments Filter:
  • Missed the point (Score:5, Informative)

    by mgiuca ( 1040724 ) on Wednesday August 03, 2011 @12:24AM (#36968492)

    Interesting, but I think this article largely misses the point.

    Firstly, it makes it seem like the address+length format is a no-brainer, but there are quite a lot of problems with that. It would have had the undesirable consequence of making a string larger than a pointer. Alternatively, it could be a pointer to a length+data block, but then it wouldn't be possible to take a suffix of a string by moving the pointer forward. Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer. (Though a size_t length would make more sense.) Furthermore, it would be more complex for interoperating between languages -- right now, a char* is a char*. If we used a length field, how many bytes would it be? What endianness? Would the length be first or last? How many implementations would trip up on strings > 128 bytes (treating it as a signed quantity)? In some ways, it is nice that getaddrinfo takes a NUL-terminated char* and not a more complicated monster. I'm not saying this makes NUL-termination the right decision, but it certainly has a number of advantages over addr+length.

    Secondly, this article puts the blame on the C language. It misses the historical step of B, which had the same design decision (by the same people), except it used ASCII 4 (EOT) to terminate strings. I think switching to NUL was a good decision ;)

    Hardware development, performance, and compiler development costs are all valid. But on the security costs section, it focuses on the buffer overflow issue, which is irrelevant. gets is a very bad idea, and it would be whether C had used NUL-terminated strings or addr+len strings. The decision which led to all these buffer overflow problems is that the C library tends to use a "you allocate, I fill" model, rather than an "I allocate and fill" model (strdup being one of the few exceptions). That's got nothing to do with the NUL terminator.

    What the article missed was the real security problems caused by the NUL terminator. The obvious fact that if you forget to NUL-terminate a string, anything which traverses it will read on past the end of the buffer for who knows how long. The author blames gets, but this isn't why gets is bad -- gets correctly NUL-terminates the string. There are other, sneaky subtle NUL-termination problems that aren't buffer overflows. A couple of years back, a vulnerability was found in Microsoft's crypto libraries (I don't have a link unfortunately) affecting all web browsers except Firefox (which has its own). The problem was that it allowed NUL bytes in domain names, and used strcmp to compare domain names when checking certificates. This meant that "google.com" and "google.com\0.malicioushacker.com" compared equal, so if I got a certificate for "*.com\0.malicioushacker.com" I could use it to impersonate any legitimate .com domain. That would have been an interesting case to mention rather than merely equating "NUL pointer problem" with "buffer overflow".

  • Re:Missed the point (Score:5, Informative)

    by Anonymous Coward on Wednesday August 03, 2011 @12:37AM (#36968578)
  • Re:Missed the point (Score:5, Informative)

    by snowgirl ( 978879 ) on Wednesday August 03, 2011 @12:48AM (#36968630) Journal

    I'm correcting myself here... apparently they weren't considering going with a 255-byte limit, but a 65535-byte limit, which would have increased the size overhead by one.

  • by gmhowell ( 26755 ) <gmhowell@gmail.com> on Wednesday August 03, 2011 @12:50AM (#36968644) Homepage Journal

    FTA:

    We learn from our mistakes, so let me say for the record, before somebody comes up with a catchy but totally misleading Internet headline for this article, that there is absolutely no way Ken, Dennis, and Brian could have foreseen the full consequences of their choice some 30 years ago, and they disclaimed all warranties back then. For all I know, it took at least 15 years before anybody realized why this subtle decision was a bad idea, and few, if any, of my own IT decisions have stood up that long.

    In other words, Ken, Dennis, and Brian did the right thing.

  • Re:Missed the point (Score:5, Informative)

    by dbc ( 135354 ) on Wednesday August 03, 2011 @01:02AM (#36968692)

    Oh, Lordy, if you had ever programmed in a language with a 255 character limit for strings you would praise $DIETY every time you use a C string. Dealing with length limited strings is the largest PITA of any senseless and time-wasting programming task.

    Suppose C had originally had a length for strings? The only thing that makes sense is for the string length count to be the same size as a pointer, so that it could effectively be all of memory. A long is, by C language definition, large enough to hold a pointer that has been cast into it. So string length computations all become longs. Not such a big deal for most of life... until.... 64 bit addressing. Then all sorts of string breakage occurs.

    The bottom line is that in an application programming language strings need to be atomic, as they are in Python. You just should not care how strings are implemented, and you should never worry about a length limit. The trouble is, C is a systems programming language, so it is imperative that the language allow direct access to bit-level implementation. If you chose to use a systems programming language for application programming, well, then it sucks to be you. So why did we do that for so long? Because all the other alternatives were worse.

    Hell, I've used languages where the statement separator was a 12-11-0-7-8-9 punch. (Bonus points if you can tell me what that is and how to make one.) So a NUL terminated string looks positively modern compared to that.

  • Re:Missed the point (Score:4, Informative)

    by arth1 ( 260657 ) on Wednesday August 03, 2011 @01:25AM (#36968804) Homepage Journal

    That's still an arbitrary limit.

    The advantages that I see for counted length are:
    - it makes copying easier - you know beforehand how much space to allocate, and how much to copy.
    - it makes certain cases of strcmp() faster - if the length doesn't match, you can assume the strings are different.
    - It makes reverse searches faster.
    - You can put binary in a string.
    But that must be weighed against the disadvantages, like not being able to take advantage of CPUs zero test conditions, but instead having to maintain a counter which eats up a valuable register. Or having to convert text blocks to print them. Or not being well suited for piped text or multiple threads; you can't just spew the text into an already nulled area, and it will be valid as it comes in; you have to update a text length counter for every byte you make available. And... and...

    Getting a free strlen() is NOT an advantage, by the way. In fact, that became a liability when UTF-8 arrived. With a library strlen() function, all you had to do was update the library, but when the compiler was hardcoded to just return the byte count, that wasn't an option. Sure, one could go to UTF-16 instead, but then there's a lot of wasted space.

    All in all, having worked with both systems, I find more advantages with null-termination.

    There's also a third system for text - linked lists. It doesn't have the disadvantage of an artificial string length limit, and allows for easy cuts and pastes, and even COW speedups, but requires far more advanced (and thus slower) routines and housekeeping, and has many of the same disadvantages as byte-counted text.. Some text processors have used this as a native string format, due to the specific advantages.

    I'd still take NULL-terminated for most purposes.

  • by hamster_nz ( 656572 ) on Wednesday August 03, 2011 @01:41AM (#36968860)

    After 25 years of using C, I don't mind the strings being terminated by nulls. If you want to do something else, just don't include string.h.

    Terminating with a null is only a convention - the C language itself has no concept of strings. As others point out, it is either an array of bytes or a pointer to bytes.

    it isn't forced on to you - you don't have to follow it.

  • by j. andrew rogers ( 774820 ) on Wednesday August 03, 2011 @01:53AM (#36968914)

    As a nitpick, this poem is not from 1920. I have an original copy that was inscribed by the owner in 1919.

    According to Wikipedia, the original poetry was published in 1916. The 1920 version was a second edition.

  • Re:Missed the point (Score:2, Informative)

    by spazdor ( 902907 ) on Wednesday August 03, 2011 @02:55PM (#36975638)

    would have, could have.

Anyone can make an omelet with eggs. The trick is to make one with none.

Working...