Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Programming Unix

The Most Expensive One-Byte Mistake 594

An anonymous reader writes "Poul-Henning Kamp looks back at some of the bad decisions made in language design, specifically the C/Unix/Posix use of NUL-terminated text strings. 'The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end? ... Using an address + length format would cost one more byte of overhead than an address + magic_marker format, and their PDP computer had limited core memory. In other words, this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day; but this one had quite atypical economic consequences.'"
This discussion has been archived. No new comments can be posted.

The Most Expensive One-Byte Mistake

Comments Filter:
  • by phantomfive ( 622387 ) on Wednesday August 03, 2011 @12:33AM (#36968546) Journal
    C. A. R. Hoare, the inventor of Quicksort, also invented the NULL pointer. Something he apologized for [wikipedia.org]:

    I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.

  • Re:Missed the point (Score:4, Interesting)

    by snowgirl ( 978879 ) on Wednesday August 03, 2011 @12:41AM (#36968596) Journal

    Not to mention the argument for "because space was at a premium" is specious, because either you had a 8-bit length prepended to the string, or you had an 8-bit special value appended to the end of the string. Both ways result in the same space usages.

    From what I read in the summary, (didn't read TFA) this whole thing sounds like a propaganda piece supporting the idea that we should use length+string, by presenting it as "this should have been a no-brainer but the idiots making C screwed up."

    As a nitpicky pedantic note though, if C had gone with length+string format, then other languages would have been written around the C standard, since most of them were written around the C standards to begin with to increase interoperability in the first place.

  • by Animats ( 122034 ) on Wednesday August 03, 2011 @01:31AM (#36968814) Homepage

    The problem with C isn't strings. It's arrays. Strings are just a special case of arrays.

    Understand that when C came out, it barely had types. "structs" were not typed; field names were just offsets. All fields in all structs, program-wide, had to have unique names. There was no "typedef". There was no parameter type checking on function calls. There were no function pointers. All parameters were passed as "int" or "float", including pointers and chars. Strong typing and function prototypes came years later, with ANSI C.

    This was rather lame, even for the late 1970s. Pascal was much more advanced at the time. Pascal powered much of the personal computer revolution, including the Macintosh. But you couldn't write an OS in Pascal at the time; it made too many assumptions about object formats. In particular, arrays had descriptors which contained length information, and this was incompatible with assembly-language code with other conventions. By design, C has no data layout conventions built into the language.

    Why was C so lame? Because it had to run on PDP-11 machines, which were weaker than PCs. On a PC, at least you had 640Kb. On a PDP-11, you had 64Kb of data space and (on the later PDP-11 models) 64Kb of code space, for each program. The C compiler had to be crammed into that. That's why the original C is so dumb.

    The price of this was a language with a built in lie - arrays are described as pointers. The language has no idea how big an array is, and there's not even a way to usefully talk about array size in C. This is the fundamental cause of buffer overflows. Millions of programs crash every day because of that problem.

    That's how we got into this mess.

    As I point out occasionally, the right answer would have been array syntax like

    int read(int fd, char[n]& buf, size_t n);

    That says buf is an array of length n, passed by reference. There's no array descriptor and no extra overhead, but the language now says what's actually going on. The classic syntax,

    int read(int fd, char* buf, size_t n);

    is a lie - you're not passing a pointer by value, you're passing an array by reference.

    C++ tries to wallpaper over the problem by hiding it under a layer of templates, but the mold always seeps through the wallpaper when a C pointer is needed to call some API.

  • by IICV ( 652597 ) on Wednesday August 03, 2011 @01:46AM (#36968880)

    Everyone misunderstands that poem.

    Robert Frost had a fairly depressing outlook on life, and the point of the poem is that it doesn't matter what road you take.

    I mean, just pay attention to the narrative tense in the last stanza, the one people take to be so life-affirming and "do something different!". The narrator isn't saying "I did this, and I know it was important"; he's saying "I did this, and I think that in the future I'm going to tell people it was important".

    The narrator is a vain, shallow individual who frets about insignificant decisions like this, thinking that they will have some gigantic impact on his life, and then later on blows those choices up to be of earthshattering proportions. This is all despite the fact that half the poem is about how the roads are effectively identical; and in the end, he doesn't even tell us what was important about the path he took, just that it was the "one less traveled by" (which makes no sense! They were "just as fair", they had been "worn ... really about the same", they "both that morning equally lay".)

    Basically, if we apply this poem to the current situation, what it's saying is that in alternate 2011 we'd have an article about how null-terminated strings would have been better than Pascal strings. It doesn't matter what path you take, if you're the right kind of person you'll always blow up the significance of it in your mind later.

  • Re:Missed the point (Score:4, Interesting)

    by jeremyp ( 130771 ) on Wednesday August 03, 2011 @06:02AM (#36970024) Homepage Journal

    That's still an arbitrary limit.

    An arbitrary limit equal to the virtual machine size of the computer that was originally targeted.

    The advantages that I see for counted length are:
    - it makes copying easier - you know beforehand how much space to allocate, and how much to copy.
    - it makes certain cases of strcmp() faster - if the length doesn't match, you can assume the strings are different.
    - It makes reverse searches faster.
    - You can put binary in a string.

    - It all but eliminates the possibility of buffer overruns for strings.

    But that must be weighed against the disadvantages, like not being able to take advantage of CPUs zero test conditions, but instead having to maintain a counter which eats up a valuable register.

    But lots of CPUs have an instruction a bit like "decrement register and jump if not zero" which can be used for length+data strings.

    Or not being well suited for piped text or multiple threads; you can't just spew the text into an already nulled area, and it will be valid as it comes in;

    With modern character encodings, you can't guarantee that whatever string format you use. Couple that with the fact that streamed data tends to be read and written in blocks with a length parameter anyway, and the whole advantage is gone. This is why almost all modern languages have some variation on length + data for their strings and utilities for manipulating raw byte buffers.

    Getting a free strlen() is NOT an advantage, by the way. In fact, that became a liability when UTF-8 arrived. With a library strlen() function, all you had to do was update the library, but when the compiler was hardcoded to just return the byte count, that wasn't an option.

    Except that strlen() has always and still does count the number of C chars before the null byte. This is enshrined in the C99 standard. UTF-8 has not changed the implementation of strlen(). Also, gcc and probably many other compilers will normally optimise things like strlen() to a few lines of assembler rather than a call to libc, so you'd have to recompile anyway if it does change.

    Sure, one could go to UTF-16 instead, but then there's a lot of wasted space.

    All in all, having worked with both systems, I find more advantages with null-termination.

    There's also a third system for text - linked lists. It doesn't have the disadvantage of an artificial string length limit, and allows for easy cuts and pastes, and even COW speedups, but requires far more advanced (and thus slower) routines and housekeeping, and has many of the same disadvantages as byte-counted text.. Some text processors have used this as a native string format, due to the specific advantages.

    I'd still take NULL-terminated for most purposes.

    Most modern languages have a proper string type and I would always take that over null terminated char sequences. You can bet that Java's internal implementation of String uses length+data.

It's a naive, domestic operating system without any breeding, but I think you'll be amused by its presumption.

Working...