Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Programming Unix

The Most Expensive One-Byte Mistake 594

An anonymous reader writes "Poul-Henning Kamp looks back at some of the bad decisions made in language design, specifically the C/Unix/Posix use of NUL-terminated text strings. 'The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end? ... Using an address + length format would cost one more byte of overhead than an address + magic_marker format, and their PDP computer had limited core memory. In other words, this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day; but this one had quite atypical economic consequences.'"
This discussion has been archived. No new comments can be posted.

The Most Expensive One-Byte Mistake

Comments Filter:
  • The Road Not Taken (Score:5, Insightful)

    by symbolset ( 646467 ) * on Wednesday August 03, 2011 @12:18AM (#36968460) Journal

    Two roads diverged in a yellow wood,
    And sorry I could not travel both
    And be one traveler, long I stood
    And looked down one as far as I could
    To where it bent in the undergrowth;

    Then took the other, as just as fair,
    And having perhaps the better claim,
    Because it was grassy and wanted wear;
    Though as for that the passing there
    Had worn them really about the same,

    And both that morning equally lay
    In leaves no step had trodden black.
    Oh, I kept the first for another day!
    Yet knowing how way leads on to way,
    I doubted if I should ever come back.

    I shall be telling this with a sigh
    Somewhere ages and ages hence:
    Two roads diverged in a wood, and I—
    I took the one less traveled by,
    And that has made all the difference.

    - Robert Frost, 1920

  • by symbolset ( 646467 ) * on Wednesday August 03, 2011 @12:36AM (#36968564) Journal
    Slashdot is lost.
  • Got it wrong (Score:4, Insightful)

    by Spazmania ( 174582 ) on Wednesday August 03, 2011 @01:05AM (#36968716) Homepage

    It probably wasn't about the bytes. The factors are:

    1. Complexity. Without exception, every variable in C is an integer, a pointer or a struct. A null terminated string is a pointer to a series of integers -- barely one step more complex than a single integer. To keep the string length, you'd have to employ a struct. That or you'd have to create a magic type for strings that's on the same level as integers, pointers and structs. And you don't want to use a magic type because then you can't edit it as an array. Simplicity was important in C -- keep it close to the metal.

    2. Computational efficiency. Many if not most operations on strings don't need to know how long they are. So why suffer the overhead of keeping track? That makes string operations on null terminated strings on average faster than string operations on a string bounded by an integer.

    3. Bytes. It's only one extra byte with a magic type or an advanced topic struct. In both cases with an assumption that the maximum possible length on which the standard string functions will work is 64kb. If you're talking about a more mundane struct then you're talking about an int and a pointer to a block of memory which has an extra set of malloc overhead. That's a lot of extra bytes, not just one.

    For the kind of language C aimed to be -- a replacement for assembly language -- the choice of null terminated strings was both obvious and correct.

  • by perpenso ( 1613749 ) on Wednesday August 03, 2011 @01:09AM (#36968736)

    They don't look the same to me, these days the "IT" decisions are taken by the MBA type guys, with the sole purpose of maximizing their chances to get more visibility, "exceed objectives" and get a larger bonus/promotion/whatever. Sure they're rational too but what do they have in common with CS?

    Programmer for 20+ years here, BS and MS in CS. I used to share such opinions. Then I went to business school. I really enjoyed business school in part because I was constantly amused by how ignorant and wrong I had been regarding such opinions. May I be bold enough to suggest that the portrayal of MBAs in popular and nerd cultures are about as accurate as the portrayal of programmers in popular and non-nerd cultures.

    None of the above should be interpreted to mean that business school makes one appreciate Dilbert any less. Dilbert is actually pretty popular with MBA types and their professors as well.

  • by epine ( 68316 ) on Wednesday August 03, 2011 @01:18AM (#36968774)

    Normally I tend to agree with what I've read from PHK, but this one seems wide of the mark. If you involve a *real* C guru in the discussion, I don't think there would be much sentiment toward nixing the sentinel.

    C makes a big deal about the equivalence of pointers and arrays. Plus in C a string also represents every suffix string.

    char string [] = { 't', 'e', 's', 't', '\0' };
    char* cdr_string = string + 1;

    Perfectly valid, as God intended. A string with a length prefix is a hybrid data structure. What is the size of the length structure up front? It can be interesting in C to sort all suffixes of a string, having only one copy of the string itself. Try that with length prefix strings. (The trivial algorithm is far from ideal for large or degenerate character sequences, but it does provide insight into position trees and the Burrows-Wheeler transform.)

    Nor would I blame all the stupid coding errors on the '\0' terminator convention. In C, a determined idiot can mess up just about anything, unless the compiler takes over and does things for you, a la Pascal by another name. If that had been the bias, would be all be using C now, or some other language? Repeat after me: Generativity Rocks. Nanny languages usually manage to bork generativity over. Correct Programming Made Easy never strays far from the subtitle Composition Made Difficult.

    No one who ever read Dijkstra and took him serious ever made a tiny fraction of the stupid mistakes blamed on hapless zero.

    If you want to point to a real steaming pile, strcpy() was designed by a moron with a bad hang-over and no copy of Dijkstra within a 100 mile radius. It was tantamount to declaring "you don't really need to test your preconditions ... what kind of sissy would do that?"

    C is a nice design, as evidenced by how seamlessly the STL was grafted onto C++ at the abstraction layer (at the syntax layer, not so much). The problem with C was always a communication problem. To use C well one must test preconditions on operation validity. To use algebra well one must test preconditions on operation validity.

    Where does PHK lay the blame for the algebraist who made it possible to divide both side of an equation by zero, or multiply an inequality by -1? Preferably with the complete moron who doesn't check preconditions on the validity of the operation. Two thousand years later, now we have a better solution?

    PHK is right about cache hierarchies. By the time cache hierarchies arrived, we had C++ with entirely different string representations.

    For some reason I've never been keen on having a programmer who can't manage to correctly test the precondition for buffer overflow making deep design decisions about little blocks of lead in the radiation path.

    And it's not even much of a burden. As Dijkstra observed, for many algorithms, once you have all your preconditions right and you've got a provable variant, there's often very little left to decide. It actually makes the design of many algorithms simpler in the mode of divide and conquer: first get your preconditions and variant right (you're now half done and you've barely begun to think hard), *then* worry about additional logic constraints (or performance felicitous sequencing of legal alternatives).

    The coders who first try to get their logical requirements correct and then puzzle out the preconditions do indeed make the original task more difficult than not bothering with preconditions at all, supposing there's some kind of accurate measure over crap solutions, which I refuse to concede.

  • Faster loops (Score:5, Insightful)

    by Sloppy ( 14984 ) on Wednesday August 03, 2011 @01:55AM (#36968932) Homepage Journal

    TFA suggests the decision was to save a byte, but I don't believe that's the main reason it happened.

    If you're traversing a string anyway, what happens is that when you load the data into your register (which you'll be doing anyway, for whatever reason you're traversing the string), you get a status flag set "for free" if it's zero, so that's your loop test right there. Branch if zero. If you have to compare an offset to a length on every iteration, then now you're having to store this offset in another register (great, like I have lots of registers to spare on 1970s CPUs?) and compare (i.e. subtract) to the length which is stored in memory (great, a memory access) or another register (oh great, I need to use another register in the 1970s?!) and the code is bigger and slower.

    It's easy to laugh these days about anyone caring about how many clock cycles a loop takes and whether it uses 2 registers or 4 registers, but this stuff used to be pretty important (and more recently than the 1970s). Kids these days: if you weren't there, you just don't know what it was like.

    BTW, I have a hunch K & R didn't know they were building such an eternal legacy. It's reasonable to speculate that this is still going to be part of systems a hundred years from now, but in 1970 you would have been a mad man to suggest such a thing. (Not that this invalidates TFA's point at all; I'm just making excuses for K&R I guess.)

  • by EvanED ( 569694 ) <{evaned} {at} {gmail.com}> on Wednesday August 03, 2011 @02:30AM (#36969074)

    If you want to point to a real steaming pile, strcpy() was designed by a moron with a bad hang-over and no copy of Dijkstra within a 100 mile radius. It was tantamount to declaring "you don't really need to test your preconditions ... what kind of sissy would do that?"

    To play Devil's advocate, strcpy cannot check it's precondition. You can't tell whether a pointer you're given is valid, or how much space is left in the buffer.

    (Well, I guess you could go make malloc record far more information than it otherwise has to, and make strcpy grovel through that and some other data, but even I don't think that'd have been worth it. And I'm pretty far on the side of "why the heck are we using languages that are as unsafe as C".)

  • by stderr_dk ( 902007 ) on Wednesday August 03, 2011 @02:32AM (#36969080) Homepage Journal

    Poor guy. I guess sooner or later he's going to have to learn how to manage his memory and understand how the underlying physical hardware works. That must be a real toughie for anyone who learned to "program" in the Java/C# world.

    Yeah, clearly PHK [wikipedia.org] doesn't knows anything about memory allocation. (Except for the malloc library he wrote for FreeBSD...)

    Maybe he should RTFM.

    I don't have a FreeBSD system at hand, but I wouldn't be surprised if the malloc page was written by PHK.

  • by snowgirl ( 978879 ) on Wednesday August 03, 2011 @04:52AM (#36969742) Journal

    If we were to switch now, is that the compatibility you're referring to? Well sure.

    But nobody's talking about switching now, the point of the topic is that C should have been designed differently. In those days there was very little backwards compatibility to worry about.

    And if it had been decided to be 1-byte length + data, and everyone used it like that, and assumed that the full 8-bits are available for the length, then when we switch to variable-byte length encoding, it would create an incompatibility. The incompatibility I speak of is the hypothetical one switching from 1-byte fixed-length length encoding to variable-byte length encoding.

    "They could have just used variable length encoding from the beginning." Sure, and they could have programmed everything in Java from the start... the idea of a variable length encoding would have been over-engineering the problem that they were facing.

  • by looie ( 9995 ) <michael@trollope.org> on Wednesday August 03, 2011 @08:14AM (#36970650) Homepage

    Not sure where you took your "poetical exegesis" class, but you should ask for a refund.

    The narrator as "vain, shallow individual" is entirely a character pulled out of your hindquarters, as there is nothing in the text of the poem to lead to that conclusion.

    The poem is simply a reflection on how we, as individuals, make choices in life. Some of us choose to take the direction taken by most of those around us. That might be university, family, job, retirement in FL. Some of us choose to turn aside from that direction and try another path. Programming a PDP to play "Space Travel," for example. Or writing an operating system "just for fun."

    Frost's suggestion is that these choices of path may seem insignificant at the time -- both paths being nearly the same; but that, as "way leads on to way," there's no going back and thus we may find ourselves down a path that leads to unexpected places. When Linus Torvalds wrote linux, he could not know that "the path less traveled" would lead to fame and fortune, literally. The college kids who created Slashdot could not know it would make them rich.

    In fact, the point of the poem is exactly that it does matter which path you take. But that you don't always know how your choice is going to turn out. Frost himself might have continued his career as a teacher, a stable and certain means of supporting his family. Instead, he chose to focus on his poetry. He took a chance. And it worked well for him.

    mp

  • by cecille ( 583022 ) on Wednesday August 03, 2011 @10:26AM (#36972332)

    Would anyone care you join me
    in flicking a few pebbles in the direction
    of teachers who are fond of asking the question:
    "what is the poet trying to say?"

    as if Thomas Hardy and Emily Dickinson
    had struggled but ultimately failed in their efforts -
    inarticulate wretches that they were,
    biting their pens and staring out of the windows for a clue.

    Yes, it seems that Whitman, Amy Lowell
    and the rest could only try and fail,
    but we in Mrs. Parker's third-period English class
    here at Springfield High will succeed

    with the help of these study questions
    in saying what the poor poet could not,
    and we will get all this done before
    that orgy of egg salad and tuna fish known as lunch.

    -- from Billy Collins "The Effort"

Old programmers never die, they just hit account block limit.

Working...