Slashdot stories can be listened to in audio form via an RSS feed, as read by our own robotic overlord.

 



Forgot your password?
typodupeerror
Databases Programming

Falsehoods Programmers Believe About Names 773

Posted by timothy
from the can't-we-stick-to-slashdot-user-ids? dept.
Jamie points out this interesting article about how hard it is for programmers to get names right. Since software ultimately is used by and for humans, and we humans are pretty tightly linked to our names (whatever the language, spelling, or orthography), this is a big deal. This piece notes some of the ways that names get mishandled, and suggests rules of thumb (in the form of anti-suggestions) to encourage programmers to handle names more gracefully.
This discussion has been archived. No new comments can be posted.

Falsehoods Programmers Believe About Names

Comments Filter:
  • Dumbfuck summary (Score:5, Insightful)

    by oldhack (1037484) on Thursday June 17, 2010 @09:28PM (#32609220)
    Names of what?!
  • by Vellmont (569020) on Thursday June 17, 2010 @09:40PM (#32609282) Homepage

    Software is NOT designed to be perfect and cover every case. Have a numeral in your name? Too bad. Need some names to be case sensitive, and others case insensitive? Sucks to be you. Have a 200 character name that doesn't fit in the 100 characters the designers thought no crazy person would ever have? Tough.

    I started reading through the list, and it's just ridiculous. There's a few good points, like names don't change, or names are unique. But they're so obvious that the vast majority of the times it's not a big problem. More often it's just a matter of training the data edit/entry folks how to change someones name, or how to not assume a name is a sole identifier.

    But assuming the worst and trying to design a system that'll allow people's names to be Chinese characters when you don't do business in China, have presence in China, or ever ever plan to? That's ridiculous. Software doesn't have to be perfect out of the shoot. It should be adaptable though if some unforeseen shortcoming becomes a larger problem. Gee, I guess if you ever chose to do business in China and need Chinese character names you might have to re-write part of the damn software. Oh well, that's what software developers are FOR!

    If you don't even HAVE a name, then I submit you're crazier than the artist formerly known as the artist formerly known as Prince. At least HE had a name, though it was an unpronounceable symbol. The world can't accommodate every possibility, and software is no exception.

  • by Trepidity (597) <delirium-slashdotNO@SPAMhackish.org> on Thursday June 17, 2010 @09:45PM (#32609312)

    He's essentially arguing that, because names vary a lot and are complex, your software should never do anything useful with them. Sorry, but that's a stupid answer. In a lot of systems, being able to sort by surname may well be more important than being able to handle people who claim they have no surname.

    Of course, you shouldn't gratuitously do stupid things, and interfaces should aim to be relatively clear. But most people can figure out how to enter their names into relatively standardized forms, and those that don't should probably figure out how.

  • by lennier (44736) on Thursday June 17, 2010 @09:57PM (#32609372) Homepage

    But assuming the worst and trying to design a system that'll allow people's names to be Chinese characters when you don't do business in China, have presence in China, or ever ever plan to? That's ridiculous.

    Or sell in New Zealand, or Australia, or anywhere else in the Pacific, or deal with immigrants, or be used by anyone who has a Chinese name?

    This is the Internet now. Welcome to it.

  • by Vellmont (569020) on Thursday June 17, 2010 @10:02PM (#32609388) Homepage


    Then it's designed to fail

    Anything ever designed is designed to fail. This applies to bridges, the pyramids, and all software. This belief you have that software doesn't have to be maintained is as ridiculous as the idea that a bridge or any physical structure doesn't have to be maintained. Software lives and dies like anything else. Nothing lives forever.

  • Most Chinese emigrants to countries that use a Roman alphabet are perfectly capable of writing their name in Roman characters if they need to. If they weren't, they wouldn't have been able to get visas and get into the country in the first place.

  • by thepainguy (1436453) <thepainguy@gmail.com> on Thursday June 17, 2010 @10:06PM (#32609420) Homepage
    My last name is O'Leary and over the past 5 years web sites have not gotten any better, and arguably have gotten worse, at handling the apostrophe in my last name

    Help me Slashdot, you're my only hope.
  • by PrecambrianRabbit (1834412) on Thursday June 17, 2010 @10:13PM (#32609444)

    You're overreacting (I know, I know, "welcome to the Internet"). Software should behave in some sane, safe manner given any input. Sometimes, the sane thing to do is to throw an error, or say "Sorry, Dave, I can't do that."

    In particular, systems don't necessarily have to shoehorn insane data into their processing. To use a relevant example, simply because Prince wants to upload a PNG in the "Name" field doesn't mean that the software has to let him. Rejecting this case does not doom said software system to "become a botnet" or "leave a trail of broken data."

  • Well Duh (Score:5, Insightful)

    by Saint Stephen (19450) on Thursday June 17, 2010 @10:22PM (#32609492) Homepage Journal

    First thing I learned back in 1993 when I got started.

    1) George Foreman has five boys named George Foreman. Your database better be able to handle that.
    2) Your database better be able to handle Cher (no last name).
    3) People are not required to have Social Security numbers. (it's an optional program - you don't have to partipate).
    4) Not everyone's last name starts with a capital letter.
    5) Mexican people's names break ASCII (the tilda n).
    6) People named O'Grady have a hard time getting their name in a database sometimes and have a hard time getting their name passed via a URL sometimes and generally mess stuff up.
    7) People from Sri Lanka will break your name length limits.
    8) Some people's name is only a single letter.
    9) Some people go by their middle name god damn it! :-)

  • by jrumney (197329) on Thursday June 17, 2010 @10:23PM (#32609502) Homepage

    Generally when building a form that asks for a name, I create a first name field and a surname field.

    And you fail right there. For some people, their first name is their surname. Others don't have a surname. Some of those without a surname may use a patronym or matrinym as part of their full name, but you never use it to address them without their personal name. Some people have a first name and second name that always go together, so parsing a first name out of the full name, or disallowing whitespace in the first name field is another common fail.

    Names are complex. Don't assume it doesn't matter because your database is only intended for local use, because unless you live somewhere as closed as North Korea, there are immigrants in your town that break your assumptions.

  • by justfred (63412) on Thursday June 17, 2010 @10:23PM (#32609504) Homepage

    I code to spec. The product and marketing departments write the spec (what little there is); the QA department amends the spec with overly specific test cases. I suggest that the spec is incomplete and won't handle...but I'm told, just code it to spec. I recommend changed, but we don't have time for edge cases. I point out potential problems, but we're unlikely to get any of those. I warn of potential compatibility problems but we don't care. Are you just trying to be difficult? If there's a problem QA will catch it. The project is overdue already, and by the way here are some new requirements that need to make it in, and we can't change the release date because we already promised the stockholders. Why is your code so complicated, my twelve-year-old kid could write this.

    It's not my fault. I code to spec.

  • by yyxx (1812612) on Thursday June 17, 2010 @10:35PM (#32609554)

    Software shouldn't have to satisfy every whim and excentricity. If you don't have a well-defined first name and last name that consists of extended alphanumeric characters in Unicode and starts with a letter, well, then get one, OK? And while you're at it, come up with decent Romanized and ASCII (traditional Latin) versions of your name, conformant with one of the common Romanization systems of your language; you will need that too if you want to travel internationally. Single letter names are also a potential problem because they are confusable with abbreviations, so consider using a variant spelling ("O" -> "Oh").

    This isn't because programmers have some sort of hangups about names, it's because people themselves need to be able to refer to individuals in some reasonable and standardized way, they need to be able to write your name, alphabetize it, and correct errors.

  • ...so what? (Score:5, Insightful)

    by SanityInAnarchy (655584) <ninja@slaphack.com> on Thursday June 17, 2010 @10:42PM (#32609580) Journal

    It seems to me that most misconceptions about names can be fixed by the following:

    Allow a single, Unicode-enabled field of "unlimited" length (let's say 4 kilobytes) which represents "name". Several would be defined by different roles -- "Real name", "Nickname", "login", where only login (sometimes simply an email address) is required to be globally unique.

    Now let's look at what that breaks:

    First, #1, 2, 4, and 5. How am I supposed to avoid assuming these? People should be allowed to enter an arbitrary number of names for themselves? I suppose that's possible, but it immediately kills most of the potential uses of this data. If I want to set a nickname that goes with my forum posts, say, what good is it for me to have five nicknames? Seems like the only potential use would be making people easy to find by real name -- so, a social network.

    #6 -- surely 4k is enough, but this is also not a terribly difficult assumption to change later. Annoying, but not devastating, not even as hard as changing from the first name / last name combination into one "real name" field.

    #7, 8 -- most systems would make it trivial for people to change their names.

    #9, 10 -- UTF8 is easy.

    #11 -- very, very curious to see an example. And wouldn't that be a bug in Unicode? And this is again one where I have to ask -- how do you change this? Allow arbitrary images?

    #12, 13 -- obvious solution is to make the name system case-preserving, thus allowing both case-sensitive and case-insensitive searches.

    #14 -- again, avoid by simply allowing the name to be a single opaque field.

    #15, 16, 17 -- if your name supports random unicode, no idea why these would be a problem.

    #18 -- not sure why it matters.

    #19, 20 -- again, if it's just arbitrary text, it just works.

    #21, 22, 23 -- not sure how I'd make that assumption.

    #24, 25, 26, 27 -- again, the name is just an opaque bunch of characters.

    #28 -- what?

    #29 -- opaque characters.

    #30 -- keep the original text as-is. If you want to try to split people out by naming scheme, do it later, but keep the original. This should be a "duh" concept -- always preserve the original user input. Cache transformations for speed, if you like, but they're a cache -- keep the original. Your algorithm might change.

    #31 -- bad idea to assume bad words won't cause problems in general. I currently play an MMO in which I physically can't talk about Emily Dickinson, and have occasion to more frequently than you might suspect.

    #32-36 -- why would it matter? Unless...

    #37 -- Fine, but how would I otherwise connect the same person?

    #38 -- How about unicode-equivalent? And of course, they might not -- one might make a mistake, or the name might be represented differently. But you'd have to deal with typos anyway, so this isn't exactly shocking.

    #39 -- I'm going to have to agree with the assumption, though. If I develop a system which works well for people who only follow the US standard, and I suddenly have a ton of people from China wanting to use my service -- enough that this is actually a problem for me -- that's a nice problem to have.

    #40 -- People can make up names. I guess this explains #32-36, though.

    The sense I get is that half the list is stuff you'd almost have to be stupid to run into (seriously, who doesn't use Unicode?), and the other half involves some seriously weird names and cultures that are going to have to meet me halfway, if they expect me to do anything interesting with their name. As I understand it, the only way to get this right would be to allow people to have zero or more names, each of which is either an unlimited amount of text in any encoding, or an image (raster or vector) of unlimited size. To query such a system requires insane amounts of logic just to deal with the text, and throw in some OCR for good measure.

    I think this is a case where I would much rather see people evolve to match the technology, rather than the other way

  • by RomulusNR (29439) on Thursday June 17, 2010 @11:16PM (#32609754) Homepage

    Yes. It's programmer's fault that they write applications that make poor assumptions about names -- not the people who design software requirements who are neither programmers nor usually very worldly.

    Perhaps we should have a list of "assumptions people make about developers"!
    * Developers get to design their own software.
    * Developers get to have some say in how their software is designed.
    * Developers at least can prevent really stupid things from being put in the software they write.
    * Developers aren't smart enough to know that outliers are inevitable.
    * Developers aren't smart enough to know that of course there are people with punctuation, extra words and spaces, even letters that no one has seen before.
    * Developers wouldn't rather code just one column to hold an identifier rather than two.

  • by shutdown -p now (807394) on Thursday June 17, 2010 @11:32PM (#32609850) Journal

    You can't "just use Unicode" and do no validation, though, unless you're perfectly fine with all sorts of bidi control characters showing up random places

    That is output validation problem, not input validation. So go ahead and strip it on output, when (and if) you need to do it.

    or nonprintable characters causing two different names to look identical

    And why would that be a problem (anymore so than people with identical names in general)?

    And yes, I would say that if someone can't invent something to put in a "family name or surname" field, then too bad. They would also find themselves unable to travel to most countries, since most countries' immigration forms have such a box

    Having a box is not a problem. By all means, keep one. The problem is making input there mandatory. Do immigration forms make it so?

  • by Anonymous Coward on Friday June 18, 2010 @01:06AM (#32610240)

    The regular expression, if one must be used, doesn't need to be any more complex than:

    ^[^@]+@[^@]+$

    Sending out response emails to an improperly validated address just turned you into an open relay. Spammers can use your server to send spam by embedding their entire message as the email address, trailed by '\x004@.'

    Validate your inputs. Always.

  • by vikstar (615372) on Friday June 18, 2010 @01:19AM (#32610270) Journal

    I would find it more interesting if it contained approximate statistics for each type point. I will not spend time designing a system which caters for the 2 individuals having some weird exception to the detriment of millions of others which adhere to a much more useful schema. IE, sure you can just have Name and accept a 2048-length UTF-16 string to accommodate everyone, or skip a few outliers and have given and last names with certain restrictions to catch user error in the input.

  • by TheLink (130905) on Friday June 18, 2010 @02:46AM (#32610534) Journal
    Fake, duplicate or not, numeric IDs are still easier to key in ;).

    As most slashdotters will know, if your data records are in a digital computer, it's pretty hard to avoid being linked to at least one number.

    Even if you don't have national ID numbers, someone could go around claiming to be you, or the "System" could still confuse you with someone else.

    At least accidental/erroneous duplicate IDs are easier to detect.

    Of course if some Big Bad Ruler/Government starts issuing citizens with Citizen Certificates that have to be renewed every year then that's a problem :).
  • by CarpetShark (865376) on Friday June 18, 2010 @02:48AM (#32610540)

    Read between the lines a bit. Treat them the same means: treat them as all potentially valid, not that all the names would match in a string comparison.

    I don't think that's what it meant at all. I think the author is trying to be too smart by suggesting that someone looking for MacDonald might have heard it wrong, and so might type in McDonald instead. It's probably a valid point for fuzzy searches, but to say that they should all be treated the same is wrong.

    That said, his other points, especially about the fact that not all names are properly mapped in unicode, is a good one. I just wish he'd posted citations and solutions, rather than simply pointing out the issues. But the first step in fixing a problem is acknowledging the problem.

  • by Anonymous Coward on Friday June 18, 2010 @04:31AM (#32610888)

    THX1138

  • by delinear (991444) on Friday June 18, 2010 @04:56AM (#32611004)

    Sometimes I despair when I read or hear somebody referring to eg. Djengis Khan as "Mr Khan" ("Khan" is a title, not a name) or even call Hu Jintao, "Mr Jintao"; you would have thought people would, by now, have caught on to the idea that something like half the world's population has the family name first.

    Oh, come now - are you seriously saying you expect every single person to understand every subtle nuance of every other culture's use of titles and names? Here are some non-English [wikipedia.org] equivalents to Mr., are you seriously telling us you know all of these? Here are the various forms of address [wikipedia.org] in the UK alone, do you know all of these and every other culture's equivalent? How many of these should I learn before I go from being someone you despair of to someone you feel is welcome in your titular elite?

    If half the world's population has the family name first, which half do I choose to offend when I don't know the exact rule for the home country of the person I'm speaking to? That's even assuming I know which country they're from. There's no reason to assume in this shrinking planet that someone who looks like they're from country A wasn't in fact born in country B to parents from countries A and C - a person born in Japan but with lineage in China might take great offence if I use Chinese honorifics to address him, surely it's better to be polite within the confines of my own known culture than to make such crass assumptions about his? The key thing I take from someone saying "Mr Khan" or "Mr Jintao" is that they're at least making the effort to communicate in a civil manner, which certainly causes me no despair.

  • by FuckingNickName (1362625) on Friday June 18, 2010 @05:10AM (#32611052) Journal

    I was born with a complicated Spanish name.
    One first name.
    Two second names.
    One hyphenated, accented surname from my father.
    One simpler surname from my mother.

    One of the first things I've done since reaching majority is to give a precise, simple, standard name to everyone who asks for it:
      Xxxxx Xxxxxxx
    where X is in A-Z and x is in a-z. Xxxxx is my first name, and Xxxxxxx is a shortened, accent-and-hyphen-free version of my father's surname.

    You know why?

    Because, in life, there are lots of things one must be "unreasonable" about in order to effect progress, but accommodation of one's name is not of them. It's a tedious, selfish expression of nothing more than ego which ultimately will land you in more trouble than others: some day you will be denied access to something thanks to some computer system not being designed to handle your name, and "computer says no" gets priority over the angry demands to the immigration officer of "Joe\0\rBlogg$ 3'); DROP TABLE citizens; -- [insert spinning cube here] Jr."

    If you and your friends/colleagues have some other name by which they call you, sure why not? But, as any cat will tell you, the world is best when you have three names:

    (i) one for communicating formally;
    (ii) one for more intimate discourse (there's no reason why this can't be the same as (i), though many people end up with peculiar nicknames); and
    (iii) one personal identification which you can keep to yourself and you can't express in words.

    If you want the sum of all your history, culture and personality as expressed in (iii) to be embodied in (i), you're both expecting others to be burdened with your ego and bad at understanding human communication. All I asked for was a couple of words I can use in a reasonably uniform way to easily pick/call you out from a small crowd - that's what (i) is for, after all.

    tl;dr The naming of cats is both a delightful poem and an insightful account of the multiple namespaces for kitty/human names and their different purposes. Don't confuse them.

  • by delinear (991444) on Friday June 18, 2010 @05:31AM (#32611108)
    I'm the same, faces for me just won't stick - the first few weeks going to a new employer is torture for me as I'll be bombarded with names and I just can't remember anyone, and of course the problem is multiplied because you're the new guy so everyone knows your name. I don't know if it's specifically a developer issue. I did read that people with borderline Asperger's find it difficult to recognise faces and a lot of developers I know seem to fit the patterns for that (awkward in social situations or around new people, like to collect things, etc) so maybe there's some correlation that people who fit those patterns are drawn to careers where they can focus on impenetrable logic problems and not have to deal with people too much (I know I'm making some massive generalisations here, and this is purely anecdotal, but it fits my observations).
  • by digitig (1056110) on Friday June 18, 2010 @06:41AM (#32611374)

    But assuming the worst and trying to design a system that'll allow people's names to be Chinese characters when you don't do business in China, have presence in China, or ever ever plan to? That's ridiculous.

    No, but making a conscious design decision not to accommodate names in non-Roman character sets, and documenting that in the specification, is sensible.

    If you don't even HAVE a name, then I submit you're crazier than the artist formerly known as the artist formerly known as Prince.

    The discussion gives examples of people who don't have names, such as somebody born into slavery in the Sudan. In that case, it's not the person who is crazy. Do you need to account for that in your data entry? Well, it depends. If it's online sales then the chances are that that person will never be a customer. If you're doing a missing persons database for a relief agency, though, you probably need to find a way to account for them. So no, you don't have to address all of the cases that the author mentions, but if you're smart you'll at least consider whether you should in your particular context.

  • by Moraelin (679338) on Friday June 18, 2010 @07:22AM (#32611572) Journal

    You know, attitudes like yours are IMHO the root of all that's wrong with computers today. And I'm saying that as a programmer, not as Jane Grandma. The whole idiotic OCD idea that you _must_ make up rules about everything, and that your rules are more important than what people are actually trying to do. The idea that if even someone's name doesn't fit "your" database, then you can just brush them off and have a beer.

    Here's some free clue: yes, you can't handle every edge case in the universe, but you'll find it's easier if you don't create such edge cases in the first place. If your database (actually more likely the program in front of it) can't handle last names with more than one capital letter, or with a dash in the middle, or which are more than 32 bytes long (which with UTF-8 might mean less than you'd think), then guess what? _You_ created an artificial edge case that had no reason to be there in the first place. Instead of handling every edge case in the universe, how about not creating them in the first place?

    I find that about 90% of the problems boil down to the above: some idiot put some artificial limits or rules, that really aren't needed anywhere else. Just because he has the delusion that he's some kind of Moses on the mountain and just _has_ to come down with some rules.

    E.g., he just had to define a byte limit, because he's prematurely optimizing a non-problem he doesn't understand. God forbid wasting space in the database by allowing 256 or 2000 byte strings... never mind that if he actually understood that underlying database, he'd know that a VARCHAR is not padded to that max length. If someone just entered "Alex", the same 4 bytes will be actually used in the database, regardless if the field is a defined as maximum 4, 32, 256 or 2000 characters. But nah, he has to put some restrictive number there, 'cause it looks more like he's doing some smart job.

    There is hardly any reason to even use a user name for anything other than display purposes. (You do have a primary key for that record for everything else, right?) As such there is no reason to make any assumptions about it, or enforce any particular format, or anything. There's no reason to even disallow SQL keywords (just effing quote it before using it in SQL) or angular brackets (just quote it before using it in HTML.)

    There is no reason to create any edge cases in the first place.

    And really it's not even just about names. Names are just one case where people make up BS rules just to feel more like they did the great design job. One could make the same case for the gazillion other pointless rules imposed upon the user or his work-flow or data, not because they're actually needed anywhere, but just because some OCD idiot feels like he _must_ impose some rigid structure upon things that really have none and don't need any. But he'd just feel naked without defining that kind of rigid structure, or without imposing upon humans some data structures theory that was intended only for use by programs.

  • by russotto (537200) on Friday June 18, 2010 @09:23AM (#32612736) Journal

    The idea that if even someone's name doesn't fit "your" database, then you can just brush them off and have a beer.

    We can. Fact is, trying to write a system which can deal with all those 40 assumptions and still do anything useful with names is impossible. Even covering most of them is impractical, if you want programmers to do anything else. It has nothing to do with OCD. The programmers aren't making the rules because of some inner desire for order, but because the requirements of the system require they be made.

    Suppose your system is some sort of order-taking system. And one of the things it must do is print your name on a mailing label. How do you handle that if the name doesn't _fit_ on the mailing label? Or if there is no name at all? Or if the mailing label printer doesn't handle the name's character set? Or if the postal service for the countries in question have standards for names which are not met?

  • by Frater 219 (1455) on Friday June 18, 2010 @11:24AM (#32614102) Journal

    Check out the huge regex at the bottom of the RFC 5322 compliant validator from CPAN:

    Honestly, this sort of thing is an example of overusing regex when it's the only parsing tool they know. Regex becomes unwieldy when you put too much of it in one place -- but this is because regex is unwieldy, not because the problem of parsing email addresses is fundamentally hard. Parsing email addresses is a case for a modular parser such as Parsec (or any of its ports and imitators) ... which will give you the added advantage of useful error messages on invalid input, instead of just a match failure.

    Moreover, isn't it kind of silly to point at an example of someone already having written the code to do something as a way of saying that doing it is difficult? In code, once it's already been done once, correctly, it doesn't need to be done again. If you think CPAN's huge regex (or any other implementation) is correct, and you've tested it to your satisfaction, you don't need to reimplement it; just use it.

The most exciting phrase to hear in science, the one that heralds new discoveries, is not "Eureka!" (I found it!) but "That's funny ..." -- Isaac Asimov

Working...