Forgot your password?
typodupeerror
Businesses Programming Unix

Taco Bell Programming 394

Posted by timothy
from the how-dare-you-insult-the-code-monkeys dept.
theodp writes "Think outside the box? Nah, think outside the bun. Ted Dziuba argues there's a programming lesson to be learned from observing how Taco Bell manages to pull down $1.9 billion by mixing-and-matching roughly eight ingredients: 'The more I write code and design systems, the more I understand that many times, you can achieve the desired functionality simply with clever reconfigurations of the basic Unix tool set. After all, functionality is an asset, but code is a liability. This is the opposite of a trend of nonsense called DevOps, where system administrators start writing unit tests and other things to help the developers warm up to them — Taco Bell Programming is about developers knowing enough about Ops (and Unix in general) so that they don't overthink things, and arrive at simple, scalable solutions.'"
This discussion has been archived. No new comments can be posted.

Taco Bell Programming

Comments Filter:
  • by Animats (122034) on Sunday October 24, 2010 @07:54PM (#34007398) Homepage

    A big problem with shell programming is that the error information coming back is so limited. You get back a numeric status code, if you're lucky, or maybe a "broken pipe" signal. It's difficult to handle errors gracefully. This is a killer in production applications.

    Here's an example. The original article talks about reading a million pages with "wget". I doubt the author of the article has actually done that. Our sitetruth.com system does in fact read a million web pages or so a month. Blindly getting them with "wget" won't work. All of the following situations come up routinely:

    • There's a network error. A retry in an hour or so needs to be scheduled.
    • There's an HTTP error. That has to be analyzed. Some errors mean "give up", and some mean "try again later".
    • The site's HTML contains a redirect, which needs to be followed. "wget" won't notice a redirect at the HTML level.
    • The site's "robots.txt" file says we shouldn't read the file from a bulk process. "wget" does not obey "robots.txt".
    • The site is really, really slow. Some sites will take half an hour to feed out a page. Maybe they're overloaded. Maybe their denial of service detection software has tripped and is metering out bytes very slowly in defense. You don't want this to hold up the entire operation. Last week, for some reason, "orbitz.com" did that.
    • The site doesn't return data at all. Some British university sites have a network implementation which, if asked for a HTTPS connection, does part of the SSL connection handshake and then just stops, leaving the TCP connection open but sending nothing. This requires a special timeout.
    • The site doesn't like too many simultaneous connections from the same IP address. We limit our system to three simultaneous connections to a given site, so as not to overload it.

    That's just reading the page text. More things can go wrong in parsing.

    Even routine reading of some known data page requires some effort to get it right. We read PhishTank's entire XML list of phishing sites every three hours. Doing this reliably is non-trivial. PhishTank just overwrites their file when they update, rather than replacing it with a new one. (This is one of the design errors of UNIX, as Stallman once pointed out. Yes, there are workarounds they could do.) So we have to read the file twice, a minute apart, and wait until we get two identical copies. Then we have to check for 1) an empty file, 2) a file with proper XML structure but no data records, and 3) an improperly terminated XML file, all of which we've encountered. Then we pump the data into a MySQL database, prepared to roll back the changes if some error is detected.

    The clowns who try to do stuff like this with shell scripts and cron jobs spend a big fraction of their time dealing manually with the failures. If you do it right, it just keeps working. One of my other sites, "downside.com", has been updating itself daily from SEC filings for over a decade now. About once a month, something goes wrong with the nightly update, and it's corrected automatically the next night.

  • by Anonymous Coward on Sunday October 24, 2010 @08:38PM (#34007678)

    There's a network error. A retry in an hour or so needs to be scheduled.

    Echo, cron.

    There's an HTTP error. That has to be analyzed. Some errors mean "give up", and some mean "try again later".

    If, else if, else.

    The site's HTML contains a redirect, which needs to be followed. "wget" won't notice a redirect at the HTML level.

    ‘--max-redirect=number' Specifies the maximum number of redirections to follow for a resource. The default is 20, which is usually far more than necessary. However, on those occasions where you want to allow more (or fewer), this is the option to use. ...lol, wat?

    The site's "robots.txt" file says we shouldn't read the file from a bulk process. "wget" does not obey "robots.txt".

    'To ignore robots.txt and no-follow, use something like: wget -e robots=off...' ...lol, wat?

    The site is really, really slow. Some sites will take half an hour to feed out a page. Maybe they're overloaded. Maybe their denial of service detection software has tripped and is metering out bytes very slowly in defense. You don't want this to hold up the entire operation. Last week, for some reason, "orbitz.com" did that.

    You can't write a simple timer? It's not like you're being asked to write memory management in C.

    The site doesn't return data at all. Some British university sites have a network implementation which, if asked for a HTTPS connection, does part of the SSL connection handshake and then just stops, leaving the TCP connection open but sending nothing. This requires a special timeout.

    ...And requires special handling with any tool, for that matter.

    The site doesn't like too many simultaneous connections from the same IP address. We limit our system to three simultaneous connections to a given site, so as not to overload it.

    ...Okay, I'm done.

    There must be something in the water, cuz all I see is a code monkey who can't handle the command line.

  • by _LORAX_ (4790) on Sunday October 24, 2010 @08:42PM (#34007706) Homepage

    Psst,

    " | sort | uniq -c "

    Will sort and then count repetitive lines and output count, line. You can pipe the result back through sort -n if you want a frequency sort or sort -k 2 for item sorting.

  • by Giant Electronic Bra (1229876) on Sunday October 24, 2010 @08:55PM (#34007828)

    Sure, awk is a programming language. It is also a command line tool. A bit more flexible than most, but you can't really draw a line between something that is a programming language and something that is a 'built in tool'.

    I really have no idea WHY their code was so large. It was all written in FORTRAN and VMS is honestly a nightmarishly complicated operating environment. A lot of it is probably related to the fact that Unix has a much simpler and more orthogonal environment. Of course this is also WHY Unix killed VMS dead long ago. Simplicity is a virtue. This is why Windows still hasn't entrenched itself forever in the server room. It lacks the simple elegance of 'everything is a byte stream' and 'small flexible programs that simply process a stream'. Those are powerful concepts upon which can be built a lot of really complex stuff in a small amount of code.

  • by Eskarel (565631) on Sunday October 24, 2010 @09:33PM (#34008020)

    That was HTML redirects (well likely more specifically javascript redirects), not HTTP redirects.

  • by arth1 (260657) on Sunday October 24, 2010 @10:21PM (#34008230) Homepage Journal

    The site's HTML contains a redirect, which needs to be followed. "wget" won't notice a redirect at the HTML level.

    Actually, it does. But in any case, this is why you parse the HTML after fetching it with wget -- how else can you get things like javascript generated URLs to work?

    The site's "robots.txt" file says we shouldn't read the file from a bulk process. "wget" does not obey "robots.txt".

    From the wget man page:

    Wget can follow links in HTML, XHTML, and CSS pages, to create local
    versions of remote web sites, fully recreating the directory structure
    of the original site. This is sometimes referred to as "recursive
    downloading." While doing that, Wget respects the Robot Exclusion
    Standard (/robots.txt).

    The site is really, really slow. Some sites will take half an hour to feed out a page.

    And you still haven't looked at the wget(1) man page, or you'd know about the --read-timeout parameter.

    Maybe they're overloaded. Maybe their denial of service detection software has tripped and is metering out bytes very slowly in defense. You don't want this to hold up the entire operation. Last week, for some reason, "orbitz.com" did that.

    Not holding up your operation is why you use multiple tools that can run concurrently. A wget of orbitz.com taking forever won't prevent the wget of soggy.com that you scheduled for half an hour later, and neither will stop the parser.
    Of course, if you design an all-eggs-in-one-basket solution that depends on sequential operations, you deserve what you get.

    The site doesn't return data at all. Some British university sites have a network implementation which, if asked for a HTTPS connection, does part of the SSL connection handshake and then just stops, leaving the TCP connection open but sending nothing.
    This requires a special timeout.

    Yes, the --connect-timeout.

    The site doesn't like too many simultaneous connections from the same IP address. We limit our system to three simultaneous connections to a given site, so as not to overload it.

    wget limits to a single connection with keep-alive per instance. (If you want more, spawn more wget -nc commands)

    Even routine reading of some known data page requires some effort to get it right. We read PhishTank's entire XML list of phishing sites every three hours. Doing this reliably is non-trivial. PhishTank just overwrites their file when they update, rather than replacing it with a new one.

    That's no problem as long as you pay attention to the HTTP timestamp.

    (This is one of the design errors of UNIX, as Stallman once pointed out. Yes, there are workarounds they could do.) So we have to read the file twice, a minute apart, and wait until we get two identical copies. Then we have to check for 1) an empty file, 2) a file with proper XML structure but no data records, and 3) an improperly terminated XML file, all of which we've encountered.

    Oh. My.
    I'd do a HEAD as the second request, and check the Last-Modified time stamp.
    If the Date in the fetch was later than this, and you got a 2xx return code, all is well, and there's no need to download two copies, blatantly disregarding the "X-Request-Limit-Interval: 259200 Seconds" as you do.

    It'd be much faster too. But what do I know...

    The clowns who try to do stuff like this with shell scripts and cron jobs spend a big fraction of their time dealing manually with the failures.

    The clowns who do stuff like this with the simplest tools that do the job (

  • Re:8 keywords? (Score:4, Informative)

    by iamnobody2 (859379) on Sunday October 24, 2010 @11:23PM (#34008516)
    8 ingredients, no. i've worked at a taco bell, there's a few more then that. this is most Hot Line: beef, chicken, steak, beans, rice, potatoes, red sauce, nacho cheese sauce, green sauce (only used by request), cold line: lettuce, tomatos, cheddar cheese, 3 cheese blend, onions, fiesta salsa (pico de gallo, the same tomatos and onions mixed with a sauce), sour cream, gaucamole, baja sauce, creamy jalapeno sauce. plus 5 kinds/sizes of tortillas (3 sizes of regular, 2 sizes of die cut) nacho chips, etc etc here's an interesting fact, those Cinnamon Twists you may or may not love? they're made of deep fried rotini (a type of pasta, usually boiled)
  • by cratermoon (765155) on Monday October 25, 2010 @01:33AM (#34009016) Homepage

    Well, Linux IS Unix, just without the trademark, but I didn't really come here to correct your misconception on that.

    What I wanted to highlight was the reality behind your statements "we have fifty times as many Windows servers as the other two combined" and "The building where I work has a ratio of about 1 production Windows server for every four employees. If you count non-production servers, we have more Windows servers than people."

    This is most certainly not because Windows is so much better or more popular than the other platforms at your place of work. Any experienced sysadmin who is not a Microsoft apologist will confirm that for any typical datacenter server function, it's necessary to have more instances of Windows to get the same capacity, reliability and uptime as few instances of other server operating systems. It's just the nature of the Microsoft stack that effective load-sharing and failover are a necessity in capacity planning. Anyone who argues that a single instance of Windows is equal to a single instance of AIX or Linux has simply never been part of real world datacenter administration.

    In short, your employer may have a lot more Windows servers than anything else, but that certainly doesn't mean Windows is better or more popular -- it just demonstrates how the TCO of Windows is terrible.

  • by dkf (304284) <donal.k.fellows@manchester.ac.uk> on Monday October 25, 2010 @04:20AM (#34009664) Homepage

    How, exactly, are they brittle?

    The principal brittleness of shell scripts is their assumption that filenames do not contain odd characters like spaces. Most other languages don't do auto-splitting of every argument and so won't break when some user insists on creating a directory called "Documents and Settings"...

    (You can write armored shell scripts that cope just fine with this - I've done that quite a bit over the years - but a lot of people don't.)

  • Re:8 keywords? (Score:3, Informative)

    by drjzzz (150299) on Monday October 25, 2010 @09:43AM (#34011328) Homepage Journal

    So if I limit myself to 8 keywords my code has less defects and is more maintainable?

    ... fewer defects. Never mind.

  • by Andrew Cady (115471) on Monday October 25, 2010 @12:01PM (#34013414)

    Wget for crawling tens of millions of web pages using a 10 line script? He doesn't understand crawling at scale.

    Wget is made for crawling at scale.

    There's a lot more to it than just following links. For example, lots of servers will block you if you start ripping them in full, so you need to have a system in place to crawl sites over many days/weeks a few pages at a time.

    wget --random-wait

    You also want to distribute the load over several IP addresses

    The way I do this with wget is to use wget to generate a list of URLs, then launch a separate wget process with varying source IPs specified with --bind-address. It would, however, be trivial to add a --randomize-bind-address option to wget source.

    and you need logic to handle things like auto generated/tar pits/temporarily down sites, etc.

    What makes you think you can't handle these things with wget?

    And of course you want to coordinate all that while simultaneously extracting the list of URLs that you'll hand over to the crawlers next.

    Again, why do you think wget is inadequate to this? It's not.

    Any custom-coded wget alternative will be implementing a great deal of wget. Most limitations of wget can be avoided by launching multiple wget processes, putting a bit of intelligence into the glue that does so. If that isn't enough, it probably makes sense to make minor alterations to wget source instead of coding something new.

    My point here is just that wget is way more awesome than you give credit.

  • by eap (91469) on Monday October 25, 2010 @03:09PM (#34016110) Journal

    Psst,

    " | sort | uniq -c "

    Will sort and then count repetitive lines and output count, line. You can pipe the result back through sort -n if you want a frequency sort or sort -k 2 for item sorting.

    The problem was not figuring out how to count the unique items. It's the part before the pipe that was difficult. The poster needed to combine the results of two different commands and then compute the unique items. The solution would have to be, logically, "command1 + command2 | sort | uniq -c".

    Unless you can find a way to pass the output from command1 through command2, you will lose command1's data. The solution he/she found was elegant: (command1):(command2) | someKindOfSort. My syntax is probably wrong. If you were simply pointing out a better way to sort, then please disregard.

Is a person who blows up banks an econoclast?

Working...