Taco Bell Programming 394
theodp writes "Think outside the box? Nah, think outside the bun. Ted Dziuba argues there's a programming lesson to be learned from observing how Taco Bell manages to pull down $1.9 billion by mixing-and-matching roughly eight ingredients: 'The more I write code and design systems, the more I understand that many times, you can achieve the desired functionality simply with clever reconfigurations of the basic Unix tool set. After all, functionality is an asset, but code is a liability. This is the opposite of a trend of nonsense called DevOps, where system administrators start writing unit tests and other things to help the developers warm up to them — Taco Bell Programming is about developers knowing enough about Ops (and Unix in general) so that they don't overthink things, and arrive at simple, scalable solutions.'"
Weak error handling (Score:5, Informative)
A big problem with shell programming is that the error information coming back is so limited. You get back a numeric status code, if you're lucky, or maybe a "broken pipe" signal. It's difficult to handle errors gracefully. This is a killer in production applications.
Here's an example. The original article talks about reading a million pages with "wget". I doubt the author of the article has actually done that. Our sitetruth.com system does in fact read a million web pages or so a month. Blindly getting them with "wget" won't work. All of the following situations come up routinely:
That's just reading the page text. More things can go wrong in parsing.
Even routine reading of some known data page requires some effort to get it right. We read PhishTank's entire XML list of phishing sites every three hours. Doing this reliably is non-trivial. PhishTank just overwrites their file when they update, rather than replacing it with a new one. (This is one of the design errors of UNIX, as Stallman once pointed out. Yes, there are workarounds they could do.) So we have to read the file twice, a minute apart, and wait until we get two identical copies. Then we have to check for 1) an empty file, 2) a file with proper XML structure but no data records, and 3) an improperly terminated XML file, all of which we've encountered. Then we pump the data into a MySQL database, prepared to roll back the changes if some error is detected.
The clowns who try to do stuff like this with shell scripts and cron jobs spend a big fraction of their time dealing manually with the failures. If you do it right, it just keeps working. One of my other sites, "downside.com", has been updating itself daily from SEC filings for over a decade now. About once a month, something goes wrong with the nightly update, and it's corrected automatically the next night.
Re:Weak error handling (Score:1, Informative)
There's a network error. A retry in an hour or so needs to be scheduled.
Echo, cron.
There's an HTTP error. That has to be analyzed. Some errors mean "give up", and some mean "try again later".
If, else if, else.
The site's HTML contains a redirect, which needs to be followed. "wget" won't notice a redirect at the HTML level.
‘--max-redirect=number' Specifies the maximum number of redirections to follow for a resource. The default is 20, which is usually far more than necessary. However, on those occasions where you want to allow more (or fewer), this is the option to use. ...lol, wat?
The site's "robots.txt" file says we shouldn't read the file from a bulk process. "wget" does not obey "robots.txt".
'To ignore robots.txt and no-follow, use something like: wget -e robots=off...' ...lol, wat?
The site is really, really slow. Some sites will take half an hour to feed out a page. Maybe they're overloaded. Maybe their denial of service detection software has tripped and is metering out bytes very slowly in defense. You don't want this to hold up the entire operation. Last week, for some reason, "orbitz.com" did that.
You can't write a simple timer? It's not like you're being asked to write memory management in C.
The site doesn't return data at all. Some British university sites have a network implementation which, if asked for a HTTPS connection, does part of the SSL connection handshake and then just stops, leaving the TCP connection open but sending nothing. This requires a special timeout.
...And requires special handling with any tool, for that matter.
The site doesn't like too many simultaneous connections from the same IP address. We limit our system to three simultaneous connections to a given site, so as not to overload it.
...Okay, I'm done.
There must be something in the water, cuz all I see is a code monkey who can't handle the command line.
Re:It's easy to overthink even in the simplest cas (Score:4, Informative)
Psst,
" | sort | uniq -c "
Will sort and then count repetitive lines and output count, line. You can pipe the result back through sort -n if you want a frequency sort or sort -k 2 for item sorting.
Re:Let me tell you a story (Score:4, Informative)
Sure, awk is a programming language. It is also a command line tool. A bit more flexible than most, but you can't really draw a line between something that is a programming language and something that is a 'built in tool'.
I really have no idea WHY their code was so large. It was all written in FORTRAN and VMS is honestly a nightmarishly complicated operating environment. A lot of it is probably related to the fact that Unix has a much simpler and more orthogonal environment. Of course this is also WHY Unix killed VMS dead long ago. Simplicity is a virtue. This is why Windows still hasn't entrenched itself forever in the server room. It lacks the simple elegance of 'everything is a byte stream' and 'small flexible programs that simply process a stream'. Those are powerful concepts upon which can be built a lot of really complex stuff in a small amount of code.
Re:Weak error handling (Score:3, Informative)
That was HTML redirects (well likely more specifically javascript redirects), not HTTP redirects.
Re:Weak error handling (Score:5, Informative)
The site's HTML contains a redirect, which needs to be followed. "wget" won't notice a redirect at the HTML level.
Actually, it does. But in any case, this is why you parse the HTML after fetching it with wget -- how else can you get things like javascript generated URLs to work?
The site's "robots.txt" file says we shouldn't read the file from a bulk process. "wget" does not obey "robots.txt".
From the wget man page:
Wget can follow links in HTML, XHTML, and CSS pages, to create local
versions of remote web sites, fully recreating the directory structure
of the original site. This is sometimes referred to as "recursive
downloading." While doing that, Wget respects the Robot Exclusion
Standard (/robots.txt).
The site is really, really slow. Some sites will take half an hour to feed out a page.
And you still haven't looked at the wget(1) man page, or you'd know about the --read-timeout parameter.
Maybe they're overloaded. Maybe their denial of service detection software has tripped and is metering out bytes very slowly in defense. You don't want this to hold up the entire operation. Last week, for some reason, "orbitz.com" did that.
Not holding up your operation is why you use multiple tools that can run concurrently. A wget of orbitz.com taking forever won't prevent the wget of soggy.com that you scheduled for half an hour later, and neither will stop the parser.
Of course, if you design an all-eggs-in-one-basket solution that depends on sequential operations, you deserve what you get.
The site doesn't return data at all. Some British university sites have a network implementation which, if asked for a HTTPS connection, does part of the SSL connection handshake and then just stops, leaving the TCP connection open but sending nothing.
This requires a special timeout.
Yes, the --connect-timeout.
The site doesn't like too many simultaneous connections from the same IP address. We limit our system to three simultaneous connections to a given site, so as not to overload it.
wget limits to a single connection with keep-alive per instance. (If you want more, spawn more wget -nc commands)
Even routine reading of some known data page requires some effort to get it right. We read PhishTank's entire XML list of phishing sites every three hours. Doing this reliably is non-trivial. PhishTank just overwrites their file when they update, rather than replacing it with a new one.
That's no problem as long as you pay attention to the HTTP timestamp.
(This is one of the design errors of UNIX, as Stallman once pointed out. Yes, there are workarounds they could do.) So we have to read the file twice, a minute apart, and wait until we get two identical copies. Then we have to check for 1) an empty file, 2) a file with proper XML structure but no data records, and 3) an improperly terminated XML file, all of which we've encountered.
Oh. My.
I'd do a HEAD as the second request, and check the Last-Modified time stamp.
If the Date in the fetch was later than this, and you got a 2xx return code, all is well, and there's no need to download two copies, blatantly disregarding the "X-Request-Limit-Interval: 259200 Seconds" as you do.
It'd be much faster too. But what do I know...
The clowns who try to do stuff like this with shell scripts and cron jobs spend a big fraction of their time dealing manually with the failures.
The clowns who do stuff like this with the simplest tools that do the job (
Re:8 keywords? (Score:4, Informative)
Re:Let me tell you a story (Score:4, Informative)
Well, Linux IS Unix, just without the trademark, but I didn't really come here to correct your misconception on that.
What I wanted to highlight was the reality behind your statements "we have fifty times as many Windows servers as the other two combined" and "The building where I work has a ratio of about 1 production Windows server for every four employees. If you count non-production servers, we have more Windows servers than people."
This is most certainly not because Windows is so much better or more popular than the other platforms at your place of work. Any experienced sysadmin who is not a Microsoft apologist will confirm that for any typical datacenter server function, it's necessary to have more instances of Windows to get the same capacity, reliability and uptime as few instances of other server operating systems. It's just the nature of the Microsoft stack that effective load-sharing and failover are a necessity in capacity planning. Anyone who argues that a single instance of Windows is equal to a single instance of AIX or Linux has simply never been part of real world datacenter administration.
In short, your employer may have a lot more Windows servers than anything else, but that certainly doesn't mean Windows is better or more popular -- it just demonstrates how the TCO of Windows is terrible.
Re:Let me tell you a story (Score:3, Informative)
How, exactly, are they brittle?
The principal brittleness of shell scripts is their assumption that filenames do not contain odd characters like spaces. Most other languages don't do auto-splitting of every argument and so won't break when some user insists on creating a directory called "Documents and Settings"...
(You can write armored shell scripts that cope just fine with this - I've done that quite a bit over the years - but a lot of people don't.)
Re:8 keywords? (Score:3, Informative)
So if I limit myself to 8 keywords my code has less defects and is more maintainable?
... fewer defects. Never mind.
Re:which language is best? (Score:3, Informative)
Wget is made for crawling at scale.
wget --random-wait
The way I do this with wget is to use wget to generate a list of URLs, then launch a separate wget process with varying source IPs specified with --bind-address. It would, however, be trivial to add a --randomize-bind-address option to wget source.
What makes you think you can't handle these things with wget?
Again, why do you think wget is inadequate to this? It's not.
Any custom-coded wget alternative will be implementing a great deal of wget. Most limitations of wget can be avoided by launching multiple wget processes, putting a bit of intelligence into the glue that does so. If that isn't enough, it probably makes sense to make minor alterations to wget source instead of coding something new.
My point here is just that wget is way more awesome than you give credit.
Re:It's easy to overthink even in the simplest cas (Score:3, Informative)
Psst,
" | sort | uniq -c "
Will sort and then count repetitive lines and output count, line. You can pipe the result back through sort -n if you want a frequency sort or sort -k 2 for item sorting.
The problem was not figuring out how to count the unique items. It's the part before the pipe that was difficult. The poster needed to combine the results of two different commands and then compute the unique items. The solution would have to be, logically, "command1 + command2 | sort | uniq -c".
Unless you can find a way to pass the output from command1 through command2, you will lose command1's data. The solution he/she found was elegant: (command1):(command2) | someKindOfSort. My syntax is probably wrong. If you were simply pointing out a better way to sort, then please disregard.