Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Businesses Programming Unix

Taco Bell Programming 394

theodp writes "Think outside the box? Nah, think outside the bun. Ted Dziuba argues there's a programming lesson to be learned from observing how Taco Bell manages to pull down $1.9 billion by mixing-and-matching roughly eight ingredients: 'The more I write code and design systems, the more I understand that many times, you can achieve the desired functionality simply with clever reconfigurations of the basic Unix tool set. After all, functionality is an asset, but code is a liability. This is the opposite of a trend of nonsense called DevOps, where system administrators start writing unit tests and other things to help the developers warm up to them — Taco Bell Programming is about developers knowing enough about Ops (and Unix in general) so that they don't overthink things, and arrive at simple, scalable solutions.'"
This discussion has been archived. No new comments can be posted.

Taco Bell Programming

Comments Filter:
  • 8 keywords? (Score:2, Funny)

    by lalena ( 1221394 )
    So if I limit myself to 8 keywords my code has less defects and is more maintainable?
    • Re:8 keywords? (Score:5, Insightful)

      by Anonymous Coward on Sunday October 24, 2010 @04:52PM (#34006696)

      exactly.

      Those 8 keywords are + - > [ ] . ,

    • Re:8 keywords? (Score:5, Insightful)

      by hardburn ( 141468 ) <<hardburn> <at> <wumpus-cave.net>> on Sunday October 24, 2010 @05:07PM (#34006788)

      Ook! Ook?

    • by EdIII ( 1114411 )

      I don't get that either, but the summary said 8 ingredients. Which made me wonder if all of Taco Bell's food is made from 8 basic ingredients. That seems to be what it is saying... right?

      Either way, now I am confused and hungry.

      • I thought there were three basic ingredients:
        • Protons
        • Neutrons
        • Electrons
      • Re:8 keywords? (Score:4, Informative)

        by iamnobody2 ( 859379 ) on Sunday October 24, 2010 @10:23PM (#34008516)
        8 ingredients, no. i've worked at a taco bell, there's a few more then that. this is most Hot Line: beef, chicken, steak, beans, rice, potatoes, red sauce, nacho cheese sauce, green sauce (only used by request), cold line: lettuce, tomatos, cheddar cheese, 3 cheese blend, onions, fiesta salsa (pico de gallo, the same tomatos and onions mixed with a sauce), sour cream, gaucamole, baja sauce, creamy jalapeno sauce. plus 5 kinds/sizes of tortillas (3 sizes of regular, 2 sizes of die cut) nacho chips, etc etc here's an interesting fact, those Cinnamon Twists you may or may not love? they're made of deep fried rotini (a type of pasta, usually boiled)
    • Re: (Score:3, Informative)

      by drjzzz ( 150299 )

      So if I limit myself to 8 keywords my code has less defects and is more maintainable?

      ... fewer defects. Never mind.

  • My order (Score:4, Funny)

    by Wingman 5 ( 551897 ) on Sunday October 24, 2010 @04:49PM (#34006688)

    Can I get a server logging system, hold the email notifications. Can I get extra rotating log files with that?

  • by phantomfive ( 622387 ) on Sunday October 24, 2010 @04:49PM (#34006690) Journal
    Reminds me of a job interview I did once with an old guy, he had around 30 different programming languages on his resume. I asked him which programming language was his favorite, expecting it to be something like Lisp or Forth, but he said, "shell script." I was a bit surprised, but he said, "it lets me tie pieces in from everywhere and do it all with the least amount of code."

    I wasn't entirely convinced, but he did have the resume. Seems Mr Dziuba is from the same school of thought. I read the full introduction to the DevOps page and I'm still not entirely sure what it's about. We should work together and deliver on time, or something like that.
    • by visualight ( 468005 ) on Sunday October 24, 2010 @05:02PM (#34006750) Homepage

      The DevOps thing is yet another crock of shit on par with 'managing programmers is like herding cats' and web2.0

    • by martin-boundary ( 547041 ) on Sunday October 24, 2010 @05:59PM (#34007068)
      Sadly, Mr Dziuba has the right idea but uses terrible examples in his blogpost.

      Wget for crawling tens of millions of web pages using a 10 line script? He doesn't understand crawling at scale.

      There's a lot more to it than just following links. For example, lots of servers will block you if you start ripping them in full, so you need to have a system in place to crawl sites over many days/weeks a few pages at a time. You also want to distribute the load over several IP addresses, and you need logic to handle things like auto generated/tar pits/temporarily down sites, etc. And of course you want to coordinate all that while simultaneously extracting the list of URLs that you'll hand over to the crawlers next.

      His other example is also bullshit. Tens of millions of webpages are not that much for a single PC, it hardly justifies using MapReduce, especially if you're only going to process pages independently with zero communication between processes.

      MapReduce is all about cutting the dataset into chunks, then alternating between 1) an (independent) processing phase on each chunk, and 2) a communication phase where the partial results are combined. And where this really pays off is when you have so much data that you need a distributed filesystem.

      • Re: (Score:3, Informative)

        by Andrew Cady ( 115471 )

        Wget for crawling tens of millions of web pages using a 10 line script? He doesn't understand crawling at scale.

        Wget is made for crawling at scale.

        There's a lot more to it than just following links. For example, lots of servers will block you if you start ripping them in full, so you need to have a system in place to crawl sites over many days/weeks a few pages at a time.

        wget --random-wait

        You also want to distribute the load over several IP addresses

        The way I do this with wget is to use wget to genera

    • by ShakaUVM ( 157947 ) on Sunday October 24, 2010 @06:02PM (#34007088) Homepage Journal

      >>I asked him which programming language was his favorite, expecting it to be something like Lisp or Forth, but he said, "shell script."

      Shell script is awesome for a large number of tasks. It can't do everything (otherwise we'd just teach scripting and be done with a CS degree in a quarter), but there's a lot of times when someone thinks they're going to have to write this long program involving a lot of text parsing and you just go, "Well, just Cut out everything except the field you want, pipe it through sort|uniq, and then run an xargs on the result." You get done in an hour (including writing, args checking, and debugging) what another person might spend a week doing in C (which is spectacularly unsuited for such tasks anyway).

      • Re: (Score:3, Insightful)

        by gangien ( 151940 )

        (otherwise we'd just teach scripting and be done with a CS degree in a quarter)

        because the programming language has so much to do with CS?

        • by itlurksbeneath ( 952654 ) on Sunday October 24, 2010 @09:13PM (#34008186) Journal
          Bingo. CS has nothing to do with programming languages. It's about PROGRAMMING. Lots of CS grads still don't get this. They are typically the mediocre programmers that move on to project management (or something else that doesn't involve programming) fairly quickly. Or they end up doing horrible ASP web apps and Microsoft Access front ends.
      • Re: (Score:3, Interesting)

        by flnca ( 1022891 )

        what another person might spend a week doing in C (which is spectacularly unsuited for such tasks anyway).

        A skilled C programmer also needs less than 1 hour for something like that. The standard C library has a lot of text processing functions (like sscanf()), plus it has a qsort(). Ever wonder why the C I/O library is suitable for managing database files? All the field functions in fscanf()/fprintf() etc. are suitable for database management.

        Also, C is still one of the prime choice languages for writing compilers, which do a lot of text processing.

        • by ShakaUVM ( 157947 ) on Monday October 25, 2010 @02:40AM (#34009516) Homepage Journal

          >>A skilled C programmer also needs less than 1 hour for something like that.

          Hmm, well if you want to time yourself, here's a common enough task that I automate with shell scripts. I just timed myself. Including logging in, doing a detour into man and a 'locate access_log' to find the file, it took a bit less than 4 minutes.

          tail -n 100 /var/log/apache2/access_log | cut -f1 -d" " | sort | uniq

          Grabs the end of the access_log and shows you the last few ip addresses that have connected to your site. I do something like this occasionally. Optionally pipe it into xargs host to do DNS lookups on them, if that's how you prefer to roll.

          I'm honestly curious how long it will take you to do it in C, with/without the DNS lookup. Post source if you don't mind.

  • by Anonymous Coward on Sunday October 24, 2010 @04:52PM (#34006700)

    Good grief, I think this is yet another useless article from the Ted Dziuba/Jeff Atwood/Joel Spolsky crowd. They spew out article after article after article with, in my opinion, bullshit "insights" that don't hold any water in the real world. Yet they've developed such a large online following, mainly of "web designers", "JavaScript programmers" and "NoSQL DBAs", that it tricks a lot of people in the industry into thinking what they say actually has some usefulness, when it usually doesn't.

    Yeah, it's great when we can write a few shell or Perl scripts to perform simple tasks, but sometimes that's just not sufficient. Sometimes we do have to write our own code. While UNIX offers a very practical and powerful environment, we shouldn't waste our time trying to convolute its utilities to all sorts of problems, especially when it'll be quicker, easier and significantly more maintainable to roll some tools by hand.

    • by Giant Electronic Bra ( 1229876 ) on Sunday October 24, 2010 @06:07PM (#34007124)

      Once, about 20 years ago, I worked for a company who's line of business generated a VERY large amount of data which for legal reasons had to be carefully reduced, archived, etc. There were various clusters of VMS machines which captured data from different processes to disk, from where it was processed and shipped around. There were also some of the 'new fangled' Unix machines that needed to integrate into this process. The main trick was always constantly managing disk space. Any single disk in the place would probably have 2-10x its capacity worth of data moving on and off it in an given day. It was thus VITAL to constantly monitor disk usage in pretty much real time.

      On VMS the sysops had developed a system to manage all this data which weighed in at 20-30k lines of code. This stuff generated reports, went through different drives and figured out what was going in where, compared it to data from earlier runs, created deltas, etc. It was a fairly slick system, but really all it did was iterate through directories, total up file sizes, and write stuff to a couple report files, and send an email if a disk was filling up too fast.

      So one day my boss asks me to write basically the same program for the Unix cluster. I had a reputation as the guy that could figure out weird stuff. Even had played a small amount with Unix systems before. So I whipped out the printed Man pages and started reading. Now I figured I'd have to write a whole bunch of code, after all I'm duplicating an application that has like 30k lines of code in it, not gigantic but substantial. Pretty soon though I learned that every command line app in Unix could feed into the other ones with a pipe or a temp file. Pretty soon I learned that those apps produced ALL the data that I wanted and produced it in pretty much the format that I needed. All that I really had to do was glue it together properly. Pretty soon I (thank God it starts with A) I found awk, and then sed. 3 days after that I had 2 awk scripts, a shell script that ran a few things through sed, a cron job, and a few other bits. It was maybe 100 lines of code, total. It did MORE than the old app. It was easy to maintain and customize. It saved a LOT of time and money.

      There's PLENTY to recommend the KISS principle in software design. Not every problem can be solved with a bit of shell coding of course, but it is always worth remembering that those tools are tried and true and can be deployed quickly and cheaply. Often they beat the pants off fancier approaches.

      One other thing to remember from that project. My boss was the one that wrote the 30k LoC monstrosity. The week after I showed her the new Unix version, I got downsized out the door. People HATE it when you show them up...

      • Re: (Score:3, Insightful)

        by Jaime2 ( 824950 )
        BTW, awk is a programming language. Really, all you did was to write their process in a different language, not convert it from a custom program to some built in tools.

        As a side note, I have a hard time with the concept that it took the VMS guys 30000 lines of code to do what could be done with a handful of regular expressions. They were either really bad at it, or it had grown for years and nobody had the guts to purge the dead code.
        • by Giant Electronic Bra ( 1229876 ) on Sunday October 24, 2010 @07:55PM (#34007828)

          Sure, awk is a programming language. It is also a command line tool. A bit more flexible than most, but you can't really draw a line between something that is a programming language and something that is a 'built in tool'.

          I really have no idea WHY their code was so large. It was all written in FORTRAN and VMS is honestly a nightmarishly complicated operating environment. A lot of it is probably related to the fact that Unix has a much simpler and more orthogonal environment. Of course this is also WHY Unix killed VMS dead long ago. Simplicity is a virtue. This is why Windows still hasn't entrenched itself forever in the server room. It lacks the simple elegance of 'everything is a byte stream' and 'small flexible programs that simply process a stream'. Those are powerful concepts upon which can be built a lot of really complex stuff in a small amount of code.

      • Re: (Score:3, Insightful)

        by symbolic ( 11752 )

        This kind of story makes me laugh when I see/hear anecdotes that have management talking about metrics like LoC.

        • by Kjella ( 173770 ) on Monday October 25, 2010 @12:19AM (#34008980) Homepage

          LOCs is roughly as meaningless as valuing a document by its word count. You could spend tons on research on something summed up in a few pages, or get an endless word diarrhea of mindless babble spewed out at 300 WPM. But people need to measure progress. Yes, I've seen how it gets when nobody measures progress and everyone pretends the last 10% of code will suddenly turn a turd into a gem, if so expect the people with survival skills to disappear some 80% into the project. Another disastrous variation is to leave it entirely up to the subjective opinion of the manager, which in any sizable company means your career depends on your favor with the PHB and his lying skills compared to the other PHBs.

          Saying it's bad is like shooting fish in a barrel. Coming up with a good system of objectively measuring code design and quality that works in a large organization is ridiculously hard. Particularly since everybody tries to wiggle out of the definitions and go with what you measure, if you made avoiding LoC a metric then the lines would compacted to the point of obfuscation with hideous cross calling to save lines. You want people to hit a sane level of structuring and code reuse, neither LoC bloat nor 4k compos.

  • You can easily have a little more or less salt, sugar or flour in your food. However, software is not so forgiving. Change one character and you screw up badly. Lets face it, software is hard to write and it is even harder to write good software.

    Although re-use is a good thing and scripting many common problems instead of coding in [insert low-level language] is also good. But this should be common sense for any /good/ programmer. Good tools make bad programmers look slightly less bad, but fuck up anyway. G

  • ...you insensitive clod!

    8 commands. period. no more, no less. Super maintainable, cross platform and...

    bah, who am I kidding?

  • When I saw the title I thought it was a book review of a new O'Reilly release of that name.

  • by topham ( 32406 ) on Sunday October 24, 2010 @05:06PM (#34006776) Homepage

    I limit myself to two bits. A 0 and a 1.

    Why would I need 8?

  • Seriously, what's going on with the articles here? "My code is like a Taco"? Is that flying because of CmdrTaco's username?

    Nothing new here:
    1) Code reuse. Woopdeedoo. The whole industry has invested heavily in many paradigms for reusing code: The reusable library, module reuse, object reuse etc.
    2) Stringing Unix commands together is news? Did I just take a Deloriane back to 1955? (Well that's a slight exaggeration. Unix has only been around since the 70s)

    Finally, who wants to compare their code reuse to a c

    • by Tablizer ( 95088 ) on Sunday October 24, 2010 @05:13PM (#34006820) Journal

      I've found the best reuse comes from simple modules, not from complex ones that try to do everything. The one that tries to do everything will still be missing the one feature you need. It's easier to add the features you need to the simple one because it's, well, simpler. With the fancier one you have to work around all the features you don't need to add those that you do need, creating more reading time and more mistakes.

      • by syousef ( 465911 ) on Sunday October 24, 2010 @05:33PM (#34006934) Journal

        I've found the best reuse comes from simple modules, not from complex ones that try to do everything. The one that tries to do everything will still be missing the one feature you need. It's easier to add the features you need to the simple one because it's, well, simpler. With the fancier one you have to work around all the features you don't need to add those that you do need, creating more reading time and more mistakes.

        Agreed. With most complex frameworks there is also the additional overhead of having to do things in a particular way. If you try to do it differently or need to add a feature that wasn't designed for in the original framework, you often find yourself fighting it rather than working with it. At that point you should ditch the framework, but often it's not your decision to make, and then cost of redoing things once the framework is removed makes it impractical.

        • Re: (Score:3, Insightful)

          by aztracker1 ( 702135 )
          I think this is one of the reasons why jQuery has become so popular... it does "just enough" in a consistent (relatively) way, with a decent plug-in model... so stringing things together works pretty well, and there is usually a plugin for just about anything you'd want/need. Though it's maybe a bit heavier than hand crafted code, stringing jQuery and plugins is less debt, with more reuse. I do have a few things in my current js toolbox... namely some JS extensions, json2 (modified), date.js, jquery, jque
    • by seebs ( 15766 )

      About twenty years ago, I was dating someone who was working on what she called the "Taco Bell theory of fashion", which was that you have a smallish number of items of clothing which all go together.

      I think it's just that they're a particularly impressive example, familiar to a lot of people, of an extremely broad variety of foods made from a very small number of ingredients. ... And yes, stringing commands together is, empirically, news to many people, because I keep finding people who can't do it.

  • by sootman ( 158191 ) on Sunday October 24, 2010 @05:08PM (#34006794) Homepage Journal

    From over a decade ago: Taco Bell's Five Ingredients Combined In Totally New Way [theonion.com]

    I think of that every time Taco Bell adds a "new" item to their menu.

  • From TFA (Score:4, Interesting)

    by Jaime2 ( 824950 ) on Sunday October 24, 2010 @05:10PM (#34006806)
    From the article:

    I made most of a SOAP server using static files and Apache's mod_rewrite. I could have done the whole thing Taco Bell style if I had only manned up and broken out sed, but I pussied out and wrote some Python.

    It seems that only software he knows counts as "Taco Bell ingredients". I'd trust Axis (or any other SOAP library) much more than sed to parse a web service request. Heck, if you discount code that you don't directly maintain, SOAP requires very little code other than the functionality of the service itself. I had a boss like this once. He would let you do anything as long as you used tools he was familiar with, but if you brought in a tool that he didn't know, you had to jump through a thousand extra testing hoops. He stopped doing actual work and got into management in the early 90's, so he pretty much didn't know any modern tool. He once made me do a full regression test on a 50KLOC application to get approval to add an index to a Microsoft SQL Server table.

    • Re: (Score:3, Interesting)

      by metamatic ( 202216 )

      Heck, if you discount code that you don't directly maintain, SOAP requires very little code other than the functionality of the service itself.

      However, any time you change the API--even to make a change that no client should notice--you have to regenerate the glue code from the WSDL and recompile all your client programs. Which is why these days, I build REST-based web services.

  • Simplicity (Score:5, Insightful)

    by SimonInOz ( 579741 ) on Sunday October 24, 2010 @05:11PM (#34006808)

    The complexity people seem to delight in putting into things always amazes me. I was recently working at a major bank (they didn't like me eventually as I'm bad at authority structures). Anyway the area I was working on involved opening bank accounts from the web site. Complicated, right? The new account holder has to choose the type of account they want (of about 7), enter their details (name, address, etc), and press go. Data gets passed about, the mainframe makes the account, and we return the new account number.

    Gosh.

    So why, oh tell me why, did they use the following list of technologies (usually all on the same jsp page) [I may have missed some]
    HTML
    CSS
    JSP (with pure java on the page)
    Javascript (modifying the page)
    JQuery
    XML
    XSLT
    JDBC with Hibernate
    JDBC without Hibernate
    Custom Tag library
    Spring (including AOP)
    J2EE EJBs
    JMS

    Awesome. All this on each of the countless pages, each custom designed and built. Staggering. In fact, the site needed about 30 pages, many of them minor variations of each other. The whole thing could have been built using simple metadtata. It would have run faster, been easier to debug and test (the existing system was a nightmare), and easily changeable to suit the new business requirements that poured in.

    So instead of using one efficient, smart programmer for a while, then limited support after that, they had a team of (cheap) very nervous programmers, furiously coding away, terrified lest they break something. And yes, there were layers and layers of software, each overriding the other as the new programmer didn't understand the original system, so added their own. Palimpsest, anyone?

    And yet, despite my offers to rebuild the whole thing this way (including demos), management loved it. Staggering.

    But I still like to keep things simple. And yes, my name is Simon. And yes, I do want a new job.

    • Re:Simplicity (Score:5, Insightful)

      by MichaelSmith ( 789609 ) on Sunday October 24, 2010 @05:14PM (#34006830) Homepage Journal

      Complexity creates bugs

      Bugs create employment

    • Re:Simplicity (Score:4, Interesting)

      by swamp boy ( 151038 ) on Sunday October 24, 2010 @05:24PM (#34006886)

      Sounds like your coworkers are busily filling out their resumes with all the latest fad software tools. Like you, I despise such thinking, and it's why I pass on any job opportunity where 'web apps' and 'java' are used in the same description.

      • Out of curiosity, what would you use to write a web app in?

        • Really, I don't see anything wrong with using Java to write web apps. The problem is when all the 30 different libraries, frameworks, extensions, etc. get thrown in. I steer clear of anything that even mentions Hibernate, Spring (esp. AOP), and any mix of more than about 4 different technologies.

    • Re:Simplicity (Score:4, Insightful)

      by NorbrookC ( 674063 ) on Sunday October 24, 2010 @05:53PM (#34007032) Journal
      It seems to me that the point is that programmers have a variant of "if all you have is a hammer, every problem is a nail" saying. In this case, they have a huge toolbox, so every time they need to drive a nail, it means that they must design and use a methodology that will, eventually, cause the nail to be pushed into place, instead of just reaching for the hammer and getting the job done.
    • Hello, devil's advocate here... I totally agree with the sentiment that keeping things simpler is preferable, and that there are problems created by programmers who either don't care, are trying to preserve their job security (or pad their resumes with buzzwords), don't know better, or don't take the time to think out the design/maintainability of what they are doing.

      On a recent project to provide real-time, asynchronously updating, data-driven, interactive graphs and gauges on a modern web application, I h

  • Reference ? (Score:2, Funny)

    by dargaud ( 518470 )
    So now Taco Bell is a reference for both cooking and programming ? I ate there exactly once and it tasted like sucking ass off a dead donkey. I pity the people who've been forced to eat there since a young age and now think this is 'food'. Yeah, flamebait, etc...
    • Re: (Score:3, Insightful)

      I'm more disturbed by the fact that you know what dead donkey ass tastes like...
    • If you had paid attention in shell class and Taco Bell -- you would know that the Taco Bell ingredients are great for quickly passing through your pipeline.

      Just - try to pipe it through tail instead of head.

      • Re: (Score:3, Funny)

        by msaavedra ( 29918 )

        Taco Bell ingredients are great for quickly passing through your pipeline

        That's why one of my friends calls the place Taco Bowel. It's much more descriptive than the commonly-heard Taco Hell.

    • We are one step closer to idiocracy [imdb.com]! I for one welcome my new "AOL Time Warner Taco Bell US Government Long Distance" overlords.

      Mmmm.. foamy lattes.

      -6d

    • Re: (Score:3, Funny)

      by couchslug ( 175151 )

      That post is worthless without pics!

  • Unexpected (Score:4, Interesting)

    by DWMorse ( 1816016 ) on Sunday October 24, 2010 @05:18PM (#34006858) Homepage

    Unexpected comparison of trained coders / developers, many with certifications and degrees, to untrained sub-GED Taco Bell employee... well... frankly, knuckle-draggers.

    Also, I don't care if your code is minimal and profitable, if it gives me a sore stomach as Taco Bell does, I'm opting for something more complex and just... better. Better for me, better for everyone.

    I get the appeal of promoting minimalistic coding styles with food concepts, and it's a refreshing change from the raggedy car analogies... but come on. Taco Bell? Really??

    • Unexpected comparison of trained coders / developers, many with certifications and degrees, to untrained sub-GED Taco Bell employee... well... frankly, knuckle-draggers.

      >

      Oh my, aren't we thin-skinned today?

      I think you missed the point. The equivalent of the "blank-slate" Taco Bell employee is the blank-slate computer that only executes instructions given to it. The persons who get compared to good developers are the Taco Bell recipe writers, who managed to deliver instructions that yield quick, cheap, consistent and idiot-proof solutions. Many coders with degrees can't say as much.

  • by Meriahven ( 1154311 ) on Sunday October 24, 2010 @05:39PM (#34006966)

    I once had a pair of command line tools that both printed lists of words (usernames, actually, one per row), and I wanted to find out how many unique ones there were. Obviously, the right hand side part of the pipeline was going to be something along the lines of " | sort -u | wc -l", but then I got utterly stuck by the left hand side. How can I combine the STDOUTs of two processes? Do I really need to resort to using temporary files? Is there really no tool to do the logical opposite of the "tee" command?

    You are probably thinking: "Oh, you silly person, that's so trivial, you must be very incompetent", but in case you aren't, you might want to spend a minute trying to figure it out before reading on. I even asked a colleague for help before realizing that the reason I could not find a tool for the task was quite an obvious one: such a tool does not exist. Or actually it kinda does, but only in an implied sense: what I was hoping to achieve could be done by the humble semicolon and a pair of parens. I only had to put the two commands in parens to run them in a subshell, put a semicolon in between, so one will run after the other is finished, and I was done. I guess it was just that the logical leap from "This task is so simple, there must be a tool for this" to "just run the commands one after another" was too big for my feeble mind to accomplish.

    So I guess the moral of the story is, even if you want to use just one simple tool, you may be overthinking it :-)

    • by _LORAX_ ( 4790 ) on Sunday October 24, 2010 @07:42PM (#34007706) Homepage

      Psst,

      " | sort | uniq -c "

      Will sort and then count repetitive lines and output count, line. You can pipe the result back through sort -n if you want a frequency sort or sort -k 2 for item sorting.

      • Re: (Score:3, Informative)

        by eap ( 91469 )

        Psst,

        " | sort | uniq -c "

        Will sort and then count repetitive lines and output count, line. You can pipe the result back through sort -n if you want a frequency sort or sort -k 2 for item sorting.

        The problem was not figuring out how to count the unique items. It's the part before the pipe that was difficult. The poster needed to combine the results of two different commands and then compute the unique items. The solution would have to be, logically, "command1 + command2 | sort | uniq -c".

        Unless you can find a way to pass the output from command1 through command2, you will lose command1's data. The solution he/she found was elegant: (command1):(command2) | someKindOfSort. My syntax is probably

  • What is Devops? (Score:4, Insightful)

    by hawguy ( 1600213 ) on Sunday October 24, 2010 @05:41PM (#34006968)

    I read the linked Devops article and know even less about than before I read the article. It's full of management buzzwords and I'm sure a CIO would love it, but what does it mean?

    How does Devops help?

    The Devops movement is built around a group of people who believe that the application of a combination of appropriate technology and attitude can revolutionize the world of software development and delivery.

    ...

    Beyond this multi-disciplinary approach, the Devops movement is attempting to encourage the development of communication skills, understanding of the domain in which the software is being written, and, crucially, a sensitivity and passion for the underlying business, and for ensuring it succeeds.

    oh yeah, that clears it up. All it takes is a passion for the underlying business and it's sure to succeed!

  • If your software project is comparable to crappy fast food then your doing something wrong. Code obesity is killing our kids! On a more serious note, if you're reusing code you may be bringing along a lot of unnecessary fat that you really didn't need to. If you really want a lean mean program you will not be bringing in feature-laden libraries, you'll have to rewrite some stuff yourself.

    The very top chefs and cooks will use 5-8 ingredients at the most to make dishes, they understand the importance of si
    • > The very top chefs and cooks will use 5-8 ingredients at the most to make dishes

      Curry, rice, chicken, oil, salt.

      That's eight, and boring!

  • Shell scripting is fine for stuff that *only you* are going to use. It's just not robust enough for use in anything important, that more than one person might actually use. For example, handling paths with spaces is pretty damn hard - loads of scripts can't handle them.

  • by gblackwo ( 1087063 ) on Sunday October 24, 2010 @06:00PM (#34007080) Homepage
    Mexican food's great, but it's essentially all the same ingredients, so there's a way you'd have to deal with all these stupid questions. "What is nachos?" "...Nachos? It's tortilla with cheese, meat, and vegetables." "Oh, well then what is a burrito?" "Tortilla with cheese, meat, and vegetables." "Well then what is a tostada?" "Tortilla with cheese, meat, and vegetables." "Well then what i-" "Look, it's all the same shit! Why don't you say a spanish word and I'll bring you something." - Jim Gaffigan
  • by Animats ( 122034 ) on Sunday October 24, 2010 @06:54PM (#34007398) Homepage

    A big problem with shell programming is that the error information coming back is so limited. You get back a numeric status code, if you're lucky, or maybe a "broken pipe" signal. It's difficult to handle errors gracefully. This is a killer in production applications.

    Here's an example. The original article talks about reading a million pages with "wget". I doubt the author of the article has actually done that. Our sitetruth.com system does in fact read a million web pages or so a month. Blindly getting them with "wget" won't work. All of the following situations come up routinely:

    • There's a network error. A retry in an hour or so needs to be scheduled.
    • There's an HTTP error. That has to be analyzed. Some errors mean "give up", and some mean "try again later".
    • The site's HTML contains a redirect, which needs to be followed. "wget" won't notice a redirect at the HTML level.
    • The site's "robots.txt" file says we shouldn't read the file from a bulk process. "wget" does not obey "robots.txt".
    • The site is really, really slow. Some sites will take half an hour to feed out a page. Maybe they're overloaded. Maybe their denial of service detection software has tripped and is metering out bytes very slowly in defense. You don't want this to hold up the entire operation. Last week, for some reason, "orbitz.com" did that.
    • The site doesn't return data at all. Some British university sites have a network implementation which, if asked for a HTTPS connection, does part of the SSL connection handshake and then just stops, leaving the TCP connection open but sending nothing. This requires a special timeout.
    • The site doesn't like too many simultaneous connections from the same IP address. We limit our system to three simultaneous connections to a given site, so as not to overload it.

    That's just reading the page text. More things can go wrong in parsing.

    Even routine reading of some known data page requires some effort to get it right. We read PhishTank's entire XML list of phishing sites every three hours. Doing this reliably is non-trivial. PhishTank just overwrites their file when they update, rather than replacing it with a new one. (This is one of the design errors of UNIX, as Stallman once pointed out. Yes, there are workarounds they could do.) So we have to read the file twice, a minute apart, and wait until we get two identical copies. Then we have to check for 1) an empty file, 2) a file with proper XML structure but no data records, and 3) an improperly terminated XML file, all of which we've encountered. Then we pump the data into a MySQL database, prepared to roll back the changes if some error is detected.

    The clowns who try to do stuff like this with shell scripts and cron jobs spend a big fraction of their time dealing manually with the failures. If you do it right, it just keeps working. One of my other sites, "downside.com", has been updating itself daily from SEC filings for over a decade now. About once a month, something goes wrong with the nightly update, and it's corrected automatically the next night.

    • by arth1 ( 260657 ) on Sunday October 24, 2010 @09:21PM (#34008230) Homepage Journal

      The site's HTML contains a redirect, which needs to be followed. "wget" won't notice a redirect at the HTML level.

      Actually, it does. But in any case, this is why you parse the HTML after fetching it with wget -- how else can you get things like javascript generated URLs to work?

      The site's "robots.txt" file says we shouldn't read the file from a bulk process. "wget" does not obey "robots.txt".

      From the wget man page:

      Wget can follow links in HTML, XHTML, and CSS pages, to create local
      versions of remote web sites, fully recreating the directory structure
      of the original site. This is sometimes referred to as "recursive
      downloading." While doing that, Wget respects the Robot Exclusion
      Standard (/robots.txt).

      The site is really, really slow. Some sites will take half an hour to feed out a page.

      And you still haven't looked at the wget(1) man page, or you'd know about the --read-timeout parameter.

      Maybe they're overloaded. Maybe their denial of service detection software has tripped and is metering out bytes very slowly in defense. You don't want this to hold up the entire operation. Last week, for some reason, "orbitz.com" did that.

      Not holding up your operation is why you use multiple tools that can run concurrently. A wget of orbitz.com taking forever won't prevent the wget of soggy.com that you scheduled for half an hour later, and neither will stop the parser.
      Of course, if you design an all-eggs-in-one-basket solution that depends on sequential operations, you deserve what you get.

      The site doesn't return data at all. Some British university sites have a network implementation which, if asked for a HTTPS connection, does part of the SSL connection handshake and then just stops, leaving the TCP connection open but sending nothing.
      This requires a special timeout.

      Yes, the --connect-timeout.

      The site doesn't like too many simultaneous connections from the same IP address. We limit our system to three simultaneous connections to a given site, so as not to overload it.

      wget limits to a single connection with keep-alive per instance. (If you want more, spawn more wget -nc commands)

      Even routine reading of some known data page requires some effort to get it right. We read PhishTank's entire XML list of phishing sites every three hours. Doing this reliably is non-trivial. PhishTank just overwrites their file when they update, rather than replacing it with a new one.

      That's no problem as long as you pay attention to the HTTP timestamp.

      (This is one of the design errors of UNIX, as Stallman once pointed out. Yes, there are workarounds they could do.) So we have to read the file twice, a minute apart, and wait until we get two identical copies. Then we have to check for 1) an empty file, 2) a file with proper XML structure but no data records, and 3) an improperly terminated XML file, all of which we've encountered.

      Oh. My.
      I'd do a HEAD as the second request, and check the Last-Modified time stamp.
      If the Date in the fetch was later than this, and you got a 2xx return code, all is well, and there's no need to download two copies, blatantly disregarding the "X-Request-Limit-Interval: 259200 Seconds" as you do.

      It'd be much faster too. But what do I know...

      The clowns who try to do stuff like this with shell scripts and cron jobs spend a big fraction of their time dealing manually with the failures.

      The clowns who do stuff like this with the simplest tools that do the job (

      • by jklovanc ( 1603149 ) on Monday October 25, 2010 @03:11AM (#34009628)

        It is interesting that wget does not handle errors other than ignoring them and trying to continue. The original poster's first and second point are not addressed. Does that mean the operator has to manually monitor the crons and restart the ones that failed?

        The site is really, really slow. Some sites will take half an hour to feed out a page.

        And you still haven't looked at the wget(1) man page, or you'd know about the --read-timeout parameter.

        Maybe they're overloaded. Maybe their denial of service detection software has tripped and is metering out bytes very slowly in defense. You don't want this to hold up the entire operation. Last week, for some reason, "orbitz.com" did that.

        Not holding up your operation is why you use multiple tools that can run concurrently. A wget of orbitz.com taking forever won't prevent the wget of soggy.com that you scheduled for half an hour later, and neither will stop the parser.
        Of course, if you design an all-eggs-in-one-basket solution that depends on sequential operations, you deserve what you get.

        How do you schedule orbitz.com to go off and then soggy.com to go off later? What of you are handling hundreds of different web sites? Hundreds of crons? How do you retry later on sites that are very slow at the moment? How would you know that wget timed out due to slow download?

        The site doesn't return data at all. Some British university sites have a network implementation which, if asked for a HTTPS connection, does part of the SSL connection handshake and then just stops, leaving the TCP connection open but sending nothing.
        This requires a special timeout.

        Yes, the --connect-timeout.

        The connection has been made so it is not --connect-timeout it is --read-timeout. That is the problem, there is no different timeout when you are slowly getting data vs getting no data.

        The site doesn't like too many simultaneous connections from the same IP address. We limit our system to three simultaneous connections to a given site, so as not to overload it.

        wget limits to a single connection with keep-alive per instance. (If you want more, spawn more wget -nc commands)

        You missed the point; it is not more connections it is limiting connections. Say I am crawling five different sites using host spanning and they all link to the same site. Since there is no coordination between the wgets it is possible for all of the to connect to the same site at the same time. What if I have 100 crawlers at the same time?

        The original poster is right; using wget ignores errors (timesout) and does not report them so there is no way of programaticly figuring out what went wrong and react to it.
        Things wget does not do: avoid known non responsive pages, requeue requests that have timed out or log them so that are not tried again, coordinate multiple crawls so they do not hit the same server simultaneously, handle errors itself. There are probably more.

        This is a perfect example of the 80/20 rule. The "solution" may cover 80% of the problem but that final 20% will require so much babysitting as to make it unusable. Wget is not an enterprise level web crawler.

  • by rwwyatt ( 963545 ) on Sunday October 24, 2010 @08:09PM (#34007904)
    It leaves results in your shorts.

After all is said and done, a hell of a lot more is said than done.

Working...