Forgot your password?
typodupeerror
Programming IT

The Simian Army and the Antifragile Organization 66

Posted by Soulskill
from the if-it-ain't-broke-get-a-bigger-hammer dept.
CowboyRobot writes "ACM has an article about how Netflix conducts its resilience testing. Instead of the GameDays used by sites such as Amazon and Google, Netflix uses what they call The Simian Army, based on the philosophy that 'Resilience can be improved by increasing the frequency and variety of failure and evolving the system to deal better with each new-found failure, thereby increasing anti-fragility.' While GameDay exercises are like a fire-drill, with scheduled exercises where failure is manually introduced or simulated, the Simian Army relies on failure in the live environment induced by autonomous agents known as 'monkeys.' Chaos Monkey randomly terminates virtual instances in a production environment that are serving live customer traffic. Chaos Gorilla causes an entire Amazon Availability Zone to fail. And Chaos Kong will take down an entire region of zones. 'What doesn't kill you makes you stronger' and Netflix hopes that by constantly protecting itself from internal onslaught, they will become increasingly 'anti-fragile — growing stronger from each successive stressor, disturbance, and failure.'"
This discussion has been archived. No new comments can be posted.

The Simian Army and the Antifragile Organization

Comments Filter:
  • Antifragility (Score:2, Informative)

    by Anonymous Coward

    This is the example that I use when explaining antifragility to my colleagues. I highly recommend Nassim Nicholas Taleb's book, "Antifragile" - at least chapters 1,2, and 7.

  • by ebno-10db (1459097) on Tuesday July 02, 2013 @10:15PM (#44172389)

    No wonder nobody takes Netflix seriously. What kind of tech company worries about things like reliability and robustness? That's soooo 20th century. Everyone knows that if you have more than 90% availability or too low of a bug rate it means you're not agile enough and you can't be one of those amazingly innovative social networking outfits.

    • by stox (131684)

      Assuming the DRM features remain in HTML5, there will be no need for a client, you'll just be able to use your browser.

    • by b4dc0d3r (1268512)

      That's why they wrote apps to lower availability. The anti-fragility thing is a cover invented since the last time the story hit slashdot.

      Their availability is in the higher range of reasonable, as a result of making the simians more powerful. Obviously they work hard at staying within the agile metrics, no matter how much time and money it takes.

    • Totally. I suggest Netflix be fixed via the liberal use of Javascript, both on the Server and the Client.

    • No wonder nobody takes Netflix seriously.

      They paid the programmers peanuts, and got monkeys.

    • No wonder nobody takes Netflix seriously.

      Its impact on the market says otherwise.

      • No wonder nobody takes Netflix seriously.

        Its impact on the market says otherwise.

        Many Slashdotters are impervious to irony.

  • But how do you write the big spec for that? The PMI would never approve.

    sPh

  • by Anonymous Coward

    So this is the ability to use whatever resources are available for graceful failover, allowing masses cheap/consumer grade equipment to be used instead of small amounts of expensive, reliable, enterprise gear.
    Sounds like a winning strategy.

    • I wonder how this behaves in the eye of the customer.

      From cluster solutions i know there are those in the maintain of it that mistake a redundant system with zero downtime.

      The problem is that if you take down a server , all connections to it are down. Some application gracefully swtich to an other server. Some application however first have to time out. Some applicatons crash.

      THe question is, do those interruptions get reported correctly, or are people just blame the app, restart their PC?

      Very few of those

  • It's hard to explain for layer Antifragility are best built on layers of fragility.. meaning cells in a body are fragile but the body itself get's stronger when stressed (lifting weights, Etc.). The Netflix example is good, it's a bit like randoming pulling parts of a plane in flight and then after the crash making the next planes stronger.. it also leads to antifragility, but it's a strong stressor. .

  • The Black Swan chaos, Government/Hollywood takeover, the 1+ billon dollars lawsuit, EMP bombs, mass/worldwide migration to internet 2, Yellowstone and of course, the Cthulu Chaos. Probably the insider threat chaos goes around all this options.
  • I was looking forward to hearing about this army full of primates.

    • I was looking forward to hearing about this army full of primates.

      ... Or at least an army of twelve monkeys.

      Actually, my first thought was the final battle of Planet of the Apes (2001), when all the apes are running towards the grounded ship.

    • I was hoping for a story about an army of simians doing glorious battle with an army of code-monkeys.
  • by girlintraining (1395911) on Tuesday July 02, 2013 @11:39PM (#44172771)

    The problem with this, is that it's still programmed failure. In my experience, hardware or software faults, or combinations of both, are not nearly as effective as plain old human stupidity. Oh, and government action. There is no disaster recovery plan for "Here's a warrant. Give us all your shit." There is a similar lack of recovery options for human stupidity. And let's be honest: It's more abundant in the universe than hydrogen, and infinitely harder to defend against, precisely because stupidity is far more cunning and unpredictable than intelligence could ever hope to be.

    • by c0lo (1497653)

      There is a similar lack of recovery options for human stupidity. And let's be honest: It's more abundant in the universe than hydrogen, and infinitely harder to defend against, precisely because stupidity is far more cunning and unpredictable than intelligence could ever hope to be.

      The last I knew, the stuff that's more abundant than hydrogen was called "dark matter/energy". You mean they lately discovered those are actually "stupidity in action"?

    • by Dahamma (304068)

      The problem with this, is that it's still programmed failure. In my experience, hardware or software faults, or combinations of both, are not nearly as effective as plain old human stupidity.

      But that's largely irrelevant to their testing methodology. They don't just simulate hardware, software, or human faults, they simulate loss of services at various levels of granularity. Doesn't matter whether a server died, someone misconfigured a router, a construction backhoe plowed a fiber cable, a Starz Network-funded hit squad took out their data center, or an earthquake struck the West Coast - it simulates an outage in their network that they want to recover from.

      And let's be honest: It's more abundant in the universe than hydrogen, and infinitely harder to defend against

      Ok, this line is just plain ridiculo

    • by bertok (226922)

      I wrote about this before in an unrelated post, but the point is the same: most "enterprise" vendors will sell you kit that can tolerate nuclear war, but as far as I know, there are very few solutions to protect from administrator error or malice.

      Think about the harm someone could do to a typical business with nothing other than an Active Directory "Domain Admin" account! Given something like that, I can think of a whole bunch of ways to harm an environment in such a way that even the availability of backup

      • most "enterprise" vendors will sell you kit that can tolerate nuclear war, but as far as I know, there are very few solutions to protect from administrator error or malice

        Not true. If the nuclear war kills the administrator you're safe.

  • I think part of the reason for their heavy focus on reliability is that they are competing with the mature television industry and thus have a lot of concern for finicky customers that are considering cutting their cable/satellite plan.
    • The biggest issue isn't Netflix. It's crappy routers on the other end that can't recover from a mild outage. I came home the other day to find my router blinking like a a madman on stimulants but no service. Unplugged, replugged and it all worked fine. Seems like it should have been able to diagnose itself and restart to achieve the same result.

    • by DrXym (126579)
      I think more likely it's because they're following the AOL model. They have a high percentage of non technical users (and morons) and therefore the service should be ultra simple and ultra reliable. They most likely fear the cost of support calls and customer churn caused by a service that "confuses" customers.

      On the flip side it makes their service maddeningly retarded at times especially in families where adult and kid viewing habits are munged into one unholy meaningless mess and there is no easy way t

      • They have a high percentage of non technical users ... therefore the service should be ... ultra reliable.

        Technically sophisticated users shouldn't have reliable service?

        • by DrXym (126579)

          Technically sophisticated users shouldn't have reliable service?

          The point I was making is that AOL achieved reliability by dumbing the UI down to what the lowest common denominator was capable of. Not because that represented the optimum user experience but because they dreaded customers choking up their call centres by "confusing" them with features.

          • The point I was making is that AOL achieved reliability by dumbing the UI down to what the lowest common denominator was capable of. Not because that represented the optimum user experience but because they dreaded customers choking up their call centres by "confusing" them with features.

            But that's not what Netflix is doing. They're trying to ensure reliable delivery of content. That's something that should just work, and not require endless tweaking by a technically sophisticated end user.

          • by Anonymous Coward

            You know what dozens of config options are caused by? Lazy fucking programmers.

            You call clean and simple UI's "dumbed down" - I call them programmers doing their fucking job. Its part of a programmers job to reduce the complexity of a problem for the user - not just pass that complexity on in a different form.

  • "Chaos Monkey" sounds like it ought to be the name of the next iteration of Firefox's Javascript subsystem.

    Hang on.... "Chaos Monkey is a piece of software that deliberately takes out random parts of your live production system".... hmmmm.... maybe it *is* the Firefox Javascript subsystem?

  • I originally read that as the "Syrian Army".

  • netflix gets all this great PR for this approach - and at least in theory it's a good one - but as a customer of netflix's, the results i've experienced are actually pretty poor.

    think about it, they go around shooting nodes in the head during business hours. In the long run, that's great, they can be prepared for anything, but it's still madness.

    Oh and separation of services? Great. But who the hell wants to browse the netflix directory when the streaming service is down? Not me, for one.

  • Maybe they do this with their PC client? They surely don't seem to care about the robustness of their Android client. I think they must develop and test that monster on the latest, most powerful hardware that a corporation can buy. Then they fill it full of graphics and video until it almost breaks thus ensuring that it runs like crap on anything less. I would drop Netflix like a ton of bricks except they have licensed most of the content that I would actually want to watch while Hulu, the only competit

  • The article appears to be a slightly pretentious way of saying that Netflix does reliability testing on its live systems. They can get away with this only because it is not critically important for Netflix to be highly robust: the downside of failure is merely a degree of temporary irritation. Don't try this in the financial markets or life-support systems.

If you are good, you will be assigned all the work. If you are real good, you will get out of it.

Working...