The Simian Army and the Antifragile Organization 66
CowboyRobot writes "ACM has an article about how Netflix conducts its resilience testing. Instead of the GameDays used by sites such as Amazon and Google, Netflix uses what they call The Simian Army, based on the philosophy that 'Resilience can be improved by increasing the frequency and variety of failure and evolving the system to deal better with each new-found failure, thereby increasing anti-fragility.' While GameDay exercises are like a fire-drill, with scheduled exercises where failure is manually introduced or simulated, the Simian Army relies on failure in the live environment induced by autonomous agents known as 'monkeys.' Chaos Monkey randomly terminates virtual instances in a production environment that are serving live customer traffic. Chaos Gorilla causes an entire Amazon Availability Zone to fail. And Chaos Kong will take down an entire region of zones. 'What doesn't kill you makes you stronger' and Netflix hopes that by constantly protecting itself from internal onslaught, they will become increasingly 'anti-fragile — growing stronger from each successive stressor, disturbance, and failure.'"
Antifragility (Score:2, Informative)
This is the example that I use when explaining antifragility to my colleagues. I highly recommend Nassim Nicholas Taleb's book, "Antifragile" - at least chapters 1,2, and 7.
no wonder nobody takes Netflix seriously (Score:4, Funny)
No wonder nobody takes Netflix seriously. What kind of tech company worries about things like reliability and robustness? That's soooo 20th century. Everyone knows that if you have more than 90% availability or too low of a bug rate it means you're not agile enough and you can't be one of those amazingly innovative social networking outfits.
Re: (Score:1)
Re: (Score:2)
What other choice did they have? Flash? Even HTML5 isn't really ready until more browsers implement the security features required.
Re: (Score:2)
What other choice did they have? Flash?
Yes, believe it or not, Flash is kick ass when it comes to streaming video.
Far superior to your fabled HTML5 especially in regards to streaming latency.
Want to stream live video from the iPhone? Flash is the ONLY way to do it.
IIRC, they chose Silverlight over Flash because of Microsoft's DRM stack and I'm sure MS gave them some sweetheart deals like they did with MLB.
Re: (Score:2)
I don't think it was only DRM. At the time they chose Silverlight (five years ago), Flash's video streaming support wasn't nearly as robust; it didn't support seamless bitrate changes at the time, for example.
I haven't kept up on things, so it's possible that Flash's video streaming support is as robust as Silverlight today, but it wasn't back then.
Re: (Score:2)
Not cross platform at all until some browser for Linux actually implements their extensions via EME. Chrome might, but probably just for Google's own DRM (Widevine). Netflix currently uses MS PlayReady, good luck getting that in a Linux browser...
Re: (Score:3)
Hint: You can play Netflix movies on Chromebooks, using HTML5. Think that uses MS PlayReady?
Re: (Score:1)
If you have access to root mode, couldn't you add a small patch called "I don't have root chrome, you did that yourself, no problems here"
trusting the client for security is rearanging deck chairs on the titanic.
Re: (Score:2)
It's definitely time for DVD Jon [wikipedia.org] to make a comeback.
-l
Re: (Score:2)
Re: (Score:2)
Re: (Score:1)
Netflix itself runs on Likely Linux and a number of open-source projects for ultimate reliability,
How do they get access to the Chaos Gorilla when they're not running Microsoft products? Do the chairs throw themselves?
Re: (Score:2)
As it is, you can watch Netflix on Linux by using Wine to run the Windows Firefox and Silverlight plugin.
Re: (Score:2)
Assuming the DRM features remain in HTML5, there will be no need for a client, you'll just be able to use your browser.
Re: (Score:2)
Re: (Score:2)
That's why they wrote apps to lower availability. The anti-fragility thing is a cover invented since the last time the story hit slashdot.
Their availability is in the higher range of reasonable, as a result of making the simians more powerful. Obviously they work hard at staying within the agile metrics, no matter how much time and money it takes.
Re: (Score:1)
Totally. I suggest Netflix be fixed via the liberal use of Javascript, both on the Server and the Client.
Re: (Score:2)
No wonder nobody takes Netflix seriously.
They paid the programmers peanuts, and got monkeys.
Re: (Score:2)
No wonder nobody takes Netflix seriously.
Its impact on the market says otherwise.
Re: (Score:2)
No wonder nobody takes Netflix seriously.
Its impact on the market says otherwise.
Many Slashdotters are impervious to irony.
The big spec? (Score:2)
But how do you write the big spec for that? The PMI would never approve.
sPh
Reliability vs Resilience (Score:1)
So this is the ability to use whatever resources are available for graceful failover, allowing masses cheap/consumer grade equipment to be used instead of small amounts of expensive, reliable, enterprise gear.
Sounds like a winning strategy.
Failover vs 0 downtime vs no brokenconnection. (Score:2)
I wonder how this behaves in the eye of the customer.
From cluster solutions i know there are those in the maintain of it that mistake a redundant system with zero downtime.
The problem is that if you take down a server , all connections to it are down. Some application gracefully swtich to an other server. Some application however first have to time out. Some applicatons crash.
THe question is, do those interruptions get reported correctly, or are people just blame the app, restart their PC?
Very few of those
Antifragile (Score:2)
It's hard to explain for layer Antifragility are best built on layers of fragility.. meaning cells in a body are fragile but the body itself get's stronger when stressed (lifting weights, Etc.). The Netflix example is good, it's a bit like randoming pulling parts of a plane in flight and then after the crash making the next planes stronger.. it also leads to antifragility, but it's a strong stressor. .
Forgotten chaos (Score:2)
Misleading (Score:2)
I was looking forward to hearing about this army full of primates.
Re: (Score:2)
I was looking forward to hearing about this army full of primates.
... Or at least an army of twelve monkeys.
Actually, my first thought was the final battle of Planet of the Apes (2001), when all the apes are running towards the grounded ship.
Re: (Score:2)
That is only because you saw it on Netflix the other night...
Re: (Score:1)
Re: (Score:2)
Not even a new name, monkey testing [wikipedia.org] has been around for a long time...
Mongolian Horde (Score:4, Funny)
The problem with this, is that it's still programmed failure. In my experience, hardware or software faults, or combinations of both, are not nearly as effective as plain old human stupidity. Oh, and government action. There is no disaster recovery plan for "Here's a warrant. Give us all your shit." There is a similar lack of recovery options for human stupidity. And let's be honest: It's more abundant in the universe than hydrogen, and infinitely harder to defend against, precisely because stupidity is far more cunning and unpredictable than intelligence could ever hope to be.
Re: (Score:2)
There is a similar lack of recovery options for human stupidity. And let's be honest: It's more abundant in the universe than hydrogen, and infinitely harder to defend against, precisely because stupidity is far more cunning and unpredictable than intelligence could ever hope to be.
The last I knew, the stuff that's more abundant than hydrogen was called "dark matter/energy". You mean they lately discovered those are actually "stupidity in action"?
Re: (Score:2)
The problem with this, is that it's still programmed failure. In my experience, hardware or software faults, or combinations of both, are not nearly as effective as plain old human stupidity.
But that's largely irrelevant to their testing methodology. They don't just simulate hardware, software, or human faults, they simulate loss of services at various levels of granularity. Doesn't matter whether a server died, someone misconfigured a router, a construction backhoe plowed a fiber cable, a Starz Network-funded hit squad took out their data center, or an earthquake struck the West Coast - it simulates an outage in their network that they want to recover from.
And let's be honest: It's more abundant in the universe than hydrogen, and infinitely harder to defend against
Ok, this line is just plain ridiculo
Re: (Score:2)
I wrote about this before in an unrelated post, but the point is the same: most "enterprise" vendors will sell you kit that can tolerate nuclear war, but as far as I know, there are very few solutions to protect from administrator error or malice.
Think about the harm someone could do to a typical business with nothing other than an Active Directory "Domain Admin" account! Given something like that, I can think of a whole bunch of ways to harm an environment in such a way that even the availability of backup
Re: (Score:3)
most "enterprise" vendors will sell you kit that can tolerate nuclear war, but as far as I know, there are very few solutions to protect from administrator error or malice
Not true. If the nuclear war kills the administrator you're safe.
Reliability (Score:1)
Re: (Score:2)
The biggest issue isn't Netflix. It's crappy routers on the other end that can't recover from a mild outage. I came home the other day to find my router blinking like a a madman on stimulants but no service. Unplugged, replugged and it all worked fine. Seems like it should have been able to diagnose itself and restart to achieve the same result.
Re: (Score:2)
Re: (Score:2)
On the flip side it makes their service maddeningly retarded at times especially in families where adult and kid viewing habits are munged into one unholy meaningless mess and there is no easy way t
Re: (Score:2)
They have a high percentage of non technical users ... therefore the service should be ... ultra reliable.
Technically sophisticated users shouldn't have reliable service?
Re: (Score:2)
Technically sophisticated users shouldn't have reliable service?
The point I was making is that AOL achieved reliability by dumbing the UI down to what the lowest common denominator was capable of. Not because that represented the optimum user experience but because they dreaded customers choking up their call centres by "confusing" them with features.
Re: (Score:2)
The point I was making is that AOL achieved reliability by dumbing the UI down to what the lowest common denominator was capable of. Not because that represented the optimum user experience but because they dreaded customers choking up their call centres by "confusing" them with features.
But that's not what Netflix is doing. They're trying to ensure reliable delivery of content. That's something that should just work, and not require endless tweaking by a technically sophisticated end user.
Re: (Score:1)
You know what dozens of config options are caused by? Lazy fucking programmers.
You call clean and simple UI's "dumbed down" - I call them programmers doing their fucking job. Its part of a programmers job to reduce the complexity of a problem for the user - not just pass that complexity on in a different form.
"Chaos Monkey"? (Score:2)
"Chaos Monkey" sounds like it ought to be the name of the next iteration of Firefox's Javascript subsystem.
Hang on.... "Chaos Monkey is a piece of software that deliberately takes out random parts of your live production system".... hmmmm.... maybe it *is* the Firefox Javascript subsystem?
You and what army? (Score:2)
I originally read that as the "Syrian Army".
grr (Score:2)
netflix gets all this great PR for this approach - and at least in theory it's a good one - but as a customer of netflix's, the results i've experienced are actually pretty poor.
think about it, they go around shooting nodes in the head during business hours. In the long run, that's great, they can be prepared for anything, but it's still madness.
Oh and separation of services? Great. But who the hell wants to browse the netflix directory when the streaming service is down? Not me, for one.
uh huh (Score:2)
Maybe they do this with their PC client? They surely don't seem to care about the robustness of their Android client. I think they must develop and test that monster on the latest, most powerful hardware that a corporation can buy. Then they fill it full of graphics and video until it almost breaks thus ensuring that it runs like crap on anything less. I would drop Netflix like a ton of bricks except they have licensed most of the content that I would actually want to watch while Hulu, the only competit
Won't Work where it Matters Most (Score:2)
The article appears to be a slightly pretentious way of saying that Netflix does reliability testing on its live systems. They can get away with this only because it is not critically important for Netflix to be highly robust: the downside of failure is merely a degree of temporary irritation. Don't try this in the financial markets or life-support systems.