Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Cloud Open Source Software Hardware IT

Netflix Gives Data Center Tools To Fail 75

Nerval's Lobster writes "Netflix has released Hystrix, a library designed for managing interactions between distributed systems, complete with 'fallback' options for when those systems inevitably fail. The code for Hystrix—which Netflix tested on its own systems—can be downloaded at Github, with documentation available here, in addition to a getting-started guide and operations examples, among others. Hystrix evolved out of Netflix's need to manage an increasing rate of calls to its APIs, and resulted in (according to the company) a 'dramatic improvement in uptime and resilience has been achieved through its use.' The Netflix API receives more than 1 billion incoming calls per day, which translates into several billion outgoing calls (averaging a ratio of 1:6) to dozens of underlying systems, with peaks of over 100,000 dependency requests per second. That's according to Netflix engineer Ben Christensen, who described the incredible loads on the company's infrastructure in a February blog posting. The vast majority of those calls serve the discovery user interfaces (UIs) of the more than 800 different devices supported by Netflix."
This discussion has been archived. No new comments can be posted.

Netflix Gives Data Center Tools To Fail

Comments Filter:
  • Thank you Netflix! (Score:5, Interesting)

    by siDDis ( 961791 ) on Tuesday November 27, 2012 @12:16PM (#42106709)

    Not only have you created an amazing tool, it is open source and the best part...it's actually well documented! Christmas came early this year!

    • Yeah, dumb question, but why would they choose to share it? Are they building on GPL code?
      • by Anonymous Coward on Tuesday November 27, 2012 @01:31PM (#42107443)

        (Netflix employee here, so forgive the AC)

        We don't use GPL code (and, assuming we were using GPLv2 code, given that we don't ship out server code, we wouldn't need to share it anyway), but:

        1. Netflix uses a ton of open-source technology. It's nice to be polite and give back;
        2. It's good publicity, which helps when we recruit people (which is something we do all the time);
        3. If it's good, then we'll have other people contribute to the software engineering efforts, which lowers the cost we pay to maintain and improve the software;

    • Its great there planning for the 3% of possible down time. Shows how great Netflix really is when you own 27% of internet traffic I guess you need to be prepaired.
  • I hate to say it, but the only thing I take away from this is that Netflix's software is such an unwieldy mess that they need a library just to enforce application separation and provide default fall-backs when a service call does fail.

    FWIW, my preferred "circuit breaker" is a load balancer... All possible requests are network calls that go through load balancers, where it goes to the most responsive server, and if your admins screw up and none of the servers are responding quickly enough to answer the hea

    • by curunir ( 98273 ) * on Tuesday November 27, 2012 @12:48PM (#42106983) Homepage Journal

      Your critique seems overly simplistic. An HTTP load balancer is great for HTTP calls, but not everything in a complex infrastructure is HTTP. There's queues, data stores, caches, RPC, FileSystem access (SAN, NAS or local) and more that shouldn't run behind an HTTP interface. This tool helps solve the problem and gives you health check monitoring and metrics in the process. On initial inspection, my only complaint is that it requires too much modification of application code, however it seems like it should be pretty simple to integrate with the various IoC frameworks to use AOP proxies to apply the tool declaratively based on annotations.

      And you do realize that you followed up a weak critique of a backend scalability tool with a critique about a failing of their front-end application, right? What relevance does that have?

      • An HTTP load balancer is great for HTTP calls, but not everything in a complex infrastructure is HTTP

        A common load balancer isn't restricted to HTTP. Any TCP connection can be load balanced quite well.

        And you do realize that you followed up a weak critique of a backend scalability tool with a critique about a failing of their front-end application, right? What relevance does that have?

        No, I followed-up by saying the old API was superior, while less taxing on the back-end... A common issue with "Web-2.0" i

    • Load balancing & Fail over serve two entirely different purposes. What happens when your load balancer has nowhere to balance to (we're talking about streaming HD media after all), or something in the cluster fails? Not a whole lot of places have instant fail-over, so it's going to be interesting to see who takes advantage of this.
      • All I get from your comment is the fact that you've NEVER used a load balancer before, and don't really know what it is and does.

    • Load balancing is a mess. The vast majority of load balancing algorithms on VIPs are round robin-based. Fine for http requests from a single UI and that's about it.

      This application allows balancing of the services - not the servers.

      • The vast majority of load balancing algorithms on VIPs are round robin-based.

        A fair point... Stay away from the cheaper units, and make sure you use something that uses latency to make proper decisions.

        This application allows balancing of the services - not the servers.

        Services these days are damn near all TCP/IP based... Your front-end is making a network request to a Tomcat backend, which is running an app that's make JSON requests to some other service, which is pulling some data from some database, wh

    • by ahem ( 174666 )

      Not to pick too nittily, but your assertion that a load balancer should just "go[es] to the most responsive server" is kind of simplistic. When I was working there, we had a failure mode where the most responsive server was one that had tripped over a subtle bug causing all subsequent requests to that instance of the service to almost immediately respond with an http 200 and empty content (it was a bug, after all, and it was compounded by this failure mode of returning a 200). Because we used weighted round

      • we had a failure mode where the most responsive server was one that had tripped over a subtle bug causing all subsequent requests to that instance of the service to almost immediately respond with an http 200 and empty content

        Load balancers can be configured to verify the checksum of the content returned, and not just assume the return code is accurate.

        I'd rather have 98% of the developers focused on application logic and a specialized 2% of the developers providing productivity improvements to the core p

        • by ahem ( 174666 )

          I'd rather have 98% of the developers focused on application logic and a specialized 2% of the developers providing productivity improvements to the core platform.

          The whole idea of writing crap code, and "optimizing" it later, whether with automated tools or by handing it off to others, works very, very poorly in practice. Putting a little effort in, at the start, to architect services properly, and keeping an eye on the design through the coding process, pays off in spades later on.

          I never suggested that it was a good idea to write crap code. I suggested that it's a good idea to have some developers focused on things that all developers need to be taking care of (e.g. a platform that supports universal tasks). In that way, you raise everyone's efficiency with a single core effort and the vast majority of the team can focus on implementing features that move the business forward.

  • by interkin3tic ( 1469267 ) on Tuesday November 27, 2012 @12:30PM (#42106829)
    But they can't possibly manage to bring it to Linux.
    • But they can't possibly manage to bring it to Linux.

      Probably has something to with the Silverlight deal with Microsoft.

      • Srsly, does this thing use silverlight?
      • by msk ( 6205 )

        I read that as a lack of will on the part of Netflix, who is likely Microsoft's single largest user of Silverlight.

        They have the leverage to push Microsoft on this.

        But don't.

        • by msk ( 6205 )

          They also haven't had a version of the Android player that's worked well on the LG Optimus V since v1.2 and that version won't work any more.

      • by bill_mcgonigle ( 4333 ) * on Tuesday November 27, 2012 @01:39PM (#42107525) Homepage Journal

        Probably has something to with the Silverlight deal with Microsoft.

        Close, but 'confusing cause and effect'.

        Silverlight was a facet of the DRM deal that Netflix made with the Studios. So is not releasing a Linux client (because then, y'know, there would be Netflix rippers and movies on bittorrent...).

        Amazon plays movies on Flash on Linux, so Netflix made a bad deal (or perhaps Amazon benefited from not being 'first', same as when Apple pioneered online music with iTunes and got AES AAC while Amazon later had plain MP3). There's also a libnetflixplayer.so ELF-32 on Chromebook, so there's no technical obstacle.

        Presumably those contracts have a renewal period. Accept that there's no technical problem and focus on the legal (government) problems instead.

        • (or perhaps Amazon benefited from not being 'first', same as when Apple pioneered online music with iTunes and got AES AAC while Amazon later had plain MP3)

          Based on lots of reviews, AAC sounds better at the same bitrate. How is that being worse?

      • I'd say it actually has more to do with their CEO being on the board of MS until recently.
  • by concealment ( 2447304 ) on Tuesday November 27, 2012 @12:31PM (#42106839) Homepage Journal

    One of the best changes in "design philosophy" that has happened in the past 20 years is that instead of the idea of any product as a fortress that cannot fail, products are designed to expect their components to fail, and to recovery gracefully from it.

    This leads to a more flexible and resilient product. It reminds me of the military approach, where every system has at least two backups or alternates.

    • by medcalf ( 68293 )
      Except for the Internet of course. The Internet was designed with resiliency in mind, but the practical implementation of it has ended up being highly prone to failure, mostly because effectively all routing is over a limited number of backbone networks, sharing a small number of interconnects. It doesn't help practical resiliency to be able to take any route from A to B, when only one possible route is provided.
    • by mcgrew ( 92797 ) *

      It reminds me of the military approach, where every system has at least two backups or alternates.

      Millitary geat is FAR more robust than its civilian counterparts. I don't know how many winter coats I've bought that have worn out in the last forty years, I finally quit buying coats and now just wear the USAF field jacket I was issued forty years ago, that is still in good shape. I was always lucky to get three years out of a civilian coat (I'm rough on clothing).

    • One of the best changes in "design philosophy" that has happened in the past 20 years is that instead of the idea of any product as a fortress that cannot fail, products are designed to expect their components to fail, and to recovery gracefully from it.

      This leads to a more flexible and resilient product. It reminds me of the military approach, where every system has at least two backups or alternates.

      I think that comes more from the fact that whereas in ancient times, re-IPLing an IBM mainframe was horribly expensive and something to be prevented wherever possible, doing the 3-finger salute on a Windows computer was infinitely cheaper than paying someone to write reliable software. Especially since Windows wasn't.

      As to the "recover gracefully" part....

  • I read that Hysterix:

    One becomes Hysterixical when their data center components fail.

  • I can't be the only one having trouble parsing the title of this article "Netflix Gives Data Center Tools To Fail". What does it mean to "give something to fail?" I thought "fail" was a verb and doesn't make sense as the target of the verb "give". I've heard of the phrase "given to failure", but that doesn't seem what's being implied here.

    • "Your education gives you tools to win. An unwillingness to further self-educate yourself gives you unnecessary roadblocks to fail." I admit the second sentence sounds more natural if it used "failure" instead of "fail". Netflix was just trying to be too cutesy and took poetic license, but I agree, it is still confusing and clunky.
    • Nike gives Olympian shoes to run.

      It's not great but i was smart enough to parse it fairly easy.

      • by bws111 ( 1216812 )

        Still doesn't make sense. In your example, running is what the Olympian wants to do, and Nike is giving him the shoes to do it. This stupid headline makes it sound like failing is the goal, and Netflix is helping them accomplish failure.

        • I thought the same as you on reading the headline. I thought perhaps Netflix's code had caused data centers to fail.

          Often in publications, this kind of word misuse is caught and corrected by someone called an editor.
      • OK, that gets me far enough to understand that the entity Netflix is performing the action of giving to the target of a data center, and the object its giving is tools. I'm still stumped at the "To Fail". Like the sibling reply says, this makes it sound like Netflix wants the data center to use its tools to accomplish failure. Should the title say "To Handle Failure" instead of "To Fail?"

      • by Qzukk ( 229616 )

        Nike gives Olympian shoes to trip.

        Broke that for you. Now it's more like the title of this story and you may be able to see where everyone else's complaint is coming from.

        • Perhaps you don't understand the headline because you don't understand the topic being discussed. Netflix has given data centers tool to fail (and recover effeciently). If Nike had made some shoes that made it easier to recover from tripping, then "Nike gives Olympian shoes to trip." would be like the title of this story and you would understand how appropriate it really is. If it helps you, take some of the irony out of the titles by adding the word 'gracefully' or 'efficiently' to the end of each.

    • Fail is a verb. Give is another verb. Give is not being 'targeted'. Netflix is giving data centers tools to fail. "Fail" describes what the data centers are doing with the tools in which they are being given. Fail is being used in the infinitive here.

      From the wiki page on Infinitives, as an example: "The letter says I'm to wait outside". You understand that in that sentence, "says" isn't being a 'target' of "wait", right? Why would you think otherwise in the summary? I mean, the only difference
      • by bws111 ( 1216812 )

        So what you are saying is that in the absence of these tools the data centers would NOT fail? That is just stupid. With the tools, the data centers are RECOVERING from failure, or AVOIDING failure, or some such. Take out those important words, and you convey the exact opposite meaning from that which was intended. At that is pretty much the definition of a really crappy headline.

        • In response to your question, I don't think an if and only if relationship was proposed, but I'm not the writer.

          In response to the rest of your statment, well, I never said any part of it WAS written well, only that I understood what the writer was trying to say. :-P
      • At first I thought "Data Center" was being used as an adjective, which was part of my problem. I thought they weren't just regular tools, I thought they were Data Center tools being given to something else. That's why I found it so confusing (to answer the question, "why would you think otherwise?").

  • ... that goes down every time someone breaks wind in an AWS datacenter, right?

Some people manage by the book, even though they don't know who wrote the book or even what book.

Working...