Forgot your password?
typodupeerror
Programming Bug

New Framework For Programming Unreliable Chips 128

Posted by samzenpus
from the this-is-how-you-do-it dept.
rtoz writes "For handling the future unreliable chips, a research group at MIT's Computer Science and Artificial Intelligence Laboratory has developed a new programming framework that enables software developers to specify when errors may be tolerable. The system then calculates the probability that the software will perform as it's intended. As transistors get smaller, they also become less reliable. This reliability won't be a major issue in some cases. For example, if few pixels in each frame of a high-definition video are improperly decoded, viewers probably won't notice — but relaxing the requirement of perfect decoding could yield gains in speed or energy efficiency."
This discussion has been archived. No new comments can be posted.

New Framework For Programming Unreliable Chips

Comments Filter:
  • godzilla (Score:5, Insightful)

    by Anonymous Coward on Monday November 04, 2013 @10:37AM (#45325135)

    Asking software to correct hardware errors is like asking godzilla to protect tokyo from mega godzilla

    this does not lead to rising property values

    • Re:godzilla (Score:5, Interesting)

      by n6mod (17734) on Monday November 04, 2013 @10:57AM (#45325375) Homepage

      I was hoping someone would mention James Mickens' epic rant. [usenix.org]

      • by kermidge (2221646)

        God, that's one beautiful little piece of writing. Thank you.

        From the posted summary "...if few pixels in each frame of a high-definition video are improperly decoded, viewers probably won't notice" - Now there's a slippery slope if ever I saw one.

    • Chicken and the Egg. (Score:4, Informative)

      by jellomizer (103300) on Monday November 04, 2013 @11:17AM (#45325607)

      We need software to design hardware to make software...

      In short it is about better adjusting your tolerance levels on individual features.
      I want my Integer arithmetic to be perfect. My Floating point, good up to 8 decimals places. But there components meant for interfacing with the human. Audio, so much stuff is altered or loss due to difference in quality of speakers, every top notch ones with Gold(Or whatever crazy stuff) Cables. So in your digital to audio conversion, you may be fine if a voltage is a bit off, or you skipped a random change, as the smoothing mechanism will often hide that little mistake.

      Now for displays... We need to be pixel perfect when we have screens with little movement. But if we are watching a movie, a Pixel color #8F6314 can be #A07310 for 1 60th of a second and we wouldn't notice it. And most most displays are not even high enough quality to show these differences.

      We hear of these errors and think, how horrible that we are not good perfect products... However it is more due to the trade-off of getting smaller and faster with a few more glitches,

      • by CastrTroy (595695)
        Yeah, but you could save just as much power (I'm guessing) with dedicated hardware decoders, as you could by letting the chips be inaccurate. As chips get smaller it's much more feasible to hard hardware specific chips for just about everything. The ARM chips in phones and tablets have all kinds of specialized hardware, some for decoding video and audio, other's for doing encryption and other things that are usually costly for a general purpose processor. Plus it's a lot easier for the developer to not h
        • by Dahamma (304068)

          Yeah, but you could save just as much power (I'm guessing) with dedicated hardware decoders, as you could by letting the chips be inaccurate.

          Eh, a dedicated hardware decoder is still made out of silicon. That's the point, make chips that perform tasks like that (or other things pushing lots of data that is only relevant for a short period, like GPUs - GPUs used only for gfx and not computation, at least) tolerate some error, so that they can use even less power. No one is yet suggesting we make general purpose CPUs in today's architectures unreliable :)

          • by CastrTroy (595695)
            Yeah, but that's not something the application level software developer has to account for. They just use OpenGL, or DirectX, and the chip and video card driver decides how to execute it and render it. Actually, with some graphics cards, and driver implementations, they basically do this already, by rendering the image incorrectly, it speeds up the result, and they hope nobody notices. Basically, if any error is acceptable when programming against certain hardware, it should just be handled at the API lev
            • by Dahamma (304068)

              They just use OpenGL, or DirectX, and the chip and video card driver decides how to execute it and render it.

              *Real* use of OpenGL and DirectX these days is all about the shaders, which get compiled and run on the GPUs. And even basic ops that are "builtin" to the drivers usually are using shader code internal to the driver (or microcode internal to the hardware/firmware).

              The people programming against the hardware shouldn't have to decide how much, if any, error is acceptable.

              Absolutely they should, and have been doing so with existing 3D hardware for a long time. It's just been more about 3D rendering shortcuts/heuristics/etc than faulty hardware. It's all about tricking the viewer's eyes and brain to increasing de

    • Asking software to correct hardware errors is like asking godzilla to protect tokyo from mega godzilla

      OTOH, in measurement theory, it's been long known that random errors can be eliminated by post-processing multiple measurements.

      • by vux984 (928602)

        OTOH, in measurement theory, it's been long known that random errors can be eliminated by post-processing multiple measurements.

        Gaining speed an energy efficiency is not usually accomplished by doing something multiple times, and then post processing the results of THAT, when you used to just do it once and got it right.

        You'll have to do the measurements in parallel, and do it a lot faster to have time for the post processing and still come out ahead for performance. And I'm still not sure that buys you any

        • Gaining speed an energy efficiency is not usually accomplished by doing something multiple times, and then post processing the results of THAT, when you used to just do it once and got it right.

          For some kinds of computations, results can be verified in a time much shorter than the time in which they are computed. Often even asymptotically, but that's not even necessary. If you can perform a certain computation twice as fast and with half the energy on a faster but sometimes unreliable circuit/computational node, with the proviso that you need to invest five percent extra time and energy to check the result, you've still won big. (There are even kinds of computation when not even probabilistically

    • by rasmusbr (2186518)

      Nobody is suggesting allowing errors everywhere. Errors will only be allowed where they wouldn't cause massive unexpected effects.

      A simple (self-driving) car analogy here would be that you might allow the lights to flicker a little if that saves power. You might even allow the steering wheel to move very slightly at random in order to save power as long as it never causes the car to spin out of control, but you would never allow even a small chance that the car would select its destination at random.

    • Are you kidding? Properties with a beautiful view on the battlefield between Godzilla and Mega Godzilla would definitely be worth MILLIONS of yen
  • Hmmm ... (Score:5, Insightful)

    by gstoddart (321705) on Monday November 04, 2013 @10:38AM (#45325147) Homepage

    So, expect the quality of computers to go downhill over the next few years, but we'll do out best to fix it in software?

    That sounds like we're putting the quality control on the wrong side of the equation to me.

    • So, expect the quality of computers to go downhill over the next few years, but we'll do out best to fix it in software?

      If you use modern hard drives, you've already accepted high error rates corrected by software.

      • I haven't accepted bad data from the newer hard drives.
      • if you access any server remotely then you're already using this - it's called ECC RAM

    • by Desler (1608317)

      Next few years? More like a few decades or more. Drivers, firmware microcode, etc. have always contained software workarounds to hardware bugs. This is nothing new.

    • by ZeroPly (881915)
      Relax, pal - frameworks that don't particularly care about accuracy have been around for years now. If you don't believe me, talk to anyone who uses .NET Framework.
      • by fizzer06 (1500649)
        frameworks that don't particularly care about accuracy . . . .NET Framework.

        Okay, I'll bite. Explain yourself.

        • by ZeroPly (881915)
          I'm an application deployment guy, not a programmer. Every time we push something that needs .NET Framework, the end users complain about it being hideously slow. Our MS developers of course want everyone to have a Core i7 machine with 64GB RAM and SSD hard drive - to which I reply "learn how to write some fucking code without seven layers of frameworks and abstraction layers".

          Then of course, I can never get a straight answer from the developers on which .NET to install. Do you want 4, 3.5 SP1, 2? The usual
          • Our MS developers of course want everyone to have a Core i7 machine with 64GB RAM and SSD hard drive

            Do what the company I'm working for has done then.
            Give everyone an i7 with 16GB RAM and an SSD.

            Except they run Windows 7 32bit, so we can only use 4GB of that (and PAE is disabled on Win7 32bit), and the SSD is the D: drive, not the system drive so when everything does page, it slows to a crawl.

    • You don't seem to have read the article. The software is not going to supply extra error correction when the hardware has errors. It's going to allow the programmer to specify code operations that can tolerate more errors, which the compiler can then move to the lower-quality hardware. Some software operations, like audio or video playback, can allow errors and still work OK, which allows you to use lower-energy less-quality hardware for those operations. If they did as you suggest, and tried to fix har
    • All in preparation for next big thing after that... MORE accurate hardware! 6.24% more!
  • by Desler (1608317)

    but relaxing the requirement of perfect decoding could yield gains in speed or energy efficiency."

    Which you could already get now simply by not doing error correction. No need for some other programming framework to get this.

    • It's not so much about skipping error correction as it is saying when you can skip error correction. If 5 pixels are decoded improperly, fuck it, just keep going. However, if 500 pixels are decoded improperly, then maybe it's time to fix that.

      • by Desler (1608317)

        And as I said you can do that already.

        • by MightyYar (622222)

          Really? You can tell your phone/PC/laptop/whatever to run the graphics chip at an unreliably low voltage on demand?

          • by HybridST (894157)

            For PC and laptop, yes I can.

            Overclocking utilities can also underclock and near the lower stability threshold of graphics frequency, I often do see a few pixels out of whack. Not enough to crash, but artifacts definitely appear. A mhz or 2 higher clock clears them up though.

            I have a dumb phone so reclocking it isn't necessary.

            • by MightyYar (622222)

              So you've done this yourself and you still don't see the utility in doing it at the application level rather than the system level?

              • by HybridST (894157)

                Automating the process would be handy, but not revolutionary. Automating it at the system level makes more sense to me but i'm just a power user.

                • by MightyYar (622222)

                  I don't think it is revolutionary, either... it's just a framework, after all. I was imagining a use where you have some super-low-power device out in the woods sampling temperatures, only firing itself up to "reliable" when it needs to send out data or something. Or a smartphone media app that lets the user choose between high video/audio quality and better battery life. Yeah, they could have already done this with some custom driver or something, but presumably having an existing framework would make it

            • by Xrikcus (207545)

              When you do it that way you have no control over which computations are inaccurate. There's a lot more you can do if you have some input information from higher levels of the system.

              You may be happy that your pixels come out wrong occasionally, but you certainly don't want the memory allocator that controls the data to do the same. The point of this kind of technology (which is becoming common in research at the moment, the MIT link here is a good use of the marketing department) is to be able to control th

  • h.264 relies heavily on the pixels in all previous frames. Incorrectly decoded pixels will be visible on many frames that are following. What's worse, they will start moving around and spreading.
    • by Desler (1608317)

      Not always true. There are cases where corrupted macroblocks will only cause artifacts in a single frame and won't necessarily cause further decoding corruption.

    • So what you're saying is that the pixels are alive, and growing! I smell a SyFy movie of the week in the works.

    • by gigaherz (2653757)

      You missed the point. This is a framework for writing code that KNOWS about unreliable bits. The whole idea is that it lets you write algorithms that can tell the compiler where it's acceptable to have a few errores bits, and where isn't. No one said it would apply to EXISITNG code...

  • How on earth (Score:5, Insightful)

    by dmatos (232892) on Monday November 04, 2013 @10:48AM (#45325257)

    are they going to make "unreliable transistors" that, upon failure, simply decode a pixel incorrectly, rather than, oh, I don't know, branching the program to an unspecified memory address in the middle of nowhere and borking everything.

    They'd have to completely re-architect whatever chip is doing the calculations. You'd need three classes of "data" - instructions, important data (branch addresses, etc), and unimportant data. Only one of these could be run on unreliable transistors.

    I can't imagine a way of doing that where the overhead takes less time than actually using decent transistors in the first place.

    Oh, wait. It's a software lab that's doing this. Never mind, they're not thinking about the hardware at all.

    • Where are my mod points when I need them?! This is exactly my sentiment as well. Even the simple processing required to check if the data output is correct or within bounds will be staggering compared to simply letting it pass.
    • by gigaherz (2653757)
      This was in slashdot years ago. I can't find the slashdot link, but I did find this one [extremetech.com]. The idea is that you design a cpu focusing the reliability in the more significant bits, while you allow the least significant bits to be wrong more often. The errors will be centered around the right values (and tend to average into them), so if you write code that is aware of that fact, you can teach it to compensate for the wrong values. Of course this is not acceptable for certain kinds of software, but for things l
    • How on earth are they going to make "unreliable transistors" that, upon failure, simply decode a pixel incorrectly, rather than, oh, I don't know, branching the program to an unspecified memory address in the middle of nowhere and borking everything.

      Very easily: the developer specifies that pixel values can tolerate errors but that branch conditions/memory addresses can't. If you'd bothered to read the summary, you'll see it says exactly that:

      a new programming framework that enables software developers to specify when errors may be tolerable.

      They'd have to completely re-architect whatever chip is doing the calculations.

      Erm, that's the whole point. If we allowed high error rates with existing architectures, none of our results would be trustworthy. I imagine the most practical approach would be a fast, low-power but error-prone co-processor living alongside the main, low-error processor. This could be programmed just like GPUs ar

      • Re:How on earth (Score:4, Insightful)

        by bluefoxlucid (723572) on Monday November 04, 2013 @03:06PM (#45328509) Journal

        Erm, that's the whole point. If we allowed high error rates with existing architectures, none of our results would be trustworthy. I imagine the most practical approach would be a fast, low-power but error-prone co-processor living alongside the main, low-error processor.

        Or you know, the thing from 5000 years ago where we used 3 CPUs (we could on-package ALU this shit today) all running at high speeds and looking for 2 that get the same result and accepting that result. It's called MISD architecture.

        • This is the better approach but i wonder if there is a saving with 3 dodgy processors over 1 good processor. i guess if the yield falls below one third then it might. But power requirements may triple so hard to see the saving.

          • Power requirements actually increase hyperlinearly. DDR RAM uses a serializer, for example, so that you run the RAM at 100MHz but fetch multiple bytes into a buffer and output that across your FSB. This is because running the RAM at 100MHz takes N power, while running at 200MHz takes N^2 power or something ridiculously bigger than 2N.
    • by tlhIngan (30335)

      are they going to make "unreliable transistors" that, upon failure, simply decode a pixel incorrectly, rather than, oh, I don't know, branching the program to an unspecified memory address in the middle of nowhere and borking everything.

      They'd have to completely re-architect whatever chip is doing the calculations. You'd need three classes of "data" - instructions, important data (branch addresses, etc), and unimportant data. Only one of these could be run on unreliable transistors.

      I can't imagine a way of

    • by Darinbob (1142669)

      It seems a bit strange to me also. Didn't real all the article; but a few pixels wrong is extremely minor and very lucky. One wrong bit is far more likely to crash your computer than to make a pixel be incorrect. What about the CPU? Are we so media obsessed now that getting the pixels wrong is considered a major error but we completely ignore all the serious errors that could result? We'd need redundant transistors to monitor everything, making sure that the CPU registers have the correct values, that

    • by jouassou (1854178)

      I can imagine a couple of applications of these transistors though...

      Many numerical simulations [wikipedia.org] require repeated random sampling of some process, and then combine the results in the end. If you're averaging some billion simulations, the result should be quite robust to fluctuations in the results of each simulation. Thus it might well be worth it to use 10 billion unreliable transistors instead of 1 billion reliable transistors, if they cost the same.

      Another application could be to generate random numbers.

  • by MetaDFF (944558) on Monday November 04, 2013 @11:30AM (#45325745)
    The idea of fault tolerable computing is similar to the EnerJ programming language being developed at the University of Washington for power savings The Language of Good Enough Computing [ieee.org]

    The jist of the idea is that the programmer can specify which variables need to be exact and which variables can be approximate. The approximate variables would then be stored a low refresh RAM which is more prone to errors to save power, while the precise variables would be stored a higher power memory which would be error free.

    The example they gave was calculating the average shade of grey in a large image of 1000 by 1000 pixels. The running total could be held in an approximate variable since the error incurred by adding one pixel incorrectly out of a million would be small, while the control loop variable would be accurate since you wouldn't want your loop to overflow.
    • The example they gave was calculating the average shade of grey in a large image of 1000 by 1000 pixels. The running total could be held in an approximate variable since the error incurred by adding one pixel incorrectly out of a million would be small...

      What makes them think that the kinds of errors you'd get in a variable in low-refresh-rate RAM would be small? Flip the MSB from 1 to 0 and your total is suddenly divided in half. Or, if it's a floating-point variable, flip one bit in the exponent field and your total changes from 1.23232e4 to 1.23232e-124.

  • by Spaham (634471)

    am I the only french who thinks that the "Computer Science and Artificial Intelligence Laboratory" sound like this in french :
    CS-AIL ?

  • I love this idea, because it reminds me of the most energy efficient signal processing tool in the known universe, the human brain. Give Ken Jennings a granola bar, and he'll seriously challenge Watson, who will be needing several kilowatt-hours to do the same job. Plus, Ken Jennings is a lot more flexible. He can carry on conversations, tie shoes, etc. This is because his central processing unit basically relies on some sort of fault-tolerant software. I think that there will be a lot more applications of
    • I love this idea, because it reminds me of the most energy efficient signal processing tool in the known universe, the human brain.

      Dumb analogy. Being inaccurate does not make you more intelligent and won't cause emergent behavior.

      Give Ken Jennings a granola bar, and he'll seriously challenge Watson, who will be needing several kilowatt-hours to do the same job.

      Wrong. Ken Jennings' brain runs on blood sugar, glycogen stored in the liver from previous food (converted into blood sugar by glucagon as blood sugar is consumed for work), fat stored in consolidated fat cells from previous food (converted into blood sugar by lipolysis), and a huge set of neurotransmitters (mainly acetylcholine) stored up by prior processes. Never mind that you get 10% of the energy at

  • May the best chi(m)p win.

    • by Iniamyen (2440798)
      The cihps rlealy olny hvae to get the frsit and lsat ltetres corcert. Yuor brian can flil in the rset.
  • Doesn't intel already make a chip that is unreliable?
  • Yeah, let's take away the only thing that computers had going for them - doing exactly what they're told. THAT sounds like a GREAT idea.
  • It can be done, we dont have to race for atomic size transistors before we have the technology ot make them more reliable.

  • now that would be world changing!
  • What do you think the artefacts shown on screen are when you overclock your video card too high? Acceptable (sometimes) hardware errors.

  • This is why everything is disposable and nothing works anymore. People are too willing to sacrifice quality and reliability for cost.

  • So: This assumes that something, somewhere knows which transistors are unreliable. This data needs to be stored somewhere - on the "good" transistors. How is this data obtained? is there a trustworthy "map" of "unreliable transistors" ? And the code that determines the probability has to run on the "good" transistors too. Will those transistors stay good?

    I cannot see any way of allowing *any* transistor being unreliable... And based on my (admittedly incomplete) understandin

  • I already thought we had a framework for making chips unreliable in the programming realm known as Windows API.

    Oh wait...

    -Hackus

  • if it's a choice between using a slower chip that is reliable and a chip that is blistering fast but makes mistakes, i'll take the slower chip every time.

  • From the article: "A third possibility, which some researchers have begun to float, is that we could simply let our computers make more mistakes.

    A fourth possibility is to forget this silliness before it turns into epic failure, go back to the drawing board, and design computers that make fewer mistakes, not more mistakes. Sheesh, what lunacy!

The one day you'd sell your soul for something, souls are a glut.

Working...