New Framework For Programming Unreliable Chips 128
rtoz writes "For handling the future unreliable chips, a research group at MIT's Computer Science and Artificial Intelligence Laboratory has developed a new programming framework that enables software developers to specify when errors may be tolerable. The system then calculates the probability that the software will perform as it's intended.
As transistors get smaller, they also become less reliable. This reliability won't be a major issue in some cases. For example, if few pixels in each frame of a high-definition video are improperly decoded, viewers probably won't notice — but relaxing the requirement of perfect decoding could yield gains in speed or energy efficiency."
godzilla (Score:5, Insightful)
Asking software to correct hardware errors is like asking godzilla to protect tokyo from mega godzilla
this does not lead to rising property values
Re:godzilla (Score:5, Interesting)
I was hoping someone would mention James Mickens' epic rant. [usenix.org]
Re: (Score:2)
God, that's one beautiful little piece of writing. Thank you.
From the posted summary "...if few pixels in each frame of a high-definition video are improperly decoded, viewers probably won't notice" - Now there's a slippery slope if ever I saw one.
Chicken and the Egg. (Score:4, Informative)
We need software to design hardware to make software...
In short it is about better adjusting your tolerance levels on individual features.
I want my Integer arithmetic to be perfect. My Floating point, good up to 8 decimals places. But there components meant for interfacing with the human. Audio, so much stuff is altered or loss due to difference in quality of speakers, every top notch ones with Gold(Or whatever crazy stuff) Cables. So in your digital to audio conversion, you may be fine if a voltage is a bit off, or you skipped a random change, as the smoothing mechanism will often hide that little mistake.
Now for displays... We need to be pixel perfect when we have screens with little movement. But if we are watching a movie, a Pixel color #8F6314 can be #A07310 for 1 60th of a second and we wouldn't notice it. And most most displays are not even high enough quality to show these differences.
We hear of these errors and think, how horrible that we are not good perfect products... However it is more due to the trade-off of getting smaller and faster with a few more glitches,
Re: (Score:3)
Re: (Score:2)
Yeah, but you could save just as much power (I'm guessing) with dedicated hardware decoders, as you could by letting the chips be inaccurate.
Eh, a dedicated hardware decoder is still made out of silicon. That's the point, make chips that perform tasks like that (or other things pushing lots of data that is only relevant for a short period, like GPUs - GPUs used only for gfx and not computation, at least) tolerate some error, so that they can use even less power. No one is yet suggesting we make general purpose CPUs in today's architectures unreliable :)
Re: (Score:2)
Re: (Score:2)
They just use OpenGL, or DirectX, and the chip and video card driver decides how to execute it and render it.
*Real* use of OpenGL and DirectX these days is all about the shaders, which get compiled and run on the GPUs. And even basic ops that are "builtin" to the drivers usually are using shader code internal to the driver (or microcode internal to the hardware/firmware).
The people programming against the hardware shouldn't have to decide how much, if any, error is acceptable.
Absolutely they should, and have been doing so with existing 3D hardware for a long time. It's just been more about 3D rendering shortcuts/heuristics/etc than faulty hardware. It's all about tricking the viewer's eyes and brain to increasing de
Re: (Score:2)
Asking software to correct hardware errors is like asking godzilla to protect tokyo from mega godzilla
OTOH, in measurement theory, it's been long known that random errors can be eliminated by post-processing multiple measurements.
Re: (Score:3)
OTOH, in measurement theory, it's been long known that random errors can be eliminated by post-processing multiple measurements.
Gaining speed an energy efficiency is not usually accomplished by doing something multiple times, and then post processing the results of THAT, when you used to just do it once and got it right.
You'll have to do the measurements in parallel, and do it a lot faster to have time for the post processing and still come out ahead for performance. And I'm still not sure that buys you any
Re: (Score:3)
Gaining speed an energy efficiency is not usually accomplished by doing something multiple times, and then post processing the results of THAT, when you used to just do it once and got it right.
For some kinds of computations, results can be verified in a time much shorter than the time in which they are computed. Often even asymptotically, but that's not even necessary. If you can perform a certain computation twice as fast and with half the energy on a faster but sometimes unreliable circuit/computational node, with the proviso that you need to invest five percent extra time and energy to check the result, you've still won big. (There are even kinds of computation when not even probabilistically
Re: (Score:2)
Nobody is suggesting allowing errors everywhere. Errors will only be allowed where they wouldn't cause massive unexpected effects.
A simple (self-driving) car analogy here would be that you might allow the lights to flicker a little if that saves power. You might even allow the steering wheel to move very slightly at random in order to save power as long as it never causes the car to spin out of control, but you would never allow even a small chance that the car would select its destination at random.
Re: (Score:2)
I'd rather end up at the wrong street number than sideways into a power pole...
Re: (Score:1)
Re: (Score:2)
(This is where the analogy falls apart. How useful is a partly sorted array? Not very. An almost correct floating point calculation on the other hand might even be just as good as the correct result, depending on the application.)
Actually, it seems to me that the analogy is still quite valid. Having a large array where items are guaranteed to be off by no more than one spot -- in other words, where some adjacent items may be swapped from their correct positions -- could be quite useful. I'm thinking of things like "sort by most recent" for news articles, or "search by price ascending" in an online store. In fact, I'm seeing such "approximate ordering" a lot more frequently on large-scale Web apps; it's better to have an approximatel
Hmmm ... (Score:5, Insightful)
So, expect the quality of computers to go downhill over the next few years, but we'll do out best to fix it in software?
That sounds like we're putting the quality control on the wrong side of the equation to me.
Re: (Score:3)
So, expect the quality of computers to go downhill over the next few years, but we'll do out best to fix it in software?
If you use modern hard drives, you've already accepted high error rates corrected by software.
Re: Hmmm ... (Score:2)
Re: (Score:1)
if you access any server remotely then you're already using this - it's called ECC RAM
Re: (Score:2)
Re: (Score:1)
Next few years? More like a few decades or more. Drivers, firmware microcode, etc. have always contained software workarounds to hardware bugs. This is nothing new.
Re: (Score:3)
Re: (Score:2)
Okay, I'll bite. Explain yourself.
Re: (Score:2)
Then of course, I can never get a straight answer from the developers on which
Re: (Score:3)
Our MS developers of course want everyone to have a Core i7 machine with 64GB RAM and SSD hard drive
Do what the company I'm working for has done then.
Give everyone an i7 with 16GB RAM and an SSD.
Except they run Windows 7 32bit, so we can only use 4GB of that (and PAE is disabled on Win7 32bit), and the SSD is the D: drive, not the system drive so when everything does page, it slows to a crawl.
Re: (Score:1)
Re: (Score:2)
Re: (Score:2)
Not 6.24%, 6.26%... or was it 8.24%?
I forget which bit got flipped.
Re: (Score:2)
.Seriously, why do we want to do this? Is power usage going to cut in half?
Yes. Well, about in 1/2. Think about signal processors and cell phones. Would you accept a 5% reduction in voice quality for a doubling of your talk time?
Re: (Score:2)
Except the battery drain in talk-time is mostly the radio, not the CPU.
The battery drain while using it is mostly the screen backlight.
So cutting in half the power consumption of something contributing and almost insignificant amount of power is going to do not much.
Huh? (Score:1)
but relaxing the requirement of perfect decoding could yield gains in speed or energy efficiency."
Which you could already get now simply by not doing error correction. No need for some other programming framework to get this.
Re: (Score:2)
It's not so much about skipping error correction as it is saying when you can skip error correction. If 5 pixels are decoded improperly, fuck it, just keep going. However, if 500 pixels are decoded improperly, then maybe it's time to fix that.
Re: (Score:1)
And as I said you can do that already.
Re: (Score:2)
Really? You can tell your phone/PC/laptop/whatever to run the graphics chip at an unreliably low voltage on demand?
Re: (Score:1)
For PC and laptop, yes I can.
Overclocking utilities can also underclock and near the lower stability threshold of graphics frequency, I often do see a few pixels out of whack. Not enough to crash, but artifacts definitely appear. A mhz or 2 higher clock clears them up though.
I have a dumb phone so reclocking it isn't necessary.
Re: (Score:2)
So you've done this yourself and you still don't see the utility in doing it at the application level rather than the system level?
Re: (Score:1)
Automating the process would be handy, but not revolutionary. Automating it at the system level makes more sense to me but i'm just a power user.
Re: (Score:2)
I don't think it is revolutionary, either... it's just a framework, after all. I was imagining a use where you have some super-low-power device out in the woods sampling temperatures, only firing itself up to "reliable" when it needs to send out data or something. Or a smartphone media app that lets the user choose between high video/audio quality and better battery life. Yeah, they could have already done this with some custom driver or something, but presumably having an existing framework would make it
Re: (Score:2)
When you do it that way you have no control over which computations are inaccurate. There's a lot more you can do if you have some input information from higher levels of the system.
You may be happy that your pixels come out wrong occasionally, but you certainly don't want the memory allocator that controls the data to do the same. The point of this kind of technology (which is becoming common in research at the moment, the MIT link here is a good use of the marketing department) is to be able to control th
Re: (Score:2)
Or we could just use java, with it's "almost" IEEE complete libraries
That's a design feature and what strictfp is for. It's not Sun's fault all the different CPU's Java code can run on implement floating point hardware differently. The only other option is to emulate it in software.
It's a pitty nothing you mentioned has anything to do with Java not guaranteeing floating point operations.
Re: (Score:1)
You confuse what that sentence is talking about. They aren't talking about stuck pixels on an LCD. It's talking about not spending time doing extensive error correction/masking when a few pixels in the video are corrupted and thus will be decoded with some level of artifacting.
Re: (Score:2)
A stuck pixel is still just an unreliable transistor...
Re: (Score:2)
You must have gone through a lot of monitors before realizing this has nothing to do with dead pixels on a display.
"A few pixels incorrectly decoded"... (Score:2)
Re: (Score:1)
Not always true. There are cases where corrupted macroblocks will only cause artifacts in a single frame and won't necessarily cause further decoding corruption.
Re: (Score:2)
24fps? Depends on content. It's too high for landscapes establishing shots, talking heads and presentations. Yet too low for high-action scenes and sports. It's a happy medium.
If you don't like it, try to get variable frame rate support more established. Then everyone is happy.
Re: (Score:2)
So then for you, the compromise in this particular example would be that you would crank up the power a bit and make the pixels all perfect. Other people without such good eyes could crank down the power and get more battery life.
Re: (Score:2)
So what you're saying is that the pixels are alive, and growing! I smell a SyFy movie of the week in the works.
Re: (Score:2)
You missed the point. This is a framework for writing code that KNOWS about unreliable bits. The whole idea is that it lets you write algorithms that can tell the compiler where it's acceptable to have a few errores bits, and where isn't. No one said it would apply to EXISITNG code...
Re: (Score:2)
So why not just add more instructions, for doing faster but less accurate calculations? 24bit operations for RGB values, for example.
How on earth (Score:5, Insightful)
are they going to make "unreliable transistors" that, upon failure, simply decode a pixel incorrectly, rather than, oh, I don't know, branching the program to an unspecified memory address in the middle of nowhere and borking everything.
They'd have to completely re-architect whatever chip is doing the calculations. You'd need three classes of "data" - instructions, important data (branch addresses, etc), and unimportant data. Only one of these could be run on unreliable transistors.
I can't imagine a way of doing that where the overhead takes less time than actually using decent transistors in the first place.
Oh, wait. It's a software lab that's doing this. Never mind, they're not thinking about the hardware at all.
Re: (Score:2)
2+2=5 for large values of 2.
When you're performing calculations, you need to know where and how rounding takes place if everything isn't an integer.
Re: (Score:1)
Re: (Score:2)
Doesn't that depend on the application? What if I'm simply updating a position based upon an already noisy sensor? I already have a bunch of code to throw out crappy results. I'm taking lots of samples, so as long as most of my measurements are accurate, it's all good. Obviously I can't tolerate a random error in every single cycle, but maybe 1 in a million is OK and lets me run at a lower voltage.
Re: (Score:2)
A big class of CPU bugs consists of so-called speed-paths, where a part of the CPU expects a calculation in a different part of the CPU to be complete before it has actually completed
Care to expand on that? This is not a typical race condition. What you're describing is a CPU not ordering instructions as expected - not doing its primary purpose.
Re: (Score:2)
Re: (Score:2)
How on earth are they going to make "unreliable transistors" that, upon failure, simply decode a pixel incorrectly, rather than, oh, I don't know, branching the program to an unspecified memory address in the middle of nowhere and borking everything.
Very easily: the developer specifies that pixel values can tolerate errors but that branch conditions/memory addresses can't. If you'd bothered to read the summary, you'll see it says exactly that:
a new programming framework that enables software developers to specify when errors may be tolerable.
They'd have to completely re-architect whatever chip is doing the calculations.
Erm, that's the whole point. If we allowed high error rates with existing architectures, none of our results would be trustworthy. I imagine the most practical approach would be a fast, low-power but error-prone co-processor living alongside the main, low-error processor. This could be programmed just like GPUs ar
Re:How on earth (Score:4, Insightful)
Erm, that's the whole point. If we allowed high error rates with existing architectures, none of our results would be trustworthy. I imagine the most practical approach would be a fast, low-power but error-prone co-processor living alongside the main, low-error processor.
Or you know, the thing from 5000 years ago where we used 3 CPUs (we could on-package ALU this shit today) all running at high speeds and looking for 2 that get the same result and accepting that result. It's called MISD architecture.
Re: (Score:2)
This is the better approach but i wonder if there is a saving with 3 dodgy processors over 1 good processor. i guess if the yield falls below one third then it might. But power requirements may triple so hard to see the saving.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
It seems a bit strange to me also. Didn't real all the article; but a few pixels wrong is extremely minor and very lucky. One wrong bit is far more likely to crash your computer than to make a pixel be incorrect. What about the CPU? Are we so media obsessed now that getting the pixels wrong is considered a major error but we completely ignore all the serious errors that could result? We'd need redundant transistors to monitor everything, making sure that the CPU registers have the correct values, that
Re: (Score:2)
Re: (Score:2)
What exactly led you to believe that anyone is wanting to use this concept in situations where 100% reliability is required?
Similar Idea to EnerJ Language (Score:3, Interesting)
The jist of the idea is that the programmer can specify which variables need to be exact and which variables can be approximate. The approximate variables would then be stored a low refresh RAM which is more prone to errors to save power, while the precise variables would be stored a higher power memory which would be error free.
The example they gave was calculating the average shade of grey in a large image of 1000 by 1000 pixels. The running total could be held in an approximate variable since the error incurred by adding one pixel incorrectly out of a million would be small, while the control loop variable would be accurate since you wouldn't want your loop to overflow.
Re: (Score:2)
The example they gave was calculating the average shade of grey in a large image of 1000 by 1000 pixels. The running total could be held in an approximate variable since the error incurred by adding one pixel incorrectly out of a million would be small...
What makes them think that the kinds of errors you'd get in a variable in low-refresh-rate RAM would be small? Flip the MSB from 1 to 0 and your total is suddenly divided in half. Or, if it's a floating-point variable, flip one bit in the exponent field and your total changes from 1.23232e4 to 1.23232e-124.
french (Score:2)
am I the only french who thinks that the "Computer Science and Artificial Intelligence Laboratory" sound like this in french :
CS-AIL ?
This could make computers more brain-like (Score:2)
Re: (Score:2)
I love this idea, because it reminds me of the most energy efficient signal processing tool in the known universe, the human brain.
Dumb analogy. Being inaccurate does not make you more intelligent and won't cause emergent behavior.
Give Ken Jennings a granola bar, and he'll seriously challenge Watson, who will be needing several kilowatt-hours to do the same job.
Wrong. Ken Jennings' brain runs on blood sugar, glycogen stored in the liver from previous food (converted into blood sugar by glucagon as blood sugar is consumed for work), fat stored in consolidated fat cells from previous food (converted into blood sugar by lipolysis), and a huge set of neurotransmitters (mainly acetylcholine) stored up by prior processes. Never mind that you get 10% of the energy at
No! Unreliability is a feature (Score:1)
May the best chi(m)p win.
Re: (Score:1)
Already done (Score:2)
Re: (Score:2)
Oh, GREAT (Score:1)
How about stop making crap hardware? (Score:2)
It can be done, we dont have to race for atomic size transistors before we have the technology ot make them more reliable.
Chips for unreliable programming... (Score:2)
Decades old news (Score:2)
What do you think the artefacts shown on screen are when you overclock your video card too high? Acceptable (sometimes) hardware errors.
And the inexorable decline of humanity continues (Score:1)
This is why everything is disposable and nothing works anymore. People are too willing to sacrifice quality and reliability for cost.
Infinite recursion here? (Score:2)
So: This assumes that something, somewhere knows which transistors are unreliable. This data needs to be stored somewhere - on the "good" transistors. How is this data obtained? is there a trustworthy "map" of "unreliable transistors" ? And the code that determines the probability has to run on the "good" transistors too. Will those transistors stay good?
I cannot see any way of allowing *any* transistor being unreliable... And based on my (admittedly incomplete) understandin
Funny.... (Score:2)
I already thought we had a framework for making chips unreliable in the programming realm known as Windows API.
Oh wait...
-Hackus
faster and broken != upgrade (Score:2)
if it's a choice between using a slower chip that is reliable and a chip that is blistering fast but makes mistakes, i'll take the slower chip every time.
a fourth possibility (Score:1)
A fourth possibility is to forget this silliness before it turns into epic failure, go back to the drawing board, and design computers that make fewer mistakes, not more mistakes. Sheesh, what lunacy!
Re: (Score:2)
In other words, it assumes that we won't be using general-purpose computers in the future.
Too true. Any transistor that is in the path of calculating anything that ends up as a memory location or an offset to one anywhere has the possibility of crashing the process if you're lucky, or compromising the entire system.