A Diagnosis of Self-Healing Systems 149
gManZboy writes "We've been hearing about self-healing systems for a while, but (as is usual), so far it's more hype than reality. Well it looks like Mike Shapiro (from Sun's Solaris Kernel group) has been doing a little actual work in this direction. His prognosis is that there's a long way to go before we get fully self-healing systems. In this article he talks a little bit about what he's done, points out some alternative approaches to his own, as well as what's left to do."
Re:The challenge of a truly self-healing system (Score:5, Informative)
If something goes wrong with one, the system should detect either too little or too much DC voltage or current coming from it, and switch to it's backup.
Your suggestion doesn't make much sense. Should mozilla know what to do if a usb mouse fails or is removed unexpectedly? Of course not, the mozilla developers expect that this will be taken care of.
Likewise when an correctably memory or disk error occurs... The memory controller or disk firmware should deal with it and the application should be none-the-wiser.
Similar to IBM's Autonomic Computing (Score:2, Informative)
Check out: Autonomic Computing [ibm.com]
Re:I'm confused (Score:3, Informative)
"Self Healing", in this context, is the systems ability to detect a fault (hardware or software), deal with it (restart a process, isolate hardware, etc) and then get on with life (in a possibly degraded mode). In a way, the venerable Veritas Cluster System is an example of a "self healing" system. (it detects a failure of a service group and restarts it, on another node if needed)
Note that with "self healing" systems, the process may die, and end users may notice a failure. But the system is 'back online' sooner than if it required manual intervention. Compare this to a Fault Tolerant systems that never went down in the first place.
IBMs been there done that (Score:3, Informative)
The currentzSeries machines come with 16 cpus and L2 & L1 packaged together on a board.
But only 12 cpus are used.
Each "cpu" is actually two cpus and a comparitor. When the cpus come up with a different answer the cpu is shutdown and procesing is taken over by one of the four free cpus on the board.
You will never know it happened until you run one of the mainrneance utilities.
In the way of IBM this technoligy will probaly appear on top end pSeries (AIX/Linux) and iSeries boxes in a couple of years.