Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Operating Systems Software Programming IT Technology

A Diagnosis of Self-Healing Systems 149

gManZboy writes "We've been hearing about self-healing systems for a while, but (as is usual), so far it's more hype than reality. Well it looks like Mike Shapiro (from Sun's Solaris Kernel group) has been doing a little actual work in this direction. His prognosis is that there's a long way to go before we get fully self-healing systems. In this article he talks a little bit about what he's done, points out some alternative approaches to his own, as well as what's left to do."
This discussion has been archived. No new comments can be posted.

A Diagnosis of Self-Healing Systems

Comments Filter:
  • by grahamsz ( 150076 ) on Tuesday December 21, 2004 @07:49PM (#11154052) Homepage Journal
    Plenty of Sun's boxes have redundant power supplies.

    If something goes wrong with one, the system should detect either too little or too much DC voltage or current coming from it, and switch to it's backup.

    Your suggestion doesn't make much sense. Should mozilla know what to do if a usb mouse fails or is removed unexpectedly? Of course not, the mozilla developers expect that this will be taken care of.

    Likewise when an correctably memory or disk error occurs... The memory controller or disk firmware should deal with it and the application should be none-the-wiser.

  • by bhadreshl ( 841411 ) on Tuesday December 21, 2004 @08:12PM (#11154266)
    Well this seems like where computing services are heading as IBM is doing extensive research on Self-Configuring, Self-Healing, Self-Optimizing, and Self-Protecting computing systems called 'Autonomic'

    Check out: Autonomic Computing [ibm.com]

  • Re:I'm confused (Score:3, Informative)

    by segfaultcoredump ( 226031 ) on Tuesday December 21, 2004 @10:46PM (#11155341)
    Fault Tolerance implies the ability to not just detect the fault (i.e. a failed cpu), but to keep the processes running as if nothing happened. This is possible with Stratus and Tandem boxes. It is genrally not possible with common x86/Power/SPARC boxes (unless you put a lot of software on top of two boxes to make them look like one big virual system).

    "Self Healing", in this context, is the systems ability to detect a fault (hardware or software), deal with it (restart a process, isolate hardware, etc) and then get on with life (in a possibly degraded mode). In a way, the venerable Veritas Cluster System is an example of a "self healing" system. (it detects a failure of a service group and restarts it, on another node if needed)

    Note that with "self healing" systems, the process may die, and end users may notice a failure. But the system is 'back online' sooner than if it required manual intervention. Compare this to a Fault Tolerant systems that never went down in the first place.
  • by supersnail ( 106701 ) on Wednesday December 22, 2004 @06:19AM (#11157091)
    .... given away the tshirts.

    The currentzSeries machines come with 16 cpus and L2 & L1 packaged together on a board.
    But only 12 cpus are used.

    Each "cpu" is actually two cpus and a comparitor. When the cpus come up with a different answer the cpu is shutdown and procesing is taken over by one of the four free cpus on the board.

    You will never know it happened until you run one of the mainrneance utilities.

    In the way of IBM this technoligy will probaly appear on top end pSeries (AIX/Linux) and iSeries boxes in a couple of years.

Happiness is twin floppies.

Working...