A Diagnosis of Self-Healing Systems 149
gManZboy writes "We've been hearing about self-healing systems for a while, but (as is usual), so far it's more hype than reality. Well it looks like Mike Shapiro (from Sun's Solaris Kernel group) has been doing a little actual work in this direction. His prognosis is that there's a long way to go before we get fully self-healing systems. In this article he talks a little bit about what he's done, points out some alternative approaches to his own, as well as what's left to do."
Had this 3 years ago (Score:5, Interesting)
Which turned out not to be faulty... hmmm...
Some IBM mainframes are already at this level of self-diagnosis. Where I work, IBM repairmen show up with spare drives for the RAID array when they fail and the array phones IBM to report the fault. We don't know that a drive failed until the field service tech shows up!
Re:Had this 3 years ago (Score:4, Interesting)
Interesting. Where I work this happens too except instead of IBM techs we get sent techs who work for the city and instead of finding out that they were sent for some good reason, 90% of the time it turns out that the techs were sent for no reason. The techs usually don't even know that a machine called in a service request and waste a lot of time asking me why they were called.
If the future holds more of this I hope I die soon.
Reset button (Score:2, Interesting)
UNIX is the problem. Tandem was the solution. (Score:5, Interesting)
The most successful example is Tandem. For decades, systems that have to keep running have run on Tandem's operating system. For an overview of how they did it, see the 1985 paper Why Computers Stop and What Can Be Done About It. [hp.com]
The basic concepts are:
Every time you use an ATM or trade a stock, somewhere a Tandem cluster was involved.
Tandem's problem was that they had rather expensive proprietary hardware. You also needed extra hardware to allow for fail-operational systems. But it all really does work. HP still sells Tandem, but since Carly, it's being neglected, like most other high technology at HP.
Re:UNIX is the problem. Tandem was the solution. (Score:5, Interesting)
The machines used MIPS processors (supporting SMP) and ran a Tandem variant of System V UNIX. Combine this with a decent transactional database, and application software capable of check-pointing itself, and you have a very robust system. Albeit a very expensive one.
Tandem was bought out by Compaq, and then by HP. When I left, Tandem had quite a few interesting ideas they were working on, but near as I can tell, they never saw the light of day.
One of my self-healing systems (Score:5, Interesting)
I'd call that a self healing system. I'm a network admin though so my perception of these things tends to be on a larger scale.
The need for a "self" symbol (Score:3, Interesting)
This is really something that, IMHO, calls for more interaction between the best of the futurists, science-fiction writers, and coders, and other complexity thinkers.
In order for any system to have an understanding of and proper diagnosis of its own operation, it needs to be able to conceptualize its relationship to other systems around it. Am I important? What functions do I provide? What level of error is proper to report to my administrator? Do I have a history of hardware problems? Has chip 2341 on motherboard 12 been acting up intermittently? If so, is it getting worse or better? How have I been doing over the last few days? Is there a new virus going around that is similar to something I've had before?
What good is a self-diagnosing system without a memory of its prior actions?
All of these questions imply some sort of context that will require the system to use symbols to represent "things" in the "world" around it. Clearly, the largest (though perhaps not qualitatively different) symbol will be a "self" symbol.
From there, all you have to do is follow Hofstadter [dmoz.org]'s path and you'll arrive at a system with emergent self-awareness or consciousness.
The end result of this will be something a) very complex and b) designed/grown by itself. You'll have either the computer from the U.S.S. Enterprise [epguides.info] or H.A.L. [underview.com]
Side question: What is CYC doing these days? [cyc.com]
It's a long way (Score:4, Interesting)
The former could be considered self-repair, but it is limited as you don't have to have much in the way of an error to totally swamp most error-correction codes.
The second form isn't really self-repair as much as it is damage control. This is just as important as self-repair, as you can't do much repair work if your software can't run.
On the whole, "normal" systems don't need any kind of self-repair, beyond the basic error-correction codes. Instead, you are likely better off to have a "hot fail-over" system - two systems running in parallel with the same data, only one of them is kept "silent". Both take input from the same source(s), and so should have identical states at all times, with no synchronization required.
If the "active" one fails, just "unsilence" the other one and restore the first one's state. If the "silent" one fails, all you do is copy the state over.
However, computers are deterministic. Two identical machines, performing identical operations, will always produce identical results. Therefore, in order to have a meaningful hot fail-over of the kind described, the two can't be identical. They have to be different enough to not fail under identical conditions, but be similar enough that you can trivially switch the output from one to the other without anybody noticing.
eg: The use of a Linux box on an AMD running Roxen, and an OpenBSD box on an Intel running Apache, would be pretty much guaranteed not to have common points of failure. If you used a keepalive daemon for each box to monitor the other's health, you could easily ensure that only one box was "talking" at a time, even if both were receiving.
The added complexity is minimal, which is always good for reliability, and the result is as good or better than any existing software self-repair method out there.
Now, you can't always use such solutions. Anything designed to work in space, these days, uses a combination of the above techniques to extend the lifetime of the computer. By dynamically monitoring the health of the components, re-routing data flow as needed, and repairing data/code stored in transistors that have become damaged, you ensure the system will keep functioning.
Transistors get destroyed by radiation quite easily. If you didn't have some kind of self-repair/damage-control, you'd either be using chips with transistors which may or may not work, or you'd have to scrub the entire chip after a single transistor went.
Re:Had this 3 years ago (Score:2, Interesting)
For example, a broken door sensor could make the door fail to slow down when closing, and the only symptom would be the louder sound of the door slamming. However, in a few days other parts would be damaged, increasing the cost of the repair and rendering the elevator out of service.
The tech could get in the building before the elevator stopped working. According with the marketing guys, it would gave us an image of excellence in hardware and service.
All this was written in 80C51 Assembly using less than 16 Kb. The PC code for the field service central was written in C, and featured a nice EGA graphic (640x350 in 4 pages) of the electric circuit. In real-time mode (when the central called the elevator) the graph could show the relays, interruptors, buttons, etc all animated. We could even tell how many people entered the elevator by the number of times the door sensor was activated, or which buttons were pushed. Cool!
Re:One of my self-healing systems (Score:3, Interesting)
Your systems shouldn't have gotten infected with spyware in the first place, and the fact that they did shows you have bigger problems. What if they get infected with something more malicious than gator? Or how about something that's not detected by the spyware removal tools?
Re:One of my self-healing systems (Score:3, Interesting)
Tracking the symptoms like this alerts me to these problems - running SpyBot on a machine never hurts, and I'll do other things too like have a script email me the list of adminstrators on the machine and perhaps change the password.
As for more malicious, I have used the same technique with Snort sensors around the network logging into a database. Another script queries the database and takes the appropriate action du jour - for example during Nimda I had scripts that would scan the database and clean infected machines.
Always worth putting in the extra time to automate these things as you have a solution for the future and can sit back and admire your work.
As for curing the symptoms and not the cause, this frees up my time to tackle the cause. If I ran around manually cleaning up systems my time would go nowhere.