A Diagnosis of Self-Healing Systems 149
gManZboy writes "We've been hearing about self-healing systems for a while, but (as is usual), so far it's more hype than reality. Well it looks like Mike Shapiro (from Sun's Solaris Kernel group) has been doing a little actual work in this direction. His prognosis is that there's a long way to go before we get fully self-healing systems. In this article he talks a little bit about what he's done, points out some alternative approaches to his own, as well as what's left to do."
The challenge of a truly self-healing system (Score:4, Funny)
Neither the applications nor the OS should depend on the other providing any failover or self-healing services; they should always be prepared to go it alone if necessary (as it might be the failover system). Services that crash should restart themselves, etc. This part is pretty well done by most enterprise-grade server software. It's the operating systems we're waiting to play catch-up.
And I'm still waiting to see any box that can replace its own power supply after someone flips the 115/230 switch. Once we get that, then we'll have truly self-healing systems. And all you BOFH's out there might be looking for a new career...
Re:The challenge of a truly self-healing system (Score:5, Informative)
If something goes wrong with one, the system should detect either too little or too much DC voltage or current coming from it, and switch to it's backup.
Your suggestion doesn't make much sense. Should mozilla know what to do if a usb mouse fails or is removed unexpectedly? Of course not, the mozilla developers expect that this will be taken care of.
Likewise when an correctably memory or disk error occurs... The memory controller or disk firmware should deal with it and the application should be none-the-wiser.
Joke Spoiler (Score:3)
Click here to ruin the joke. [ntk.net]
Re:Joke Spoiler (Score:1)
Re:The challenge of a truly self-healing system (Score:2)
Of course not... the point is not that each layer (peripheral, BIOS, kernel, application) can handle errors in all other layers. The point is that Mozilla should be designed to be able to recover from crashes without help from the kernel, BIOS, or anything else. Likewise, if a USB mouse somehow gets "confused" (protocol-wise) it should take the initi
Re:The challenge of a truly self-healing system (Score:2)
No, but Mozilla could be written to survive a memory access failure. It oculd be written so that it does not assume that drives and ram are infallible.
Re:The challenge of a truly self-healing system (Score:1)
Re:As a Tech... (Score:3, Insightful)
Re:As a Tech... (Score:2)
Re:As a Tech... (Score:2)
Do
Re:As a Tech... (Score:2)
Re:As a Tech... (Score:2)
I definitely do. A job like that is more for a computer engineer/systems administrator who takes care of things like that between his other skills, IE maintaining the network, analyzing security, and other things that are generally outside the scope of a computer repair dude. I can't really see companies hiring full time guys who only do computer-repair. Also keep in mind the speed of computers and their prices;
Re:As a Tech... (Score:2)
I got the impression from your original post that you want to get into repair as a career, sorry for the mis-interpretation.
Had this 3 years ago (Score:5, Interesting)
Which turned out not to be faulty... hmmm...
Some IBM mainframes are already at this level of self-diagnosis. Where I work, IBM repairmen show up with spare drives for the RAID array when they fail and the array phones IBM to report the fault. We don't know that a drive failed until the field service tech shows up!
Remote Monitoring (Score:2)
I believe Sun are working on systems that will attempt to spot failure trends, so they can proactively identify other customers who may run into similar problems and then either have the system fix itself or send someone out to deal with it.
The other mindset i've seen with RAID disks, is why bother replacing them. Disks are getting to the point that it's probably cheaper just to leave the dead one in there and power up a spare than
Re:Remote Monitoring (Score:1)
You can't just "not replace dead drives" unless you have like 400 SCSI controllers on your machine, therefore providing an insane amount of hot spares for future failures.
Re:Remote Monitoring (Score:2)
Re:Had this 3 years ago (Score:4, Interesting)
Interesting. Where I work this happens too except instead of IBM techs we get sent techs who work for the city and instead of finding out that they were sent for some good reason, 90% of the time it turns out that the techs were sent for no reason. The techs usually don't even know that a machine called in a service request and waste a lot of time asking me why they were called.
If the future holds more of this I hope I die soon.
Re:Had this 3 years ago (Score:3, Funny)
Your support request has been logged and a field technician has been sent to solve your problem.
Thank you for using IBM.
Similar to IBM's Autonomic Computing (Score:2, Informative)
Check out: Autonomic Computing [ibm.com]
Re:Had this 3 years ago (Score:2)
HAL 9000 sent an astronaout out to help repair the antenna azimuth control board.
Unfortunately, the astronaut (one 'Dave') wasn't able to comply, because HAL refused to open the pod-bay door.
Re:Had this 3 years ago (Score:1)
Re:Had this 3 years ago (Score:2, Interesting)
For example, a broken door sensor could make the door fail to slow down when closing, and the only symptom would be the louder sound of the door sla
Re:One system this will never work on (Score:1, Funny)
Re:One system this will never work on (Score:1)
http://www.microsoft.com/downloads/details.aspx
Re:One system this will never work on (Score:1)
I've had plenty of luck with self healing (Score:1)
Self-Healing Systems (Score:2)
TiVo (Score:3, Insightful)
Re:TiVo (Score:1)
Not really (Score:3, Insightful)
Indeed my TiVo very rarely crashes and always recovers, but the same is also true of every embedded system i've used - be it a cellphone, weather station or alarm system.
Now if i screw around modding my tivo then it's entirely possible to crash it and it doesn't recover very well from that...
Re:Not really (Score:2)
Re:Not really (Score:2)
Re:TiVo (Score:2)
These people are in a different domain; they don't know what apps their system will run, or what mistakes the sysadmin will make, or what worms someone will write next month -- they're preparing a re
Re:TiVo (Score:2)
Re:TiVo (Score:2)
My fav self-healing method (Score:1)
Re:My fav self-healing method (Score:2)
Re:My fav self-healing method (Score:2)
Re:My fav self-healing method (Score:2)
The blue button (Score:2)
After repeated viewing of those Thinkpad commercials where the techs tell the hysterical PHB to press the blue button on startup and thereby enable IBM to magically resurrect his hard drive, I summoned up the courage to try it. (The curly haired guy in those ads is also in one of my favorite commercials ("Please stay
Re:The blue button (Score:2)
Re:The blue button (Score:2)
if (Score:3, Insightful)
How many times do I have to move their icons to a submenu before they realise I don't want my root menu cluttered up with crap?
Re:if (Score:2)
It's much better than that. Self-healing means that disks in a RAID array can detect corrupted blocks of data using checksums and correct them from good mirrors on-the-fly. With multiple mirrors with checksums proving whether there is a problem or not, corrupt data files should be a thing of the past (on systems with RAID). It seems failing drives would be detected sooner, also.
Re:if (Score:1)
However many it takes you to realize the problem is user error and upgrade to an OS that works well.
If you compare computers to biological systems (something people should spend more time doing, the biological systems usually are more robust) then self-healing is something like the concept of "radiant health." Before you worry about that, first you have to reach a state of healt
Re:if (Score:1, Troll)
How about systems that I can manually heal first? (Score:4, Insightful)
Uninstalling applications is often not handled by the OS and has to be done by application itself, resulting in incomplete installations, config files and registiry entries that havn't been properly cleaned up and whatever.
Files arn't versioned, so every change done to a file will simply erase the former content forever, not so good if the former content might have been important.
Undelete? Nope, we don't have that either, we have this hack of a Trashcan, but that won't help you much if some programm deleted the file.
Check of integritiy of an installed piece of software isn't possible either, sure there are third-party solutions, but again that should be something that the OS provides at default
Well, there are millons of more issues why todays system suck and why it is often easier to simply reinstall from scratch then to try to actually fix the mess, and yep, that is true for both Linux, Windows and MacOS, sure for some more then for the others, but thats it.
Re:How about systems that I can manually heal firs (Score:1)
Mostly, that's because Windows is a piece of shit.
Re:How about systems that I can manually heal firs (Score:1)
Sorry about that, I just said the first thing that came to my mind.
Re:How about systems that I can manually heal firs (Score:3, Insightful)
Undelete?
Check of integritiy of an installed piece of software
During the desktop's formative years, the raw drive space needed to actually implement these kinds of things just wasn't available. This is why things like file versioning (popular on large systems like VMS, where the universities/companies running it had the money for the storage requirements) and permanent storage of "unwanted" files just didn't appear.
The third problem is a bit tougher without some extra metadata
Re:How about systems that I can manually heal firs (Score:2)
Well, the system ultimativly knows when something changes, since it is the one who changes it. You are right that one needs some metadata, those however in most cases already comes with the packages (deb/rpm) one installs, there just isn't a standard way to automatically check these changes. However this problem can be solved completly in userspace with a cronjob, would just be nice to have a standard way to do it.
###
Re:How about systems that I can manually heal firs (Score:2)
That would be "Survive, damn you! Survive!"
There is this internal conflict we must have, where on the one hand we want our technology to have a survival instinct; so that it is motivated to look after itself while we are not.
A bit like a human baby figuring out that sometimes mummy is not looking this way and it has to get out of the way of the reversing SUV by its self.
On the other hand, the prospect of computers that have a survival instinct is (or bloody
Re:Get a *real* OS (Score:2)
Feel free to suggest one, neither Windows, MacOSX, Linux (any distri) nor FreeBSD get the job done properlly.
### Why in hell does the OS have to be involved in an application install?!?!?!
If not the OS, then who else should take care of it? If the application are free to dump themself anywhere there won't without anybody taking care of them doing it properly its just a matter of time since one app will to bad things.
### VMS. Old news. Also lots
Re:Get a *real* OS (Score:2)
Re:Get a *real* OS (Score:2)
If now apple could just return to
Re:Get a *real* OS (Score:2)
Re:How about systems that I can manually heal firs (Score:2)
Basically both, if one wants a self healing system, the system needs first a way to find out that something is wrong in the first place. If there isn't a way to detected that some files got broken, then a self-healing system can do nothing. Beside from that it might of course also help to detect some cracker attacks or corrupt harddisks easier.
Sod the stupid machines (Score:2)
Reset button (Score:2, Interesting)
I'm confused (Score:2)
Explain to me how any of the failure responses I see discussed in the article or in these discussions qualifies as "Healing"? Almost all fault tolerant systems isolate failing components or programs from the rest of the system (killing rouge processes counts as isolation). Quarantine is not an attempt to heal, it is an attempt to tolerate. Are there actually any non-quarantine "self healin
Re:I'm confused (Score:2)
However, if you want to understand what self-healing really means (and does not mean), consider that our DNA are self-healing.
Now, I do not claim to understand the mechanisms whereby the DNA is self healing. I am aware that there is a recent article that points out how the DNA breaks get
Re:I'm confused (Score:3, Informative)
"Self Healing", in this context, is the systems ability to detect a fault (hardware or software), deal with it (restart a process, isolate hardware, etc) and
Re:I'm confused (Score:2)
UNIX is the problem. Tandem was the solution. (Score:5, Interesting)
The most successful example is Tandem. For decades, systems that have to keep running have run on Tandem's operating system. For an overview of how they did it, see the 1985 paper Why Computers Stop and What Can Be Done About It. [hp.com]
The basic concepts are:
Every time you use an ATM or trade a stock, somewhere a Tandem cluster was involved.
Tandem's problem was that they had rather expensive proprietary hardware. You also needed extra hardware to allow for fail-operational systems. But it all really does work. HP still sells Tandem, but since Carly, it's being neglected, like most other high technology at HP.
Re:UNIX is the problem. Tandem was the solution. (Score:5, Interesting)
The machines used MIPS processors (supporting SMP) and ran a Tandem variant of System V UNIX. Combine this with a decent transactional database, and application software capable of check-pointing itself, and you have a very robust system. Albeit a very expensive one.
Tandem was bought out by Compaq, and then by HP. When I left, Tandem had quite a few interesting ideas they were working on, but near as I can tell, they never saw the light of day.
Re:UNIX is the problem. Tandem was the solution. (Score:2)
Re:UNIX is the problem. Tandem was the solution. (Score:3, Insightful)
Knowing HP, your systems are probably being replaced by Tandem-branded PCs with ECC RAM and software RAID. A rescue DVD will provide instant system rebuilds so downtime is never more than two days.
Re:UNIX is the problem. Tandem was the solution. (Score:2)
They did the same to DEC
So how's the health of the system that monitors (Score:2)
First step to self-awareness and AI? (Score:2)
It is then a small step to go from simple feedback self-healing mechanisms to feed-forward control mechanisms. F
full redundancy (almost) always works (Score:2)
One of my self-healing systems (Score:5, Interesting)
I'd call that a self healing system. I'm a network admin though so my perception of these things tends to be on a larger scale.
Re: (Score:1)
Re:One of my self-healing systems (Score:3, Interesting)
Your systems shouldn't have gotten infected with spyware in the first place, and the fact that they did shows you have bigger problems. What if they get infected with something more malicious than gator? Or how about something that's not detected by the spyware removal tools?
Re:One of my self-healing systems (Score:3, Interesting)
Tracking the symptoms like this alerts me to these problems - running SpyBot on a machine never hurts, and I'll do other things too like have a script email me the list of adminstrators on the machine and perhaps change the password.
As for more malici
The need for a "self" symbol (Score:3, Interesting)
This is really something that, IMHO, calls for more interaction between the best of the futurists, science-fiction writers, and coders, and other complexity thinkers.
In order for any system to have an understanding of and proper diagnosis of its own operation, it needs to be able to conceptualize its relationship to other systems around it. Am I important? What functions do I provide? What level of error is proper to report to my administrator? Do I have a history of hardware problems? Has chip 2341 on motherboard 12 been acting up intermittently? If so, is it getting worse or better? How have I been doing over the last few days? Is there a new virus going around that is similar to something I've had before?
What good is a self-diagnosing system without a memory of its prior actions?
All of these questions imply some sort of context that will require the system to use symbols to represent "things" in the "world" around it. Clearly, the largest (though perhaps not qualitatively different) symbol will be a "self" symbol.
From there, all you have to do is follow Hofstadter [dmoz.org]'s path and you'll arrive at a system with emergent self-awareness or consciousness.
The end result of this will be something a) very complex and b) designed/grown by itself. You'll have either the computer from the U.S.S. Enterprise [epguides.info] or H.A.L. [underview.com]
Side question: What is CYC doing these days? [cyc.com]
Self-Healing Data Transfer (Score:2)
Where does it hurt? (Score:3, Insightful)
Re:Where does it hurt? (Score:1)
but i doubt many developers think from a user's point of view...
Re:Where does it hurt? (Score:1)
That is more or less how things have evolved here at The Internet Archive [archive.org].
Unixes and the services that run on them can be configured to be very verbose in their errors and warnings, and error messages can be used as triggers to check various logs and system states for additional information, but in a nontrivial cluster there are major problems with humans trying to digest this flow of information and make sense of it all.
Better tools help, but beyond a point it just makes sense to try and make the tools
Re:Where does it hurt? (Score:2)
Re:TMI, anyone? (Score:2)
It's a long way (Score:4, Interesting)
The former could be considered self-repair, but it is limited as you don't have to have much in the way of an error to totally swamp most error-correction codes.
The second form isn't really self-repair as much as it is damage control. This is just as important as self-repair, as you can't do much repair work if your software can't run.
On the whole, "normal" systems don't need any kind of self-repair, beyond the basic error-correction codes. Instead, you are likely better off to have a "hot fail-over" system - two systems running in parallel with the same data, only one of them is kept "silent". Both take input from the same source(s), and so should have identical states at all times, with no synchronization required.
If the "active" one fails, just "unsilence" the other one and restore the first one's state. If the "silent" one fails, all you do is copy the state over.
However, computers are deterministic. Two identical machines, performing identical operations, will always produce identical results. Therefore, in order to have a meaningful hot fail-over of the kind described, the two can't be identical. They have to be different enough to not fail under identical conditions, but be similar enough that you can trivially switch the output from one to the other without anybody noticing.
eg: The use of a Linux box on an AMD running Roxen, and an OpenBSD box on an Intel running Apache, would be pretty much guaranteed not to have common points of failure. If you used a keepalive daemon for each box to monitor the other's health, you could easily ensure that only one box was "talking" at a time, even if both were receiving.
The added complexity is minimal, which is always good for reliability, and the result is as good or better than any existing software self-repair method out there.
Now, you can't always use such solutions. Anything designed to work in space, these days, uses a combination of the above techniques to extend the lifetime of the computer. By dynamically monitoring the health of the components, re-routing data flow as needed, and repairing data/code stored in transistors that have become damaged, you ensure the system will keep functioning.
Transistors get destroyed by radiation quite easily. If you didn't have some kind of self-repair/damage-control, you'd either be using chips with transistors which may or may not work, or you'd have to scrub the entire chip after a single transistor went.
Re:It's a long way (Score:1)
Fuck no! Computers have souls, man. That is because they are so complicated that the deterministic model no longer holds; there's a non-deterministic layer that gives the machines their personal features.
I'd rather have self-healing coffee. (Score:2)
Now, as for self-heating systems...
The article is a pretty good roadmap... (Score:2)
Next, it talks about verbose and useful errors, so that a techy can make intelligent decisions about terminating a process, restarting it, altering a file, or some other fix. Presumably, once a tech marks a problem "successfully fixed" by a certain set of actions enough times, the system wiull try those series of actions before throwing an error message.
What will be nice is when the system recognizes what it is it's doing, so it'll have
worst case? (Score:1, Insightful)
Why the term "self-healing?" (Score:2)
Wouldn't a "self-healing" system just be good at a) reporting what hardware is actually broken on the machine b) automating well defined responses to well defined programs and c) building parallel, fault tolerant hardware at all levels of the system?
As far as I know, even the best AI research hasn't come up with software that can diagnose and fix unknown, first time, bizarre problems. Ultimately, it a
We already have this... (Score:3, Insightful)
The Shuttle has many thousands of sensors and backup sensors. Each sensor feeds into one of many computer systems. These computer systems talk to each other as more of a committee rather than just passing data amongst themselves. If a computer discovers a fault, another computer will see that fault as well, it will combine data gathered from other computer systems throughout the suttle and each computer system will literally cast a vote on what the best solution should be for the particular fault discovered.
If one computer system suffers a partial or complete failure, the remaining systems will work around the failed system.
This computer system has managed to keep our astronauts alive for every mission, except those two that suffered from a catastrophic mechanical failure. The second of which (Columbia) the computers kept the craft flying until it broke apart completely.
I say not bad for a system designed over 20 years ago!
nice systems (Score:2)
too late (Score:3, Funny)
All about dependability (Score:2)
I guess if you work this out upto a low enough level, this includes the hardware, you can actually make the system heal itself.
You could probably start at the root of the whole system: power, and build your way up from there in a sort of tree-version. However, other environmental issues for you system could exist that make a power failure seem like christmas.
It could
IBMs been there done that (Score:3, Informative)
The currentzSeries machines come with 16 cpus and L2 & L1 packaged together on a board.
But only 12 cpus are used.
Each "cpu" is actually two cpus and a comparitor. When the cpus come up with a different answer the cpu is shutdown and procesing is taken over by one of the four free cpus on the board.
You will never know it happened until you run one of the mainrneance utilities.
In the way of IBM this technoligy will probaly appear on top end pSeries (AIX/Linux) and iSeries boxes in a couple of years.
IBM Autonomic Computing (Score:1)
It's not all pie in the sky either - they've already released preliminary Autonomic Computing Toolkits as part of their Emerging Technologies Toolkit [ibm.com]. Start by looking at the Logging and Trace components, and then maybe look at the Solution Install pieces - they underpin the whole framework.
It will take a generation, or two (10-15 yea