Forgot your password?
Programming IT Technology

Fault Tolerant Shell 234

Posted by michael
from the does-it-correct-typos? dept.
Paul Howe writes "Roaming around my school's computer science webserver I ran across what struck me as a great (and very prescient) idea: a fault tolerant scripting language. This makes a lot of sense when programming in environments that are almost fundamentally unstable i.e. distributed systems etc. I'm not sure how active this project is, but its clear that this is an idea whose time has come. Fault Tolerant Shell."
This discussion has been archived. No new comments can be posted.

Fault Tolerant Shell

Comments Filter:
  • by phaze3000 (204500) on Monday March 15, 2004 @06:08AM (#8566856) Homepage
    IMO (as someone who works on clustered systems for a living) you're looking at this from the wrong point of view. A clustered shell is useful only if the system it is running on top of is inherently unstable.

    The real benefit is in having a system which is sufficiently distributed that any program running on top of it can continue to do so despite any sort of underlying failure.

  • by DavidNWelton (142216) on Monday March 15, 2004 @06:16AM (#8566879) Homepage
    ... or probably Perl or Python, either.

    It doesn't actually seem to grok the commands that are being run, so something like

    proc try {times script} {
    if { [catch [uplevel $script] err] } { cleanup ; retry }

    is all that's needed (of course to do it right you'd need a bit more, but still...).

    try {5 times} {

    Although Tcl is a bit lower level, and would require you to do exec ls, you could of course wrap that too so that all commands in the $script block would just be 'exec'ed by default.

    In any case, better to use a flexible tool that can be tweaked to do what you need then write highly specialized tools.
  • by heldlikesound (132717) on Monday March 15, 2004 @06:21AM (#8566894) Homepage
    on a loosely configured network, not saying this tool doesn't seem interesting, but it seems prone for use in DOS attacks...
  • by MrIrwin (761231) on Monday March 15, 2004 @06:24AM (#8566906) Journal
    The idea of being to timeout and exception handle in scripts is a great idea......assuming you want to use scripts. I think most people end up resorting to Perl, Python or whatever for anything more complex. But perhaps with this facility Scripts would be more useful? I come to a related topic. I build factory wide systems, systems which have eg. Automatic warehouses and whatever in the middle. I do a lot of stuff with VB6 not because it is fault tolerant but because it is 'fix tolerant'. During the comminssioning phases I can leave a program running in the debugger and, if it freaks out, I can debug, fix, test by iterating forwards and **backwards** in the the function that caused the hitch, and then continue to run were I left off. Many minor problems get fixed on the fly without users even realizing anything was amiss. In every other respect (syntax, structure, error trapping etc) VB6 is a disaster and not really suited at all to these types of progects, so the fact that I use it is a measure of how important this feature is. Like the fault tolerant shell, it is a 'non-pure' extension insofar as purists say it should not be neccessary, but in pratice it is a godsend. Anybody know an alternative for VB6 in this respect?
  • Re:Bad Idea (Score:2, Interesting)

    by Anonymous Coward on Monday March 15, 2004 @06:28AM (#8566923)
    Well, your idea is well-meaning, but it is a bad idea to cling too closely to the principle of fixing all bugs first. It is a fact of life that some bugs will remain however hard you try (even formal proofs won't notice specification errors), and in a critical production system you need to have some robustness against failures. It's no good if your system up and dies the moment it hits a real bug. IMHO a fault-tolerant shell will be a useful tool in some situations, even if some people end up misusing it.
  • Re:Bad Idea (Score:3, Interesting)

    by Yakman (22964) on Monday March 15, 2004 @07:00AM (#8566991) Homepage Journal
    hink of it along the lines of IE and HTML; if you don't want to close your tags, say your table td and tr tags, it's fine, the IE browser will do it for you.

    According to the HTML 4.01 spec [] </td> and </tr> tags are optional. So you can code a standards compliant page without them, as long as you declare your doctype properly.
  • by xlurker (253257) on Monday March 15, 2004 @07:01AM (#8566996) Homepage
    (the concept of fault-tolerant coding encourages sloppy coding. and it makes it harder to see what's actually happening in the script. but that's not what they actually mean.)

    what they seem to essentially want is

    • a try statement and error catching and
    • a fortran like syntax for testing and arithmetic
    I think the authors were a bit misguided. Instead of creating a whole new shell how about just extending a good existing shell with a new try statement a described.

    it can even be done without extending the shell:

    ( cd /tmp/blabla
    rm -rf tmpdir
    wget http://some.thing/wome/where
    ) || echo something went wrong

    as for the new syntax of .eq. .ne. .lt. .gt. .to.
    certainly looks like fortran-hugging to me , yuck

    as for integer arithmetic, that can be done with by either using backticks or the $[ ] expansion

    % echo $[ 12 * 12 + 10 ]
    % 154
  • by andersen (10283) on Monday March 15, 2004 @07:02AM (#8566997) Homepage
    "On two occasions I have been asked [by members of Parliament!], 'Pray, Mr.
    Babbage, if you put into the machine wrong figures, will the right answers
    come out?' I am not able rightly to apprehend the kind of confusion of ideas
    that could provoke such a question."
    -- Charles Babbage
  • by quakeslut (107512) on Monday March 15, 2004 @07:08AM (#8567006)
    What do you lose by using something like this?

    Well.. besides pipes of course ;)
    Variable redirection looks just like file redirection, except a dash is put in front of the redirector. For example, For example, suppose that we want to capture the output of grep and then run it through sort:

    grep needle /tmp/haystack -> needles
    sort -< needles

    This sort of operation takes the place of a pipeline, which ftsh does not have (yet).
  • by Anonymous Coward on Monday March 15, 2004 @07:51AM (#8567095)
    ... was mentioned a few months back in one of the magazines I pick up almost monthly (forget which one out of the several it was).

    I think the shell was called dsh. I believe this is the project site:

    Are the aims of this fault tolerant shell and dsh the same? I'm not a programmer, but I'm trying to teach myself *nix system administration.

    Eventually I'm hoping to cluster some older x86 systems I'm going to get at auction together for a Beowulf cluster. It sounds to me like one if not both of these two shells might come in handy!
  • OK, wise guys... (Score:5, Interesting)

    by JAPrufrock (760889) on Monday March 15, 2004 @08:01AM (#8567120)
    I'm working with Grid and ftsh as we speak. I'm a physicist, not a professional coder. I write reasonable code, but I'm no purist. With that said...

    ftsh has great utility in the realm it's written for. Obviously, it's not a basis for installing kernels or doing password authentication. In a Grid (not just distributed) environment, things break for all sorts of reasons all the time. You're dealing with a Friendly Admin on another system, one who may well be unaffiliated with your institution, project or field of study. He doesn't have any particular reason to consult with you about system changes.

    Now you find yourself writing a grid diagnostic or submitter or job manager. One does not need strongly typed compiled languages for this. Shell scripts are almost always more efficient to write, and the speed difference is unimportant. Right now, most Grid submitters are being written in bash or Python or some such. Bash sucks for exception handling of the sort we're talking about. Python does better with its try: statements, but there's room for improvement. ftsh is a good choice for a sublayer to these scripts. One writes some of the machinery that actually interacts with the Grid nodes and supervisors in this easy, clear and flexible form.

    Now there are a lot of specific points to answer:

    One needs a Windows port to be able to make the Grid software we write in Linux available to the poor drones who are stuck with Win boxes.

    This is not a code spellchecker or coding environment. At all.

    This is not a crutch for inadequate programmers. This is a collection of methods to deal with a specific set of recalcitrant problems.

    As I was pointing out before, this is, after all, an unstable system. One is using diverse resources on diverse platforms in many countries at many institutions. I appreciate the comment made by unixbob about operating in heterogeneous environments.

    This isn't a substitute for wget. One uses wget as an example because it's clear.

    The "pull" model breaks down immediately when there is no unified environment, as is described on When you're not the admin, and your software has to be wiped out the minute your job is done, "push" is the only way to do it. This is the case with most Grid computing right now (that I know about)

    All the woe and doom about the sloppy coding and letting the environment correct your deficiencies is... ill-thought-out. That's what a compiler is, folks. Should we all be coding in machine language? :) Use the right tool for the job and save time.

    I do agree, however, that one should indeed hone one's craft. Sloppy coding in projects of importance is inexcusable (M$). There is no reason to stick to strict exception handling, however, in the applications being discussed by ftsh's developers (the same folks who brought you Condor). When code becomes 3/4 exception handling, even when the specific exceptions don't matter, there's a problem, IMHO. :)

  • by divec (48748) on Monday March 15, 2004 @08:04AM (#8567130) Homepage
    The article says:


    cd /work/foo
    rm -rf bar
    cp -r /fresh/data .

    Suppose that the /work filesystem is temporarily unavailable, perhaps due to an NFS failure. The cd command will fail and print a message on the console. The shell will ignore this error result -- it is primarily designed as a user interface tool -- and proceed to execute the rm and cp in the directory it happened to be before.

    That shell script can be improved a lot by using " set -e " to exit on failure, as follows:

    set -e # exit on failure

    cd /work/foo
    rm -rf bar
    cp -r /fresh/data .

    This means that, if any command in the script fails, the script will exit immediately, instead of carrying on blindly.

    The script's exit status will be non-zero, indicating failure. If it was called by another script, and that had "set -e", then that too will exit immediately. This is a little bit like exceptions in some other languages.

  • Re:Bad Idea (Score:5, Interesting)

    by cgenman (325138) on Monday March 15, 2004 @08:18AM (#8567167) Homepage
    It's a well meaning idea, but it would cause more problems than it would solve. It would just encourage sloppy code; people would rationalize "I don't need to fix errors because it doesn't matter", which is a very bad habit to get into when programming, ignoring errors, or even warnings

    The same logic could be applied to any security system, from the automatic door lock on the front of your house to Airbags in your car. Spell checkers discourage people from learning to spell. Antibiotics prevent the growth of the immune system. Why have a lock on your trigger, if it will encourage you to leave it in a place where your kids can find it.

    The fact of the matter is, if the code works, it's good code. This is a shell scripting language we're talking about here... Not exactly assembly. Programmers would be better off spending more time thinking about the higher structure of their applications and less time hunting down trivial mistakes.

    Of course, I know that this isn't quite what the article is talking about, but it's the principle of the thing. Augmentation would be an improvement.

  • by Air-conditioned cowh (552882) on Monday March 15, 2004 @08:36AM (#8567216)
    Personlly, I think my spelling has improved due to spell checkers. I allways try to learn from whatever corrections is makes. Maybe other folks do too.

    Also, this isn't about covering up mistakes. I am sure good script programmer will _allways_ assume a command can fail. Using the example of the "cd" command in the article, should I really just assume it worked before removing files? Of course not. How ftsh helps is that the necessary error checking code is made more readable and brief. I still have to trap errors whether I use ftsh or bash, the difference is ftsh is easier to understand.

    Simply making code less convoluted and more readable is not the same as sloppy programming.

  • by Tei (520358) on Monday March 15, 2004 @09:03AM (#8567331) Journal
    Mostly because Windows lack of good command line admin tools historically. Actually has a few, but cmd is not bash, so you have to suplement these..

    some people, (I myself too) use bash as a daily basic for Windows, this new stuff can be interesting and maybe usefull for the unsafe windows enviroment.

  • by Rakshasa Taisab (244699) on Monday March 15, 2004 @09:12AM (#8567371) Homepage
    If you think C++ belongs on that side of the line, then you've either never programmed in C++, or you've written some pretty buggy programs (and are ignorant of it). C++ is kinda like a powered chainsaw, effective and powerfull but if you don't know how to use it you'll end up losing a leg or two.
  • ACID Filesystems (Score:3, Interesting)

    by NZheretic (23872) on Monday March 15, 2004 @09:31AM (#8567458) Homepage Journal
    For a system like this to be truly effective you would need an operating system which supported a truly transactional filesystem.

    Remounting a filesystem with ACID on, a process sets a rollback point , executing a series of commands with the operating system keeping a record of the changes to the filesystem made by the process and its children. The process would inform the OS to either commit or rollback the changes.

    This still raises questions on how to deal with with two or more competing "transactional" processes which rely on read information which another process chooses to rollback to an early state.

  • by vidarh (309115) <> on Monday March 15, 2004 @10:23AM (#8567767) Homepage Journal
    I guess you didn't READ the article, considering that the example given on the page was specifically an example of pulling a data file, trying multiple hosts in turn.

    The thing is, if you run a distributed system with a thousand servers, and your patch distribution points drop of the network, you don't suddenly want a thousand servers hammering the networks endlessly. You want things like exponential backoff, timeouts after which the system will change behaviour (stop requesting updates until explicitly requested to start again, start triggering alarms etc.), and that is exactly the kind of scenario ftsh makes easy to do in scripts without having to write all the logic yourself for every bloody script.

  • by AxelTorvalds (544851) on Monday March 15, 2004 @10:27AM (#8567786)
    I know that if you need a ton of fault tolerance in your shell scripts that you should probably be using a different language but every time I look at any complex systems, not just a signle app but a system, there is always shell script glue. More importantly, I've never seen a shell script that checked the return codes of everything at best they look at a few key components and report on their success of failure. Exceptions would be nice.

    I think perl is where it is because so many people use it as "super script." To me that says, a) we recode all the Bourne and csh and bash in perl or b) we look at why people do shell scripting in perl or other languages and add that to the shell. I couldn't tell you which is right. It's a neat idea though and I'm glad they made it.

    A real example I can think of, I had a test machine that had some kind of ext3 corruption and so it mounted up in read-only mode when it booted. I spent time diagnosing an application error in our application because nothing caught that; these are redhat type startup scripts. I noticed that our app couldn't write logs and began to debug the system. More interestingly, a dozen or so start-up scripts failed to start up critical components and their failure wasn't noticed. If you can't write to the filesystem, you can't create a socket(AF_UNIX) and all sort's of things go tits up then. If that's how you debug it's only going to get more difficult as you add more and more complexity, you have to detect the lower level failures and report them. Perversely, this wouldn't have been noticed had a different partition been read-only. Turns out that a drive was going bad. Had it been a different partition, it would have been noticed at catastrophic system failure time when the drive died.

    I've done a fair amount of embedded work and there is always a test for new guys, you can tell the new guy (new college grad, whatever) because he skips half or more of the error checking in his code. You know printf returns a value? Funnier still, if you develop something like a consumer app in embedded space, you'll eventually see things like printf fail. We know it never should, but with 20,000+ users in different environments and what not, things like that can and do fail and usually point to a greater problem, like a dead drive or something. Instead of logging/alerting something to the critical and unusual printf failure, the app fails in a different way because this printf failed. Heaven forbid that it was sprintf that failed and then you shove bad data in to a database or configuration file and not just fail the system but corrupt the data too. Inspite of all of that, even veterans will forget error checking at times, it's a common bug and so having higher level tools to help assist, like exception in the shell can only be a good thing.

  • by vidarh (309115) <> on Monday March 15, 2004 @10:31AM (#8567823) Homepage Journal
    Bzzt. Try again. Where is the exponential backoff? Where is the ability to restrict each contained statement for a specific amount of time? Where is the ability to execute each command at specific intervals?

    Your example only does a fraction of what ftsh does.

  • by The Pim (140414) on Monday March 15, 2004 @10:42AM (#8567934)
    What do you lose by using something like this?

    Well.. besides pipes of course ;)

    Funny you should mention this, because I was going to write something about pipes. Getting pipes right with good error semantics is hard. For all the "just use set +e in bash" weenies out there, try running

    #!/bin/sh -e
    cat nosuchfile | echo hello
    Where's your error?

    If you think about the unix process and pipe primitives, you will see the difficulty. To create a pipeline, you normally fork, create the pipe, fork again, and run the two ends of the pipeline in the two sub-processes. This is scalable to deeply nested pipelines, but has a cost: Only one of the sub-processes is a child of the shell, so only one exit status can be monitored. To work around this, you really need to build a mini-OS environment on top of unix.

    This demonstrates that unix was fundamentally not designed with concern for error semantics (consider Erlang as a diametric example). And this, I'm sure, is why ftsh doesn't have pipes (yet).

  • Re:Bad Idea (Score:3, Interesting)

    by JohnFluxx (413620) on Monday March 15, 2004 @11:38AM (#8568457)
    "Nevermind that it will break most any W3C compliant browser on the planet."

    Same problem with english. Most people have a high degree of fault tolerance when parsing natural language, which means any old crap will still just about be readable.

    (sorry - I had to after reading that statement heh)
  • by Moderation abuser (184013) on Monday March 15, 2004 @12:18PM (#8568868)
    I did. Endlessly is good. The network overhead is negligible.

    Check once every 1,2,4,8,16,32,64,whatever,mins *all the time anyway* whether it fails or succeeds and you *absolutely don't* want to have to explicitly tell 1000 machines to start again.

    You simply generalise the update process, get rid of the special cases. In the case of patches, you know you're going to have to distribute them out to clients at some point anyway so have all the clients check once a day, every day. If the distribution server is down for a couple of days it's pretty much irrelevant.

    My error detection code is trivial the network traffic is negligible unless the job's actually being done and I still haven't been given a good case for ftsh. I have a good case for a better randomising algorithm within a shell and a decent distributed cron (which is simple BTW), but not for a specifically fault tolerant shell.

    You've got to stop thinking of these things as individual systems. The network is the machine.

  • by Pyrrus (97830) on Monday March 15, 2004 @02:39PM (#8570402) Homepage
    Something perhaps like this []?
  • Re:Missing the point (Score:2, Interesting)

    by vague (107055) on Monday March 15, 2004 @02:42PM (#8570434) Homepage
    Read the last post at

    It's not about ignoring errors, it's about the central idea that you'll never, _ever_, be able to write 100% perfect code, and if you could your code will be so full of error checking that it's both unreadable and, as a result, unmaintainable, masking logic bugs and similar. It's a better economy to come up with better ways to deal with failure than trying to prevent it altogether. And the final solution will be more stable.

    This is an important realisation: Failure is inevitable, how you deal with it is what matters.

FORTRAN is for pipe stress freaks and crystallography weenies.