Forgot your password?
typodupeerror
Programming IT Technology

Fault Tolerant Shell 234

Posted by michael
from the does-it-correct-typos? dept.
Paul Howe writes "Roaming around my school's computer science webserver I ran across what struck me as a great (and very prescient) idea: a fault tolerant scripting language. This makes a lot of sense when programming in environments that are almost fundamentally unstable i.e. distributed systems etc. I'm not sure how active this project is, but its clear that this is an idea whose time has come. Fault Tolerant Shell."
This discussion has been archived. No new comments can be posted.

Fault Tolerant Shell

Comments Filter:
  • Python (Score:2, Offtopic)

    by derphilipp (745164)
    Also, I would appreciate (not quite the same) a auto-completing python interpreter and editor (which can complete methods and objects from modules)... Such kind of stuff really increases productivity !
  • by phaze3000 (204500) on Monday March 15, 2004 @06:08AM (#8566856) Homepage
    IMO (as someone who works on clustered systems for a living) you're looking at this from the wrong point of view. A clustered shell is useful only if the system it is running on top of is inherently unstable.

    The real benefit is in having a system which is sufficiently distributed that any program running on top of it can continue to do so despite any sort of underlying failure.

    • by unixbob (523657) on Monday March 15, 2004 @06:28AM (#8566919)
      Doesn't that depend on the definition of clustered though? Clustered systems can be things like beowulf clusters. But often a collection of standalone web servers behind a http load balancers is commonly referred to as a web cluster or array.

      IMHO as someone who works in a complex web server / database server environment, there are many interdependancies brought by different software, different platforms and different applications. Whilst 100% uptime on all servers is a nice to have, it's a complex goal to achieve and requires not just expertise in the operating systems & web / database server software but an indepth understanding of the applications.

      A system such as this fault tolerant shell is actually quite a neat idea. It allows for flexibility in system performance and availability, without requiring complex (and therefore possibly error prone or difficult to maintain) management jobs. An example would be server which replicates images using rsync. If one of the targets is busy serving web pages or running another application, ftsh would allow for that kind of unforeseen error to be catered for relatively easily.
      • "An example would be server which replicates images using rsync. If one of the targets is busy serving web pages or running another application, ftsh would allow for that kind of unforeseen error to be catered for relatively easily."

        It depends how you organise your systems. If you push to them then yes you need something like ftsh. If you organise them so that they pull updates, pull scripts to execute and arrange those scripts so that they fail safe (as they all should anyway) then you'll have something w
        • Well that was just an example. If on the other hand the system the images were pulled from was very busy then the same is true. The problem is that you can't architect for a moving target and the flexbility that rapidly changing environments require is something which ftsh would be quite useful for.
          • Push doesn't scale. (Score:4, Informative)

            by Moderation abuser (184013) on Monday March 15, 2004 @09:45AM (#8567532)
            The system you pull from is a distribution server, all it does is distribute files. If it's slow, it's slow for all the machines sucking data and you need a bigger infrastructure. If it's down, the client scripts fail safe and do nothing.

            Even here, pull scales better than push, look at a web server as an example thousands of machines sucking web pages from a server is not uncommon. Try pushing those pages out to the same number of machines.

            Push methodologies simply don't scale, I've been there, done that and it's a bad architecture for more than trivial numbers of machines and I'm not the only one to notice:

            http://www.infrastructures.org/bootstrap/pushpul l. shtml


          • --
            The Romans didn't find algebra very challenging, because X was always 10


            Does that mean they were (wait for it...) existentialists?

    • by vidarh (309115) <vidar@hokstad.com> on Monday March 15, 2004 @10:50AM (#8568008) Homepage Journal
      If you can set up a distributed system at a reasonable cost where any program can continue to run without carying about an underlying failure, you would be richer than Bill Gates.

      Resources DO become unavailable in most systems. It simply doesn't pay to ensure everything is duplicated, and set up infrastructures that makes it transparent to the end user - there are almost always cheaper ways of meeting your business goals by looking at what level of fault tolerance you actually need.

      For most people hours, sometimes even days, of outages can be tolerable for many of their systems, and minutes mostly not noticeable if the tools can handle it. The cost difference in providing a system where unavailabilities are treated as a normal, acceptable condition within some parameters, and one where failures are made transparent to the user can be astronomical.

      To this date, I have NEVER seen a computer system that would come close to the transparency you are suggesting, simply beause for most "normal" uses it doesn't make economic sense.

  • Bad Idea (Score:5, Insightful)

    by teklob (650327) on Monday March 15, 2004 @06:10AM (#8566863)
    It's a well meaning idea, but it would cause more problems than it would solve. It would just encourage sloppy code; people would rationalize "I don't need to fix errors because it doesn't matter", which is a very bad habit to get into when programming, ignoring errors, or even warnings
    • Re:Bad Idea (Score:2, Interesting)

      by Anonymous Coward
      Well, your idea is well-meaning, but it is a bad idea to cling too closely to the principle of fixing all bugs first. It is a fact of life that some bugs will remain however hard you try (even formal proofs won't notice specification errors), and in a critical production system you need to have some robustness against failures. It's no good if your system up and dies the moment it hits a real bug. IMHO a fault-tolerant shell will be a useful tool in some situations, even if some people end up misusing it
    • Re:Bad Idea (Score:5, Insightful)

      by Tx (96709) on Monday March 15, 2004 @06:41AM (#8566948) Journal
      I agree. Web browsers were designed to be fault tolerant, and just look at all the horrendously broken crap that passes for HTML out there. Dangerous stuff.
      • You couldn't have said it in better words. I mean, people using IE are used to each and every one of it's quirks, even if it happens to be "fault tolerant" by allowing badly formed HTML/JS/CSS/XML/etc to render the pages. Consequently they may think that Mozilla/Opera/Firefox/whatever aren't as good because they may not tolerate badly designed HTML/JS/CSS/etc as quietly.
    • Re:Bad Idea (Score:2, Insightful)

      by Androclese (627848)
      Exactly, you nailed it on the head!

      The only thing I can add at this point is an analogy:

      Think of it along the lines of IE and HTML; if you don't want to close your tags, say your table td and tr tags, it's fine, the IE browser will do it for you.

      Nevermind that it will break most any W3C compliant browser on the planet.

      (insert deity here) help the person that gets used to this style of programming and then joins the real world.
      • Re:Bad Idea (Score:3, Interesting)

        by Yakman (22964)
        hink of it along the lines of IE and HTML; if you don't want to close your tags, say your table td and tr tags, it's fine, the IE browser will do it for you.

        According to the HTML 4.01 spec [w3.org] </td> and </tr> tags are optional. So you can code a standards compliant page without them, as long as you declare your doctype properly.
      • Re:Bad Idea (Score:3, Interesting)

        by JohnFluxx (413620)
        "Nevermind that it will break most any W3C compliant browser on the planet."

        Same problem with english. Most people have a high degree of fault tolerance when parsing natural language, which means any old crap will still just about be readable.

        (sorry - I had to after reading that statement heh)
    • OTOH, if the tool makes scripts simpler then bugs are less likely to be introduced. Why do we all need to keep on reinventing our own libraries and methods for checking an NFS server hasn't timed out (to use the example on the ftsh site). why not use a tool that does that stuff for us so we can concentrate on writing more elegant and functional scripts that are easier to read and debug and quicker to write.
    • Missing the point (Score:5, Insightful)

      by SmallFurryCreature (593017) on Monday March 15, 2004 @07:21AM (#8567033) Journal
      This is not about catching scripting errors. It does not fix your code. It is about catching errors in the enviroment that scripts are running in.

      Shell scripts should be short and easy to write. I have seen plenty of them fail due to some resource or another being temporarily down. At first people are neat and then send an email to notify the admin. When this then results in a ton of emails everytime some dodo knocks out the DNS they turn it off and forget about it.

      Every scripting language has their own special little niche. BASH for simple things, perl for heavy text manipulation, PHP for creating HTML output. This scripting language is pretty much like BASH but takes failure as given. The example shows clearly how it works. Instead of ending up with PERL like scripts to catch all the possible errors you add two lines and you got a wonderfull small script, wich is what shell scripts should be, that is none the less capable of recovering from an error. This script will simply retry when someone knocks out the DNS again.

      This new language will not catch your errors. It will catch other peoples errors. Sure a really good programmer can do this himself. A really good programmer can also create his own libraries. Most find of us in admin jobs find it easier to use somebody elses code rather then constantly reinvent the wheel.

      • by f0rt0r (636600) on Monday March 15, 2004 @09:55AM (#8567600)
        I think it still will promote bad programming/scripting practices. Many people ( including myself ) started with scripting before moving on to full-fledged programming. What they learned in scripting they carry forward with them into programming, and trust me, I learned to be very meticulous when it comes to interacting with things outside of my scripts control ( such as files ). Every I/O operation should be tested for success. Trying to open a file? Did it work? Ok, try writing to the file...did it work? Open a database connection...did it work? Let the user enter a number...did they enter a valid number? Error handling and input validation is something you just have to learn, like it or not. Something that holds your hand and lets you code while remaining oblivious to the realities of the scripting/programming environment is a bad thing IMHO.

        On a side note for Perl, one thing I always hated were the examples that had something like "open( FH, "file/path" ) || die "Could not open file!" . $!; I mean, come one, you don't want your script to just quit if it encounters an error...how about putting in an example of error handling other than the script throwing up its hands and quitting! LOL.

        Please excuse any grammatical/other typos above, I was on 4 hrs sleep when I wrote this. Thank You.

        • by vidarh (309115) <vidar@hokstad.com> on Monday March 15, 2004 @10:43AM (#8567944) Homepage Journal
          So what you are saying is that programming should be hard, and people should be expected to do it right, or it promotes bad practices.

          Yet we are expected to excuse your grammatical and typos. Doesn't that just promote bad practices? Shouldn't we whack you over the head with a baseball bat just to make sure you won't post when you're not prepared to write flawless posts?

          The more work you have to do to check errors, the more likely it is that however vigilant you might be, errors slip past. If you have to check the return values of a 100 commands, that is a 100 chances for forgetting to do the check or for doing the check the wrong way, or for handling the error incorrectly.

          In this case, the shell offers a function that provides a more sensible default handling of errors: If you don't handle them, the shell won't continue executing by "accident" because you didn't catch an error, but will terminate. It also provides an optional feature that let you easily retry commands that are likely to fail sometimes and where the likely error handling would be to stop processing and retry without having to write the logic yourself.

          Each time you have to write logic to handle exponential backoff and to retry according to specific patterns is one more chance of introducing errors.

          No offense, but I would rather trust a SINGLE implementation that I can hammer the hell out of until I trust it and reuse again and again than trust you (or anyone else) to check the return code of every command and every function they call.

          This shell does not remove the responsibility to for handling errors. It a) chooses a default behaviour that reduces the chance of catastrophic errors when an unhandled error occurs, and b) provides a mechanism for automatic recovery from a class of errors that occur frequently in a particular type of systems (distributed systems where network problems DO happen on a regular basis), and by that leave developers free to spend their time on more sensible things (I'd rather have my team doing testing than writing more code than they need to)

      • This simply isn't required in a properly organised distributed architecture.

        Write your scripts to fail safe, then don't perform ad-hoc updates, schedule them regularly.

    • Re:Bad Idea (Score:5, Interesting)

      by cgenman (325138) on Monday March 15, 2004 @08:18AM (#8567167) Homepage
      It's a well meaning idea, but it would cause more problems than it would solve. It would just encourage sloppy code; people would rationalize "I don't need to fix errors because it doesn't matter", which is a very bad habit to get into when programming, ignoring errors, or even warnings

      The same logic could be applied to any security system, from the automatic door lock on the front of your house to Airbags in your car. Spell checkers discourage people from learning to spell. Antibiotics prevent the growth of the immune system. Why have a lock on your trigger, if it will encourage you to leave it in a place where your kids can find it.

      The fact of the matter is, if the code works, it's good code. This is a shell scripting language we're talking about here... Not exactly assembly. Programmers would be better off spending more time thinking about the higher structure of their applications and less time hunting down trivial mistakes.

      Of course, I know that this isn't quite what the article is talking about, but it's the principle of the thing. Augmentation would be an improvement.

      • Re:Bad Idea (Score:5, Insightful)

        by Jerf (17166) on Monday March 15, 2004 @10:18AM (#8567729) Journal
        Spell checkers discourage people from learning to spell.

        Done correctly, spellcheckers can be the best spelling-learning tool there is.

        "Correctly" here means the spell-checkers that give you red underlines when you've finished typing the word and it's wrong. Right-clicking lets you see suggestions, add it to your personal dict, etc.

        "Incorrectly" is when you have to run the spell-checker manually at the "end" of typing. That's when people lean on it.

        The reason, of course, is feedback; feedback is absolutely vital to learning and spell-checkers that highlight are the only thing I know of that cuts the feedback loop down to zero seconds. Compared to this, spelling tests in school where the teacher hands back the test three days from now are a complete waste of time. (This is one of many places where out of the box thinking with computers would greatly improve the education process but nobody has the guts to say, "We need to stop 'testing' spelling and start using proper spell-checkers, and come up with some way to encourage kids to use words they don't necessarily know how to spell instead of punishing them." The primary use of computers in education is to cut the feedback loop down to no time at all. But I digress...)

        'gaim' is pretty close but it really ticks me off how it always spellchecks a word immediately, so if you're typing along and you're going to send the word "unfortunately", but you've only typed as far as "unfortun", it highlights it as a misspelled word. Bad program! Wait until I've left the word!
        • The biggest problem with spell checkers is that they don't handle homonyms at all. Further, their handling of misspelled words can be weak as well: how many times have you seen allot where the person meant "a lot." The problem is that the person used the common misspelling alot, which the spell checker identifies as allot. T.f. the person changes to allot instead of correcting to a lot, totally changing the meaning of the sentence (it would actually have been easier to read the misspelled alot).

          To, too
          • Spell checkers are *not* a substitute for knowing how words are spelled.

            Of course not. But using a spell checker means having time to learn about the homonyms, instead of endlessly playing catch up.

            You still predicated your post on "relying" on spell checkers; I'm saying that people learn from good spell checkers. That people can't learn everything from a spell checker is hardly a reason to throw the baby out with the bath water and insist that people use inferior learning techniques anyhow!

            A kid that
    • Re:Bad Idea (Score:3, Insightful)

      by ChaosDiscord (4913)

      It's a well meaning idea, but it would cause more problems than it would solve. It would just encourage sloppy code; people would rationalize "I don't need to fix errors because it doesn't matter", which is a very bad habit to get into when programming, ignoring errors, or even warnings

      You've got it backwards.

      Most shell scripting is quick and dirty; no one checks error codes (mostly because it's a nuisance). FTSH makes it easier to check error codes, in part because the default behavior is the bail on

  • More ideas whose time has come [google.com], including:
    • DRM Helmets
    • Jack Kemp
    • Yankee Go Home
    • Collaborative Dispute Resolution
    • Microchips for Your Pet Parrot! (see page 2 of Google results)
  • by DavidNWelton (142216) on Monday March 15, 2004 @06:16AM (#8566879) Homepage
    ... or probably Perl or Python, either.

    It doesn't actually seem to grok the commands that are being run, so something like

    proc try {times script} {
    if { [catch [uplevel $script] err] } { cleanup ; retry }
    }

    is all that's needed (of course to do it right you'd need a bit more, but still...).

    try {5 times} {
    commands...
    }

    Although Tcl is a bit lower level, and would require you to do exec ls, you could of course wrap that too so that all commands in the $script block would just be 'exec'ed by default.

    In any case, better to use a flexible tool that can be tweaked to do what you need then write highly specialized tools.
    • Here's a Ruby one:

      def college_try (limit, seq =0)
      begin
      yield
      catch e
      # forgot the syntax for getting the block
      college_try( limit, seq + 1, block ) if (seq < limit)
      end
      end

      college_try( 50 ) {
      begin
      do some work
      catch e
      do error clean up here
      raise e
      ensure
      do cleanup that should always run here
      end
      }

      Anyways, I agree with the notion that most popular scripting languages have advanced error handling that is up to the task
    • Bzzt. Try again. Where is the exponential backoff? Where is the ability to restrict each contained statement for a specific amount of time? Where is the ability to execute each command at specific intervals?

      Your example only does a fraction of what ftsh does.

      • > Your example only does a fraction
        > of what ftsh does.

        yawn, so we didn't post a 100-500 line library in our slashdot comment.

        the point is, this stuff would be trivial to implement in language like ruby. plus, using a full scripting language you get lots of other useful features like regular expressions, classes, etc, etc

        It's a good idea, but it's a library implemented as a language.
      • No kidding... that's why it's an example and not a full implementation. You can do all those things in Tcl (or Ruby, Perl, etc...). The idea is that instead of creating some one-off shell, you add a neat feature as an extension to an existing tool.
  • by 91degrees (207121) on Monday March 15, 2004 @06:18AM (#8566884) Journal
    This will not improve people's skills. In fact, it willl make them more prone to mistakes, and more likely to get the result that they didn't expect. It's similat to computer spell checkers. Ever since people started relying on these, their spelling has gone way downhill simly because they don't bother thinking. Computer do all the spelling for them. They don;t need a spell checker. They need spelling lessons.

    This si even worse. Computers will try to second guess what the user means, get get it wrong half tyhe time.

    A qualified shell scripter will be not make these mistakes in the first place. Anyone who thinks they need this shell actually just need to learn to spell and to ytype accuratly.
    • Dude - you could have spell-checked your post!
    • by FrostedWheat (172733) on Monday March 15, 2004 @06:27AM (#8566918)
      just need to learn to spell and to ytype accuratly. -- QED - Quite Easily Done

      <Teal'c> Indeed </Teal'c>
    • RTFA (Score:2, Insightful)

      by Anonymous Coward
      You obviously didn't read the article.

      "It [ftsh] is especially useful in building distributed systems, where failures are common, making timeouts, retry, and alternation necessary techniques."

      It doesn't protect you from typos in the script, it handles failures in the commands that are executed.
    • I think you mean our spelling's gone way downhill, and we need spelling lessons.
    • Personlly, I think my spelling has improved due to spell checkers. I allways try to learn from whatever corrections is makes. Maybe other folks do too.

      Also, this isn't about covering up mistakes. I am sure good script programmer will _allways_ assume a command can fail. Using the example of the "cd" command in the article, should I really just assume it worked before removing files? Of course not. How ftsh helps is that the necessary error checking code is made more readable and brief. I still have to trap
    • Argh. You are the second or third to bring up this silly spell checker analogy.

      The software is not called the "mistake-tolerant" shell. It is the fault tolerant shell. It handles faults like hard drive crashes, network outages, cosmic rays, and yes, probably software bugs as a side effect. Look at the feature set: they are much more geared towards hardware failure than software failure. How does retry or exponential backoff help if a software bug prevents a computation from correctly completing? Actually,

  • by Moderation abuser (184013) on Monday March 15, 2004 @06:21AM (#8566893)
    While, yes, you manage distributed systems from the center, you don't *push* updates, changes, modifications because, it doesn't scale. You end up having to write stuff like this fault tolerant shell which is frankly backwards thinking.

    Instead, you automate everything and *pull* updates, changes, scripts etc. That way if a system is up, it just works, if it's down, it'll get updated next time it's up.

    I won't go into details but I'll point you at http://infrastructures.org/

    • I guess you didn't READ the article, considering that the example given on the page was specifically an example of pulling a data file, trying multiple hosts in turn.

      The thing is, if you run a distributed system with a thousand servers, and your patch distribution points drop of the network, you don't suddenly want a thousand servers hammering the networks endlessly. You want things like exponential backoff, timeouts after which the system will change behaviour (stop requesting updates until explicitly re

      • by Moderation abuser (184013) on Monday March 15, 2004 @12:18PM (#8568868)
        I did. Endlessly is good. The network overhead is negligible.

        Check once every 1,2,4,8,16,32,64,whatever,mins *all the time anyway* whether it fails or succeeds and you *absolutely don't* want to have to explicitly tell 1000 machines to start again.

        You simply generalise the update process, get rid of the special cases. In the case of patches, you know you're going to have to distribute them out to clients at some point anyway so have all the clients check once a day, every day. If the distribution server is down for a couple of days it's pretty much irrelevant.

        My error detection code is trivial the network traffic is negligible unless the job's actually being done and I still haven't been given a good case for ftsh. I have a good case for a better randomising algorithm within a shell and a decent distributed cron (which is simple BTW), but not for a specifically fault tolerant shell.

        You've got to stop thinking of these things as individual systems. The network is the machine.

  • by heldlikesound (132717) on Monday March 15, 2004 @06:21AM (#8566894) Homepage
    on a loosely configured network, not saying this tool doesn't seem interesting, but it seems prone for use in DOS attacks...
  • by Ritontor (244585) on Monday March 15, 2004 @06:23AM (#8566901)
    how many times have you hacked something together in perl that ended up being relied on for some pretty important stuff, only to find 6 months down the track that there's some condition (db connects fine, but fails halfway through script execution as an example) you didn't consider and the whole thing just collapses in a heap - a nasty to recover heap cause you didn't write much logging code either.

    This would REALLY be useful when you're connecting to services external to yourself - network glitches cause more problems with my code than ANYTHING else, and it's a pain in the arse to write code to deal with it gracefully. i'd really really like to see a universal "try this for 5 minutes" wrapper, which, if it still failed, you'd only have one exit condition to worry about. hey, what the hell, maybe i'll spend a few days and write one myself.
    • by SmallFurryCreature (593017) on Monday March 15, 2004 @07:31AM (#8567050) Journal
      This is indeed little more then the wrapper that you describe. Yet most seem to comment on its non-claimed properties of fixing the programmers errors. Wich it really really doesn't. In fact it is worse since this one would happily keep trying to execute a command like "rm -Rf / home/me/tmp".

      I have often had to write such wrappers myself. Sure even easier/better would have been if somebody added this to say BASH as an extension but perhaps that is not possible.

      How often have you needed to write horrible bash code just to pull data from an unreliable source and ended up either with a script that worked totally blind "command && command && command &&" wich never reported if it failed for days on end or ended up with several pages just to catch all the damn network errors that could occur.

      I will definitly be giving this little language a try in the near future. Just another tool for the smart sys-admin. (smart people write as little code as possible. Let others work for you)

      • Hm... what happens when the first command in your catch says

        rm datafile

        With no -f? This is a failure condition if the condition is false, thus it would throw the code into an infinite failure loop until the timeout.

        Like you're saying though, if it's a programming error, it won't get fixed, and in someways, even worse, it adds new dimensions to think about.
  • by humankind (704050) on Monday March 15, 2004 @06:24AM (#8566903) Journal

    All the programmers who need the environment to compensate for their inadequacies, step on one side. All the programmers who want to learn from their mistakes and become better at their craft, get on the other side.

    Most of us know where this line is located.
  • by MrIrwin (761231) on Monday March 15, 2004 @06:24AM (#8566906) Journal
    The idea of being to timeout and exception handle in scripts is a great idea......assuming you want to use scripts. I think most people end up resorting to Perl, Python or whatever for anything more complex. But perhaps with this facility Scripts would be more useful? But...now I come to a related topic. I build factory wide systems, systems which have eg. Automatic warehouses and whatever in the middle. I do a lot of stuff with VB6 not because it is fault tolerant but because it is 'fix tolerant'. During the comminssioning phases I can leave a program running in the debugger and, if it freaks out, I can debug, fix, test by iterating forwards and **backwards** in the the function that caused the hitch, and then continue to run were I left off. Many minor problems get fixed on the fly without users even realizing anything was amiss. In every other respect (syntax, structure, error trapping etc) VB6 is a disaster and not really suited at all to these types of progects, so the fact that I use it is a measure of how important this feature is. Like the fault tolerant shell, it is a 'non-pure' extension insofar as purists say it should not be neccessary, but in pratice it is a godsend. Anybody know an alternative for VB6 in this respect?
  • by gazbo (517111) on Monday March 15, 2004 @06:27AM (#8566916)
    They are deleting a number of files on a number of different machines, then downloading an updated version. The implication is that the fault tolerance means a failure is not fatal.

    So what happens if the files are crucial (let's use the toy example of kernel modules being updated): The modules get deleted, then the update fails because the remote host is down. Presumably the shell can't rollback the changes a la DBMS, as that would involve either hooks into the FS or every file util ever written.

    Now I think it's a nice idea, but it could easily lead to such sloppy coding; if your shell automatically tries, backs off and cleans up, why would people bother doing it the 'correct' way and downloading the new files before removing the old ones?

  • by simon_clarkstone (750637) on Monday March 15, 2004 @06:27AM (#8566917)
    ...people start pronouncing "ftsh" as "fetish". Actually, I've started already, just ask the girl sitting next to me. ;-)
  • login (Score:5, Funny)

    by Rutje (606635) on Monday March 15, 2004 @06:56AM (#8566983)
    "Password fairly correct. Root login granted."

  • It seems like a bad example to me since wget already has a lot of retrying build in.
  • by xlurker (253257) on Monday March 15, 2004 @07:01AM (#8566996) Homepage
    (the concept of fault-tolerant coding encourages sloppy coding. and it makes it harder to see what's actually happening in the script. but that's not what they actually mean.)

    what they seem to essentially want is

    • a try statement and error catching and
    • a fortran like syntax for testing and arithmetic
    I think the authors were a bit misguided. Instead of creating a whole new shell how about just extending a good existing shell with a new try statement a described.

    it can even be done without extending the shell:

    ( cd /tmp/blabla
    &&
    rm -rf tmpdir
    &&
    wget http://some.thing/wome/where
    ) || echo something went wrong

    as for the new syntax of .eq. .ne. .lt. .gt. .to.
    certainly looks like fortran-hugging to me , yuck

    as for integer arithmetic, that can be done with by either using backticks or the $[ ] expansion

    % echo $[ 12 * 12 + 10 ]
    % 154
    • Golden hammer (Score:2, Insightful)

      by sangdrax (132295)
      Ofcourse bash can do it as well using the proper constructions. That is not the point. Care should be taken not to view bash as a golden hammer when it comes to shell scripting. The same goes for 'ftsh', ofcourse. It won't try to replace bash for every script out there.

      The author merely thinks it would be nice to have a shell in which such fault-tolerant constructions are natural by design. Just to save people headaches when writing simple scripts which are there to get some job done, not to waste time dea
    • First of all, your comment about fault tolerant coding encouraging sloppy coding is either misguided or just plain stupid. Not catching and handling error conditions correctly IS sloppy coding. Catching and handling error conditions correctly IS fault tolerant coding.

      ftsh provides a mechanism aimed at making it easier to catch and handle error conditions correctly. How does that encourage sloppy coding?

      As for your bash example, it only demonstrates the most basic capability of ftsh - exiting on error. I

  • by andersen (10283) on Monday March 15, 2004 @07:02AM (#8566997) Homepage
    "On two occasions I have been asked [by members of Parliament!], 'Pray, Mr.
    Babbage, if you put into the machine wrong figures, will the right answers
    come out?' I am not able rightly to apprehend the kind of confusion of ideas
    that could provoke such a question."
    -- Charles Babbage
  • I'm sorry, but I can't understand why a Windows port (even if not native) is even attempted. Seems kind of useless in a totally GUI environment. Of course, maybe it's just me?
    • by Anonymous Coward
      Seems kind of useless in a totally GUI environment. Of course, maybe it's just me?

      Uh.. it's just you. You should, y'know, maybe try using Windows 2000 or XP sometime... Windows has a perfectly good command line. Point at the "Start" menu, click "Run" -> type "cmd", and away you go.

      You can turn on command line completion (search for "TweakUI" or "Windows Powertoys", I can't be bothered to link to them). And even pipes work just fine (as they have since the DOS days). For example:

      dir *.txt /s /b > t

    • Mostly because Windows lack of good command line admin tools historically. Actually has a few, but cmd is not bash, so you have to suplement these..

      some people, (I myself too) use bash as a daily basic for Windows, this new stuff can be interesting and maybe usefull for the unsafe windows enviroment.

    • Perhaps because if you have access to run software remotely on a few hundred desktop computers that "has to" run Windows you would like to have the same script environment to use on Windows as on the other platforms you use? Perhaps because automated scripts aren't particularly good at using GUI's?
  • by quakeslut (107512) on Monday March 15, 2004 @07:08AM (#8567006)
    What do you lose by using something like this?

    Well.. besides pipes of course ;)
    Variable redirection looks just like file redirection, except a dash is put in front of the redirector. For example, For example, suppose that we want to capture the output of grep and then run it through sort:

    grep needle /tmp/haystack -> needles
    sort -< needles

    This sort of operation takes the place of a pipeline, which ftsh does not have (yet).
    • by The Pim (140414) on Monday March 15, 2004 @10:42AM (#8567934)
      What do you lose by using something like this?

      Well.. besides pipes of course ;)

      Funny you should mention this, because I was going to write something about pipes. Getting pipes right with good error semantics is hard. For all the "just use set +e in bash" weenies out there, try running

      #!/bin/sh -e
      cat nosuchfile | echo hello
      Where's your error?

      If you think about the unix process and pipe primitives, you will see the difficulty. To create a pipeline, you normally fork, create the pipe, fork again, and run the two ends of the pipeline in the two sub-processes. This is scalable to deeply nested pipelines, but has a cost: Only one of the sub-processes is a child of the shell, so only one exit status can be monitored. To work around this, you really need to build a mini-OS environment on top of unix.

      This demonstrates that unix was fundamentally not designed with concern for error semantics (consider Erlang as a diametric example). And this, I'm sure, is why ftsh doesn't have pipes (yet).

  • by fruity1983 (561851) on Monday March 15, 2004 @07:18AM (#8567028)
    In monopolistic America, you tolerate faulty shell.
  • Not good (Score:4, Funny)

    by Molina the Bofh (99621) on Monday March 15, 2004 @07:39AM (#8567070) Homepage
    joshua:~#rm -Rf //tmp
    Probable typing error detected. Parsed as rm -Rf / /tmp

    • *poof*
      I see you're trying to vaporize your UNIX system. Would you like to:
      • Pay SCO $600 for rights to your Linux system
      • Download and Install a W4r3Z version of Windows XP (c'mon, everyone's doin it!)
  • by Anonymous Coward
    ... was mentioned a few months back in one of the magazines I pick up almost monthly (forget which one out of the several it was).

    I think the shell was called dsh. I believe this is the project site: http://dsh.sourceforge.net/

    Are the aims of this fault tolerant shell and dsh the same? I'm not a programmer, but I'm trying to teach myself *nix system administration.

    Eventually I'm hoping to cluster some older x86 systems I'm going to get at auction together for a Beowulf cluster. It sounds to me like one i
  • OK, wise guys... (Score:5, Interesting)

    by JAPrufrock (760889) on Monday March 15, 2004 @08:01AM (#8567120)
    I'm working with Grid and ftsh as we speak. I'm a physicist, not a professional coder. I write reasonable code, but I'm no purist. With that said...

    ftsh has great utility in the realm it's written for. Obviously, it's not a basis for installing kernels or doing password authentication. In a Grid (not just distributed) environment, things break for all sorts of reasons all the time. You're dealing with a Friendly Admin on another system, one who may well be unaffiliated with your institution, project or field of study. He doesn't have any particular reason to consult with you about system changes.

    Now you find yourself writing a grid diagnostic or submitter or job manager. One does not need strongly typed compiled languages for this. Shell scripts are almost always more efficient to write, and the speed difference is unimportant. Right now, most Grid submitters are being written in bash or Python or some such. Bash sucks for exception handling of the sort we're talking about. Python does better with its try: statements, but there's room for improvement. ftsh is a good choice for a sublayer to these scripts. One writes some of the machinery that actually interacts with the Grid nodes and supervisors in this easy, clear and flexible form.

    Now there are a lot of specific points to answer:

    One needs a Windows port to be able to make the Grid software we write in Linux available to the poor drones who are stuck with Win boxes.

    This is not a code spellchecker or coding environment. At all.

    This is not a crutch for inadequate programmers. This is a collection of methods to deal with a specific set of recalcitrant problems.

    As I was pointing out before, this is, after all, an unstable system. One is using diverse resources on diverse platforms in many countries at many institutions. I appreciate the comment made by unixbob about operating in heterogeneous environments.

    This isn't a substitute for wget. One uses wget as an example because it's clear.

    The "pull" model breaks down immediately when there is no unified environment, as is described on infrastructures.org. When you're not the admin, and your software has to be wiped out the minute your job is done, "push" is the only way to do it. This is the case with most Grid computing right now (that I know about)

    All the woe and doom about the sloppy coding and letting the environment correct your deficiencies is... ill-thought-out. That's what a compiler is, folks. Should we all be coding in machine language? :) Use the right tool for the job and save time.

    I do agree, however, that one should indeed hone one's craft. Sloppy coding in projects of importance is inexcusable (M$). There is no reason to stick to strict exception handling, however, in the applications being discussed by ftsh's developers (the same folks who brought you Condor). When code becomes 3/4 exception handling, even when the specific exceptions don't matter, there's a problem, IMHO. :)

    • Nope, pull doesn't break down in heterogeneously managed/owned or even platformed environments. It works best in these environments. The www is an ideal example of such, apt-get is another, seti@home is another, distributed.net, I could go on.

      In Grid based computing environments, jobs queue until they can be started, and as it happens, they tend to be architected as pull based systems whether you see that as a user or not.

      Again, I can't see a reason for ftsh in this case. I'm sure there's a niche for it
  • This was an obscure typo bug I found this morning (after 3 months)

    Argh.

    Wish the shell would have added the (obvious) ' > ' :P
  • by divec (48748) on Monday March 15, 2004 @08:04AM (#8567130) Homepage
    The article says:

    #!/bin/sh

    cd /work/foo
    rm -rf bar
    cp -r /fresh/data .

    Suppose that the /work filesystem is temporarily unavailable, perhaps due to an NFS failure. The cd command will fail and print a message on the console. The shell will ignore this error result -- it is primarily designed as a user interface tool -- and proceed to execute the rm and cp in the directory it happened to be before.

    That shell script can be improved a lot by using " set -e " to exit on failure, as follows:
    #!/bin/sh

    set -e # exit on failure

    cd /work/foo
    rm -rf bar
    cp -r /fresh/data .


    This means that, if any command in the script fails, the script will exit immediately, instead of carrying on blindly.

    The script's exit status will be non-zero, indicating failure. If it was called by another script, and that had "set -e", then that too will exit immediately. This is a little bit like exceptions in some other languages.


  • I love this... (Score:3, Insightful)

    by deego (587575) on Monday March 15, 2004 @08:17AM (#8567160)
    .. the shell just got the cool error-handling lisp has always had (condition-case in elisp, for example). From a lisper's perspectice, things will be so much easier now... and I can really try some more scripting..

  • I finished building the shell after I changed the code that uses a non-standard way of printing the usage message, show_help() in src/ftsh.c. In emacs, I replaced ^\(.*\)\\$ with "\1", and then went back and changed the lines that did not end in a backslash, removed the beginning and ending quotes.

    Then it compiled (on Fedora Core 1).

    Then it failed the functions test, because my computer does not have the file /etc/networks. For a fault tolerant shell, it does not seem very tolerant of my machine! Aft

  • by WetCat (558132) on Monday March 15, 2004 @09:25AM (#8567421)
    Erlang (http://www.erlang.org) has it.
    You can have multiple linked interpreters and
    even fault-tolerant database!
    It is a scripting language.
    From the FAQ:
    1.1. In a nutshell, what is Erlang?
    Erlang is a general-purpose programming language and runtime environment. Erlang has built-in support for concurrency, distribution and fault tolerance. Erlang is used in several large telecommunication systems from Ericsson. The most popular implementation of Erlang is available as open source from the open source erlang site.

  • ACID Filesystems (Score:3, Interesting)

    by NZheretic (23872) on Monday March 15, 2004 @09:31AM (#8567458) Homepage Journal
    For a system like this to be truly effective you would need an operating system which supported a truly transactional filesystem.

    Remounting a filesystem with ACID on, a process sets a rollback point , executing a series of commands with the operating system keeping a record of the changes to the filesystem made by the process and its children. The process would inform the OS to either commit or rollback the changes.

    This still raises questions on how to deal with with two or more competing "transactional" processes which rely on read information which another process chooses to rollback to an early state.

  • by AxelTorvalds (544851) on Monday March 15, 2004 @10:27AM (#8567786)
    I know that if you need a ton of fault tolerance in your shell scripts that you should probably be using a different language but every time I look at any complex systems, not just a signle app but a system, there is always shell script glue. More importantly, I've never seen a shell script that checked the return codes of everything at best they look at a few key components and report on their success of failure. Exceptions would be nice.

    I think perl is where it is because so many people use it as "super script." To me that says, a) we recode all the Bourne and csh and bash in perl or b) we look at why people do shell scripting in perl or other languages and add that to the shell. I couldn't tell you which is right. It's a neat idea though and I'm glad they made it.

    A real example I can think of, I had a test machine that had some kind of ext3 corruption and so it mounted up in read-only mode when it booted. I spent time diagnosing an application error in our application because nothing caught that; these are redhat type startup scripts. I noticed that our app couldn't write logs and began to debug the system. More interestingly, a dozen or so start-up scripts failed to start up critical components and their failure wasn't noticed. If you can't write to the filesystem, you can't create a socket(AF_UNIX) and all sort's of things go tits up then. If that's how you debug it's only going to get more difficult as you add more and more complexity, you have to detect the lower level failures and report them. Perversely, this wouldn't have been noticed had a different partition been read-only. Turns out that a drive was going bad. Had it been a different partition, it would have been noticed at catastrophic system failure time when the drive died.

    I've done a fair amount of embedded work and there is always a test for new guys, you can tell the new guy (new college grad, whatever) because he skips half or more of the error checking in his code. You know printf returns a value? Funnier still, if you develop something like a consumer app in embedded space, you'll eventually see things like printf fail. We know it never should, but with 20,000+ users in different environments and what not, things like that can and do fail and usually point to a greater problem, like a dead drive or something. Instead of logging/alerting something to the critical and unusual printf failure, the app fails in a different way because this printf failed. Heaven forbid that it was sprintf that failed and then you shove bad data in to a database or configuration file and not just fail the system but corrupt the data too. Inspite of all of that, even veterans will forget error checking at times, it's a common bug and so having higher level tools to help assist, like exception in the shell can only be a good thing.

  • given his example:
    cd /work/foo
    rm -rf bar
    cp -r /fresh/data

    would this not suffice:
    cd /work/foo && rm -rf bar && cp -r /fresh/data

    my undertanding of && was that it only executes in the previous command didn't throw some sort of error. i understand its not as powerful as what he's talking about, but there is some degree of fault tolerance there.

    secondly, i don't know about you, but i would be very uncomfortable with something that tries a few thousand times or for a particular amount
  • by ChaosDiscord (4913) on Monday March 15, 2004 @01:19PM (#8569496) Homepage Journal

    What's with all of the people claiming that FTSH will ruin the world because it makes it easier to be a sloppy programmer. Did you freaking read the documentation?

    To massively oversimplify, FTSH adds exceptions to shell scripting. Is that really so horrible? Is of line-after-line of "if [$? -eq 0] then" really an improvement? Welcome to the 1980's, we've discovered that programming languages should try and minimize the amount of time you spent typing the same thing over and over again. Human beings are bad at repetitive behavior, avoid repetition if you can.

    Similarlly FTSH provides looping constructs to simplify the common case of "Try until it works, or until some timer or counter runs out." Less programmer time wasted coding Yet Another Loop, less opportunities for a stupid slip-up while coding that loop.

    If you're so bothered by the possibility of people ignoring return codes it should please you to know that FTSH forces you to appreciate that return codes are very uncertain things. Did diff return 1 because the files are different, or because the linker failed to find a required library? Ultimately all you can say is that diff failed.

    Christ, did C++ and Java get this sort of reaming early on? "How horrible, exceptions mean that you don't have to check return codes at every single level."

Computers will not be perfected until they can compute how much more than the estimate the job will cost.

Working...