Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Perl Programming

Next Generation Regexp 248

prostoalex writes "Jeffrey E. F. Friedl, author of newly published 2nd edition of Mastering Regular Expressions, wrote a feature article for O'Reilly Network on the recent innovations in the regular expression world. You'd think that such area as regular expressions would be fairly stable, but according to the author, 'when I started to work on the second edition of Mastering Regular Expressions and started refocusing on the field, I was rather shocked to find out how much had really changed'. The article's behind-the-scene purpose is apparently to push a new book that O'Reilly published this month, but it has great educational value for anyone involved with practical extracting and reporting."
This discussion has been archived. No new comments can be posted.

Next Generation Regexp

Comments Filter:
  • by Jobe_br ( 27348 ) <bdruth.gmail@com> on Wednesday July 17, 2002 @05:16PM (#3904749)

    I particularly like this bit:

    A full chapter on .NET-specific regex issues helps to clarify things, and helps to make up for the exceedingly poor documentation that Microsoft provides with the package.
    Nice to see that things haven't changed much ;)
    • by Rui del-Negro ( 531098 ) on Wednesday July 17, 2002 @05:31PM (#3904855) Homepage
      Microsoft's documentation reads like a novel compared to IBM's. The typical IBM manual has the following format:

      PAGE 1:

      [COMMAND1] is executed by typing the word [command1] followed by the argument string, followed by enter. The argument string consists of a sequence of non-whitespace characters separated by whitespace characters.

      [COMMAND2] is executed by typing the word [command2] followed by the argument string, followed by enter. The argument string consists of a sequence of non-whitespace characters separated by whitespace characters.

      [COMMAND3] is executed by typing the word [command3] followed by the argument string, followed by enter. The argument string consists of a sequence of non-whitespace characters separated by whitespace characters.

      PAGE 2:

      THIS PAGE IS INTENTIONALLY LEFT BLANK

      ...and so on and so on.

      Regarding this last IBM tradition (that others have tried to copy but few have truly mastered), the Spruce DVD Maestro manual has a page with the following text:

      Blank page.
      (mostly)

      RMN
      ~~~
      • The only computer books i've ever read which actually read well were "Upgrading and Repairing PC's" (So much so i wrote the author) and "The practice of system and network administration".

        If only all books could be written as well.. *sigh*...

        In-depth... summary. In-depth... Summary.

        • Who are you and where did you get my brain?

          The first made me a tech; the second is making me an admin. All the books I've read in between have been MS GUI crap or warmed over help files and TechNet articles.

          (Yeah, yeah, I know it ain't Linux...but it pays the bills)
          • That book made me a tech in '98, along with some other studying it got me my A+. I'm now a UNIX Admin, and currently re-reading that book, just for fun. It's interesting to compare the Intel processors with a Sparc or RS/6000 on an internal level..

            Oh, and Linux has about a 10% chance of paying your bills. Have a major enterprise skill set, say, Windows or AIX or Solaris.. and have linux as a secondary skill. Things may be different where you live, but in midsouth.us, Linux is currently is considered "dot-commish" and corps are steering away from it. Things are bound to change when the Itanium comes out and PC Unix means something again.

      • What really amazes me is how IBM manages to mangle man pages.

        Apparantly the traditional man pages weren't down to IBM standards, so IBM actually paid someone to rewrite them.

        In order to get man pages that actually have useful information I now have to surf the web. The ones included with AIX 4.3 are so damn useless and content-free that they're actually misleading at times.
    • by TheViffer ( 128272 ) on Wednesday July 17, 2002 @05:42PM (#3904918)
      "I see that you are writing a regular expression"
    • From the article --

      Whether you love Microsoft or hate it, there's no denying the popularity of Visual Basic. With the regular-expression package in the .NET Framework, Microsoft provides a package that can be used by VB.NET, C#, Visual C++, and any other language that wants to link to it -- even Python and Perl! The consistency is appealing, but even more important is the package itself: it's powerful and fast, and can it can hold its head up high next to Perl or any other regex package out there.

      VB's regex syntax is exactly like Perl's. In fact, when I started working with regexes in VB and I couldn't find something in the documentation I would look it up in one of the O'Reilly Perl books. Much to my "shock", I could do everything Perl regexes could do, even the things that weren't in the documentation.

      I strongly suspect Microsoft took full advantage of Perl's "artistic license" when they came up with their regex engine.

  • by N8F8 ( 4562 ) on Wednesday July 17, 2002 @05:17PM (#3904761)
    Amazon has slipped the shipping date twice. I don't know about you, but this book [amazon.com] is definitly a "Must Have".
  • by Anonymous Coward on Wednesday July 17, 2002 @05:20PM (#3904777)
    Perl6 is going to radically change regular expressions as well. I guess the term "regular expression" is pretty vague/useless these days. You have to identify the language _and_ its revision to get an accurate idea of the regexp feature set you're dealing with. Just throw some variables and control structures into regexp and we'll have a full-blown extremely cryptic language. Maybe we need a RegExp Institute of Excellence with yearly meetings in Sweden or something.
  • by Anonymous Coward on Wednesday July 17, 2002 @05:20PM (#3904780)
    Other than to tell us what is different between the two books. After reading the article I walked away with no general knowledge that was useful in using regular expresions, or what might be coming, or where we came from.

    It is a slightly wordy advertisment for why you should upgrade. The fact that it was foisted on us as something else annoys me, as I spent time reading it.

    I know, a slashdot reader that actually reads linked stories is such a minority, but come on, quite stuffing articles with advertising. Aren't the ads in the middle of a page enough?
  • what about perl 6? (Score:5, Interesting)

    by jbennetto ( 41159 ) on Wednesday July 17, 2002 @05:21PM (#3904787)
    He doesn't even mention the radical changes to regexps in Perl 6, as described in the recent Apocalypse 5 [perl.com] and Synopsis 5 [perl.com].
    • by tswinzig ( 210999 )
      He doesn't even mention the radical changes to regexps in Perl 6, as described in the recent Apocalypse 5 [perl.com] and Synopsis 5 [perl.com].

      If you could write and use a Perl 6 program right now, maybe he'd include a chapter on it in his book.

      This article is basically an overview of his book. His book doesn't cover Perl 6 regex's. Why should it? Perl 6 isn't even done yet, and so everything new for Perl 6 could change by the time it comes out.
      • If you could write and use a Perl 6 program right now, maybe he'd include a chapter on it in his book.

        heh [develooper.com].
        • From the email:

          it's now Turing-complete, if you have a Parrot engine and a bit of spare time. Call it a primitive "demo version" of some of Perl 6's features.

          So I reiterate... "if you could write and USE a Perl 6 program right now, maybe he'd include a chapter on it in his book."

          heh.
          • And you can. Your definition of use involves prodouction deployment, does it? Authors of software-related books are well used to using pre-alpha versions of software for research material. I'm sure he would not have as hard a time as you think.
  • by Shevek ( 6397 )
    That is one of the most contentless articles I have seen in a long time.

    A regex is a type 3 grammar. Type 3 grammars haven't really changed since Chomsky's time.

    The smartarses will now proceed to point out that
    a) Perl is actually limited type 2
    b) Some change noone knows or cares about was made to some definition of the Chomsky hierarchy in ninteen dumdy-dum.

    Foo.
    • A regex is a type 3 grammar. Type 3 grammars haven't really changed since Chomsky's time.
      The smartarses will now proceed to point out that
      a) Perl is actually ...
      ... using the phrase "regular expression" to describe something quite different that "the stuff that's computationally equivalent to a finite state machine" or "the kind of thing Kleene worked on"; imprecise, but most people know what you mean when you say it.
    • Maybe to a certain small class of people, "regular expression" means what you want it to mean. To 99.99% of the people who use the phrase, it means what the book describes, and those things have changed considerably.

      Many precise mathematical or scientific terms have different meanings to laymen. What is a positive number? I'm sure I learned whether 0 is a positive number way back when, but right now it simply doesn't matter. Context is usually good enough, and when not, > and >= work wonders. Quantum leap as used by mere mortals has the meaning of incredible revolutionary exciting change, but scientifically, it means the smallest possible change.

      So foo to you.
    • by Get Behind the Mule ( 61986 ) on Wednesday July 17, 2002 @06:09PM (#3905072)
      That is one of the most contentless articles I have seen in a long time.

      A regex is a type 3 grammar. Type 3 grammars haven't really changed since Chomsky's time.
      You get a B-, Bunky. And here's your cookie.

      After you've finished your untergrad CS theory class, you might go on to discover that implementations of regexes under various paradigms and in the various languages have extremely rich variety regarding syntax, semantics and efficiency. This isn't about the pristine theory of Prof. Chomsky, but about the actual use of regexes as programming constructs, and that's a tremendously complex subject. Friedl's book in the first edition is one of the best I've ever seen that has tackled such complexity and made it accessible and useful for the everyday business of programming.

      The article indicates that the practical use of regexes, far from stagnating since Chomsky's time, continues to evolve and grow. That's only "contentless" if you're stuck in the ivory tower and don't intend to leave.
    • Regular expressions aren't theoretically interesting anymore. Regexps, in the sense of a way of specifying regular (and some non-regular) expressions, shows significant change over time. In much the same way, English isn't theoretically different from Indo-European, but you won't get very far using only Indo-European these days.
  • at some point... (Score:4, Interesting)

    by g4dget ( 579145 ) on Wednesday July 17, 2002 @05:26PM (#3904819)
    Beyond a certain degree of complexity, it really doesn't make much sense anymore to use regular expressions--a simple built-in parser generator with executable annotations is both clearer and more powerful. Parser generator syntax allows comments, whitespace, with a simple, fairly standard syntax.

    Perl and other languages should leave "good enough" alone when it comes to regular expressions and instead just make it easy to put chunks of grammars into programs.

    • by joshv ( 13017 ) on Wednesday July 17, 2002 @05:40PM (#3904911)
      Beyond a certain degree of complexity, it really doesn't make much sense anymore to use regular expressions--a simple built-in parser generator with executable annotations is both clearer and more powerful. Parser generator syntax allows comments, whitespace, with a simple, fairly standard syntax.

      Yes, regular expressions should be used to find particular patterns in text and perform basic manipulations on them. Beyond a certain point of complexity it really doesn't make sense to perform more complex manipulations. Get the information you want out of the string using a regular expression, then manipulate it in code.

      One has a feeling that regexp engines are just becoming programming languages in and of themselves - the only difference being that the 'program' consists of a string of cryptic single character commands, and the input is limited to a single string.

      -josh
      • One has a feeling that regexp engines are just becoming programming languages in and of themselves [...]

        Not true. Yet.

        Perl 5 regexes can solve NP-hard problems, but they're not quite Turing complete. However, they require only four additional stack operators to do that.

        Personally, I'm waiting for the first Perl regex to become sentient.

        • Feb 20, 2042 - The day that the first true sentient artificial intelligence is created.
          Feb 21, 2042 - The day it gets converted into a Perl one-liner.
    • Beyond a certain degree of complexity, it really doesn't make much sense anymore to use regular expressions... Parser generator syntax allows comments, whitespace, with a simple, fairly standard syntax ...
      ... and (as you'd certainly know if you'd read either edition of Friedl's book) that's also of Perl 5 "regular expressions"; and Friedl strongly encourages you (e.g., by example) to write complicated regular expressions that way.
  • by revscat ( 35618 ) on Wednesday July 17, 2002 @05:28PM (#3904833) Journal

    Over the course of my career I have come to the rather firm opinion that you are not worth much as a coder if you do not know regular expressions. I don't care what language(s) you're proficient in, or if you've memorized every single design pattern the GoF has ever conceived, of do 4 foot by 6 foot UML diagrams in your head. If you can't do regexps then you're missing a basic skill. I bought Friedl's book a couple of years ago, and although I wound up not using man of the Perl related stuff the rest of the book helped me out immensely.

    A programmer without knowledge of regular expressions is like a carpenter without a hammer.

    • by Anonymous Coward on Wednesday July 17, 2002 @05:47PM (#3904952)
      A programmer without knowledge of regular expressions is like a carpenter without a hammer.

      If ever there was an apt analogy of regular expressions - that's it! They make everything seem like a nail ;).
    • Where are moderator points when you need 'em?

      To those who can't read (or write) them, regular
      expressions look like line noise. But once you learn to read them you can condense whole paragraphs of spaghetti conditionals into a single, clear (to the initiated), terse line.

      For manipulating strings of characters, they are probably the single most important innovation of the last 20 years.
      • Re:Mod parent up! (Score:3, Informative)

        by mikec ( 7785 )
        Regular expressions were certainly an important innovation, but they're a lot more than 20 years old. They were first studied by Kleene in the mid-1950's. The first algorithm to translate them into DFA's was invented in about 1960. Lex was written in the mid 70's.
    • Perhaps if you are looking for perl programmers who will need to be doing a lot of textual processing, but that's definitely not the case in other areas.

      I prefer to work with people who don't do a lot of regex, because they're less likely to use them for everything. I haven't worked on a large project that used regular expressions in years. I feel pretty good about that.

      Sure, I've used them in a couple small scripts for parsing text, but if you see the majority of programming requiring regex, you definitely need to put your hammer down and pick up a Makita.

      • Sure, I've used them in a couple small scripts for parsing text, but if you see the majority of programming requiring regex, you definitely need to put your hammer down and pick up a Makita.

        Well, I am certainly not advocating the broad use of regexps in application programming, even though it has been demonstrated to be possible. For me, regexps are an important tool in solving side issues/behind the scenes work, such as formatting a series of configuration files in a given manner, or making broad changes to a set of HTML files, and so forth. I don't do Perl, and don't really like to if I can avoid it, but I still use regular expressions on a daily basis, and have found them to be immensely helpful.

    • I know regular expressions, but funnily enough I almost never need them. Occasionally I will do a regexp search if an exact search is not good enough. I don't have Perl installed, and I can't say I have ever needed it.
      I guess they are OK if you do a shitload of text processing, but my philosophy is that data should be processed in native (i.e. binary) form and text should only be used for interchange purposes. Even in that case, you can use text "protocols" such as XML, for which regexps are useless. So... If you have a buttload of (fairly) unstructured data to import... Knock yourself out. It doesn't happen to me. Text processing just isn't an issue for me. I don't think that makes me any less of a coder. My domain is simply different to yours.
    • HEAR YE, HEAR YE!

      You speak WISDOM...

      I remember a while back, one of my clients needed to move a bunch of dns records from one server to another. Took me ~ 45 minutes to write a php shell script using REGEX to create new bind zone records for over 300 domains, and convert them - records intact, complete, ready to restart named.

      This poor client had paid somebody else to do it, they spent several DAYS at it and there were still lots of (human) mistakes.

      And, this wasn't complicated stuff!

      Any programmer who doesn't know regex is crippled!
      • I remember a while back, one of my clients needed to move a bunch of dns records from one server to another. Took me ~ 45 minutes to write a php shell script using REGEX to create new bind zone records for over 300 domains, and convert them - records intact, complete, ready to restart named.

        Forty five minutes? Wow. Had you been using djbdns [cr.yp.to], you could have been done in thirty seconds. The BIND zone file format is needlessly complex.
    • (* Over the course of my career I have come to the rather firm opinion that you are not worth much as a coder if you do not know regular expressions. *)

      That can be said about anything. IMO, many OOP fans were simply crappy at procedural/relational programming and design (either due to lack of training, or a non p/r mind). The faults they often find with p/r are their own bad thinking about p/r, and not OO's strengths.

      I think reg.ex's would be easier to learn and read and remember if they were broken down into user-definable chunks of some kind. It could be more like defining a generational grammer (substitution): you define the symbols rather than live with what Larry Wall or whoever picks. A special set of functions or operators would simplify the defining of the symbol sets.

      Further, I would like to see the peices parsed into a table (or some easy-to-navigate structure) so that second passes can be done. In other words, divide up per-character parsing and per-token parsing.

      I admit that it may not be as compact as regexp's, but easier to read for those don't need it every day.

      Regular expressions come across as a stringy diarreac glob of an irriducable mess of symbols if you don't keep up. It is like forgetting to ride a bicycle if you do not do it every 3 months or so to refresh.

      I realize that everybody is different, and what bothers me may not bother others. I just don't personally like the approach resexp's took. I would like to see it broken down into clearer chunks. IOW, the syntax would (clearly) dictate the chunks instead of running the rules in one's head to find the boundaries and context.

      I know I will get called a bunch of names for saying this all, but that is my opinion, take it or leave it.
      • Here is kind of an example of what I am
        envisioning.

        The Perl version comes from:

        http://txt2regex.sourceforge.net/

        ### date LEVEL 3: mm/dd/yyyy: matches from 00/00/1000 to 12/31/2999

        RegEx perl: (0[0-9]|1[012])/(0[0-9]|[12][0-9]|3[01])/[12][0-9] {3}

        My re-work of it:

        symb(h, "A", Symb_numRange(1,12));
        symb(h, "B", Symb_numRange(1,31));
        symb(h, "C", Symb_numRange(1000,2999));
        isGoodDate = symb_Match(h, checkMe, "A/B/C");

        Here "h" is the symbol set storage handle.
        OOP langs would probably have it on the left side as an object.
      • Regular expressions come across as a stringy diarreac glob of an irriducable mess of symbols if you don't keep up. It is like forgetting to ride a bicycle if you do not do it every 3 months or so to refresh.

        Not to pick nits, but the expression "it's like riding a bicycle" implies that once you learn how to ride a bicycle, you never forget, no matter how long you go without actually riding one.

        • (* Not to pick nits, but the expression "it's like riding a bicycle" implies that once you learn how to ride a bicycle, you never forget *)

          Well, I am suggesting that it is *not* like a bicycle. The rules and symbols don't "stick" very long if you don't use regex's very often. At least not in my head.

          Actually a few years ago I tried riding a bicycle after about a 10-year absense. I almost fell over because my weight distribution was "different"[1] later. My brain did not know how to balance the new weight.

          [1] Euphemism for "fatter"
    • Regular expressions is one of those tools that I end up teaching to anyone that doesn't know them whenever I start a new job. I don't use them in much of my applications, but I use them to write my applications and build tools. I follow the philosophy of building tools to solve problems knowing I'll need to solve the same problem again and again.

      Another tool is shell scripting. At a past company Symantec Cafe was used for developing a Java application. When I joined, I immediately created shell scripts for myself to do automated builds for a couple reasons:

      • Cafe's editor, while nice, was not up to par for me -- it slowed me down too much.
      • I multitask a lot when I'm working, and having multiple shells open at once doing builds et al is handy.
      • The editor I use on Windows, CodeWright [starbase.com], lets you call batch files (and thus shell scripts through Cygwin) for CVS and compilation.
      • Cafe didn't (and still doesn't?) do automated builds, nor does it run on Linux.

      I showed others how to use them, but only one other developer took the time to get used to it, never having used a shell before. The others complained that they shouldn't have to learn a new tool (shells and scripts) when Cafe sufficed. I explained the advantages, but to no avail.

      Well, a few months later we finally hired a real QA and release engineer. Since we were building a J2EE application to run on Linux in testing and Solaris in deployment, we needed automated builds on Unix. There was a huge rush to get everyone up to speed on the new build system using shell scripts.

      Hmm, that was a bit long-winded just to make the point that there are many useful tools to developers that don't involve the actual code they write. I've used regexps to create SQL data files and config files as mentioned. You'll learn many things, so keep open and don't stop learning. :)

    • Whenever I interview someone for a position I always ask about any "obscure" progamming languages or concepts. Perl, RegExps, Python, Scheme, Lisp, etc... It's not if they know/use the language it's how they answer the question. If they say that they don't know anything about it, that tells me that their toolbox is kinda light. These people are usually MCSEs.

      Once, I mentioned regular expressions in a room full of expensive contracters and full time employees and everyone looked at me like I had suddenly grown an extra head. I was shocked and dismayed. I'm surrounded by amatures.
    • by The Cookie Monster ( 129545 ) on Thursday July 18, 2002 @01:17AM (#3906783)
      I know and use regular expressions, but use of regular expressions is often symptomatic of poor design, this makes me somewhat suspicious of those who live and breath regexp's and preach them to the world.
      • Text processing - why isn't your text marked up? Converting data into text, passing it along, and then trying to pluck the data back out of the text is brittle and leaves you with a system that can't be upgraded - your components can't be improved to produce a more informative text stream as it will break all the regexpr's of all the components that use that stream etc.

        Text straight from the keyboard of a user won't be marked up and seems a good place to be using regular expressions. Due to the popularity of brittle and unupgradable (is that a word?) text processing, the input from other programs might not be marked up either, here regexprs are necessary (ie symptomatic of poor design, but it wasn't your decision).

      • Parsing - how many times have you encountered a HTML or XML parser written with a regexpr? Unless your job requires you code by the seat of your pants, this is just plain lazy. Parsers written with regular expressions are always incomplete (ie they work on the subset of HTML/XML they were tested on, and if the requirements or layout ever changes they break), and they are very slow compared to a proper parser. Proper robust and well tested parsers are available under most licenses and for most languages.

        This applies to much more than just HTML or XML, eg if you're going to write a javadoc clone for your pet language, do it properly, don't do it with regular expressions.

      • Development - Regular expressions appear to be developed with a 'try it and see' methodology - people write the regexpr and test it, thinking if it works then they must have done it right. This is very brittle, I've ecountered many regexpr's for email addresses, all of them work on your bog standard address, none of them work when deployed - there's always some guy with a % in their email address or some other oddity the author of the regexpr forgot or didn't know about (and lets not even think about trying to make an RFC compliant email address regexpr, it would have to handle "blarg@wibble"@slashdot.org)

        That HTML tag stripper you hacked up, did you remember to handle comments? Just because there weren't any comments in the HTML it was tested on doesn't mean it'll never encounter them in the real world (wouldn't be an issue if an off the shelf parser is used).
      I don't know, there are other issues with regexpressions but I've spend too long on this post already. I'm curious as to other's views on this - I've just come to associate use of regular expressions with flakey or hastily written software.
      • by Anthony Boyd ( 242971 ) on Thursday July 18, 2002 @04:52AM (#3907275) Homepage
        Text processing - why isn't your text marked up?

        While you later concede that form input and input from other programs might be good reasons to use a regex, that you would even pose this question is strange. For 90% of the regex fans, form input and screen scraping is exactly what they do. For almost any Web developer, this is the day-in, day-out norm. So your point seems to downplay the very uses that have made regex's so popular.

        I've ecountered many regexpr's for email addresses, all of them work on your bog standard address, none of them work when deployed

        You realize this does not bolster your claim that regex's are "overrated" -- it merely points out that some developers are overrated. A bad developer does not make a language bad.

        That HTML tag stripper you hacked up, did you remember to handle comments?

        Same as above. You're complaining about human error and then blaming the regex system itself.

        I've just come to associate use of regular expressions with flakey or hastily written software.

        Of course. But the hastily written software is the other software we interact with, not our own. And that's a broad generalization for many developers, so of course you can find exceptions. But you asked for other people's views, and in my view, regex's are sorely needed -- not so bad developers can stay bad, but so that the good developers can clean up the messes left behind after the bad developers go. It's a nice bonus that good regex developers can pull in hostile data, screen scrape, and cleanse form input. That helped one of my employees get a raise last quarter.

      • I'm curious as to other's views on this - I've just come to associate use of regular expressions with flakey or hastily written software.

        Hehe, ok, I'll be objective but some personal opinions reign. Must of this is from my personal experience, not text book stuff

        Text processing - why isn't your text marked up? Text processing forms the heart and soul of regexps. As you say, any brainful system should never pass text requiring regexps between systems (use markup, structs, whatever). However, at some point, there is usually raw input beyond your control, be it CGI input, keyboard input, non-markup input from a system beyond your control. That's where regexps are used the most (all of ?) the time for me.

        Parsing - how many times have you encountered a HTML or XML parser written with a regexpr?

        Parsing is the next level beyond regexps. You start with the specificatio and let the implementation arrive from it, like any much good development. Indeed, any "parsing" of large well specified documents (XML, HTML etc) are probably best done by proper parsers. But sometimes, you don't have well specified input at all, or you are just searching for bits out of a document. Now we are back to adhoc text processing where regexps rule. Also, parsers are overkill when we are doing small processing such as reading numeric input.

        (My IMHO) Conclusion: There is some grey (and for me, not a thin) line between text processing and parsers, where you should use regexps or not.

        Development

        A good regexp programmer knows what he is regexping for before he starts. I invariably get things right first time. That they try to parse something that has a specification (email address) without reading the RFC is stupid.

        Now here is the distinction. If something is well specified, there is invariably a perl module to handle it using whatever optimum (hopefully) method is available (XML::Parser, Email::Valid). Regexps are where we are not dealing with standard specifications, perhaps non-formatted data and thus where parsers may not work. And in those cases, withour regexps, you'd be in a very lost world and that's perhaps why they are preached so much.

      • I know and use regular expressions, but use of regular expressions is often symptomatic of poor design, this makes me somewhat suspicious of those who live and breath regexp's and preach them to the world.

        I find regexes to be very useful for checking user input in HTML forms. You can do a JavaScript regex check for the user's convenience (so that s/he doesn't need to submit the form to find out that s/he made a mistake or invalid input), then a second check on the server side with whatever server language you are using.

        Skip the JavaScript if you're lazy or in a hurry.

      • I think that your criticisms are criticisms of *string processing*. Indeed, if you are spending most of your time munging strings, you might consider whether a better interface is needed. For machine languages like HTML or C code, you should normally use a parser rather than ad hoc string processing.

        But a lot of stuff does inherently require messing with strings, and for that, the regular expression is a great general-purpose tool. It certainly beats the raw C library :-P.
  • by jhunsake ( 81920 ) on Wednesday July 17, 2002 @05:35PM (#3904876) Journal
    Regular expressions haven't changed since the seventies, at the latest. Now if you want to say that implementations of regular expressions are advancing, fine. Let's be precise in our use of language, or not.
    • Ummm.... (Score:4, Funny)

      by MemeRot ( 80975 ) on Wednesday July 17, 2002 @05:40PM (#3904908) Homepage Journal
      "Let's be precise in our use of language, or not."

      Very compressed contentlessness.

    • Well, that's true because regular expressions is nothing but a compact way to describe a deterministic finite state-machine. On the other hand, regexps are not. Regexps has nothing at all to do with deterministic finite state machines, except for the fact that the syntax is inspired by them.

      PS: Note the difference between "regular expression" which is what they teach you about in CS classes, and "regexps", which is what programmers actually use in Perl and many other languages.

  • by paj1234 ( 234750 ) on Wednesday July 17, 2002 @05:50PM (#3904964)
    I have the first edition of "Mastering Regular Expressions" and it is indeed a very fine useful book.

    For a nice way to get started with regular expressions I recommend the wonderful "txt2regex" console program. It provides a simple text based wizard-like interface. You answer questions and the program builds your regular expression for you. See:

    http://txt2regex.sourceforge.net/
  • by jfriedl ( 593824 ) on Wednesday July 17, 2002 @06:07PM (#3905058)
    The original poster says that the "behind-the-scene purposeis apparently to push a new book that O'Reilly published this month". Actually, that's pretty much the main point of the article -- to justify the need for a second edition, and to let people know what they'd get (or, if not interested, what they're passing on).

    I wrote the article so that people would have a feel for what's new in the book. Of course, my hope is that people are interested in the new content, but my general feeling is that the worst that can happen is that someone buys the book and finds out that it's not what they expected. Unmet expectations pretty much suck, and I hope the article helps avoid some of that suckage.... and piques some interest, as well.

    Jeffrey

    • thanks for your book.
      Everybody here and there is going to say how informative it is. But, what stroke me the most, is that it is well written.
      It was very pleasant to read it, apart from the knowledge I got from it. If only all manuals ...
      • what [struck] me the most, is that it is well written.

        Which is why O'Reilly is the first place I look for a book. Ther ratio of well/badly written books is better there than anywhere else. The only books I will order online. All others, I want to page through them in a bookstore first.

    • I wrote the article so that people would have a feel for what's new in the book.

      As with almost every other programmer out there, I agree that "Mastering Regular Expressions" is one of the best-written and most useful programming books there is. I know a lot of people would probably buy the second edition regardless. But the article/book review cemented my decision, since it covers Java and PHP (and even that wacky MS stuff, huh?).

  • by Lumpish Scholar ( 17107 ) on Wednesday July 17, 2002 @06:09PM (#3905071) Homepage Journal
    It's not just a Perl book, but the language independent and Perl dependent parts are a godsend.

    I was a full time Perl programmer (with a two hour commute by rail) when Friedl's book came out. I read it cover to cover, and then recommended it strongly to my co-workers.

    Friedl shows how to write powerful, readable, efficient regular expressions that can do a lot of the work your program needs to do. It changed how my group wrote Perl (very much for the better). This is more than highly recommended; after the Blue Camel, and even before the Cookbook, this is a definitive book for all those who call themselves "Perl programmers."

    (In the first edition of the book, Friedl discovered some problems with regular expressions in early versions of Perl 5. The very next release of Perl -- 5.003, I think -- immediately fixed these problems. When Larry & Co. pay attention to a Perl book, maybe you should, too?)
  • What?! (Score:2, Funny)

    by Myuu ( 529245 )
    Mark me as a troll or whatever but, "What are regular expressions?"

    are they those /:+[^:]/ statements? whats the big deal then?

    I'm really, really new to perl, studying it out of an O'Rielly book. What does this mean to me?
    • If you don't know, I guess O'Reilly has another book sale...

      regexps are a very powerful search/replace tool. One of the reasons Perl is so popular is it has a powerful, easy to use (and by this, I also mean easy to invoke, evry try this in C, yeeesh) regular expression parser. Makes text processing very easy.

      If you're learning Perl out of the Camel book, you'll be fine. It has a good explanation of it. Once you see the power of it, you'l like wonder how you got along without it.
    • Re:What?! (Score:3, Informative)

      by jbolden ( 176878 )
      The major reason to learn Perl is powerful string manipulation. Those "those /:+[^:]/ statements" are the power string manipulators. Try to do anything hard with strings in any language without regexes then you'll understand what the big deal is.

  • by millette ( 56354 ) <robin@@@millette...info> on Wednesday July 17, 2002 @06:54PM (#3905327) Homepage Journal
    Anyone here that read the latest perl apocalypse, #5 it was, knows full well the regex as we know and love them are out-the-window. The apocalypse is a large document, so I picked this page [perl.com] to give you a little idea of wants going to change. The pages before that mention all the warts that Larry wants to bury.

    I understand that Perl 6 isn't near being done, and that the "r" in "Perl" doesn't necessarily stand for "regex", depending on who you ask, but Perl will always have the greatest influence over what is called a regex. Or is that going to change with Perl 6?

  • Regular expressions are old hat. I'm much more interested in the advances in irregular expressions, as used in the old Firth and Pasquale languages.

    But, of course, everyone knows that a real coder uses irregexps in disassembly language.
  • I'd like to take all my existing regular expressions and run them through regular expressions to turn them into new age regular expressions. Can I do this, or will the universe implode?

  • You know what I'd like? A regex syntax I could use in shell scripts that would take less time to debug than the equivalent C++ program.

    -a
    • Perl 6 is your language then. You won't be able to directly use it in shell scripts, but once you have Perl figured out, you won't mind that anyway. ;-)

      Having whitespace be insignificant by default should help a great deal with readability, as will the efforts to make regex syntax more consistent. The ability to embed Perl 6 objects into regular expressions should also lead to some interesting developments.
      • I write a few shell scripts in Perl, mostly when I need a hash table, but because I don't use it often it still takes a long time to research and debug. The problem with scripting languages is that they are meant to be compact so the syntax is so crazy.

        I always want to do a simple search and replace in a shell script ala echo "$TEXT" | sed "s/$FILENAME/xyz/". But filename is bound to contain some control characters, such as '/' or '.'. I end up using "s,$FILENAME,xyz,", but every once in a while I still get strange results. Can Perl do any better?

        -a
    • This isn't regexes but you might find it useful. In sh, at least on FreeBSD and I believe on Linux (bash) you can use ## and %% to strip out various parts of a variables contents. Such as:

      $ foo=bar
      $ echo ${foo##ba}
      r

      Very useful stuff..
  • Regex Accelerator! (Score:3, Informative)

    by Anonymous Coward on Wednesday July 17, 2002 @08:50PM (#3905809)
    For the ultimate in regex'ing ... hardware regex accelerators!!! [216.239.51.100]

Our OS who art in CPU, UNIX be thy name. Thy programs run, thy syscalls done, In kernel as it is in user!

Working...