Next Generation Regexp 248
prostoalex writes "Jeffrey E. F. Friedl, author of newly published 2nd edition of Mastering Regular Expressions, wrote a feature article for O'Reilly Network on the recent innovations in the regular expression world. You'd think that such area as regular expressions would be fairly stable, but according to the author, 'when I started to work on the second edition of Mastering Regular Expressions and started refocusing on the field, I was rather shocked to find out how much had really changed'. The article's behind-the-scene purpose is apparently to push a new book that O'Reilly published this month, but it has great educational value for anyone involved with practical extracting and reporting."
.NET regexps and Microsoft's documentation (Score:4, Insightful)
I particularly like this bit:
Nice to see that things haven't changed muchRe:.NET regexps and Microsoft's documentation (Score:4, Funny)
PAGE 1:
[COMMAND1] is executed by typing the word [command1] followed by the argument string, followed by enter. The argument string consists of a sequence of non-whitespace characters separated by whitespace characters.
[COMMAND2] is executed by typing the word [command2] followed by the argument string, followed by enter. The argument string consists of a sequence of non-whitespace characters separated by whitespace characters.
[COMMAND3] is executed by typing the word [command3] followed by the argument string, followed by enter. The argument string consists of a sequence of non-whitespace characters separated by whitespace characters.
PAGE 2:
THIS PAGE IS INTENTIONALLY LEFT BLANK
Regarding this last IBM tradition (that others have tried to copy but few have truly mastered), the Spruce DVD Maestro manual has a page with the following text:
Blank page.
(mostly)
RMN
~~~
Re:.NET regexps and Microsoft's documentation (Score:2, Troll)
If only all books could be written as well.. *sigh*...
In-depth... summary. In-depth... Summary.
Re:.NET regexps and Microsoft's documentation (Score:2)
The first made me a tech; the second is making me an admin. All the books I've read in between have been MS GUI crap or warmed over help files and TechNet articles.
(Yeah, yeah, I know it ain't Linux...but it pays the bills)
Re:.NET regexps and Microsoft's documentation (Score:2)
Oh, and Linux has about a 10% chance of paying your bills. Have a major enterprise skill set, say, Windows or AIX or Solaris.. and have linux as a secondary skill. Things may be different where you live, but in midsouth.us, Linux is currently is considered "dot-commish" and corps are steering away from it. Things are bound to change when the Itanium comes out and PC Unix means something again.
Re:.NET regexps and Microsoft's documentation (Score:2)
Apparantly the traditional man pages weren't down to IBM standards, so IBM actually paid someone to rewrite them.
In order to get man pages that actually have useful information I now have to surf the web. The ones included with AIX 4.3 are so damn useless and content-free that they're actually misleading at times.
Re:.NET regexps and Microsoft's documentation (Score:2)
Another description was "Microsoft's moral guidelines". Which is also not entirely fair. Microsoft does have one guideline concerning morals: "if you want to make it in this company, get rid of them".
RMN
~~~
Re:.NET regexps and Microsoft's documentation (Score:2)
The best way to think about the complaint is that the ratio of ideas to company size was allegedly lower than "normal". For example, IBM might have been at the time 90 percent of the computer market, but made only about 50 percent of all innovative ideas.
But DEC and Intel and a little bit of Gov-scare eventually changed all that.
Where's Clippy when ya need him .. (Score:5, Funny)
Re:Where's Clippy when ya need him .. (Score:3, Funny)
"I see that you are swearing, would you like to use a thesaurus"
VB and Regexes (Score:2)
Whether you love Microsoft or hate it, there's no denying the popularity of Visual Basic. With the regular-expression package in the .NET Framework, Microsoft provides a package that can be used by VB.NET, C#, Visual C++, and any other language that wants to link to it -- even Python and Perl! The consistency is appealing, but even more important is the package itself: it's powerful and fast, and can it can hold its head up high next to Perl or any other regex package out there.
VB's regex syntax is exactly like Perl's. In fact, when I started working with regexes in VB and I couldn't find something in the documentation I would look it up in one of the O'Reilly Perl books. Much to my "shock", I could do everything Perl regexes could do, even the things that weren't in the documentation.
I strongly suspect Microsoft took full advantage of Perl's "artistic license" when they came up with their regex engine.
Re:VB and Regexes (Score:2, Insightful)
Re: ACs and imposters (Score:2)
I mean, its perfectly legal, but the point is: How would this look in the Evil Human Resources Dept? Are they going to think about promoting this guy in the near future?
Re:.NET regexps and Microsoft's documentation (Score:2)
Yes, you can find tutorials, examples and the like, but no FORMAL references and specifications.
I hate programming by example without knowing exactly what I'm doing and knowing what is inside and what is outside the spec. And lots of documentation only costs time.
In that respect, the java documentation is excellent. A consicea specification, yet very readable and useful to use during day-to-day programming.
Is that so? (Score:2, Insightful)
Well, I don't find it fair that you were modded as a troll. You may be just misinformed.
I can tell you that _any_ decent *nix gives you complete knowledge of what is going on in your machine. Without having to look at source code, without having to go to some central repository of information.
Now, press Ctrl-Alt-Del in your favorite Windows and take a look at the name of the services. Try to enter any of them in the MSDN search. What do you see? Do they tell you what that service does? How is it started? How can you stop it?
Do you still praise MSDN so high when you see that they don't even tell you the basics?
When is RegExp2 Going To Be Shipped (Score:3, Informative)
Re:When is RegExp2 Going To Be Shipped (Score:4, Funny)
Yes, and that makes me want to use a decidedly irregular expression:
#@*$^&@#$&#!!!
Perl6 regular expressions - forget everything (Score:3, Interesting)
Re:Perl6 regular expressions - forget everything (Score:2)
You just described AWK.
I thought that was the whole purpose of Perl!
Ok, so maybe I am exagerating on purpose (its called humor, folks... Dont shoot!)
But it always seemed to me that Perl's cryptic quality mainly came from having "too many" variables and control structures
(<joke> Theres too many ways to do it? </joke>)
Thank the Lord for blessing us with Guido
This has no educational purpose (Score:3, Insightful)
It is a slightly wordy advertisment for why you should upgrade. The fact that it was foisted on us as something else annoys me, as I spent time reading it.
I know, a slashdot reader that actually reads linked stories is such a minority, but come on, quite stuffing articles with advertising. Aren't the ads in the middle of a page enough?
what about perl 6? (Score:5, Interesting)
Re:what about perl 6? (Score:3, Interesting)
If you could write and use a Perl 6 program right now, maybe he'd include a chapter on it in his book.
This article is basically an overview of his book. His book doesn't cover Perl 6 regex's. Why should it? Perl 6 isn't even done yet, and so everything new for Perl 6 could change by the time it comes out.
Re:what about perl 6? (Score:2)
heh [develooper.com].
Re:what about perl 6? (Score:2)
it's now Turing-complete, if you have a Parrot engine and a bit of spare time. Call it a primitive "demo version" of some of Perl 6's features.
So I reiterate... "if you could write and USE a Perl 6 program right now, maybe he'd include a chapter on it in his book."
heh.
Re:what about perl 6? (Score:2)
Contentless article (Score:2, Insightful)
A regex is a type 3 grammar. Type 3 grammars haven't really changed since Chomsky's time.
The smartarses will now proceed to point out that
a) Perl is actually limited type 2
b) Some change noone knows or cares about was made to some definition of the Chomsky hierarchy in ninteen dumdy-dum.
Foo.
"regular expression" (was: Contentless article) (Score:2)
Contentless posting (Score:2)
Many precise mathematical or scientific terms have different meanings to laymen. What is a positive number? I'm sure I learned whether 0 is a positive number way back when, but right now it simply doesn't matter. Context is usually good enough, and when not, > and >= work wonders. Quantum leap as used by mere mortals has the meaning of incredible revolutionary exciting change, but scientifically, it means the smallest possible change.
So foo to you.
Re:Contentless article (Score:5, Insightful)
After you've finished your untergrad CS theory class, you might go on to discover that implementations of regexes under various paradigms and in the various languages have extremely rich variety regarding syntax, semantics and efficiency. This isn't about the pristine theory of Prof. Chomsky, but about the actual use of regexes as programming constructs, and that's a tremendously complex subject. Friedl's book in the first edition is one of the best I've ever seen that has tackled such complexity and made it accessible and useful for the everyday business of programming.
The article indicates that the practical use of regexes, far from stagnating since Chomsky's time, continues to evolve and grow. That's only "contentless" if you're stuck in the ivory tower and don't intend to leave.
Re:Contentless article (Score:2)
at some point... (Score:4, Interesting)
Perl and other languages should leave "good enough" alone when it comes to regular expressions and instead just make it easy to put chunks of grammars into programs.
Re:at some point... (Score:4, Insightful)
Yes, regular expressions should be used to find particular patterns in text and perform basic manipulations on them. Beyond a certain point of complexity it really doesn't make sense to perform more complex manipulations. Get the information you want out of the string using a regular expression, then manipulate it in code.
One has a feeling that regexp engines are just becoming programming languages in and of themselves - the only difference being that the 'program' consists of a string of cryptic single character commands, and the input is limited to a single string.
-josh
Re:at some point... (Score:3, Funny)
Not true. Yet.
Perl 5 regexes can solve NP-hard problems, but they're not quite Turing complete. However, they require only four additional stack operators to do that.
Personally, I'm waiting for the first Perl regex to become sentient.
Re:at some point... (Score:2, Funny)
Feb 21, 2042 - The day it gets converted into a Perl one-liner.
Re:at some point... (Score:2)
regexp and programmers (Score:4, Insightful)
Over the course of my career I have come to the rather firm opinion that you are not worth much as a coder if you do not know regular expressions. I don't care what language(s) you're proficient in, or if you've memorized every single design pattern the GoF has ever conceived, of do 4 foot by 6 foot UML diagrams in your head. If you can't do regexps then you're missing a basic skill. I bought Friedl's book a couple of years ago, and although I wound up not using man of the Perl related stuff the rest of the book helped me out immensely.
A programmer without knowledge of regular expressions is like a carpenter without a hammer.
Re:regexp and programmers (Score:4, Insightful)
If ever there was an apt analogy of regular expressions - that's it! They make everything seem like a nail
Re:regexp and programmers (Score:2)
Mod parent up! (Score:2)
To those who can't read (or write) them, regular
expressions look like line noise. But once you learn to read them you can condense whole paragraphs of spaghetti conditionals into a single, clear (to the initiated), terse line.
For manipulating strings of characters, they are probably the single most important innovation of the last 20 years.
Re:Mod parent up! (Score:3, Informative)
Re:regexp and programmers (Score:2, Troll)
I prefer to work with people who don't do a lot of regex, because they're less likely to use them for everything. I haven't worked on a large project that used regular expressions in years. I feel pretty good about that.
Sure, I've used them in a couple small scripts for parsing text, but if you see the majority of programming requiring regex, you definitely need to put your hammer down and pick up a Makita.
Re:regexp and programmers (Score:3, Interesting)
Sure, I've used them in a couple small scripts for parsing text, but if you see the majority of programming requiring regex, you definitely need to put your hammer down and pick up a Makita.
Well, I am certainly not advocating the broad use of regexps in application programming, even though it has been demonstrated to be possible. For me, regexps are an important tool in solving side issues/behind the scenes work, such as formatting a series of configuration files in a given manner, or making broad changes to a set of HTML files, and so forth. I don't do Perl, and don't really like to if I can avoid it, but I still use regular expressions on a daily basis, and have found them to be immensely helpful.
Re:regexp and programmers (Score:2)
I guess they are OK if you do a shitload of text processing, but my philosophy is that data should be processed in native (i.e. binary) form and text should only be used for interchange purposes. Even in that case, you can use text "protocols" such as XML, for which regexps are useless. So... If you have a buttload of (fairly) unstructured data to import... Knock yourself out. It doesn't happen to me. Text processing just isn't an issue for me. I don't think that makes me any less of a coder. My domain is simply different to yours.
Re:regexp and programmers (Score:2)
The rest of your argument is just hand-waving, "almost always", "in general", "generally"... Very weak.
Re:regexp and programmers (Score:2)
You speak WISDOM...
I remember a while back, one of my clients needed to move a bunch of dns records from one server to another. Took me ~ 45 minutes to write a php shell script using REGEX to create new bind zone records for over 300 domains, and convert them - records intact, complete, ready to restart named.
This poor client had paid somebody else to do it, they spent several DAYS at it and there were still lots of (human) mistakes.
And, this wasn't complicated stuff!
Any programmer who doesn't know regex is crippled!
Re:regexp and programmers (Score:2)
I remember a while back, one of my clients needed to move a bunch of dns records from one server to another. Took me ~ 45 minutes to write a php shell script using REGEX to create new bind zone records for over 300 domains, and convert them - records intact, complete, ready to restart named.
Forty five minutes? Wow. Had you been using djbdns [cr.yp.to], you could have been done in thirty seconds. The BIND zone file format is needlessly complex.regexp criticism (Score:2)
That can be said about anything. IMO, many OOP fans were simply crappy at procedural/relational programming and design (either due to lack of training, or a non p/r mind). The faults they often find with p/r are their own bad thinking about p/r, and not OO's strengths.
I think reg.ex's would be easier to learn and read and remember if they were broken down into user-definable chunks of some kind. It could be more like defining a generational grammer (substitution): you define the symbols rather than live with what Larry Wall or whoever picks. A special set of functions or operators would simplify the defining of the symbol sets.
Further, I would like to see the peices parsed into a table (or some easy-to-navigate structure) so that second passes can be done. In other words, divide up per-character parsing and per-token parsing.
I admit that it may not be as compact as regexp's, but easier to read for those don't need it every day.
Regular expressions come across as a stringy diarreac glob of an irriducable mess of symbols if you don't keep up. It is like forgetting to ride a bicycle if you do not do it every 3 months or so to refresh.
I realize that everybody is different, and what bothers me may not bother others. I just don't personally like the approach resexp's took. I would like to see it broken down into clearer chunks. IOW, the syntax would (clearly) dictate the chunks instead of running the rules in one's head to find the boundaries and context.
I know I will get called a bunch of names for saying this all, but that is my opinion, take it or leave it.
Re:regexp criticism (Score:2)
envisioning.
The Perl version comes from:
http://txt2regex.sourceforge.net/
### date LEVEL 3: mm/dd/yyyy: matches from 00/00/1000 to 12/31/2999
RegEx perl: (0[0-9]|1[012])/(0[0-9]|[12][0-9]|3[01])/[12][0-9
My re-work of it:
symb(h, "A", Symb_numRange(1,12));
symb(h, "B", Symb_numRange(1,31));
symb(h, "C", Symb_numRange(1000,2999));
isGoodDate = symb_Match(h, checkMe, "A/B/C");
Here "h" is the symbol set storage handle.
OOP langs would probably have it on the left side as an object.
Re:regexp criticism (Score:3, Insightful)
Sounds kind of like what the Regexp::English [cpan.org] perl module does.
You may also want to look at the YAPE::Regex [cpan.org] series of modules that allow parsing/extracting/explaining of regex.
Re:regexp criticism (Score:2)
Regular expressions come across as a stringy diarreac glob of an irriducable mess of symbols if you don't keep up. It is like forgetting to ride a bicycle if you do not do it every 3 months or so to refresh.
Not to pick nits, but the expression "it's like riding a bicycle" implies that once you learn how to ride a bicycle, you never forget, no matter how long you go without actually riding one.
Re:regexp criticism (Score:2)
Well, I am suggesting that it is *not* like a bicycle. The rules and symbols don't "stick" very long if you don't use regex's very often. At least not in my head.
Actually a few years ago I tried riding a bicycle after about a 10-year absense. I almost fell over because my weight distribution was "different"[1] later. My brain did not know how to balance the new weight.
[1] Euphemism for "fatter"
Yes Indeed (Score:2)
Another tool is shell scripting. At a past company Symantec Cafe was used for developing a Java application. When I joined, I immediately created shell scripts for myself to do automated builds for a couple reasons:
I showed others how to use them, but only one other developer took the time to get used to it, never having used a shell before. The others complained that they shouldn't have to learn a new tool (shells and scripts) when Cafe sufficed. I explained the advantages, but to no avail.
Well, a few months later we finally hired a real QA and release engineer. Since we were building a J2EE application to run on Linux in testing and Solaris in deployment, we needed automated builds on Unix. There was a huge rush to get everyone up to speed on the new build system using shell scripts.
Hmm, that was a bit long-winded just to make the point that there are many useful tools to developers that don't involve the actual code they write. I've used regexps to create SQL data files and config files as mentioned. You'll learn many things, so keep open and don't stop learning. :)
Re:regexp and programmers (Score:2)
Once, I mentioned regular expressions in a room full of expensive contracters and full time employees and everyone looked at me like I had suddenly grown an extra head. I was shocked and dismayed. I'm surrounded by amatures.
regexp are way overrated (Score:5, Informative)
Text straight from the keyboard of a user won't be marked up and seems a good place to be using regular expressions. Due to the popularity of brittle and unupgradable (is that a word?) text processing, the input from other programs might not be marked up either, here regexprs are necessary (ie symptomatic of poor design, but it wasn't your decision).
This applies to much more than just HTML or XML, eg if you're going to write a javadoc clone for your pet language, do it properly, don't do it with regular expressions.
That HTML tag stripper you hacked up, did you remember to handle comments? Just because there weren't any comments in the HTML it was tested on doesn't mean it'll never encounter them in the real world (wouldn't be an issue if an off the shelf parser is used).
Re:regexp are way overrated (Score:4, Interesting)
While you later concede that form input and input from other programs might be good reasons to use a regex, that you would even pose this question is strange. For 90% of the regex fans, form input and screen scraping is exactly what they do. For almost any Web developer, this is the day-in, day-out norm. So your point seems to downplay the very uses that have made regex's so popular.
You realize this does not bolster your claim that regex's are "overrated" -- it merely points out that some developers are overrated. A bad developer does not make a language bad.
Same as above. You're complaining about human error and then blaming the regex system itself.
Of course. But the hastily written software is the other software we interact with, not our own. And that's a broad generalization for many developers, so of course you can find exceptions. But you asked for other people's views, and in my view, regex's are sorely needed -- not so bad developers can stay bad, but so that the good developers can clean up the messes left behind after the bad developers go. It's a nice bonus that good regex developers can pull in hostile data, screen scrape, and cleanse form input. That helped one of my employees get a raise last quarter.
Re:regexp are way overrated (Score:2)
Hehe, ok, I'll be objective but some personal opinions reign. Must of this is from my personal experience, not text book stuff
Text processing - why isn't your text marked up? Text processing forms the heart and soul of regexps. As you say, any brainful system should never pass text requiring regexps between systems (use markup, structs, whatever). However, at some point, there is usually raw input beyond your control, be it CGI input, keyboard input, non-markup input from a system beyond your control. That's where regexps are used the most (all of ?) the time for me.
Parsing - how many times have you encountered a HTML or XML parser written with a regexpr?
Parsing is the next level beyond regexps. You start with the specificatio and let the implementation arrive from it, like any much good development. Indeed, any "parsing" of large well specified documents (XML, HTML etc) are probably best done by proper parsers. But sometimes, you don't have well specified input at all, or you are just searching for bits out of a document. Now we are back to adhoc text processing where regexps rule. Also, parsers are overkill when we are doing small processing such as reading numeric input.
(My IMHO) Conclusion: There is some grey (and for me, not a thin) line between text processing and parsers, where you should use regexps or not.
Development
A good regexp programmer knows what he is regexping for before he starts. I invariably get things right first time. That they try to parse something that has a specification (email address) without reading the RFC is stupid.
Now here is the distinction. If something is well specified, there is invariably a perl module to handle it using whatever optimum (hopefully) method is available (XML::Parser, Email::Valid). Regexps are where we are not dealing with standard specifications, perhaps non-formatted data and thus where parsers may not work. And in those cases, withour regexps, you'd be in a very lost world and that's perhaps why they are preached so much.
Re:regexp are way overrated (Score:2)
I know and use regular expressions, but use of regular expressions is often symptomatic of poor design, this makes me somewhat suspicious of those who live and breath regexp's and preach them to the world.
I find regexes to be very useful for checking user input in HTML forms. You can do a JavaScript regex check for the user's convenience (so that s/he doesn't need to submit the form to find out that s/he made a mistake or invalid input), then a second check on the server side with whatever server language you are using.
Skip the JavaScript if you're lazy or in a hurry.
Re:regexp are way overrated (Score:2)
But a lot of stuff does inherently require messing with strings, and for that, the regular expression is a great general-purpose tool. It certainly beats the raw C library
Re:I diasgree completely. (Score:2)
Who the hell can't implement regular expressions? It's pretty damned easy to do -- Kleene's Theorem isn't exactly rocket science.
Regular Expressions Haven't Changed (Score:3, Interesting)
Ummm.... (Score:4, Funny)
Very compressed contentlessness.
Re:Regular Expressions Haven't Changed (Score:3, Interesting)
PS: Note the difference between "regular expression" which is what they teach you about in CS classes, and "regexps", which is what programmers actually use in Perl and many other languages.
Re:Regular Expressions Haven't Changed (Score:2)
Well, I guess your professor also told you that regular expressions could be used for pattern matching in computers (not just generating strings that were members of a language L). And in this case, that there were two alternatives for implementation, either as a deterministic of non-deterministic automata. And since non-deterministic automata can be converted to a deterministic one by some simple rules, that leaves only one reasonable alternative for implementation of regular expression pattern matchers on computers: the deterministic one.
There are two problems with this: First of all, the conversion from deterministic to nondeterministic automata can lead to a state-explosion, and second, you might want to add new features to a regexp engine making it recognize more than just what can be described by a regular expression, and this can be easier to do, if your implementation does not use the classical deterministic finite state machine as implementation. Some implementations (Perl) choose to say the use non-deterministic regexp engines, and while that might be formally meaningless, it gives a pretty good idea of how it works informally.
Re:Regular Expressions Haven't Changed (Score:2)
Without doing a formal proof, I'm still fairly certain that positive and negative look-behind are still equivilant to classical regular expressions. Backreferences, however, make perl regular expressions into another beast entirely, as do independent subexpressions (i think), code refs, and postponed sub expressions.
Re:what? (Score:2, Interesting)
The Perl "RE" "(a+)b\1" will match aba and aaaabaaa, but not abaa or aaba.
Getting started with regular expressions (Score:5, Informative)
For a nice way to get started with regular expressions I recommend the wonderful "txt2regex" console program. It provides a simple text based wizard-like interface. You answer questions and the program builds your regular expression for you. See:
http://txt2regex.sourceforge.net/
Re:Getting started with regular expressions (Score:2)
behind-the-scene purpose (Score:5, Informative)
I wrote the article so that people would have a feel for what's new in the book. Of course, my hope is that people are interested in the new content, but my general feeling is that the worst that can happen is that someone buys the book and finds out that it's not what they expected. Unmet expectations pretty much suck, and I hope the article helps avoid some of that suckage.... and piques some interest, as well.
Jeffrey
Re:behind-the-scene purpose (Score:3, Interesting)
Everybody here and there is going to say how informative it is. But, what stroke me the most, is that it is well written.
It was very pleasant to read it, apart from the knowledge I got from it. If only all manuals
Well written (Score:2)
Which is why O'Reilly is the first place I look for a book. Ther ratio of well/badly written books is better there than anywhere else. The only books I will order online. All others, I want to page through them in a bookstore first.
Re:behind-the-scene purpose (Score:2)
I wrote the article so that people would have a feel for what's new in the book.
As with almost every other programmer out there, I agree that "Mastering Regular Expressions" is one of the best-written and most useful programming books there is. I know a lot of people would probably buy the second edition regardless. But the article/book review cemented my decision, since it covers Java and PHP (and even that wacky MS stuff, huh?).
Friedl's book is a must read for Perl folks (Score:5, Insightful)
I was a full time Perl programmer (with a two hour commute by rail) when Friedl's book came out. I read it cover to cover, and then recommended it strongly to my co-workers.
Friedl shows how to write powerful, readable, efficient regular expressions that can do a lot of the work your program needs to do. It changed how my group wrote Perl (very much for the better). This is more than highly recommended; after the Blue Camel, and even before the Cookbook, this is a definitive book for all those who call themselves "Perl programmers."
(In the first edition of the book, Friedl discovered some problems with regular expressions in early versions of Perl 5. The very next release of Perl -- 5.003, I think -- immediately fixed these problems. When Larry & Co. pay attention to a Perl book, maybe you should, too?)
What?! (Score:2, Funny)
are they those
I'm really, really new to perl, studying it out of an O'Rielly book. What does this mean to me?
Re:What?! (Score:2)
regexps are a very powerful search/replace tool. One of the reasons Perl is so popular is it has a powerful, easy to use (and by this, I also mean easy to invoke, evry try this in C, yeeesh) regular expression parser. Makes text processing very easy.
If you're learning Perl out of the Camel book, you'll be fine. It has a good explanation of it. Once you see the power of it, you'l like wonder how you got along without it.
Re:What?! (Score:3, Informative)
perl 6 is gonna change all this (Score:4, Insightful)
I understand that Perl 6 isn't near being done, and that the "r" in "Perl" doesn't necessarily stand for "regex", depending on who you ask, but Perl will always have the greatest influence over what is called a regex. Or is that going to change with Perl 6?
Irregular Expressions... (Score:2)
But, of course, everyone knows that a real coder uses irregexps in disassembly language.
may I please? (Score:2)
My request (Score:2)
-a
Re:My request (Score:2)
Having whitespace be insignificant by default should help a great deal with readability, as will the efforts to make regex syntax more consistent. The ability to embed Perl 6 objects into regular expressions should also lead to some interesting developments.
Re:My request (Score:2)
I always want to do a simple search and replace in a shell script ala echo "$TEXT" | sed "s/$FILENAME/xyz/". But filename is bound to contain some control characters, such as '/' or '.'. I end up using "s,$FILENAME,xyz,", but every once in a while I still get strange results. Can Perl do any better?
-a
Re:My request (Score:2)
Re:My request (Score:2)
$ foo=bar
$ echo ${foo##ba}
r
Very useful stuff..
Regex Accelerator! (Score:3, Informative)
Re:indeed (Score:2, Funny)
A heart for porn?
Disagree, Personal Experience (Score:5, Informative)
1) Identify heirarchical relationships that were only denoted by standard oldered list types (1,1a,2,2a,3, I, II, etc).
2) Insert html markup to reproduce proper highlighting for names and indented lists.
3) Generate internal HTML links between individuals, their unique GEDCOM (LDS Geneology)number within the document.
4) Build an index for chapters and an appendix to link from name, sorted bu surname back into the main document.
5) Add special markup for converting the end HTML into indexed and linked PDF using HTMLDoc.
Time to complete the job -2 Weeks. Without the use of Regular expression this task would have been alsmost impossible and all my Uncle's work he did to put the information together for the last two years of his life would have been lost.
Negative numbers (Score:2)
That's pretty cool...regexps let you finish jobs two weeks before you start them.
Re:Disagree, Personal Experience (Score:2)
Re:indeed (Score:3, Informative)
That is why research into regexps is doomed to failure. It is a dead end. From a theoretical standpoint, regexps are cute and interesting, but for serious data prowling, you need something with a brain and a heart.
While I agree that for large amounts of data you need something other than a regex, but that certainly doesn't mean that regexs are dead or that we shouldn't try to make them better! I don't need Google's search algorithm to make sure my user's input matchs certain parameters and I would really hate to have to write
if $input contains really_evil_characters() die;
Regex is here to stay
Re:Now, if only Google would support regexp search (Score:2)
Regexp are horrible from a complexity point of view.
According to this [plover.com] link regepx's complexity is of O(M*N), where M unfortunately is in the order of Googles DB, if my short calculation is correct. Note, this may be wrong, but the point stays that regexp searching is quite expensive and kills most of the optimizations you could do if you didn't want to provide them.
Re:Now, if only Google would support regexp search (Score:2, Insightful)
The problem with regular expressions is that there are so many constraints. for example:
As you can see, even with a very simple regular expression like this, the text has to be processed a lot to get the results needed. A simple "John AND Doe" would match all of the results while the regular expression puts more restraints on the search, which takes longer to process. For complex regular expressions, the searching of text becomes too slow for large amounts of data, such as the internet.
Re:Now, if only Google would support regexp search (Score:2)
I'm sorry, but
<John.+Doe>
Should not match "JohnDoe", and should match "John Doe". you need one or more characters between John and Doe in that regexp.
John AND Doe doesn't do shit for you in search engines either. I like the NEAR clause when I am searching for information because I often have to find things like "scanPORT specification" and I end up getting pages talking about a module with scanPORT and the specifications for the module, instead of for scanPORT. Having a NEAR clause or even a <scanPORT.{,30}specification> would help.
Re:indeed (Score:2, Insightful)
Re:indeed (Score:5, Informative)
I understand your thinking.
But your thinking is wrong.
Think about it (no pun intended).
How much better would google be if one could use regexps in one's search request.
regexp and datamining are orthogonal.
Re:indeed (Score:3, Funny)
Regexps are interesting, sure.
Not really. I use them all the time and the only time they are interesting is when you're done and they look completely silly.
Every CS student enjoys (or suffers through!) the regexp section of their Intro to Computability (or equivalent) course.
Not really. I got a degree in Computer Engineering from the #2 private engineering school in the country and I was never taught regex. If you know how to program and not just crank out syntax, you can pick up regex on your own pretty fast.
And it is pretty fun thinking about the expressive power of, say (a|b)*a*b*
That is actually a really boring regex. Lots of a's or b's folowed by lots of a's followed by lots of b's. Wow. My brain is fried.
However, we have to face the facts, that regexps, as good as they are from a mathematical standpoint at matching things, just aren't that helpful in sorting through the sea of data that is the Internet.
Wow. You're probably right. I'll bet nothing that searches for things on the internet, such as google.com, uses any regex internally in their code. Now that I'm facing the facts, you're right, regex is worthless when it comes to searching through any amount of data.
The input data just aren't orderly enough for regexps to be of any use.
Yeah, regex is best used for very very simple patterns. Anything more complex than your above example is best suited for some serious hand-parsing in visual basic.
Think about it: when you are looking for wares or porn, where do you go? Perl? Nope.
I don't know WTF you're talking about. I find ALL my porn at www.perlmonks.org
That is why research into regexps is doomed to failure.
Yeah, I should probably throw away all that perl regex code I've written thats made my company lots (and I mean lots) of money in the market. It is doomed. I should writing my pattern matching code in the google.com language.
Thank you for posting about something you apparently know very little about. Good for an afternoon giggle.
Re:indeed (Score:2)
To be a little pedantic the original poster probably meant being taught regular expressions in a formal language theory framework, where one talks about properties of computability. The same course would teach things like finite state machines (which in terms of computability are equivalent to classical regular expressions though I think not to perl regexps), context free grammars (pushdown automata) and turing machines, and just general computability (and maybe complexity) theory. All of these things have a great deal to do with how programs work, and the lack of such theory is probably actually one of the drawbacks of doing something like computer engineering over computer science. (at least from my perspective)
Re:indeed (Score:2)
Actually that regexp matches any text at all. * is 0 or more matches, not one or more. Personally I think the really interesting regexps use lookaheads but that's just me.
Re:Validate XML? (Score:5, Informative)
One of the important aspects of using regexes is to know their limits and not try to use them outside of those limits.
Re:Validate XML? (Score:2)
I might not get every single bit of the technical vocabulary right on this one, but do try to follow along anyhow (and please only refute the REALLY glaring errors).
Basically, with regular expressions, you get what Chomsky (famed linguist and political extremist...er, nut =) ) referred to as a type-3 grammar, or roughly something that can be solved with a deterministic finite-state automata (DFA. Ok, you might argue that you get an NFA (nondeterministic finite-state automata), but using the subset construction, it's so easily converted to a DFA, we'll just pretend we're working with a DFA.)
Basically, a DFA works like this: Think of a table. You start out at one row (the starting state) and based on the input you get, you move to another row/state. One or more of these states is specially marked as an accepting state, so if you run out of input characters on one of those states, the string is accepted and everyone is happy. If you run out of input on any state not marked as such, the string is rejected. (DFA's are often expressed as graphs as well, but from a programmatic perspective, it's really easy to just use a table, or for the pedantic, a list of nodes (containing the current state, and where to go on any possible input...a list of lists)).
Maybe we can go more simple than that: You're sitting on a nerdly board game. You draw a card that says "B. Go to the square labeled as 'R'", so you go to R and draw another card. You keep picking cards and following the directions on them until you get "$: The game is over. If you're on a square that is labeled with an underlined letter, you win. If the letter is not underlined, you lose."
So what does this mean? In terms of compiler/language theory, we can use a regex to recognize tokens (or individual words), but they aren't very powerful when it comes to syntax (Our lexer would be happy with "sentence This a is.", but by our grammar, it doesn't make a lot of sense. We, as people, could guess the meaning, but computers are still really bad at guessing anything (especially your weight). A parser would be necessary to figure out if things make any sense by the rules of the grammar, which would refuse "sentence This a is." but accept "This is a sentence.") If you're setting up the rules of our example nerdly board game, you could set up a number of states that could find any word in your language. (If the first state is "E" and the next state is "a" and the next state is "t", followed by "$" (commonly used for end of input, but in your language you might specify the end of the word being a space or some bit of punctuation instead), you'll have successfully parsed "Eat", which by your rules is considered to be a valid accepting state. In the same sense, if you pick "R" followed by "Z", you might move off to some error state you've specified, where no matter what input follows, you'll always loop back around to that same state, because you know for certain there's no word you want to accept that starts with "RZ".)
So to answer the question "can regexp validate XML?", the answer is yes, in the sense that it can be used to scan for valid XML components (words), and no, in that it can't tell well-formed XML from poorly-formed XML tags (sentences). A regex alone isn't quite powerful enough to understand that ">>>>XML" and "<XML>" aren't both perfectly acceptable.
Sort of.
Could you write enough rules that some really large set of regex could do it? Maybe, but it's a mathematical proof that's way out of my league, but I'll warn you now: you'll be writing so many cases for every possible permutation that you'll probably go batty trying. Part of what all this language theory got us was an understanding that some tools are good at one task, but lousy at others.
If you're interested in this further, the Dragon book (search for it on google, you'll find it as "dragon book" faster than its real title, which I've forgotten) is considered the canonical source for this sort of thing, although it can be horribly dry and hard to read. There are some other compiler theory books out there, and some aren't quite as dull (though arguably less informative. I wasn't able to prove my nerdliness by reading more than a handful of pages of the dragon book, though I found it to be a great reference for filling in the gaps of the other books (which were more prone to shameless hand-holding))
comp.compilers can be a good source as well, though sometimes a bit intimidating. Read through it, see if you can find references to the stuff you don't really understand, and just try to absorb what's there.
-transiit
Re:Validate XML? (Score:2, Informative)
You're correct in saying that regexps alone can't validate XML (or any hierarchical structure, come to that). This is an instance of the bracket-matching problem: given a string composed of opening and closing brackets that can nest, determine whether the string is properly balanced or not. For instance, ()() and (()()) are balanced, while (() and (())) are not.
The reason that a regexp can't do this is that it can't keep track of which opening brackets haven't been closed. A regexp has no memory of what it's already seen. All it knows is what state it's in now, and what token is coming next. OK, some programming languages implement regexps in such a way as to provide some sort of memory of what's been seen, but these usually feel like kludges.
If you're prepared to put up with an arbitrary limit on how deeply you can nest brackets, then you can solve the bracket-matching problem with an automaton that has N states, numbered 1 to N. If the automaton is in the state numbered x, that means that it's seen x opening brackets that haven't been closed yet. The instructions for each state would be "if you see an opening bracket, go to state x+1, if you see a closing bracket, go to state x-1, and if you see the end of the string, it isn't balanced." Exceptions would be that in state 1, if you see the end of the string, it's balanced, and if you see a closing bracket, it isn't balanced. In state N, if you see an opening bracket, the brackets are nested too deeply.
Of course, no theoretical computer scientist would ever accept arbitrary limits on how deeply a structure could be nested, which is why you would use a context-free (aka type 2) grammar to solve problems like this one.