Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
Programming It's funny.  Laugh.

Regex Golf, xkcd, and Peter Norvig 172

Posted by samzenpus
from the problem-to-solve dept.
mikejuk writes "A recent xkcd strip has started some deep academic thinking. When AI expert Peter Norvig gets involved you know the algorithms are going to fly. Code Golf is a reasonably well known sport of trying to write an algorithm in the shortest possible code. Regex Golf is similar, but in general the aim is to create a regular expression that accepts the strings in one list and rejects the strings in a second list. This started Peter Norvig, the well-known computer scientist and director of research at Google, thinking about the problem. Is it possible to write a program that would create a regular expression to solve the xkcd problem? The result is an NP hard problem that needs AI-like techniques to get an approximate answer. To find out more, read the complete description, including Python code, on Peter Norvig's blog. It ends with this challenge: 'I hope you found this interesting, and perhaps you can find ways to improve my algorithm, or more interesting lists to apply it to. I found it was fun to play with, and I hope this page gives you an idea of how to address problems like this.'"
This discussion has been archived. No new comments can be posted.

Regex Golf, xkcd, and Peter Norvig

Comments Filter:
  • by Amorymeltzer (1213818) on Sunday January 12, 2014 @03:24PM (#45933553)

    http://regex.alf.nu/ [regex.alf.nu]

    Some favor trickiness, some favor just listing possibilities, but it's fun. I'm at 3651.

    • Re: (Score:3, Informative)

      by Garridan (597129)
      Or, if you want to try your hand at meta regex golf [stackexchange.com] there's a place for that, too.
    • Be sure to allow cookies, else the site won't work.

      Cookies are an essential part of websites nowadays, dontcha' know...

    • by schitso (2541028)
      Why would you do this? I was hoping to actually have a productive day.
    • by hugetoon (766694)

      Yep, lots of fun, thank You.

      Got 3912 but I'm really confused by the "Alphabetical" level , I could craft a regex to discriminate both set that scores at 262, however I suppose it is considered cheating since I do not implement any underlying "principle" here because I cant recognize it. Any clues of what was the intent of this level?

    • Aren't there any solutions posted anywhere? This is a good exercise if you want to spend time puzzling out the more arcane features of regular expression evaluators. Some of these aren't fully working solutions.

      • Plain strings: f.o
      • Anchors: k$
      • Ranges: [a-f]{4}
      • Backrefs: (...).*\1
      • Abba: .u|il[^l]|(.)[lm]\1|^[pv]|z|[mr]it
      • A man, a plan: (.)(.)[^r]?\2\1
      • Prime: ^xxx?$|^x{5}(xx(xxxx(xx(xxxx(xx(xxxx(x{6}(xx)?)?)?)?)?)?)?)?$|x{37}
      • Four: (.)(.\1){3}
      • Order: ^[^o].{1,5}$
      • Triples: 0.$
      • Glob: ^[blp]|^.[foprv]|^re|^mi
  • by Hognoxious (631665) on Sunday January 12, 2014 @03:24PM (#45933555) Homepage Journal

    ... that can find frost psits?

  • Now build a robot that can design a program to write a regex expression to solve the xkcd problem.
  • RegExps (Score:5, Interesting)

    by ledow (319597) on Sunday January 12, 2014 @03:41PM (#45933639) Homepage

    Regexp's are a programming language unto themselves.

    I'm currently doing some temp IT work for schools while my promised job becomes available and it's eye-opening. The web-filtering is all reg-exp based but nobody understands how it works.

    They just copy/paste an example and change the parts of the URL that they can see to match the one they want. They barely bother to test the impact, past the site they need becoming "unfiltered" or "filtered" as necessary (i.e. no implication of knock-on effects on other sites with similar names). Let's not even mention the use of "." without the escape character for them to mean a literal period (but, obviously, it means "any character" in a regexp).

    I talked to them about changing their template regexp because, from the start, I could see that it wasn't really up to the job and just met if not opposition then at least apathy about the problem.

    Until someone brought an iPad into the helpdesk where a site that was supposed to be unfiltered was filtered - because nobody had considered what happens if you use "http://example.com" instead of "http://www.example.com". I was the one to spot it, and tell them that it's because their regexp was very basic.

    The good thing was, the other tech on the team was young and keen to learn and I was able to give them a quick rundown of regexps and we crafted an alternative template for them to use that would take account of the situation without, for instance, the blocking of "microsoft.com" affecting "antimicrosoft.com".

    But it is amazing how many people I know that work in IT have no idea how to program, no idea how to handle regexps, and just work on a "copy a working example" basis.

    • Re: (Score:3, Funny)

      by Hognoxious (631665)

      Regexp's are a programming language unto themselves.

      It would appear that apostrophes are too.

    • Re:RegExps (Score:4, Insightful)

      by Anonymous Coward on Sunday January 12, 2014 @03:51PM (#45933703)

      ...and then there's the people who think they understand regexes, but try to use them to decide context-free grammars.

    • Just wait till you play with Privoxy. It uses different sorts of regex for different parts of the URL pattern. You then have two additional problems, rather than just one ;-)

    • Re:RegExps (Score:5, Funny)

      by Anonymous Coward on Sunday January 12, 2014 @04:00PM (#45933753)

      But it is amazing how many people I know that work in IT have no idea how to program, no idea how to handle regexps, and just work on a "copy a working example" basis.

      You will be truly amazed by the number of people who copy a not-working example...

    • Re: (Score:2, Informative)

      by Anonymous Coward

      If regular expressions are a programming language, then they are not a very powerful one. The languages they recognize are the so-called regular languages, which are the least expressive category in the Chomsky hierarchy of languages (note the difference between the language of regular expressions and the languages recognized by regular expressions). See the Wikipedia articles on regular languages [wikipedia.org] and the Chomsky hierarchy [wikipedia.org] for details.

    • by Anonymous Coward

      But it is amazing how many people I know that work in IT have no idea how to program

      Is it? Programmers routinely make 30% more. Why should an IT person, with their entire forte of knowledge and skills, ever need to take on someone else's and even if they did, why would they keep their low salary.

      It is akin to an auto mechanic that knows how to use a CNC machine to make replacement parts. Sure it would be convenient to hire a $10/hr mechanic that can do $40/hr G-code, but what is in it for the mechanic?

    • by Anonymous Coward

      This is pretty common. It is not some new, strange thing only you have seen:

      http://en.wikipedia.org/wiki/Cargo_cult_programming

      People go through life with blinders on, because life is complicated and our minds try to ignore things that it figures aren't important. A lot of people think details of one sort or another aren't complicated or relevant, then complain when things don't work for some reason. That's why empirical reasoning and the scientific method is cool; it works against this sort of logical f

    • by raymorris (2726007) on Sunday January 12, 2014 @10:37PM (#45936121)

      For matching URLs from a domain, here's a regex we came up with that covers some special cases. Hopefully Slashdot doesn't mangle it too badly.

      https?://([a-z0-9\-]+\.)*foo\.com(/|$)

      That covers:
      https as well as http
      "subdomains" like maps.google.com as well as www.google.com and google.com
      It's not fooled by google.com.hacker.ru

      • That may be troublesome... Note that while DNS comparisons are supposed to be performed case-insensitively, uppercase letters are valid characters.

        If your regex-based web filter doesn't canonicalize the domain name first and/or use a case-insensitive regex match, users may be able to circumvent the filter with http://www.foo.com/ [foo.com] (or even http://foo.com/ [foo.com]).

        Of course, browsers these days do so much "helpful" manipulation on anything typed in the URL bar, I'm not sure you can slip a mixed-case domain through t

        • That's true, the regex is used with the non-case-sensitive flag. We use it with mod_rewrite, so it would be[NC] something like:
          RewriteCond %{HTTP_REFERER} https?://([a-z0-9\-]+\.)*foo\.com(/|$) [NC]

          In for example Perl the regex would be used within //i (or another delimiter for clarity).

      • by ledow (319597)

        Yep, pretty much what I crafted. Their problem was that their ()'s lacked the power to express that it had to have something-dot, or nothing at all (specifically the "nothing at all" part, so foo.com wasn't caught by a regexp for *.foo.com

  • by hankwang (413283) on Sunday January 12, 2014 @03:47PM (#45933677) Homepage

    The International Obfuscated C Code contest had a winning entry that could flag the names of US presidents as republican or democrat. [ioccc.org]

    main(int riguing,char**acters){puts(1[acters-~!(*(int*)1[acters]%4796%275%riguing)]);}

    Quoting: "This one-line C program accepts as a first command-line argument the last name of any of the last 31 US Presidents (from Franklin Pierce onwards), in lower case, and prints out their political affiliation. Use "republican" as the 2nd command-line argument, and "democrat" as the 3rd (or equivalent strings of your choice)."

    De-obfuscated, it is a boolean expression acting on a string s,

    (*(int*)s)%4796%275%4

    I wonder whether you can make a regexp that is shorter than this and accomplishes the same thing.

    • by hankwang (413283)
      Pardon, I forgot the "not" operator. De-obfuscated, it is a boolean expression acting on a string s,

      !((*(int*)s)%4796%275%4)

  • "Find it interesting"? I find it fucking terrifying. There's no way I'll be following that link. That's in bat country.

    • by ysth (1368415)

      No, actually it is far from terrifying. Just some brute force generation and application of simple, short, regex fragments. No "AI-like", no "algorithms going to fly".

      • by bunratty (545641)
        The algorithm is an AI algorithm. It's using hueristics to attempt to minimize some function, as many AI algorithms do.
      • Well now you've just ruined my terror. I'll have to go read up on the up-coming UK legal hilarity to restore it.

  • I'd have been extremely surprised if solving for the shortest regex golf pattern for a pair of lists weren't NP hard. And greedy approaches are fairly obvious.

    The point is, that's what makes it analogous to golf. The optimal solution is your hole in one. Some greedy algorithm solution is your par. Those aren't the interesting areas, they're just the end points.

    For those who couldn't work it out for the,selves, here are the rules of regex golf:

    How many characters you use in your regex is your score
    Lower is b

  • by v1 (525388) on Sunday January 12, 2014 @04:26PM (#45933921) Homepage Journal

    There's really no better way to scan a log file for odd log entries than to write a big regexp that filters out whitelisted entries. Lets you find log entries you're NOT expecting. (and occasionally, log entries that not even the developers are expecting)

    Editing them of course is a royal pain, (not to mention debugging!) so I usually write a script to compose the regexp. I just checked on one of mine, and it composes a 17,000+ character single-liner that scans my wired.log file.

    I've got a smaller one that keeps an eye on secure.log for anomalies.

  • by DogPhilosopher (1149275) on Sunday January 12, 2014 @04:34PM (#45933955)

    There's a field called Grammar Induction, and the problem of learning regular languages, aka regular inference, can be considered a subfield. People have been working on this since the '50s. Applications include learning DTDs for XML/wrapper induction, and all kinds of problems in bioinformatics and natural language processing.

    There's a strong link with the graph coloring problem, see
    http://www.cs.ru.nl/~sicco/papers/alt12.pdf [cs.ru.nl]

    In this field, the focus is generally on learning FSAs, but these can easily be transformed into regexps. There's work on learning regexps directly, see
    http://www.informatik.uni-trier.de/~fernau/papers/Fer05c.pdf [uni-trier.de]

    Enjoy.

    • by thogard (43403)

      The 70's research brought us Lex and Yacc. Bison and Flex just carry on the traditions.

  • by Anonymous Coward

    Sure, it's NP-hard, but it's also nonrecursive via Rice's theorem.

    Calling this NP-hard is like calling the Andromeda Galaxy "bigger than the average rhino".

    • by Anonymous Coward

      For clarification, this is referring to the problem of writing a program that identifies regex finders. The problem of finding the shortest regex is likely NP-hard, but it's not as obvious as the article attempts to imply.Giving a shitty algorithm to solve a problem, and then saying that it would also solve SET COVER, does not imply that any algorithm for solving the first problem can be used to solve SET COVER via a Karp reduction.

  • Python code? I would expect Perl or similar if parsing was your emphasis, or Lisp or some AI language if AI was your emphasis, but not a standard block-structured language.

    • by bunratty (545641) on Sunday January 12, 2014 @05:37PM (#45934329)
      Python is very similar to Perl, and also has most of the characteristics that make Lisp a good language to program AI algorithms.
    • by retchdog (1319261)

      Then you're a moron.

      He's trying to communicate his neat ideas with as many people as he can, not optimize the shit out of something (in execution time, or length of code, or whatever). Python has become the standard language for communicating with a broad base of people who are not idiots, but not necessarily "real programmers" either. This is a good thing: it has basically replaced pseudocode.

    • Usually easier to use the language you know, rather than learn a whole new language for a project.
  • by LifesABeach (234436) on Sunday January 12, 2014 @05:51PM (#45934379)
    sowing only web pages for delta airlines; and not delta faucets? that's "googlely."
  • by Anonymous Coward

    Obligatory XKCD.

    Oh, never mind.

  • //solution is simple

    if a_1=nopaper

    if a_2=paper

    //problem with this solution though is that it tends to be gender specific,

    //and that is the problem with two compared variables in a statement it can easily lead to a 3turd condition.

  • by Shaterri (253660) on Monday January 13, 2014 @03:58AM (#45937413)

    My reading of Norvig's blog post is that he suggests his specific approach (stapling together short regexps with ORs) requires solving the NP-Complete Set Cover problem, but he doesn't actually say anything about whether the core problem (match everything in list A and nothing in list B) does; it's conceivable that e.g. some sort of dynamical programming approach could do the job more efficiently than Norvig's algorithm does. Does anyone know whether the root problem (to avoid having to do the optimization, just phrase it as 'is there a separating regexp for the sets A and B of length less than k?') is specifically known to be NP-complete?

  • This started Peter Norvig, the well-known computer scientist, director of research at Google and wearer of brightly colored shirts, thinking about the problem.

    Now I got to get me some brightly colored shirts.

If you have a procedure with 10 parameters, you probably missed some.

Working...