Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Google Releases Tesseract as Open Source

Posted by ScuttleMonkey on Mon Sep 04, 2006 10:27 PM
from the bit-rot dept.
An anonymous reader writes "Google recently released Tesseract as open source. Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code and released it for general consumption. You can download Tesseract over at Sourceforge.
This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • by OrangeTide (124937) on Monday September 04 2006, @10:30PM (#16041704)
    HOORAY! Good free OCR software is in short supply. I wonder if this will have a positive impact on Project Gutenberg?
    • Sonny Bono pwned Gutenberg by tepples (Score:2) Monday September 04 2006, @10:55PM
      • Re:Sonny Bono pwned Gutenberg by Anonymous Coward (Score:1) Monday September 04 2006, @11:01PM
      • Un-Finishable (Score:5, Interesting)

        by Kadin2048 (468275) <slashdot@kadin.xoxy@net> on Monday September 04 2006, @11:09PM (#16041908)
        (http://kadin.sdf-us.org/ | Last Journal: Tuesday October 16, @01:46PM)
        In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules. They have everything written by humanity before that date to digitize: not just English language books and "classics," but government documents, records, foreign language texts, ancient manuscripts ... everything. That's as close to an un-finishable task as you can set yourself, I think.

        Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can. Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration. I think this would be covered by fair use law even if the work was still protected. Perhaps this sort of archival work is not exactly the aim of PG, but it's still critically important.

        With that said, I don't mean to in any way excuse the disgusting abuse of our political and legal system that was and is the "Sonny Bono Copyright Term Extension Act." That thing is a disgusting example of pretty much everything that's wrong with our government today.
        [ Parent ]
        • Re:Un-Finishable (Score:5, Insightful)

          by mrchaotica (681592) * <<mrchaotica> <at> <yahoo.com>> on Tuesday September 05 2006, @12:58AM (#16042365)
          In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules.

          Your argument makes the fundamentally flawed assumption that the "new rules" will remain constant. The reality is that Copyright will continue getting extended so that new content never comes into Public Domain. (I hope the copyright fuckers are the first against the wall when the revolution comes!)

          Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration.

          I'm sure they could even OCR them... they just couldn't make them available to the public. Of course, given the community-driven mechanism by which Project Gutenberg works, they couldn't legally distribute them to the volunteers either...

          [ Parent ]
        • Chastity Bono's next step is life+100 by tepples (Score:3) Tuesday September 05 2006, @08:47AM
        • Re:Un-Finishable by AnyoneEB (Score:2) Tuesday September 05 2006, @02:56PM
        • 1 reply beneath your current threshold.
      • Re:Sonny Bono pwned Gutenberg by technos (Score:2) Monday September 04 2006, @11:31PM
      • Re:Sonny Bono pwned Gutenberg by bersl2 (Score:2) Monday September 04 2006, @11:32PM
      • What else for Gutenberg? by DragonWriter (Score:2) Tuesday September 05 2006, @01:59PM
    • Isn't fully free / open source by oblique303 (Score:1) Monday September 04 2006, @11:49PM
    • Re:I take back every bad thing I said about Google by Commie1 (Score:2) Tuesday September 05 2006, @04:42AM
  • Anti-spam (Score:3, Interesting)

    by Bacon Bits (926911) on Monday September 04 2006, @10:30PM (#16041706)
    This should be useful for adding anti-image spam capabilities to FOSS anti-spam programs.
    • Re:Anti-spam by ZSpade (Score:1) Monday September 04 2006, @10:47PM
      • Re:Anti-spam by jrockway (Score:2) Monday September 04 2006, @11:50PM
        • I call bullshit (Score:5, Interesting)

          by quigonn (80360) on Tuesday September 05 2006, @12:16AM (#16042214)
          (http://synflood.at/blog/)
          The very first CAPTCHA implementation was broken, but the funny thing about CAPTCHAs is that it's absolutely no effort to make an image completely unreadable for current OCR software. And even if one certain implementation is broken, just add another layer of distortion. Human brain is capable of coping with it, OCR software usually is not.

          And after all, it's not about authentication, it's about making a service accessible only for humans.

          BTW, it's funny that you praise your own cryptography solution in your blog, but it's obvious that you have the problem of replay attacks, you even mention it in the "caveat" section below the text box.
          [ Parent ]
          • Re:I call bullshit by quintesse (Score:1) Tuesday September 05 2006, @02:53AM
          • Re:I call bullshit by AaronLawrence (Score:2) Tuesday September 05 2006, @03:54AM
          • Re:I call bullshit (Score:4, Informative)

            by johansalk (818687) on Tuesday September 05 2006, @05:25AM (#16043248)
            If captcha is using humans, wasn't there an anti-captcha thing spammers were doing by having people answer some captcha to get into some free porn that is then used (their answer) to get the bots through legitimate sites the spammers wanted to get into?
            [ Parent ]
          • Re:I call bullshit by Ciarang (Score:1) Tuesday September 05 2006, @07:02AM
          • 1 reply beneath your current threshold.
        • Re:Anti-spam by tehcyder (Score:1) Tuesday September 05 2006, @06:54AM
        • 1 reply beneath your current threshold.
      • Re:Anti-spam by Phroggy (Score:3) Tuesday September 05 2006, @02:08AM
  • improvements (Score:5, Funny)

    by Anonymous Coward on Monday September 04 2006, @10:33PM (#16041726)
    Google cleaned up some of the more outdated portions of the code
    i.e., added AdSense to the OCR output.
    • Re:improvements by puddpunk (Score:1) Monday September 04 2006, @10:41PM
    • Re:improvements by Anonymous Coward (Score:1) Tuesday September 05 2006, @04:13PM
  • Hoping OCR will improve? (Score:3, Insightful)

    by smileytshirt (988345) on Monday September 04 2006, @10:34PM (#16041733)
    (http://www.taxcalc.com.au/)
    My guess is that they are doing this in the hope the open source community will build on and improve OCR technology. This would be in Google's interest, as it can then index text from images (such as their own Books project) more accurately and efficiently.
  • Finally! (Score:3, Funny)

    by nihilatron (32440) on Monday September 04 2006, @10:40PM (#16041753)
    Now I can finally see how to tell the difference between the 'A'-ness of 'A' and the 'P'-ness of 'P'!

    (Credit to S.G.)
    • 1 reply beneath your current threshold.
  • From the Project (Score:5, Insightful)

    by Gopal.V (532678) on Monday September 04 2006, @10:43PM (#16041772)
    (http://t3.dotgnu.info/ | Last Journal: Monday September 26 2005, @06:32AM)

    > It was open-sourced by HP and UNLV in 2005.

    So google basically did what ? Fix bit-rot ? Google has re-released some open source code, essentially forking off the orginal ?

    > License: (None Listed)

    I'm a fan of the FOSS idea. Basically that makes sures that the whole work to which I contributed, always remains available to me (and others). It might not always work for a company, but as a developer it makes sense to me. And the second thing I need to see is a License after I see some code.

    So explain to me how exactly this is open source (other than the "compile, but don't touch" version of it) and *then* I might think of downloading it and probably fix a few bugs or write docs.

    • Re:From the Project by Sir_Lewk (Score:1) Monday September 04 2006, @11:07PM
    • Re:From the Project by kevlarman (Score:3) Monday September 04 2006, @11:10PM
    • License by mapinguari (Score:3) Monday September 04 2006, @11:11PM
      • Re:License by arose (Score:2) Tuesday September 05 2006, @12:22AM
        • Re:License by Strolls (Score:2) Tuesday September 05 2006, @08:05AM
          • Re:License by arose (Score:2) Tuesday September 05 2006, @08:20AM
      • Re:License by mrchaotica (Score:3) Tuesday September 05 2006, @01:03AM
        • Re:License by lisaparratt (Score:3) Tuesday September 05 2006, @02:34AM
          • Re:License by Arancaytar (Score:1) Tuesday September 05 2006, @07:21AM
    • Re:From the Project by 1 a bee (Score:1) Tuesday September 05 2006, @02:28AM
    • 2 replies beneath your current threshold.
  • I'm sorry Dave... (Score:5, Funny)

    by macadamia_harold (947445) on Monday September 04 2006, @10:44PM (#16041773)
    (http://www.google.com/)
    Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available.

    Yeah, but how is it on lip-reading? That's when we really need to worry.
  • Hosting (Score:5, Interesting)

    by truthsearch (249536) on Monday September 04 2006, @10:44PM (#16041775)
    (http://seenonslash.com/ | Last Journal: Friday May 11 2007, @04:02PM)
    Is there any particular reason google isn't hosting [google.com] the project themselves?
  • Sourceforge? (Score:1)

    by JackieBrown (987087) <dbroome@gmail.com> on Monday September 04 2006, @10:44PM (#16041776)
    (http://debcentral.org/)
    I though google was opening up their own open source repository http://www.newsforge.com/article.pl?sid=06/07/27/1 833251 [newsforge.com]
  • by sed@netcom.com (6179) on Monday September 04 2006, @11:02PM (#16041870)
    (http://durak.org/sean/)
    it would be great if tesseract [blogspot.com] could augment the gocr [sourceforge.net]-based FuzzyOCR [apache.org] and OCR [apache.org] plugins for SpamAssassin [apache.org].
    • MOD PARENT UP by Phroggy (Score:1) Tuesday September 05 2006, @02:11AM
  • Yay! (Score:2)

    No binaries! Only source code! Good luck getting it to compile on Windows, I gave up after I got several dozen obscure errors I had never seen before from the compiler.*

    * If anyone can get VC++2K5 to compile it, please post.

    • No luck for OS X either by lullabud (Score:2) Monday September 04 2006, @11:48PM
    • Re:Yay! by cduffy (Score:2) Tuesday September 05 2006, @06:13AM
      • Re:Yay! by kalidasa (Score:2) Tuesday September 05 2006, @10:28AM
      • Re:Yay! by slashkitty (Score:2) Tuesday September 05 2006, @03:03PM
    • Re:Yay! by Dishwasha (Score:1) Tuesday September 05 2006, @02:52PM
    • 1 reply beneath your current threshold.
  • my thoughts (Score:4, Interesting)

    by br00tus (528477) on Monday September 04 2006, @11:43PM (#16042078)
    I would love to use a free (speech and beer) OCR engine that works as well as a commercial one, or even nearby as good as a commercial one.

    I just checked out tesseract. One thing I have to look at more is the license. It appears to be the Apache license, which seems like a decent free license. But it also includes MITRE's aspirin. I'm not sure how dependent it is on aspirin and what the license restrictions of aspirin are.

    The two best free OCR engines out right now are clara and gocr. While they are the best, they are not that great yet. I just ran the same tiff I had run with those two (I also have the document in pbm and other formats). Tesseract did not read it, it bailed with "IMAGE::check_legal_access:Error:Can't seek backwards in a buffered image!"

    Clara and GOCR are written in C, Tesseract is written in C++, a language I don't know. Tesseract did well in the UNLV challenge so it probably has some good features. It does say it has no page layout analysis though.

    Hopefully this can be improved, or good parts of it can be borrowed and incorporated into gocr or clara. It couldn't handle my test that both clara and gocr could, but it probably has strengths the other two doesn't. One day hopefully we'll have a free OCR that handles things as automagically as the commercial ones do. I will see what I can contribute to that as well. Although this is C++ and I don't know that language.

    • Re:my thoughts by Phroggy (Score:2) Tuesday September 05 2006, @02:13AM
    • Re: Aspirin by Ayanami Rei (Score:2) Tuesday September 05 2006, @06:00PM
    • Mod parent up by makomk (Score:2) Tuesday September 05 2006, @03:24PM
    • 1 reply beneath your current threshold.
  • Vividata works quite well (Score:2, Interesting)

    by GnuPooh (696143) on Monday September 04 2006, @11:47PM (#16042093)
    (http://www.gnupooh.org/)
    I've tried all the previous open source stuff and it was pretty much unusable. The accuracy was so bad that it was just easier to start typing. I got a few of the Windows programs kinda of working under WINE, but then I discovered Vividata and it worked really well and could be called from the command line. This meant I could write my own scripts that used it. I used it quite a bit for Project Gutenberg and was very impressed. It's not cheap, but if you want to do OCR under Linux and can afford it, I recommend it.

    I would definitely prefer a Free Software solution so I'm excited about this development. Until this solution is really work-able (see the Google limitations, they're pretty serious), give Vividata a try.
  • by Frosty Piss (770223) on Tuesday September 05 2006, @12:18AM (#16042218)
    (http://www.nojailforpot.com/)
    In 1995 it was one of the top 3 performers at the OCR accuracy contest organized by University of Nevada in Las Vegas. However, shortly thereafter, HP decided to get out of the OCR business...

    Actually, shortly thereafter, HP decided to get out technology innovation business, and into the printer ink business.

  • W0W1 (Score:3, Funny)

    by Anonymous Coward on Tuesday September 05 2006, @12:21AM (#16042230)
    TH18 IS GRLAT NEWf4 FOR TH0Sj OF US U$1NZ BA) O(R RLCOGN1+ION!

    THAHKS, G00GLL!1!!!
  • by Bitsy Boffin (110334) on Tuesday September 05 2006, @12:51AM (#16042337)
    (http://www.gogo.co.nz/)
    This story is somewhat timely for me. I am secretary of a club, we have a large quantity of documents collected over the last 20 years or so, some hand written, some typed, forms, invoices, minutes of meetings, letters sent to and from etc etc. There are a LOT of documents.

    Lately I've been thinking about computerizing these documents into a web based system, so that any of the club executive can search and pull out a document they need etc, we could also flag documents as "general release" so that people could read interesting stuff from our past. And of course it would also serve as a secure backup of our documents, incase of fire, theft, alien invasion...

    I think what is needed is a rough OCR system, that is, an OCR system that's not trying to be perfect, but can at least make about 50% accuracy on both typed and handwritten (without training!) documents, and preferably where it wasn't pretty certain it was correct, it would just skip words. The idea being that I'd run each document (big job, but doesn't matter if it takes a year) through a scanner, OCR it to get some searchable content, then store it as a PDF, or jpeg or something.

    Anybody know of such an (open source, or at least free as in beer) OCR system?
  • Non-English Charsets? (Score:4, Interesting)

    As there seems to be no documentation on the Sourceforge page about what this can actually do, does it learn or follow rules? If it learns, can it learn to recognize, say, Japanese characters?
  • Music OCR (Score:1)

    by Crabbyass (867531) on Tuesday September 05 2006, @01:16AM (#16042445)
    I tell ya, it'd be friggin' sweet if someone would work on making a functional Music OCR [wikipedia.org] program. Scanning a score using the piece-of-crap Photoscore [sibelius.com] into (the not-so-piece-of-crap) Sibelius [sibelius.com] always ends taking longer than actually inputting the music manually. I don't know about others who dabble in this software, but I'm sick and tired of a piece of dust being interpreted as a meter change.
  • License issue: not free software (Score:2, Interesting)

    by hellgate (85557) on Tuesday September 05 2006, @01:18AM (#16042452)
    Parts of the Tesseract tar ball are under a "for non-commercial use" only license:

    This software is the copyright of Russell Leighton and the MITRE Corporation. It may be freely used and modified for research and development purposes. We require a brief acknowledgement in any research paper or other publication where this software has made a significant contribution. If you wish to use it for commercial gain you must contact The MITRE Corporation for conditions of use.

    The piece in question is a neural network simulator named Aspirin/MIGRAINES, presumably used for training. Pun away.

    • 1 reply beneath your current threshold.
  • by 5plicer (886415) on Tuesday September 05 2006, @01:55AM (#16042586)
    "Currently it builds under Linux with gcc2.95 and under Windows with VC++6". In other words, it won't compile under Mac OS X... yet ;)
    • 1 reply beneath your current threshold.
  • by niceone (992278) on Tuesday September 05 2006, @02:34AM (#16042721)
    (Last Journal: Tuesday June 19, @07:48AM)

    and I've never heard of this thing.

    Guess I should have got out of my cube more.

  • Test example of tesseract. (Score:2, Interesting)

    by dannycim (442761) on Tuesday September 05 2006, @02:42AM (#16042747)
    Screen captured some text from the article, used XV to transform into tif, changing image to monochrome.

    Input image: it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code

    Output text: ii has been lamed as one of lhe mos! accurale Oplical Characler Recognilion (OCR) programs available. Having sat on lhe shelf galhering dusk for so many years, Google cleaned up some of lhe more ouldaled porlions of lhe code

    I have no idea what kind of input is optimal, but for a first shot in the dark, that's not too shabby. I'll go play with it some more. (^,^)
  • License? (Score:2)

    by omeg (907329) on Tuesday September 05 2006, @02:54AM (#16042788)
    I don't get it. Isn't everything released on SourceForge supposed to be under a free license? Then how come this is released under no license? Perhaps I'm not looking on the right pages, but I can't seem to find anything besides the "none listed" on the main page of the project.
    • Re:License? by The Cisco Kid (Score:2) Tuesday September 05 2006, @08:23AM
    • Re:License? by robbak (Score:2) Tuesday September 05 2006, @08:10PM
    • 2 replies beneath your current threshold.
  • Build environment (Score:1)

    by maxwell demon (590494) on Tuesday September 05 2006, @03:26AM (#16042912)
    (Last Journal: Wednesday August 14 2002, @12:33PM)
    While it may be nice to have the source of a tesseract [wikipedia.org], however, those can only be built in a 4-dimensional space. So where do I get the build environment?
  • I always knew... (Score:1)

    by sam991 (995040) on Tuesday September 05 2006, @04:11AM (#16043048)
    (http://sam991.blogspot.com/)
    I always knew Google were powerful. I did not, however, know they had the power to open source the 4-dimensional analog of the (3-dimensional) cube, where motion along the fourth dimension is often a representation for bounded transformations of the cube through time.
  • by hey! (33014) on Tuesday September 05 2006, @05:40AM (#16043292)
    (http://kamthaka.blogspot.com/ | Last Journal: Wednesday March 30 2005, @03:18PM)
    that F/OSS isn't anti-business. It just works with different business models.

    Google's business interest in releasing this as open source is obvious: the greater the value of the materials available to the Internet, the greater the value of its service.
  • by autophile (640621) on Tuesday September 05 2006, @08:22AM (#16043932)
    ...it has been touted as one of the most accurate open source Optical Character Recognition (OCR) programs available.

    As the linked article states, there are commercial OCR programs that are far more accurate.

    --Rob

  • by rs232 (849320) <emacsuser@NoSPam.linuxmail.org> on Tuesday September 05 2006, @08:37AM (#16044031)
    Does anyone here know how to get it to install and run on SuSE 10.0. The instructions are a little confusing. If you can't use make install, what do you use.

    From INSTALL ..

    "4. Type `make install' to install the programs and any data files and documentation."

    Running ./configure returns "error in line 1329" and "make install has not been implemented yet avoid using."

    README has this to say "The executable must reside in the same directory as the tessdata directory The command line is: tesseract image.tif batch"

    Trying to run it and a windows pops up briefly and then disappears.
  • by derniers (792431) on Tuesday September 05 2006, @08:55AM (#16044167)
    well, that is quite a stretch but just maybe send the info from Mindstorms to host so that the robots can read
  • Apparently the OP thinks the entire world lives and breathes *NIX, so much so that he couldn't be bothered to mention the OS platform requirement? Thanks for wasting the time of those readers who may not yet have a Linux system with which to use it.
  • by aweinert (969529) on Monday September 04 2006, @10:32PM (#16041720)
    CAPTCHAs are specifically meant to break OCR... and if you RTFA, it say it does poorly with grayscale and color documents. Baisically its meant for reading typed text... like in a book.
    [ Parent ]
  • You're right! Let us never delve into research that could conceivably overturn weak software security! Some things man was never meant to discover! Turn back, before we fly too close to the sun and our wings melt!! O, Prometheus, why hast thou given us this OCR technology??
    [ Parent ]
  • by Carthag (643047) on Monday September 04 2006, @10:35PM (#16041739)
    (http://www.tuekistan.com/)
    OCR is most effective when the letter boundaries are clear and well-defined, such as fixed-width text, or text that is at least on a straight line. Most CAPTCHAs put the letters on a curved path, as well as distorting the letters so they are no longer within a clearly defined rectangular shape. This makes it very hard to identify which parts of the images are letters and which parts are not, making OCRing CAPTCHAs a non-trivial problem.
    [ Parent ]
  • NFB owns you (Score:5, Interesting)

    by tepples (727027) <slash2006@pineight.com> on Monday September 04 2006, @10:48PM (#16041808)
    (http://myatomic.com/ | Last Journal: Sunday November 19 2006, @12:31AM)
    CATCHAs have been very effective in stopping spammers in the past, but if they can now just read them and answer correctly, then they are effectively rendered useless ...

    They're already useless if installing one will subject your business to boycotts and/or lawsuits from National Federation of the Blind [nfb.org] and other advocates for people with disabilities.

    [ Parent ]
    • Re:NFB owns you (Score:5, Informative)

      by MrNonchalant (767683) on Tuesday September 05 2006, @12:08AM (#16042188)
      You can build accessible CAPTCHAs, using images with a sound backup for blind users. My girlfriend is visually impaired and non-accessible CAPTCHAs are a real problem for her, she can't register at some sites without assistance.
      [ Parent ]
      • Re:NFB owns you by mrchaotica (Score:2) Tuesday September 05 2006, @12:48AM
      • Re:NFB owns you by pipatron (Score:1) Tuesday September 05 2006, @02:19AM
        • Re:NFB owns you by maxwell demon (Score:2) Tuesday September 05 2006, @03:34AM
      • Re:NFB owns you by indifferent children (Score:2) Tuesday September 05 2006, @07:08AM
      • 1 reply beneath your current threshold.
    • Re:NFB owns you by Isotopian (Score:1) Tuesday September 05 2006, @12:36AM
    • Audible captchas by sita (Score:2) Tuesday September 05 2006, @03:21AM
    • Re:NFB owns you by stiggle (Score:2) Tuesday September 05 2006, @06:55AM
  • Can't spammers use this thing to break CAPTCHAs on sites like Slashdot and many other internet forums?CATCHAs have been very effective in stopping spammers in the past

    While Slashdot has always been a target for trolls and miscreants, I don't ever remember it being a spammers destination (note 4-digit UID). Even back in those crazy, hazy days when we didn't have to try to interpret some bizarro text -- AKA the vast bulk of Slashdot's existence - somehow spammers were thwarted in their evil quest. Was Slashdot just feeling a bit left out, and just had to stick a CAPTCHA in there to be just like everyone else ("See!? Spammers like us too!").

    CAPTCHAs should be replaced by forcing answers to submitted homework questions - kids get their homework done for them on a distributed network, and it somewhat proves that there's a human on the other end (no machine could interpret most homework questions).
    [ Parent ]
  • by Millenniumman (924859) on Monday September 04 2006, @11:23PM (#16041983)
    Why can't captchas just say "the letter at the beginning of the word that is spelled the reverse of the 3 letter name for an underwater vehicle or sandwich"? If someone can make a program that interprets that and gets the answer right after getting it off a captcha with OCR, then Google probably wants to know so they can hire them.
    [ Parent ]
    • Two reasons (Score:5, Insightful)

      by patio11 (857072) on Monday September 04 2006, @11:49PM (#16042108)
      You've got two constraints. One is that you have to be able to compose an arbitrarily large numbers of capchas algorithmically. For example, that example you just used is human-composed. If its the only CAPTCHA you have, the following program gets me a job at Google: gawk 'BEGIN{print "b"}' . If you have 100 CAPTCHAS, I only need to add a switch statement and some elbow grease and then I get to break your CAPTCHA a trillion times.

      The other contraint is that you have to have your problem be trivially solvable by humans. I know plenty of people who cannot solve the CAPTCHA you have given: one obvious example would be, umm, all of my coworkers, because I live in Japan and "sub sandwitch" is not generally on the Japanese English curriculum. Similarly, you could any number of parsing problems which are very difficult for machines ("Here are 10 pictures chosen from HotOrNot. Click the three hot chicks.") but which may also be difficult for some users, such as Slashdotters who have never met a girl before.

      By the way, you can find an implementation of that CAPTCHA at http://www.hotcaptcha.com/ [hotcaptcha.com]
      [ Parent ]
      • Re:Two reasons by Spazmogazm (Score:1) Tuesday September 05 2006, @02:50AM
      • Re:Two reasons by Linux987 (Score:1) Tuesday September 05 2006, @03:59AM
        • Since you ask, here's why: (Score:4, Insightful)

          by patio11 (857072) on Tuesday September 05 2006, @06:56AM (#16043541)
          The name of the system you propose is called challenge/response (CR). CR is not a good idea for the following reasons:

          1) It says "My time is more important than yours" to all your correspondents, because you're not willing to look at a few spams getting past your Bayesian filter every day so instead you offload that time burden to people who want to talk to you.
          2) Dueling CR systems ("Hey, bob@example.com, I don't recognize you. Please prove you are a human" "Re: Hey, bob -- steve@stupid.com, I don't recognize you. Please prove you are a human"). Even more fun in a potentially infinite loop. Any system you can make to shortcircuit this loop can be abused by spam to avoid the CR altogether.
          3) Doesn't survive the Chinese Sweatshop Spam Attack, which will be ubiquitous if CR becomes popular. (Take poor Chinese person, teach them 10 words of English, pay them 2 cents an hour to answer CAPTCHAs so you get guaranteed delivery of your Maximize Your Mr. Wiggly offers.)
          4) Breaks legitimate bulk mail senders, such as Amazon, Paypal, eBay, mailing lists, etc etc. Mailing lists in particular are going to be very fun, since a lot of CR systems would spam the entire list -- perhaps provoking 100 challenges! Which then leads to combinatorial hilarity!
          [ Parent ]
      • Re:Two reasons by Sgt. CoDFish (Score:1) Tuesday September 05 2006, @06:38AM
      • Re:Two reasons by GrumpySimon (Score:2) Tuesday September 05 2006, @07:08AM
      • Re:Two reasons by Catharsis (Score:2) Tuesday September 05 2006, @05:02PM
      • 1 reply beneath your current threshold.
    • by Jerf (17166) on Monday September 04 2006, @11:54PM (#16042128)
      (Last Journal: Saturday August 18 2001, @11:04AM)
      In order to pose the question, you have to generate it randomly. If it's not random, you already lost.

      In order to generate it, you're going to end up using a grammar.

      Running grammars in reverse is merely a matter of patience (to explore the space of problems the test program will pose) and the right tools; it's a fundamental bit of computer science.

      Granted, expecting spammers to be conversant with the fundamental elements of computer science is a pretty high bar, but it only takes one to leap it and the rest to buy the program from him.

      The image tests have the advantage that done properly, it takes more than just patience and computer science fundamentals to crack, it would require fundamental advances in the art.

      (Note that nowhere in this message do I claim that image tests are perfect; in fact everything I know is vulnerable to the "feed it to a human in another context (viz, 'porn') and let them do the work" attack, and there are also points to be made about how widespread any given grammar/image test becomes; I know a website where the image test actually is a constant and so far it doesn't seem to be a problem because of scale issues. My point is that text tests have an additional disadvantage. It's not an intrinsically bad idea, though.)

      Google wouldn't be interested in hiring people who could crack this, merely because they can crack this. Might make a decent interview question, though.

      (You might also be tempted to think that you could just use a really complicated grammar, but you are constrained by two things, the human supposedly reading and taking the test, and the complexity of the human language itself. By the time you write some problem generator that could reliably throw off a parser, you'll be reliably confusing the hell out of your human users, too.)
      [ Parent ]
    • Re:As much as I like open source software ... by Anonymous Coward (Score:2) Tuesday September 05 2006, @01:49AM
    • Re:As much as I like open source software ... by charlesr (Score:1) Tuesday September 05 2006, @10:10AM
  • No Wrinkle in Time comments? (Score:2, Interesting)

    by reaktor (949798) on Monday September 04 2006, @11:35PM (#16042029)
    Come on, 34 comments and no mention of A Wrinkle in Time [google.com]?
    [ Parent ]
  • In the future (Score:1)

    by Vexorian (959249) on Monday September 04 2006, @11:49PM (#16042106)
    The condition would be to solve a text given puzzle, instead of reading an image meant to be as confusing as possible, some forums have very bad systems for this and sometimes I have to register multiple times before actually getting a CAPTCHA image that I can read.
    [ Parent ]
  • by rm69990 (885744) on Tuesday September 05 2006, @12:23AM (#16042244)
    Naw, more like trollish babbling. OCR doesn't handle curving lines and distorted letters well. If you want to make yourself seem intelligent, at least research your shit first and try to stay on topic. :)
    [ Parent ]
  • by benplaut (993145) on Tuesday September 05 2006, @01:08AM (#16042405)
    Only if the CAPTCHA makers don't test it through tesseract beforehand...
    [ Parent ]
  • by 1u3hr (530656) on Tuesday September 05 2006, @01:27AM (#16042494)
    Can't spammers use this thing to break CAPTCHAs

    Captchas are designed to be difficult to OCR. Besides there are plenty of OCR apps around already, if you hadn't noticed. I don't think spammers have been holding out for a GPL one.

    [ Parent ]
  • by Locutus (9039) on Tuesday September 05 2006, @03:01AM (#16042822)
    I found that I needed to use grayscale tif files for one and "output" is the output-filename where you'll get:
    outputFilename.raw #???
    outputFilename.map # seems to be a location map of 0/1's where 1's are valid text and 0's aren't
    outputFilename.txt # the text from the OCR event

    I also found that the tessdata directory did not get installed into the /usr/local/bin directory on "make install" and copied that directory from the build directory to get it to work.

    Without "batch", it tries to bring up and X window but that just quickly goes away with no debug output.

    Usage: tesseract inputfile.tif [path/]outputfilename batch

    LoB
    [ Parent ]
  • by Sgt. CoDFish (943288) on Tuesday September 05 2006, @05:47AM (#16043319)

    Your title and post make you sound like you think this shouldn't be released open source, just in case spammers use it.

    Well, then OOo will have to stop releasing their office suite: just think, Base could be used to store e-mail addresses to spam! Or, maybe no open source e-mail clients should be released, because the spammers might use it to send spam!

    Don't blame the software for the way it is used; It's the user's fault if (s)he decides to use it malevolently. Most software has the potential for misuse, some more than others, but that doesn't mean that fear of spam should stop tools that have a chance to be misused being released. Just think of the positive uses of programs like this.

    Besides, it's more than easy enough for spammers to just make a program to do stuff like break CAPTCHAs (yes, I know they're designed to defeat spammers, but nothing's perfect).

    [ Parent ]
  • by mdew (651926) on Tuesday September 05 2006, @07:56AM (#16043782)
    (http://www.fanboy.co.nz/adblock/)
    That was my thoughts exactly, why release it on sourceforge? Unless they don't have any faith in there own code repository.
    [ Parent ]
  • by Stoenhenge (767437) on Tuesday September 05 2006, @02:59PM (#16047059)
    Software written with the intent doing good, being used to do evil ?

    That's never happened before!

    [ Parent ]
  • 7 replies beneath your current threshold.