Google Releases Tesseract as Open Source 251

Posted by ScuttleMonkey on Monday September 04, 2006 @11:27PM from the bit-rot dept.

An anonymous reader writes "Google recently released Tesseract as open source. Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code and released it for general consumption. You can download Tesseract over at Sourceforge.

This discussion has been archived. No new comments can be posted.

Google Releases Tesseract as Open Source

Load All Comments

Search 251 Comments Log In/Create an Account

Comments Filter:

I take back every bad thing I said about Google (Score:5, Interesting)

by OrangeTide ( 124937 ) writes: on Monday September 04, 2006 @11:30PM (#16041704) Homepage Journal

HOORAY! Good free OCR software is in short supply. I wonder if this will have a positive impact on Project Gutenberg?

Share
twitter facebook
- Sonny Bono pwned Gutenberg (Score:2)
  
  by tepples ( 727027 ) writes:
  
  I wonder if this will have a positive impact on Project Gutenberg?
  
  Should we praise technology that helps Project Gutenberg run out of pre-1923 books faster? Once all notable pre-1923 books are scanned, OCR'd, and cleaned up, then what does PG do?
  - Un-Finishable (Score:5, Interesting)
    
    by Kadin2048 ( 468275 ) writes: <slashdot.kadinNO@SPAMxoxy.net> on Tuesday September 05, 2006 @12:09AM (#16041908) Homepage Journal
    
    In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules. They have everything written by humanity before that date to digitize: not just English language books and "classics," but government documents, records, foreign language texts, ancient manuscripts ... everything. That's as close to an un-finishable task as you can set yourself, I think.
    
    Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can. Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration. I think this would be covered by fair use law even if the work was still protected. Perhaps this sort of archival work is not exactly the aim of PG, but it's still critically important.
    
    With that said, I don't mean to in any way excuse the disgusting abuse of our political and legal system that was and is the "Sonny Bono Copyright Term Extension Act." That thing is a disgusting example of pretty much everything that's wrong with our government today.
    
    Parent Share
    twitter facebook
    - Re:Un-Finishable (Score:5, Insightful)
      
      by mrchaotica ( 681592 ) * writes: on Tuesday September 05, 2006 @01:58AM (#16042365)
      
      In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules.
      
      Your argument makes the fundamentally flawed assumption that the "new rules" will remain constant. The reality is that Copyright will continue getting extended so that new content never comes into Public Domain. (I hope the copyright fuckers are the first against the wall when the revolution comes!)
      
      Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration.
      
      I'm sure they could even OCR them... they just couldn't make them available to the public. Of course, given the community-driven mechanism by which Project Gutenberg works, they couldn't legally distribute them to the volunteers either...
      
      Parent Share
      twitter facebook
      - Re: (Score:2, Interesting)
        
        by HuguesT ( 84078 ) writes:
        
        This is patently false. New stuff comes out of copyright every day. However, coming out of copyright is not the same thing as becoming available to the public. Clearly this is where Projet Gutenberg comes in.
        
        One enormous area I'm personnally interested in is sheet music. Some of the music I'm interested in playing has come out of copyright decades or even centuries ago. No one is going to reclaim copyright on Mozart's requiem for instance. Yet it is by and large not available to the public because translati
        
        Re: (Score:3, Informative)
        
        by gweeks ( 91403 ) writes:
        
        > This is patently false. New stuff comes out of copyright every day.
        
        This is just so un-true. In the United States (the only place that project Gutenberg worries about) nothing is entering the Public Domain except unpublished manuscripts where the author died 70 years ago. Nothing else will enter the public domain until 2019. Congress has affectivly frozen the public domain.
        
        Re: (Score:3, Informative)
        
        by fotbr ( 855184 ) writes:
        
        Unless estate holders release it early. Or the author and holder of the copyright declares in his/her will that his/her work be released into the public domain upon his death, etc.
        
        Just because its not common (or likely) doesn't mean it can't happen.
    - Chastity Bono's next step is life+100 (Score:3, Insightful)
      
      by tepples ( 727027 ) writes:
      
      I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules.
      Are you insinuating that the 115th Congress won't try to enact a Chastity Bono Copyright Term Extension Act? Given Mexico's life plus 100 copyright term, the next step of "harmonization" for the United States and its trading partners is life plus 100 or, in the case of works made for hire, 125 years after publication.
      Just assuming that somehow they did manage to dig
  - Re: (Score:2)
    
    by technos ( 73414 ) writes:
    
    Gutenberg already uses OCR. Has for a decade at least.
    - Re: (Score:2)
      
      by ma++i+ude ( 580592 ) writes:
      
      Gutenberg already uses OCR. Has for a decade at least.
      Indeed it has. And as their scanning FAQ [gutenberg.org] explains, they recommend you buy an OCR software package. I'm all for having the right tools for the job, even if it means going non-OSS, but if these packages are available for free, it encourages more people to participate. Surely that's a good thing?
  - Re: (Score:2)
    
    by bersl2 ( 689221 ) writes:
    
    Well, pending another retroactive extension of copyright (I don't even want to start on that...), works will begin to enter the public domain.
- Re: (Score:2, Interesting)
  
  by Commie1 ( 526208 ) writes:
  
  I've been using Tesseract for a PG project for a few weeks now and, as TFA says, it's not as good
  as some commercial ones out there. Abby Finereader seems to be the OCR software of choice for
  Distributed Proofreaders, at least.
  Tesseract just has ASCII support (for now, as they like to add), so it ignores italics, accents etc.
  In the case of the book I'm working on, it had a very hard time with the ff ligature and had some
  trouble with b and c, but became hut, he became be, c was often an o or e.
  The words diffi
- - Re: (Score:2, Informative)
    
    by Ed Avis ( 5917 ) writes:
    
    If you think the software isn't entirely free, contact Sourceforge. Their conditions require that all hosted projects be free software.
Anti-spam (Score:3, Interesting)

by Bacon Bits ( 926911 ) writes: on Monday September 04, 2006 @11:30PM (#16041706)

This should be useful for adding anti-image spam capabilities to FOSS anti-spam programs.

Share
twitter facebook
- - Re: (Score:2)
    
    by jrockway ( 229604 ) * writes:
    
    > Following that logic, wouldn't this then also be just as usefull a tool to spammers looking to crack those crazy registration verification images?
    
    Let me let you in on a little secret. CAPTCHAs were brpken a long time ago. They're the eqivalent of writing your password on a sticky note and putting it under your keyboard.
    
    I recommend authenticating people with strong cryptography, which is how people can post to my blog [jrock.us].
    - I call bullshit (Score:5, Interesting)
      
      by quigonn ( 80360 ) writes: on Tuesday September 05, 2006 @01:16AM (#16042214) Homepage
      
      The very first CAPTCHA implementation was broken, but the funny thing about CAPTCHAs is that it's absolutely no effort to make an image completely unreadable for current OCR software. And even if one certain implementation is broken, just add another layer of distortion. Human brain is capable of coping with it, OCR software usually is not.
      
      And after all, it's not about authentication, it's about making a service accessible only for humans.
      
      BTW, it's funny that you praise your own cryptography solution in your blog, but it's obvious that you have the problem of replay attacks, you even mention it in the "caveat" section below the text box.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by AaronLawrence ( 600990 ) * writes:
        
        Human brain is capable of coping with it, OCR software usually is not.
        
        The human brain is NOT capable of coping with an arbitrary level of distortion. Many people have remarked that recent captchas are sometimes difficult to read due to the very heavy distortion.
        
        This is true at least for letters and numbers. "Pictures of things" might do better, but they require an enormous amount of work compared to a little program spitting out JPGs of text.
        
        Re: (Score:2)
        
        by quigonn ( 80360 ) writes:
        
        Where did I write "arbitrary level of distortion"?
        
        To lay this out clearly: human capability of recognition is still much better than those of computer programs, and that's what CAPTCHAs are exploiting: generally, every AI-hard problem can be used for distinguishing between humans and computers, which also means that everytime a CAPTCHA building upon an AI-hard problem has been broken, an AI-hard problem has been solved (provided no implementation errors have been used to bypass the need of solving the actua
      - Re:I call bullshit (Score:4, Informative)
        
        by johansalk ( 818687 ) writes: on Tuesday September 05, 2006 @06:25AM (#16043248)
        
        If captcha is using humans, wasn't there an anti-captcha thing spammers were doing by having people answer some captcha to get into some free porn that is then used (their answer) to get the bots through legitimate sites the spammers wanted to get into?
        
        Parent Share
        twitter facebook
      - Re: (Score:2)
        
        by quigonn ( 80360 ) writes:
        
        As I wrote, the first CAPTCHA implementation (which you linked to) was indeed broken, but not the concept per se. Please read my posting before answering.
  - Re: (Score:3, Interesting)
    
    by Phroggy ( 441 ) * writes:
    
    Following that logic, wouldn't this then also be just as usefull a tool to spammers looking to crack those crazy registration verification images?
    
    Yes, absolutely, and spammers are already using image obfuscation techniques: using italic difficult-to-read fonts spaced very close together (difficult to separate the image into individual characters and difficult to identify each character once you do), using colored backgrounds to make the text very low-contrast when converted into a monochrome image the OCR
improvements (Score:5, Funny)

by Anonymous Coward writes: on Monday September 04, 2006 @11:33PM (#16041726)

Google cleaned up some of the more outdated portions of the code
i.e., added AdSense to the OCR output.

Share
twitter facebook
Hoping OCR will improve? (Score:3, Insightful)

by smileytshirt ( 988345 ) writes: on Monday September 04, 2006 @11:34PM (#16041733) Homepage

My guess is that they are doing this in the hope the open source community will build on and improve OCR technology. This would be in Google's interest, as it can then index text from images (such as their own Books project) more accurately and efficiently.

Share
twitter facebook
- Re: (Score:2)
  
  by grammar fascist ( 239789 ) writes:
  
  My guess is that they are doing this in the hope the open source community will build on and improve OCR technology.
  
  More likely the computer vision research community, actually. "Many eyes" help a lot with bugs and bugfixes, but, ironically, not so well on nontrivial vision tasks.
Finally! (Score:3, Funny)

by nihilatron ( 32440 ) writes: on Monday September 04, 2006 @11:40PM (#16041753)

Now I can finally see how to tell the difference between the 'A'-ness of 'A' and the 'P'-ness of 'P'!

(Credit to S.G.)

Share
twitter facebook
From the Project (Score:5, Insightful)

by Gopal.V ( 532678 ) writes: on Monday September 04, 2006 @11:43PM (#16041772) Homepage Journal

> It was open-sourced by HP and UNLV in 2005.

So google basically did what ? Fix bit-rot ? Google has re-released some open source code, essentially forking off the orginal ?

> License: (None Listed)

I'm a fan of the FOSS idea. Basically that makes sures that the whole work to which I contributed, always remains available to me (and others). It might not always work for a company, but as a developer it makes sense to me. And the second thing I need to see is a License after I see some code.

So explain to me how exactly this is open source (other than the "compile, but don't touch" version of it) and *then* I might think of downloading it and probably fix a few bugs or write docs.

Share
twitter facebook
- Re: (Score:3, Informative)
  
  by kevlarman ( 983297 ) writes:
  
  if you had bothered to browse cvs you would find that it has been released under the apache license: http://tesseract-ocr.cvs.sourceforge.net/tesseract -ocr/tesseract/COPYING?view=markup [sourceforge.net]
- License (Score:3, Informative)
  
  by mapinguari ( 110030 ) writes:
  
  Here's what's in the COPYING file distributed with the source, with some punctuation stripped to placate the lameness filter:
  This package contains the Tesseract Open Source OCR Engine. Orignally developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado, the majority of the code in this distribution is now licensed under the Apache License: ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the Licen
  - Re: (Score:2)
    
    by arose ( 644256 ) writes:
    
    So it isn't open source after-all.
  - Re: (Score:3, Interesting)
    
    by mrchaotica ( 681592 ) * writes:
    
    The Aspirin/MIGRAINES system in the aspirin directory is separately licensed thus: [proprietary junk license]
    
    Anybody know how important this headache library is to the software, and how easily replaced it is?
    - Re: (Score:3, Informative)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
I'm sorry Dave... (Score:5, Funny)

by macadamia_harold ( 947445 ) writes: on Monday September 04, 2006 @11:44PM (#16041773) Homepage

Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available.

Yeah, but how is it on lip-reading? That's when we really need to worry.

Share
twitter facebook
- Re: (Score:3, Interesting)
  
  by MichaelSmith ( 789609 ) writes:
  
  Yeah, but how is it on lip-reading? That's when we really need to worry.
  Given that my laptop has a microphone I was a bit worried about the recent article on google sampling sound on peoples computers. But my wife's laptop also has a webcam. Should I tell my wife not to google in bed? If the mic is off will they still catch what she is talking about?
  Dave why don't you take a stress pill and lie down. If you are looking for something to read there is always google news.
Hosting (Score:5, Interesting)

by truthsearch ( 249536 ) writes: on Monday September 04, 2006 @11:44PM (#16041775) Homepage Journal

Is there any particular reason google isn't hosting [google.com] the project themselves?

Share
twitter facebook
- Re:Hosting (Score:5, Funny)
  
  by larry bagina ( 561269 ) writes: on Monday September 04, 2006 @11:46PM (#16041785) Journal
  
  Yes. They need the 99.9999% uptime (6 9s) that only sourceforge can provide.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by jZnat ( 793348 ) * writes:
    
    I'm pretty sure Sourceforge goes through more than .02 seconds of downtime per month...
    - Re:Hosting (Score:5, Funny)
      
      by Leto-II ( 1509 ) writes: on Tuesday September 05, 2006 @02:24AM (#16042477)
      
      I think you need to recalibrate your sarcasm detector.
      
      Parent Share
      twitter facebook
i hope it can augment the SpamAssassin OCR plugin (Score:2, Informative)

by sednet ( 6179 ) writes:

it would be great if tesseract [blogspot.com] could augment the gocr [sourceforge.net]-based FuzzyOCR [apache.org] and OCR [apache.org] plugins for SpamAssassin [apache.org].
Yay! (Score:2)

by The MAZZTer ( 911996 ) writes:

No binaries! Only source code! Good luck getting it to compile on Windows, I gave up after I got several dozen obscure errors I had never seen before from the compiler.*

* If anyone can get VC++2K5 to compile it, please post.
- No luck for OS X either (Score:2)
  
  by lullabud ( 679893 ) writes:
  
  I downloaded and tried compiling it in OS X and got some linux-specific build problems. I'm no code guru so I gave up as well. But then, even linux doesn't support the `make install` process, as claimed but the `./configure` script's output.
- Re: (Score:2)
  
  by cduffy ( 652 ) writes:
  
  Yes, the source is crap. Look at the debugging console -- they're *spawning an xterm* for output that would traditionally go to stderr. Don't have a DISPLAY set? Program crashes. Building on MacOS? Lucky you -- they have a bunch of commented-out code for running a separate window to display (what-should-be) stderr on the Mac; consequently, instead of getting output to stderr (which would actually be *useful* for redirection to a file, or direct output to the console, or whatever) it goes off into nowhere be
my thoughts (Score:4, Interesting)

by br00tus ( 528477 ) writes: on Tuesday September 05, 2006 @12:43AM (#16042078)

I would love to use a free (speech and beer) OCR engine that works as well as a commercial one, or even nearby as good as a commercial one.
I just checked out tesseract. One thing I have to look at more is the license. It appears to be the Apache license, which seems like a decent free license. But it also includes MITRE's aspirin. I'm not sure how dependent it is on aspirin and what the license restrictions of aspirin are.
The two best free OCR engines out right now are clara and gocr. While they are the best, they are not that great yet. I just ran the same tiff I had run with those two (I also have the document in pbm and other formats). Tesseract did not read it, it bailed with "IMAGE::check_legal_access:Error:Can't seek backwards in a buffered image!"
Clara and GOCR are written in C, Tesseract is written in C++, a language I don't know. Tesseract did well in the UNLV challenge so it probably has some good features. It does say it has no page layout analysis though.
Hopefully this can be improved, or good parts of it can be borrowed and incorporated into gocr or clara. It couldn't handle my test that both clara and gocr could, but it probably has strengths the other two doesn't. One day hopefully we'll have a free OCR that handles things as automagically as the commercial ones do. I will see what I can contribute to that as well. Although this is C++ and I don't know that language.

Share
twitter facebook
- Re: (Score:2)
  
  by Phroggy ( 441 ) * writes:
  
  If you've only used the latest released version of gocr, definitely try the development version; it's far superior (i.e. not completely useless).
Vividata works quite well (Score:2, Interesting)

by GnuPooh ( 696143 ) writes:

I've tried all the previous open source stuff and it was pretty much unusable. The accuracy was so bad that it was just easier to start typing. I got a few of the Windows programs kinda of working under WINE, but then I discovered Vividata and it worked really well and could be called from the command line. This meant I could write my own scripts that used it. I used it quite a bit for Project Gutenberg and was very impressed. It's not cheap, but if you want to do OCR under Linux and can afford it, I r
HP decided to got out of the OCR business? (Score:5, Funny)

by Frosty Piss ( 770223 ) writes: on Tuesday September 05, 2006 @01:18AM (#16042218)

In 1995 it was one of the top 3 performers at the OCR accuracy contest organized by University of Nevada in Las Vegas. However, shortly thereafter, HP decided to get out of the OCR business...

Actually, shortly thereafter, HP decided to get out technology innovation business, and into the printer ink business.

Share
twitter facebook
W0W1 (Score:3, Funny)

by Anonymous Coward writes: on Tuesday September 05, 2006 @01:21AM (#16042230)

TH18 IS GRLAT NEWf4 FOR TH0Sj OF US U$1NZ BA) O(R RLCOGN1+ION!

THAHKS, G00GLL!1!!!

Share
twitter facebook
What about "rough ocr" (Score:2)

by Bitsy Boffin ( 110334 ) writes:

This story is somewhat timely for me. I am secretary of a club, we have a large quantity of documents collected over the last 20 years or so, some hand written, some typed, forms, invoices, minutes of meetings, letters sent to and from etc etc. There are a LOT of documents.

Lately I've been thinking about computerizing these documents into a web based system, so that any of the club executive can search and pull out a document they need etc, we could also flag documents as "general release" so that people
- Re: (Score:3, Insightful)
  
  by Anonymous Coward writes:
  
  You're a secretary? Do you do anal? If so, I can double your pay.
  - Re: (Score:2)
    
    by Bitsy Boffin ( 110334 ) writes:
    
    Double of zero isn't that enticing.
Non-English Charsets? (Score:4, Interesting)

by TheoMurpse ( 729043 ) writes: on Tuesday September 05, 2006 @02:13AM (#16042436) Homepage

As there seems to be no documentation on the Sourceforge page about what this can actually do, does it learn or follow rules? If it learns, can it learn to recognize, say, Japanese characters?

Share
twitter facebook
- Re: (Score:3, Informative)
  
  by Yvanhoe ( 564877 ) writes:
  
  Google specifically said in the article it doesn't work for non-english texts. I suppose it means it incorporates an english dictionnary too, so other roman language wouldn't work either.
License issue: not free software (Score:2, Interesting)

by hellgate ( 85557 ) writes:

Parts of the Tesseract tar ball are under a "for non-commercial use" only license:
This software is the copyright of Russell Leighton and the MITRE Corporation. It may be freely used and modified for research and development purposes. We require a brief acknowledgement in any research paper or other publication where this software has made a significant contribution. If you wish to use it for commercial gain you must contact The MITRE Corporation for conditions of use.

The piece in question is a neural
Test example of tesseract. (Score:2, Interesting)

by dannycim ( 442761 ) writes:

Screen captured some text from the article, used XV to transform into tif, changing image to monochrome.

Input image: it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code

Output text: ii has been lamed as one of lhe mos! accurale Oplical Characler Recognilion (OCR) programs available. Having sat on lhe shelf galhering dusk for so many ye
- Re: (Score:2)
  
  by Random832 ( 694525 ) writes:
  
  I have no idea what kind of input is optimal, but for a first shot in the dark, that's not too shabby. I'll go play with it some more.
  
  meh. a _screenshot_ contains perfectly regular characters - if it can't ace _that_ then I don't _want_ to see what it does with a scanned page.
  - Re: (Score:3, Interesting)
    
    by CXI ( 46706 ) writes:
    
    A screen shot is typically much lower resolution than what you'd normally scan documents at for OCR. It's not a good test.
License? (Score:2)

by omeg ( 907329 ) writes:

I don't get it. Isn't everything released on SourceForge supposed to be under a free license? Then how come this is released under no license? Perhaps I'm not looking on the right pages, but I can't seem to find anything besides the "none listed" on the main page of the project.
An interesting demonstration (Score:2)

by hey! ( 33014 ) writes:

that F/OSS isn't anti-business. It just works with different business models.

Google's business interest in releasing this as open source is obvious: the greater the value of the materials available to the Internet, the greater the value of its service.
- Re:As much as I like open source software ... (Score:5, Informative)
  
  by aweinert ( 969529 ) writes: on Monday September 04, 2006 @11:32PM (#16041720)
  
  CAPTCHAs are specifically meant to break OCR... and if you RTFA, it say it does poorly with grayscale and color documents. Baisically its meant for reading typed text... like in a book.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by somethinghollow ( 530478 ) writes:
    
    Specifically like Google Books [google.com], I bet. Unless the book is multi-column, then fuck it and we'll wait for the single column edition.
    - Re:As much as I like open source software ... (Score:4, Insightful)
      
      by Otto ( 17870 ) writes: on Tuesday September 05, 2006 @01:01AM (#16042165) Homepage Journal
      
      Or write up a quick script to cut the images in half down the middle and save them as a series of other images.
      
      Parent Share
      twitter facebook
  - Re: (Score:3, Funny)
    
    by ajs ( 35943 ) writes:
    
    That's no problem! All I really need it to do is allow all of those geeks out there to share those great Playboy articles with me over p2p networks! I'm tired of just getting the filler photography! ;-)
  - Comment removed (Score:4, Insightful)
    
    by account_deleted ( 4530225 ) writes: on Tuesday September 05, 2006 @03:40AM (#16042740)
    
    Comment removed based on user account deletion
    
    Parent Share
    twitter facebook
  - Re: (Score:2)
    
    by Shaper_pmp ( 825142 ) writes:
    
    Plus, IIRC CAPTCHAs don't really work [slashdot.org] anyway.
  - - Re: (Score:2)
      
      by cduffy ( 652 ) writes:
      
      ...and part of a good CAPTCHA is causing these transformations to come up with useless output.
      - Re: (Score:3, Insightful)
        
        by Arancaytar ( 966377 ) writes:
        
        Yes, by using contrasting colors that convert to the same tone in grayscale. A side effect being that most such technologies also shut out colorblind people...
    - Re: (Score:3, Informative)
      
      by Dan Ost ( 415913 ) writes:
      
      As someone who has been involved in applying OCR to real world problems, there's nothing
      trivial about generating a good binary images from images taken in the field (in my case,
      images of boxes moving down a conveyor belt or hand imaged by workers).
      
      Even if you disregard such problems as uneven lighting, glare, and distortion due the
      unavoidable vibration inherrent to plant settings, most forms that are interesting to
      OCR are handwritten and not designed to be OCR friendly. Hopefully this will change as
      the peop
- Re:As much as I like open source software ... (Score:5, Funny)
  
  by illuminatedwax ( 537131 ) writes: <stdrange@nOspaM.alumni.uchicago.edu> on Monday September 04, 2006 @11:33PM (#16041727) Journal
  
  You're right! Let us never delve into research that could conceivably overturn weak software security! Some things man was never meant to discover! Turn back, before we fly too close to the sun and our wings melt!! O, Prometheus, why hast thou given us this OCR technology??
  
  Parent Share
  twitter facebook
  - Re:As much as I like open source software ... (Score:5, Insightful)
    
    by djtack ( 545324 ) writes: on Monday September 04, 2006 @11:51PM (#16041822)
    
    Plus, good OCR could help recognize image spam (where they send the text in an image attachment, to avoid filtering, and fill the message body with "bayes poison").
    
    Parent Share
    twitter facebook
    - Re: (Score:3, Informative)
      
      by Phroggy ( 441 ) * writes:
      
      I am currently using the FuzzyOcr plugin to SpamAssassin, and it uses gocr to do the character recognition. To be sure, gocr is improving (the stable released version is practically useless, but the CVS version actually works, mostly), but if Tesseract is better, great!
    - Image spam (Score:3, Interesting)
      
      by Lonewolf666 ( 259450 ) writes:
      
      A good idea, and if significant amounts of text are in an image, I'd view the mail as dubious anyway.
      If not because of spam, then because of the idiotic format. Images are for illustrations, but using them to transfer major amounts of text is just stupid and inefficient.
      - Re: (Score:3, Insightful)
        
        by maxwell demon ( 590494 ) writes:
        
        Unless it's a scanned page, where you might be interested in more than just the raw text, or simply don't want to risk errors in converting it to text (think official documents).
  - Re:As much as I like open source software ... (Score:4, Funny)
    
    by binarybum ( 468664 ) writes: on Tuesday September 05, 2006 @12:07AM (#16041899) Homepage
    
    careful, statements like that are likely to get you voted governor in some states.
    
    Parent Share
    twitter facebook
  - - Re: (Score:2, Insightful)
      
      by illuminatedwax ( 537131 ) writes:
      
      Also: *AA includes the MAA (Mathematical Association of America), the ADAA (Anxiety Disorders Association of America), the MSAA (Multiple Sclerosis Association of America), and the SCAA (Specialty Coffee Association of America).
      
      The SCAA must be the ones responsible for not letting Java be open sourced.
      - Re: (Score:2, Flamebait)
        
        by 4D6963 ( 933028 ) writes:
        
        Also: *AA includes the MAA (Mathematical Association of America), the ADAA (Anxiety Disorders Association of America), the MSAA (Multiple Sclerosis Association of America), and the SCAA (Specialty Coffee Association of America).
        and also the GNAA (Gay Nigger Association of America)
        Don't ask me what's my point in mentionning this because I have no fucking idea :-) have a good day!
- Re: (Score:3, Insightful)
  
  by Carthag ( 643047 ) writes:
  
  OCR is most effective when the letter boundaries are clear and well-defined, such as fixed-width text, or text that is at least on a straight line. Most CAPTCHAs put the letters on a curved path, as well as distorting the letters so they are no longer within a clearly defined rectangular shape. This makes it very hard to identify which parts of the images are letters and which parts are not, making OCRing CAPTCHAs a non-trivial problem.
- NFB owns you (Score:5, Interesting)
  
  by tepples ( 727027 ) writes: <tepples@gm a i l.com> on Monday September 04, 2006 @11:48PM (#16041808) Homepage Journal
  
  CATCHAs have been very effective in stopping spammers in the past, but if they can now just read them and answer correctly, then they are effectively rendered useless ...
  
  They're already useless if installing one will subject your business to boycotts and/or lawsuits from National Federation of the Blind [nfb.org] and other advocates for people with disabilities.
  
  Parent Share
  twitter facebook
  - Re:NFB owns you (Score:5, Informative)
    
    by MrNonchalant ( 767683 ) writes: on Tuesday September 05, 2006 @01:08AM (#16042188)
    
    You can build accessible CAPTCHAs, using images with a sound backup for blind users. My girlfriend is visually impaired and non-accessible CAPTCHAs are a real problem for her, she can't register at some sites without assistance.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by mrchaotica ( 681592 ) * writes:
      
      I'm blind and deaf, you insensitive clod!
      
      (Not really, but someone could be...)
      - Re: (Score:2)
        
        by Phroggy ( 441 ) * writes:
        
        How do people who are blind and deaf use the World Wide Web? I'm not saying it couldn't be done, but unless it actually is done, we shouldn't need to worry about it.
        
        Re: (Score:2)
        
        by Punboy ( 737239 ) writes:
        
        Its called a braille display.
        
        Re: (Score:2)
        
        by Phroggy ( 441 ) * writes:
        
        So, can we make braille CAPTCHAs?
    - - Re: (Score:2, Funny)
        
        by indifferent children ( 842621 ) writes:
        
        'm a big fan of asking the user a simple random question, such as "what is 2 + 5".
        I'm tired of all of the anti-Americanism on /. If you want to exclude Americans from your site, go ahead; but don't rub our noses in it.
      - Re: (Score:2, Funny)
        
        by maxwell demon ( 590494 ) writes:
        
        Of course you can resort to other, harder to calculate questions like: "What is the answer to life, the universe and everything?" Oops, Computers seem to have become much faster since Deep Thought! [google.de] :-)
  - Audible captchas (Score:2)
    
    by sita ( 71217 ) writes:
    
    I suppose "audible captchas" should be feasible. That is, if you can't see the picture, the captcha server also has an audio file with the same information. I'd be surprised if this doesn't exist already in some form.
  - Re: (Score:2)
    
    by stiggle ( 649614 ) writes:
    
    But the NFB website itself is not standards compliant. http://validator.w3.org/check?uri=http%3A%2F%2Fwww .nfb.org%2Fnfb%2FDefault.asp [w3.org]
- Re: (Score:3, Interesting)
  
  by Millenniumman ( 924859 ) writes:
  
  Why can't captchas just say "the letter at the beginning of the word that is spelled the reverse of the 3 letter name for an underwater vehicle or sandwich"? If someone can make a program that interprets that and gets the answer right after getting it off a captcha with OCR, then Google probably wants to know so they can hire them.
  - Two reasons (Score:5, Insightful)
    
    by patio11 ( 857072 ) writes: on Tuesday September 05, 2006 @12:49AM (#16042108)
    
    You've got two constraints. One is that you have to be able to compose an arbitrarily large numbers of capchas algorithmically. For example, that example you just used is human-composed. If its the only CAPTCHA you have, the following program gets me a job at Google: gawk 'BEGIN{print "b"}' . If you have 100 CAPTCHAS, I only need to add a switch statement and some elbow grease and then I get to break your CAPTCHA a trillion times.
    
    The other contraint is that you have to have your problem be trivially solvable by humans. I know plenty of people who cannot solve the CAPTCHA you have given: one obvious example would be, umm, all of my coworkers, because I live in Japan and "sub sandwitch" is not generally on the Japanese English curriculum. Similarly, you could any number of parsing problems which are very difficult for machines ("Here are 10 pictures chosen from HotOrNot. Click the three hot chicks.") but which may also be difficult for some users, such as Slashdotters who have never met a girl before.
    
    By the way, you can find an implementation of that CAPTCHA at http://www.hotcaptcha.com/ [hotcaptcha.com]
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by GrumpySimon ( 707671 ) writes:
      
      the first version of this AFAIK was kitten auth [thepcspy.com]
    - - Since you ask, here's why: (Score:4, Insightful)
        
        by patio11 ( 857072 ) writes: on Tuesday September 05, 2006 @07:56AM (#16043541)
        
        The name of the system you propose is called challenge/response (CR). CR is not a good idea for the following reasons:
        
        1) It says "My time is more important than yours" to all your correspondents, because you're not willing to look at a few spams getting past your Bayesian filter every day so instead you offload that time burden to people who want to talk to you.
        2) Dueling CR systems ("Hey, bob@example.com, I don't recognize you. Please prove you are a human" "Re: Hey, bob -- steve@stupid.com, I don't recognize you. Please prove you are a human"). Even more fun in a potentially infinite loop. Any system you can make to shortcircuit this loop can be abused by spam to avoid the CR altogether.
        3) Doesn't survive the Chinese Sweatshop Spam Attack, which will be ubiquitous if CR becomes popular. (Take poor Chinese person, teach them 10 words of English, pay them 2 cents an hour to answer CAPTCHAs so you get guaranteed delivery of your Maximize Your Mr. Wiggly offers.)
        4) Breaks legitimate bulk mail senders, such as Amazon, Paypal, eBay, mailing lists, etc etc. Mailing lists in particular are going to be very fun, since a lot of CR systems would spam the entire list -- perhaps provoking 100 challenges! Which then leads to combinatorial hilarity!
        
        Parent Share
        twitter facebook
  - Re:As much as I like open source software ... (Score:4, Insightful)
    
    by Jerf ( 17166 ) writes: on Tuesday September 05, 2006 @12:54AM (#16042128) Journal
    
    In order to pose the question, you have to generate it randomly. If it's not random, you already lost.
    
    In order to generate it, you're going to end up using a grammar.
    
    Running grammars in reverse is merely a matter of patience (to explore the space of problems the test program will pose) and the right tools; it's a fundamental bit of computer science.
    
    Granted, expecting spammers to be conversant with the fundamental elements of computer science is a pretty high bar, but it only takes one to leap it and the rest to buy the program from him.
    
    The image tests have the advantage that done properly, it takes more than just patience and computer science fundamentals to crack, it would require fundamental advances in the art.
    
    (Note that nowhere in this message do I claim that image tests are perfect; in fact everything I know is vulnerable to the "feed it to a human in another context (viz, 'porn') and let them do the work" attack, and there are also points to be made about how widespread any given grammar/image test becomes; I know a website where the image test actually is a constant and so far it doesn't seem to be a problem because of scale issues. My point is that text tests have an additional disadvantage. It's not an intrinsically bad idea, though.)
    
    Google wouldn't be interested in hiring people who could crack this, merely because they can crack this. Might make a decent interview question, though.
    
    (You might also be tempted to think that you could just use a really complicated grammar, but you are constrained by two things, the human supposedly reading and taking the test, and the complexity of the human language itself. By the time you write some problem generator that could reliably throw off a parser, you'll be reliably confusing the hell out of your human users, too.)
    
    Parent Share
    twitter facebook
  - Re: (Score:2, Interesting)
    
    by Anonymous Coward writes:
    
    "the letter at the beginning of the word that is spelled the reverse of the 3 letter name for an underwater vehicle or sandwich"
    
    My wife would fail this test. My father will fail this test. My step-mother will fail this test. My children will fail this test.
    
    A computer will very easily get this test right one time on 26.
    
    In one word: Useless.
- No Wrinkle in Time comments? (Score:2, Interesting)
  
  by reaktor ( 949798 ) writes:
  
  Come on, 34 comments and no mention of A Wrinkle in Time [google.com]?
- Re: (Score:2)
  
  by 1u3hr ( 530656 ) writes:
  
  Can't spammers use this thing to break CAPTCHAs
  Captchas are designed to be difficult to OCR. Besides there are plenty of OCR apps around already, if you hadn't noticed. I don't think spammers have been holding out for a GPL one.
- - - Re: (Score:2)
      
      by rm69990 ( 885744 ) writes:
      
      Naw, more like trollish babbling. OCR doesn't handle curving lines and distorted letters well. If you want to make yourself seem intelligent, at least research your shit first and try to stay on topic. :)
- Re: (Score:3, Funny)
  
  by Scaba ( 183684 ) writes:
  
  I'm sick and tired of a piece of dust being interpreted as a meter change.
  
  You're just not avant-garde enough.
- Re:Music OCR (Score:4, Interesting)
  
  by lowieken ( 522530 ) writes: on Tuesday September 05, 2006 @05:02AM (#16043021) Homepage
  
  There is a piece of non-free software that runs quite well under Wine and exports nice MusicXML. You will find it linked to from http://www.recordare.com/software.html [recordare.com] .
  
  I really should ask google to help buy this technology and set it free.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by advocate_one ( 662832 ) writes:
    
    which program... there are quite a few OCR programs linked to from there...
- Re: (Score:2)
  
  by Locutus ( 9039 ) writes:
  
  I found that I needed to use grayscale tif files for one and "output" is the output-filename where you'll get:
  outputFilename.raw #???
  outputFilename.map # seems to be a location map of 0/1's where 1's are valid text and 0's aren't
  outputFilename.txt # the text from the OCR event
  
  I also found that the tessdata directory did not get installed into the /usr/local/bin directory on "make install" and copied that directory from the build directory to get it to work.
  
  Without "batch", it tries to bring up and X wind
- Re: (Score:2)
  
  by Mr. Hankey ( 95668 ) writes:
  
  It might have been dangerous. If you actually found yourself in a tesseract, you might have ended up right back in your cube (possibly falling from the ceiling) when walking out the wrong side.
- Re: (Score:2)
  
  by Mr. Hankey ( 95668 ) writes:
  
  Just build it in 3 dimensions, and let an earthquake fold it for you.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

I take back every bad thing I said about Google (Score:5, Interesting)

Sonny Bono pwned Gutenberg (Score:2)

Un-Finishable (Score:5, Interesting)

Re:Un-Finishable (Score:5, Insightful)

Re: (Score:2, Interesting)

Re: (Score:3, Informative)

Re: (Score:3, Informative)

Chastity Bono's next step is life+100 (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Interesting)

Re: (Score:2, Informative)

Anti-spam (Score:3, Interesting)

Re: (Score:2)

I call bullshit (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re:I call bullshit (Score:4, Informative)

Re: (Score:2)

Re: (Score:3, Interesting)

improvements (Score:5, Funny)

Hoping OCR will improve? (Score:3, Insightful)

Re: (Score:2)

Finally! (Score:3, Funny)

From the Project (Score:5, Insightful)

Re: (Score:3, Informative)

License (Score:3, Informative)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:3, Informative)

I'm sorry Dave... (Score:5, Funny)

Re: (Score:3, Interesting)

Hosting (Score:5, Interesting)

Re:Hosting (Score:5, Funny)

Re: (Score:2)

Re:Hosting (Score:5, Funny)

i hope it can augment the SpamAssassin OCR plugin (Score:2, Informative)

Yay! (Score:2)

No luck for OS X either (Score:2)

Re: (Score:2)

my thoughts (Score:4, Interesting)

Re: (Score:2)

Vividata works quite well (Score:2, Interesting)

HP decided to got out of the OCR business? (Score:5, Funny)

W0W1 (Score:3, Funny)

What about "rough ocr" (Score:2)

Re: (Score:3, Insightful)

Re: (Score:2)

Non-English Charsets? (Score:4, Interesting)

Re: (Score:3, Informative)

License issue: not free software (Score:2, Interesting)

Test example of tesseract. (Score:2, Interesting)

Re: (Score:2)

Re: (Score:3, Interesting)

License? (Score:2)

An interesting demonstration (Score:2)

Re:As much as I like open source software ... (Score:5, Informative)

Re: (Score:2)

Re:As much as I like open source software ... (Score:4, Insightful)

Re: (Score:3, Funny)

Comment removed (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:3, Informative)

Re:As much as I like open source software ... (Score:5, Funny)

Re:As much as I like open source software ... (Score:5, Insightful)

Re: (Score:3, Informative)

Image spam (Score:3, Interesting)

Re: (Score:3, Insightful)

Re:As much as I like open source software ... (Score:4, Funny)

Re: (Score:2, Insightful)

Re: (Score:2, Flamebait)

Re: (Score:3, Insightful)

NFB owns you (Score:5, Interesting)

Re:NFB owns you (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)