Slashdot Log In
Google Releases Tesseract as Open Source
Posted by
ScuttleMonkey
on Mon Sep 04, 2006 10:27 PM
from the bit-rot dept.
from the bit-rot dept.
An anonymous reader writes "Google recently released Tesseract as open source. Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code and released it for general consumption. You can download Tesseract over at Sourceforge.
This discussion has been archived.
No new comments can be posted.
Google Releases Tesseract as Open Source
|
Log In/Create an Account
| Top
| 251 comments
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
I take back every bad thing I said about Google (Score:5, Interesting)
Un-Finishable (Score:5, Interesting)
(http://kadin.sdf-us.org/ | Last Journal: Tuesday October 16, @01:46PM)
Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can. Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration. I think this would be covered by fair use law even if the work was still protected. Perhaps this sort of archival work is not exactly the aim of PG, but it's still critically important.
With that said, I don't mean to in any way excuse the disgusting abuse of our political and legal system that was and is the "Sonny Bono Copyright Term Extension Act." That thing is a disgusting example of pretty much everything that's wrong with our government today.
Re:Un-Finishable (Score:5, Insightful)
Your argument makes the fundamentally flawed assumption that the "new rules" will remain constant. The reality is that Copyright will continue getting extended so that new content never comes into Public Domain. (I hope the copyright fuckers are the first against the wall when the revolution comes!)
I'm sure they could even OCR them... they just couldn't make them available to the public. Of course, given the community-driven mechanism by which Project Gutenberg works, they couldn't legally distribute them to the volunteers either...
Anti-spam (Score:3, Interesting)
I call bullshit (Score:5, Interesting)
(http://synflood.at/blog/)
And after all, it's not about authentication, it's about making a service accessible only for humans.
BTW, it's funny that you praise your own cryptography solution in your blog, but it's obvious that you have the problem of replay attacks, you even mention it in the "caveat" section below the text box.
Re:I call bullshit (Score:4, Informative)
improvements (Score:5, Funny)
i.e., added AdSense to the OCR output.
Hoping OCR will improve? (Score:3, Insightful)
(http://www.taxcalc.com.au/)
Finally! (Score:3, Funny)
(Credit to S.G.)
From the Project (Score:5, Insightful)
(http://t3.dotgnu.info/ | Last Journal: Monday September 26 2005, @06:32AM)
> It was open-sourced by HP and UNLV in 2005.
So google basically did what ? Fix bit-rot ? Google has re-released some open source code, essentially forking off the orginal ?
> License: (None Listed)
I'm a fan of the FOSS idea. Basically that makes sures that the whole work to which I contributed, always remains available to me (and others). It might not always work for a company, but as a developer it makes sense to me. And the second thing I need to see is a License after I see some code.
So explain to me how exactly this is open source (other than the "compile, but don't touch" version of it) and *then* I might think of downloading it and probably fix a few bugs or write docs.
I'm sorry Dave... (Score:5, Funny)
(http://www.google.com/)
Yeah, but how is it on lip-reading? That's when we really need to worry.
Hosting (Score:5, Interesting)
(http://seenonslash.com/ | Last Journal: Friday May 11 2007, @04:02PM)
Re:Hosting (Score:5, Funny)
(Last Journal: Friday October 19, @09:21PM)
Re:Hosting (Score:5, Funny)
Sourceforge? (Score:1)
(http://debcentral.org/)
i hope it can augment the SpamAssassin OCR plugin (Score:2, Informative)
(http://durak.org/sean/)
Yay! (Score:2)
(http://blog.mzzt.net/)
No binaries! Only source code! Good luck getting it to compile on Windows, I gave up after I got several dozen obscure errors I had never seen before from the compiler.*
* If anyone can get VC++2K5 to compile it, please post.
my thoughts (Score:4, Interesting)
I just checked out tesseract. One thing I have to look at more is the license. It appears to be the Apache license, which seems like a decent free license. But it also includes MITRE's aspirin. I'm not sure how dependent it is on aspirin and what the license restrictions of aspirin are.
The two best free OCR engines out right now are clara and gocr. While they are the best, they are not that great yet. I just ran the same tiff I had run with those two (I also have the document in pbm and other formats). Tesseract did not read it, it bailed with "IMAGE::check_legal_access:Error:Can't seek backwards in a buffered image!"
Clara and GOCR are written in C, Tesseract is written in C++, a language I don't know. Tesseract did well in the UNLV challenge so it probably has some good features. It does say it has no page layout analysis though.
Hopefully this can be improved, or good parts of it can be borrowed and incorporated into gocr or clara. It couldn't handle my test that both clara and gocr could, but it probably has strengths the other two doesn't. One day hopefully we'll have a free OCR that handles things as automagically as the commercial ones do. I will see what I can contribute to that as well. Although this is C++ and I don't know that language.
Vividata works quite well (Score:2, Interesting)
(http://www.gnupooh.org/)
I would definitely prefer a Free Software solution so I'm excited about this development. Until this solution is really work-able (see the Google limitations, they're pretty serious), give Vividata a try.
HP decided to got out of the OCR business? (Score:5, Funny)
(http://www.nojailforpot.com/)
Actually, shortly thereafter, HP decided to get out technology innovation business, and into the printer ink business.
W0W1 (Score:3, Funny)
THAHKS, G00GLL!1!!!
What about "rough ocr" (Score:2)
(http://www.gogo.co.nz/)
Lately I've been thinking about computerizing these documents into a web based system, so that any of the club executive can search and pull out a document they need etc, we could also flag documents as "general release" so that people could read interesting stuff from our past. And of course it would also serve as a secure backup of our documents, incase of fire, theft, alien invasion...
I think what is needed is a rough OCR system, that is, an OCR system that's not trying to be perfect, but can at least make about 50% accuracy on both typed and handwritten (without training!) documents, and preferably where it wasn't pretty certain it was correct, it would just skip words. The idea being that I'd run each document (big job, but doesn't matter if it takes a year) through a scanner, OCR it to get some searchable content, then store it as a PDF, or jpeg or something.
Anybody know of such an (open source, or at least free as in beer) OCR system?
Non-English Charsets? (Score:4, Interesting)
(http://studyinjapan.blogspot.com/)
Music OCR (Score:1)
Re:Music OCR (Score:4, Interesting)
(http://markvdb.be/)
I really should ask google to help buy this technology and set it free.
License issue: not free software (Score:2, Interesting)
The piece in question is a neural network simulator named Aspirin/MIGRAINES, presumably used for training. Pun away.
If you're wondering about OS compatibility... (Score:1)
I worked at HP labs for some of this period (Score:1)
(Last Journal: Tuesday June 19, @07:48AM)
and I've never heard of this thing.
Guess I should have got out of my cube more.
Test example of tesseract. (Score:2, Interesting)
Input image: it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code
Output text: ii has been lamed as one of lhe mos! accurale Oplical Characler Recognilion (OCR) programs available. Having sat on lhe shelf galhering dusk for so many years, Google cleaned up some of lhe more ouldaled porlions of lhe code
I have no idea what kind of input is optimal, but for a first shot in the dark, that's not too shabby. I'll go play with it some more. (^,^)
License? (Score:2)
Build environment (Score:1)
(Last Journal: Wednesday August 14 2002, @12:33PM)
I always knew... (Score:1)
(http://sam991.blogspot.com/)
An interesting demonstration (Score:2)
(http://kamthaka.blogspot.com/ | Last Journal: Wednesday March 30 2005, @03:18PM)
Google's business interest in releasing this as open source is obvious: the greater the value of the materials available to the Internet, the greater the value of its service.
Error of omission in summary (Score:2)
As the linked article states, there are commercial OCR programs that are far more accurate.
--Rob
how to get it to run .. (Score:2)
From INSTALL
"4. Type `make install' to install the programs and any data files and documentation."
Running
README has this to say "The executable must reside in the same directory as the tessdata directory The command line is: tesseract image.tif batch"
Trying to run it and a windows pops up briefly and then disappears.
port to Mindstorms? (Score:1)
THIS IS ONLY FOR *NIX and not mentioned? (Score:1)
(http://macraig.homedns.org/blog/)
Re:As much as I like open source software ... (Score:5, Informative)
Re:As much as I like open source software ... (Score:4, Insightful)
(http://ottodestruct.com/)
Re:As much as I like open source software ... (Score:4, Insightful)
Don't know how widespread this is, but it is certainly possible.
Re:As much as I like open source software ... (Score:5, Funny)
(Last Journal: Thursday September 21 2006, @07:20AM)
Re:As much as I like open source software ... (Score:5, Insightful)
Re:As much as I like open source software ... (Score:4, Funny)
(http://slashdot.org/)
Re:As much as I like open source software ... (Score:3, Insightful)
(http://www.tuekistan.com/)
NFB owns you (Score:5, Interesting)
(http://myatomic.com/ | Last Journal: Sunday November 19 2006, @12:31AM)
They're already useless if installing one will subject your business to boycotts and/or lawsuits from National Federation of the Blind [nfb.org] and other advocates for people with disabilities.
Re:NFB owns you (Score:5, Informative)
Re:As much as I like open source software ... (Score:1)
(http://www.yafla.com/dforbes/ | Last Journal: Tuesday September 27 2005, @10:43AM)
While Slashdot has always been a target for trolls and miscreants, I don't ever remember it being a spammers destination (note 4-digit UID). Even back in those crazy, hazy days when we didn't have to try to interpret some bizarro text -- AKA the vast bulk of Slashdot's existence - somehow spammers were thwarted in their evil quest. Was Slashdot just feeling a bit left out, and just had to stick a CAPTCHA in there to be just like everyone else ("See!? Spammers like us too!").
CAPTCHAs should be replaced by forcing answers to submitted homework questions - kids get their homework done for them on a distributed network, and it somewhat proves that there's a human on the other end (no machine could interpret most homework questions).
Re:As much as I like open source software ... (Score:3, Interesting)
Two reasons (Score:5, Insightful)
The other contraint is that you have to have your problem be trivially solvable by humans. I know plenty of people who cannot solve the CAPTCHA you have given: one obvious example would be, umm, all of my coworkers, because I live in Japan and "sub sandwitch" is not generally on the Japanese English curriculum. Similarly, you could any number of parsing problems which are very difficult for machines ("Here are 10 pictures chosen from HotOrNot. Click the three hot chicks.") but which may also be difficult for some users, such as Slashdotters who have never met a girl before.
By the way, you can find an implementation of that CAPTCHA at http://www.hotcaptcha.com/ [hotcaptcha.com]
Since you ask, here's why: (Score:4, Insightful)
1) It says "My time is more important than yours" to all your correspondents, because you're not willing to look at a few spams getting past your Bayesian filter every day so instead you offload that time burden to people who want to talk to you.
2) Dueling CR systems ("Hey, bob@example.com, I don't recognize you. Please prove you are a human" "Re: Hey, bob -- steve@stupid.com, I don't recognize you. Please prove you are a human"). Even more fun in a potentially infinite loop. Any system you can make to shortcircuit this loop can be abused by spam to avoid the CR altogether.
3) Doesn't survive the Chinese Sweatshop Spam Attack, which will be ubiquitous if CR becomes popular. (Take poor Chinese person, teach them 10 words of English, pay them 2 cents an hour to answer CAPTCHAs so you get guaranteed delivery of your Maximize Your Mr. Wiggly offers.)
4) Breaks legitimate bulk mail senders, such as Amazon, Paypal, eBay, mailing lists, etc etc. Mailing lists in particular are going to be very fun, since a lot of CR systems would spam the entire list -- perhaps provoking 100 challenges! Which then leads to combinatorial hilarity!
Re:As much as I like open source software ... (Score:4, Insightful)
(Last Journal: Saturday August 18 2001, @11:04AM)
In order to generate it, you're going to end up using a grammar.
Running grammars in reverse is merely a matter of patience (to explore the space of problems the test program will pose) and the right tools; it's a fundamental bit of computer science.
Granted, expecting spammers to be conversant with the fundamental elements of computer science is a pretty high bar, but it only takes one to leap it and the rest to buy the program from him.
The image tests have the advantage that done properly, it takes more than just patience and computer science fundamentals to crack, it would require fundamental advances in the art.
(Note that nowhere in this message do I claim that image tests are perfect; in fact everything I know is vulnerable to the "feed it to a human in another context (viz, 'porn') and let them do the work" attack, and there are also points to be made about how widespread any given grammar/image test becomes; I know a website where the image test actually is a constant and so far it doesn't seem to be a problem because of scale issues. My point is that text tests have an additional disadvantage. It's not an intrinsically bad idea, though.)
Google wouldn't be interested in hiring people who could crack this, merely because they can crack this. Might make a decent interview question, though.
(You might also be tempted to think that you could just use a really complicated grammar, but you are constrained by two things, the human supposedly reading and taking the test, and the complexity of the human language itself. By the time you write some problem generator that could reliably throw off a parser, you'll be reliably confusing the hell out of your human users, too.)
No Wrinkle in Time comments? (Score:2, Interesting)
In the future (Score:1)
Re:As much as I like open source software ... (Score:2)
Re:As much as I like open source software ... (Score:1)
Re:As much as I like open source software ... (Score:2)
Captchas are designed to be difficult to OCR. Besides there are plenty of OCR apps around already, if you hadn't noticed. I don't think spammers have been holding out for a GPL one.
Re:Am I really stupid or... (Score:2)
outputFilename.raw #???
outputFilename.map # seems to be a location map of 0/1's where 1's are valid text and 0's aren't
outputFilename.txt # the text from the OCR event
I also found that the tessdata directory did not get installed into the
Without "batch", it tries to bring up and X window but that just quickly goes away with no debug output.
Usage: tesseract inputfile.tif [path/]outputfilename batch
LoB
Re:As much as I like open source software ... (Score:1)
Your title and post make you sound like you think this shouldn't be released open source, just in case spammers use it.
Well, then OOo will have to stop releasing their office suite: just think, Base could be used to store e-mail addresses to spam! Or, maybe no open source e-mail clients should be released, because the spammers might use it to send spam!
Don't blame the software for the way it is used; It's the user's fault if (s)he decides to use it malevolently. Most software has the potential for misuse, some more than others, but that doesn't mean that fear of spam should stop tools that have a chance to be misused being released. Just think of the positive uses of programs like this.
Besides, it's more than easy enough for spammers to just make a program to do stuff like break CAPTCHAs (yes, I know they're designed to defeat spammers, but nothing's perfect).
Re:Why release it on Sourceforge (Score:1)
(http://www.fanboy.co.nz/adblock/)
Re:As much as I like open source software ... (Score:1)
That's never happened before!