September 30th, 2008 | nubae
OCROpus - next gen. OCR for Linux
I was recently asked to scan in a small book, with relatively small text for addition to a database that could then be looked up by typing key words. The first step is to scan and OCR the entire book. Fortunately I've got all the necessary software to do scanning (sane and xsane) and my scanner was automatically recognised by Ubuntu.
The hard part was finding a suitable OCR software for Linux. In the past finding such a beast was a sad state of affairs in the Linux world indeed. With revived interest though, last year saw a few Google summer of code projects being released including tesseract and ocropus. Since they are very recent additions, the software is not exactly mature, but its the best there is, boasting a 95% accuracy.
I installed Ocropus and Tesseract from subversion, since it makes sense to check out the latest release. Though there are some recent releases in the Ubuntu repositories, I couldn't get OCROpus to recognise the needed extras. So the instructions I used are as follows:
svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only svn checkout http://iulib.googlecode.com/svn/trunk/ iulib-read-only svn checkout http://ocropus.googlecode.com/svn/trunk/ ocropus-read-only
Tesseract source has a bug that doesn't allow it to compile with gcc 4.3 (Intrepid Ibex comes with this default) so you need to install this patch by downloading to the directory where tesseract-ocr-read-only is located, and running:
patch -p1 tesseract+gcc-4.3.diff
Then u can do
To install iulib, go into its directory and do:
./configure && make && sudo make install
The documentation session also use OpenFST which you can download here. There is a problem however, with building using the latest Ubuntu, so I chose not to use it.
To build ocropus, go into the ocropus-read-only folder and do:
./configure --without-fst --without-leptonica
make
sudo make install
Right now, OCROpus works only as a command line tool, with some frontends coming soon one hopes. In any case, using it from the command line is not too hard if all you want is to convert to a html formatted page:
ocroscript recognise
You can also recognize a sequence of pages by listing them one by one in a file, say, file-list by doing:
ocroscript recognize @file-list > text.html WHAT DO THE ROLLS ROYCE PHANTOM AND THE HYUNDAI GENESIS HAVE IN COMMON? HINT: IT’S NOT THE PRICE. For starters——they both share a 17-speaker Lexic0n® 7.1 surround sound system} Now, we don’t suppose you’ll confuse a Genesis with a Rolls—Royce anytime soon, but these two luxury cars do share more vital appointments than you might expect. For instance—a quiet cabin—assm·ed (like the Phantoms) by whisper valves that select an alternate exhaust at low speeds to reduce noise, and by acoustic laminated windows. The car’s trailblazing ergonomics are exceptional too. A widely acclaimed DIS knob gives you intuitive access to GPS, the sound system, and any_ Bluetooth" phone. Outside, the Genesis glows with a finish so nearly perfect that (just like Rol1s—Royce) we have to use robots to achieve it. No wonder it looks so good. In fact, if you’d rather have money than a hood ornament, it may tend to look even better than a Rolls-Royce.


February 3rd, 2009 | OCR Revisited
...">installation on Ubuntu is for 0.5, and looks very complicated, so I’m going to bookmark nubae’s Habari | Linux and Education piece on ocrupus as not only does it look simpler, but it details a bug with regard to Intrepid Ibex
Tesseract sourc...
February 26th, 2009 | Tesseract OCR and Ocropus at Docunext Technology
...ause it has page layout capabilities. And so I had to download and install a few items, thankfully, this page supplied all the information I needed. 0 Responses to “Tesseract OCR and Ocropus” ...
May 16th, 2011 | mobile adds
Hey there! Do you use Twitter? I'd like to follow you if that would be okay. I'm undoubtedly enjoying your blog and look forward to new updates.
June 11th, 2011 | Kalyn Cefalu
This is very interesting, You're a very skilled blogger. I have joined your feed and look forward to seeking more of your fantastic post. Also, I have shared your web site in my social networks!
July 11th, 2011 | Herren Lederjacke
I do like the manner in which you have framed this particular challenge and it really does offer us a lot of fodder for consideration. Nonetheless, coming from what precisely I have experienced, I just trust when the reviews stack on that individuals continue to be on issue and not get started upon a soap box associated with some other news of the day. Yet, thank you for this fantastic piece and although I do not really go along with this in totality, I regard your point of view.
July 16th, 2011 | funkcje scrapebox
Wonderful blog here! Also your own internet site loads up fast! What web host are you using? Can I get your affiliate url to your web host? I need my personal internet site loaded up as quick as yours :)
July 16th, 2011 | Giełda linków
Could be by far the most awesome post I read all day :)