Wednesday, April 9, 2008

Optical Character Recoginition (OCR) using Tesseract

Recently I had a very interesting project from client who wanted to extract the text (in my case numbers) embedded on thousands of images. My natural choice was to use Tesseract, as it can be scripted and applied to many images in sequenence.

Tesseract is an OCR software, originally developed by Hewlett Packard and currently developed by Google. It is a open source software released under Apache license. Since it is open source, you can get your hands on it and install it on pretty much any operating system. I installed it on a Windows machine under Cygwin and the installation was a breeze.

Tesseract does not have any segmentation methods, no document layout and can only output the recognised text to a file. But its accuracy is good enough for many applications. It was ranked among top 3rd OCR software for the year 1995. Making a call to tesseract is also easy

tesseract data.bmp text.dat [-l langid]

The values within [] are optional. The langid is the the language being recognized. The default language is English. But it also supports French, Italian, German, Spanish.

Since Tesseract does not have any segmentation methods to separate the text from background, the user have to apply these methods using other softwares like ImageMagick, ITK etc. The most common segmentation technique for scanned documents is the Local Adaptive Thresholding. It takes in to account the variation in background intensity across the scanned image and thresholds accordingly. But the right technique has to be chosen depending on the type of image being recognized.

Possible additions to Tesseract
All the text that I needed to decipher in my images were numbers but Tesseract does not have a langid for numbers. Since there are no langid for numbers, Tesseract deciphered some of these numbers as alphabets. If I have time in the future, I will work on creating the langid for numbers as it will be helpful for many people. If you find that it might be helpful for you, I encourage you to create one or contact me and we can work togther.

6 comments:

JK said...

what are the 1st and 2nd best OCR.

Ravi said...

I tried other softwares like simpleOCR. Eventhough it was quiet good, I could not use it as I needed a command line version and that too under Linux like environment. Hence I naturally gravitated to tesseract.

ammouna said...

you have said that tesseract does not use any technique for segmentation so could you explain how it works to extract blocs and characters ?i will be thankfull

Ravi said...

Hi Ammouna,
To segment the text from the background, you could use any of the image processing tools including Matlab, ImageMagick etc. On one of the project, I used Matlab and in the other one I used ImageMagick's "local adaptive threshold". Both techniques worked but the latter is definitely easy to use.

rc said...

Gracias bien explicado! :):):):):):)

buyi wen said...

if you like tesseract ocr, you may like this free online ocr tool using tesseract ocr 3.02