Images & Visualization: April 2008

Recently I had a very interesting project from client who wanted to extract the text (in my case numbers) embedded on thousands of images. My natural choice was to use Tesseract, as it can be scripted and applied to many images in sequenence.

Tesseract is an OCR software, originally developed by Hewlett Packard and currently developed by Google. It is a open source software released under Apache license. Since it is open source, you can get your hands on it and install it on pretty much any operating system. I installed it on a Windows machine under Cygwin and the installation was a breeze.

Tesseract does not have any segmentation methods, no document layout and can only output the recognised text to a file. But its accuracy is good enough for many applications. It was ranked among top 3rd OCR software for the year 1995. Making a call to tesseract is also easy

tesseract data.bmp text.dat [-l langid]

The values within [] are optional. The langid is the the language being recognized. The default language is English. But it also supports French, Italian, German, Spanish.

Since Tesseract does not have any segmentation methods to separate the text from background, the user have to apply these methods using other softwares like ImageMagick, ITK etc. The most common segmentation technique for scanned documents is the Local Adaptive Thresholding. It takes in to account the variation in background intensity across the scanned image and thresholds accordingly. But the right technique has to be chosen depending on the type of image being recognized.

Possible additions to Tesseract
All the text that I needed to decipher in my images were numbers but Tesseract does not have a langid for numbers. Since there are no langid for numbers, Tesseract deciphered some of these numbers as alphabets. If I have time in the future, I will work on creating the langid for numbers as it will be helpful for many people. If you find that it might be helpful for you, I encourage you to create one or contact me and we can work togther.

Images & Visualization

Wednesday, April 9, 2008

Optical Character Recoginition (OCR) using Tesseract