Saturday, January 3, 2009

Spell checker in Unicode using Python

I have a project where I had to perform spell check on characters recognized using an optical character recognition program (OCR). My first choice was to search for an existing program written preferably in python, my favorite choice for such work. You can download the complete file here.

Amazingly I found this work by Peter Norvig . It was very well documented and well written piece of code.

But I had few issues that I needed to fix and so I could not use it directly.

1. In my program, unicode characters need to defined as the default character for all input and output unlike peter's program which works on ascii.

This is performed in the following code

#!/usr/bin/python -Wall
# -*- coding: utf-8 -*-

import re, collections, pprint,os
import sys
import codecs

if __name__ == '__main__':
reload(sys) sys.setdefaultencoding('iso8859-1')

2. The list of alphabets will also include the unicode characters applicable in my situation like
alphabet = u'abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿß'

3. The unicode feature of python is smart enough to recognise the right characters for conversion from upper case to lower case. All that needs to be done is to call the .lower() function on any unicode characters in the following function.

def words(text): return re.findall(u'[abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿß]+',text.lower())

4. Peter's program trains different words by determining the probability of its occurence. In simple terms, it counts the number of time a word appears in a standard piece of text. The larger the piece of text, the more representative it is to the real world. This scenario was not true in my case, as I do not have a piece of text where a word gets repeated multiple times.

In my case, I have a list of words in a text file. Almost all the word gets repeated only once and not any more. So the rank of a word was not in frequency but its ordinality.

The ord function in python returns the unicode position of a character input. In the function below, I first determine the ordinality of each word in the possible candidates (i.e., the original set of words). Then the ordinality of the word to be spell checked is also found. The difference between the two ordinalities is determined and the location of the lowest value gives the location of the correct word in the candidates.

def best_candidate(candidates,word):
     clist = list(candidates)
     #Find ordinality for the complete list
     so = []
     for cl in clist:
         sum_ord = 0
         for c in cl:
             sum_ord = sum_ord+ord(c)

     #Find ordinality of the given word
     sum_ord = 0
     for c in word:
         sum_ord = sum_ord+ord(c)

#Find difference in ordinality and also lowest value location
     so_item_l = []
     for so_item in so:
     min_loc = so_item_l.index(min(so_item_l))

     return clist[min_loc]

No comments: