IIIT HyderabadUMASS AMHERST Robust Recognition of Documents by Fusing Results of Word Clusters Venkat Rasagna 1, Anand Kumar 1, C. V. Jawahar 1, R. Manmatha 2 1 Center for Visual Information Technology, IIIT- Hyderabad 2 Center for Intelligent Information Retrieval, UMASS - Amherst
IIIT HyderabadUMASS AMHERST Recognition of books and collections. Recognition of words is crucial to Information Retrieval. Use of dictionaries and post processors are not feasible in many languages. Introduction
IIIT HyderabadUMASS AMHERST Motivation Most of the (Indian language) OCRs recognize glyph(component) and generate text from the class labels. Word accuracies are far lower than component accuracies. Word accuracy is inversely proportional to no. of components in the word. Use of language model for post processing is challenging. –High entropy, Large vocabulary (eg. Telugu). –Language processing modules still emerging. Component acc. word acc word acc No of components Is it possible to make use of multiple occurrence of the same word to improve OCR performance ? RecognizeParse Average word length = Component Accuracy = 9 / 12 = 75% Word Accuracy = 25%
IIIT HyderabadUMASS AMHERST Overview Text Multiple occurrences of a word Words are degraded independently OCR output is different for the word at different instances OCR outputGoal Cluster OCR
IIIT HyderabadUMASS AMHERST Related Work MalayalamBangla TamilHindi U. Pal, B. Chaudhuri, Pattern Recognition, A. Negi et al., ICDAR, 2001 ; 2 C. V. Jawahar et al., ICDAR, 2003; 3 K. S. Sesh Kumar et al., ICDAR P. Xiu and H. S. Baird, DRR XV,2008; 2 N. V. Neeba, C. V. Jawahar, ICPR, T. M. Rath et al., IJDAR, 2007; 2 T. M. Rath et al., CVPR, 2003; 3 Anand Kumar et al., ACCV, 2007 H. Tao, J. Hull, Document Analysis and Information Retrieval, 1995 Character Recognition in Indian languages is still an unsolved problem. Telugu is one of the most complex scripts. Recognition of a book has received some attention recently. Word images are efficiently matched for retreival. Use of word image clusters to improve OCR accuracy
IIIT HyderabadUMASS AMHERST Conventional Recognition Process Preprocessing Segmentation and Word detection Text (UNICODE) Feature ExtractionClassification Recognizer Scanned Images Word level Feature Extraction Grouping Word Grouping (Clustering) Word groups Combining OCR Results Proposed Recognition Process
IIIT HyderabadUMASS AMHERST LSH Goal: “r-Near Neighbour” –for any query q, return a point p ∈ P such that||p-q|| ≤ r (if it exists) LSH has been used for –Data Mining Taher H. Haveliwala, Aristides Gionis, Piotr Indyk, WebDB, 2000 – Information retrieval A.Andoni, M.Datar, N.Immorlica, V.Mirrokni, Piotr Indyk, 2006 – Document Image Search Anand Kumar, C.V.Jawahar, R.Manmatha, ACCV, 2007 Locality Sensitive Hashing (LSH)
IIIT HyderabadUMASS AMHERST LSH clustering on word images [TODO]
IIIT HyderabadUMASS AMHERST Character Majority Voting OCR output Components Word Cluster Final Output Algorithm [TODO]
IIIT HyderabadUMASS AMHERST Word ImgOCR o/p Dynamic Programming Voting for 1 after aligning DTW o/p for word 1 = CMV o/p for word 1 = Alignment Dynamic Programming [1,2]
IIIT HyderabadUMASS AMHERST Results Word generation process makes correct annotations available for evaluating the performance. Component AccuracyWord Accuracy DatasetOCRCMVDTWOCRCMVDTW SF SF SF SF clusters 20 variations Degraded dataset More Details
IIIT HyderabadUMASS AMHERST Results Word Accuracy Vs No. of words –Adding more no. of words makes the data set more ambiguous –Algorithm performance increases with no. of words, and saturates. Word Accuracy Vs Word Length –Word accuracy decreases as the word length increase. –Use of the cluster info helps in gaining good word accuracies.
IIIT HyderabadUMASS AMHERST Analysis ImageOCRCMVDTW
IIIT HyderabadUMASS AMHERST Results SizeNo. of Clusters Length Range Word WL No. of words Symbol accuracyWord Accuracy OCRCMVDTWOCRCMVDTW B1Short B1Medium B1Long B1ALL B2ALL B3ALL B4ALL For a small increase in component accuracy, there is a large improvement in the word accuracy. The improvement is high for long words. Relative improvement of 12% for words which occur at least twice.
IIIT HyderabadUMASS AMHERST Analysis Cuts and Merges CMV vs. DTW Wrong word in the cluster. Cases that cant be handled ImageOCRCMVDTW ImageOCRCMVDTW ImageOCRCMVDTW ImageOCRCMVDTW
IIIT HyderabadUMASS AMHERST Conclusion & Future work A new framework has been proposed for OCRing the book. A word recognition technique which uses the document constraints is shown. An efficient clustering algorithm is used to speed up the process. Word level accuracy is improved from 70.37% to 79.12%. This technique can also be used for other languages. Extending it to include the uses of techniques to handle unique words by creating clusters over parts of words.
IIIT HyderabadUMASS AMHERST END
IIIT HyderabadUMASS AMHERST Additional slides
IIIT HyderabadUMASS AMHERST LSH Algorithm Algorithm: Word Image Clustering Require: Word Images Wj andFeatures Fj, j = 1,...,n Ensure : Word Image Clusters O for each i = 1,...,l do for each j = 1,...,n do Compute hash bucket I = gi (Fj ) Store word image Wj on bucket I of hash table Ti end for k = 1 for each i = 1,...,n and Wi unmarked do Query hash table for word Wi toget cluster Ok Mark word Wi with k k = k +1 end for Back
IIIT HyderabadUMASS AMHERST Word Error Correction Algorithm: Word Error Correction Require: Cluster C of words Wi,i = 1,...,n Ensure: Clusters O of correct words for each i = 1,...,n do for each j = 1,...,n do if j != i then Align word Wi and Wj Record errors Ek,k = 1,...,m in Wi Record possible corrections Gk for Ek end if end for Correct Ek if Probability pk of correction Gk is maximum O <- O U Wi end for Back
IIIT HyderabadUMASS AMHERST Dataset –5000 clusters with 20 images of same word with different font size and resolution. –Words were generated using Image Magick. –Words were degraded with Kanungo degradation model to approximate real data. –SF1, SF2, SF3, SF4 datasets were degraded with 0, 10, 20, 30% noise.
IIIT HyderabadUMASS AMHERST Hashing Hashed Words Pre-processing Segmentation and word detection Feature Extraction Hashing Feature Extraction OCR Text Fusion Method 1 / Method 2 OCR output Cluster of words Word image