Download presentation
Presentation is loading. Please wait.
Published byArlene Barnett Modified over 9 years ago
1
IIIT HyderabadUMASS AMHERST Robust Recognition of Documents by Fusing Results of Word Clusters Venkat Rasagna 1, Anand Kumar 1, C. V. Jawahar 1, R. Manmatha 2 1 Center for Visual Information Technology, IIIT- Hyderabad 2 Center for Intelligent Information Retrieval, UMASS - Amherst
2
IIIT HyderabadUMASS AMHERST Recognition of books and collections. Recognition of words is crucial to Information Retrieval. Use of dictionaries and post processors are not feasible in many languages. Introduction
3
IIIT HyderabadUMASS AMHERST Motivation Most of the (Indian language) OCRs recognize glyph(component) and generate text from the class labels. Word accuracies are far lower than component accuracies. Word accuracy is inversely proportional to no. of components in the word. Use of language model for post processing is challenging. –High entropy, Large vocabulary (eg. Telugu). –Language processing modules still emerging. Component acc. word acc 50 0 100 word acc No of components Is it possible to make use of multiple occurrence of the same word to improve OCR performance ? RecognizeParse Average word length = Component Accuracy = 9 / 12 = 75% Word Accuracy = 25%
4
IIIT HyderabadUMASS AMHERST Overview Text Multiple occurrences of a word Words are degraded independently OCR output is different for the word at different instances OCR outputGoal Cluster OCR
5
IIIT HyderabadUMASS AMHERST Related Work MalayalamBangla TamilHindi U. Pal, B. Chaudhuri, Pattern Recognition, 2004 1 A. Negi et al., ICDAR, 2001 ; 2 C. V. Jawahar et al., ICDAR, 2003; 3 K. S. Sesh Kumar et al., ICDAR 2007 1 P. Xiu and H. S. Baird, DRR XV,2008; 2 N. V. Neeba, C. V. Jawahar, ICPR, 2008 1 T. M. Rath et al., IJDAR, 2007; 2 T. M. Rath et al., CVPR, 2003; 3 Anand Kumar et al., ACCV, 2007 H. Tao, J. Hull, Document Analysis and Information Retrieval, 1995 Character Recognition in Indian languages is still an unsolved problem. Telugu is one of the most complex scripts. Recognition of a book has received some attention recently. Word images are efficiently matched for retreival. Use of word image clusters to improve OCR accuracy
6
IIIT HyderabadUMASS AMHERST Conventional Recognition Process Preprocessing Segmentation and Word detection Text (UNICODE) Feature ExtractionClassification Recognizer Scanned Images Word level Feature Extraction Grouping Word Grouping (Clustering) Word groups Combining OCR Results Proposed Recognition Process
7
IIIT HyderabadUMASS AMHERST LSH Goal: “r-Near Neighbour” –for any query q, return a point p ∈ P such that||p-q|| ≤ r (if it exists) LSH has been used for –Data Mining Taher H. Haveliwala, Aristides Gionis, Piotr Indyk, WebDB, 2000 – Information retrieval A.Andoni, M.Datar, N.Immorlica, V.Mirrokni, Piotr Indyk, 2006 – Document Image Search Anand Kumar, C.V.Jawahar, R.Manmatha, ACCV, 2007 Locality Sensitive Hashing (LSH)
8
IIIT HyderabadUMASS AMHERST LSH clustering on word images [TODO]
9
IIIT HyderabadUMASS AMHERST Character Majority Voting OCR output Components Word Cluster Final Output Algorithm [TODO]
10
IIIT HyderabadUMASS AMHERST Word ImgOCR o/p 1 2 3 4 Dynamic Programming Voting for 1 after aligning DTW o/p for word 1 = CMV o/p for word 1 = Alignment Dynamic Programming [1,2]
11
IIIT HyderabadUMASS AMHERST Results Word generation process makes correct annotations available for evaluating the performance. Component AccuracyWord Accuracy DatasetOCRCMVDTWOCRCMVDTW SF1 98.3 95.5 SF2 94.8298.04 98.1985.2494.9795.28 SF3 85.83 95.78 97.967.5188.31 94.15 SF4 79.3887.82 92.1951.978.8185.2 5000 clusters 20 variations Degraded dataset More Details
12
IIIT HyderabadUMASS AMHERST Results Word Accuracy Vs No. of words –Adding more no. of words makes the data set more ambiguous –Algorithm performance increases with no. of words, and saturates. Word Accuracy Vs Word Length –Word accuracy decreases as the word length increase. –Use of the cluster info helps in gaining good word accuracies.
13
IIIT HyderabadUMASS AMHERST Analysis ImageOCRCMVDTW
14
IIIT HyderabadUMASS AMHERST Results SizeNo. of Clusters Length Range Word WL No. of words Symbol accuracyWord Accuracy OCRCMVDTWOCRCMVDTW B1Short6762-32.45377890.6491.6191.6680.5682.3982.45 B1Medium 9944-54.43516190.7892.3592.4273.3479.1480.53 B1Long6906-167.31458789.9892.1592.3158.6472.3474.82 B1ALL B2ALL B3ALL B4ALL For a small increase in component accuracy, there is a large improvement in the word accuracy. The improvement is high for long words. Relative improvement of 12% for words which occur at least twice.
15
IIIT HyderabadUMASS AMHERST Analysis Cuts and Merges CMV vs. DTW Wrong word in the cluster. Cases that cant be handled ImageOCRCMVDTW ImageOCRCMVDTW ImageOCRCMVDTW ImageOCRCMVDTW
16
IIIT HyderabadUMASS AMHERST Conclusion & Future work A new framework has been proposed for OCRing the book. A word recognition technique which uses the document constraints is shown. An efficient clustering algorithm is used to speed up the process. Word level accuracy is improved from 70.37% to 79.12%. This technique can also be used for other languages. Extending it to include the uses of techniques to handle unique words by creating clusters over parts of words.
17
IIIT HyderabadUMASS AMHERST END
18
IIIT HyderabadUMASS AMHERST Additional slides
19
IIIT HyderabadUMASS AMHERST LSH Algorithm Algorithm: Word Image Clustering Require: Word Images Wj andFeatures Fj, j = 1,...,n Ensure : Word Image Clusters O for each i = 1,...,l do for each j = 1,...,n do Compute hash bucket I = gi (Fj ) Store word image Wj on bucket I of hash table Ti end for k = 1 for each i = 1,...,n and Wi unmarked do Query hash table for word Wi toget cluster Ok Mark word Wi with k k = k +1 end for Back
20
IIIT HyderabadUMASS AMHERST Word Error Correction Algorithm: Word Error Correction Require: Cluster C of words Wi,i = 1,...,n Ensure: Clusters O of correct words for each i = 1,...,n do for each j = 1,...,n do if j != i then Align word Wi and Wj Record errors Ek,k = 1,...,m in Wi Record possible corrections Gk for Ek end if end for Correct Ek if Probability pk of correction Gk is maximum O <- O U Wi end for Back
21
IIIT HyderabadUMASS AMHERST Dataset –5000 clusters with 20 images of same word with different font size and resolution. –Words were generated using Image Magick. –Words were degraded with Kanungo degradation model to approximate real data. –SF1, SF2, SF3, SF4 datasets were degraded with 0, 10, 20, 30% noise.
22
IIIT HyderabadUMASS AMHERST Hashing Hashed Words Pre-processing Segmentation and word detection Feature Extraction Hashing Feature Extraction OCR Text Fusion Method 1 / Method 2 OCR output Cluster of words Word image
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.