Content Level Access to Digital Library of India Pages 10/29/11 Content Level Access to Digital Library of India Pages Praveen Krishnan, Ravi Shekhar, C.V. Jawahar CVIT, IIIT Hyderabad 1
Digital Library of India (DLI) http://www.dli.iiit.ac.in/ Vision : To enhance access to information and knowledge to masses. Partner to Million Book Universal Digital Library Programme. Information for people Dataset for researchers Vamshi Ambati, N.Balakrishnan, Raj Reddy, Lakshmi Pratha, C V Jawahar: The Digital Library of India Project: Process, Policies and Architecture, ICDL , 2007.
Digital Library of India (DLI) Vision : To enhance access to information and knowledge to masses. Content Languages Statistics 41 different languages Includes - Hindi, Telugu, Marathi.. - English, French, Greek.. #Books 4 Lakhs #Pages 134 Million #Words 26 Billion Source: http://www.new1.dli.ernet.in/
Digital Library of India (DLI) Meta data search Supports Meta data based search. No Content Level Access Indian freedom struggle and independence Search
Digital Library of India (DLI) Need Content Level Access Content + Meta Data Indian freedom struggle and independence Search
Digital Library of India (DLI) Reliable Text Representation ? Need Content Level Access Content + Meta Data Indian freedom struggle and independence Search
Digital Library of India Search Goal Digital Library of India Search Build a search engine with support for Indian languages. Word Spotting
Indian Language Document Search Engine Goal Indian Language Document Search Engine Text Query Support खोज Page 1
Indian Language Document Search Engine Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Multi Keyword Support Page 1
Indian Language Document Search Engine Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Ranks based on # Occurrences Page 1
Indian Language Document Search Engine Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Semantically Related Words Page 1
Seamless scaling to billions of word images. Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Seamless scaling to billions of word images. Sub second retrieval Page 1
Text from OCR Hindi Page Telugu Page - Hindi: Title - Praachin Bhaartiy Vichaar Aur Vibhutiyaan, Published in 1624 - Telugu: Title - Andhra Vagmayaramba Dasha, Published in 1960
Text from OCR Hindi Page Telugu Page Cuts Cuts
Text from OCR Hindi Page Telugu Page Merges Cuts
Variations in Script, Font and Typesetting. Text from OCR Hindi Page Telugu Page Variations in Script, Font and Typesetting. Cuts
Text from OCR Char % Hindi Telugu [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.
Text from OCR Word % Hindi Telugu [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.
Text from OCR Search % Hindi Telugu
BoVW for Image Retrieval Text Retrieval Image Recognition Query Image Ranked Retrieved Results Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003
BoVW for Image Retrieval Fixed Length Representation Invariant to popular deformation Query Image Ranked Retrieved Results Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003
BoVW for Document Image Retrieval R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval Cuts R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval Cuts Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval Merges R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval Merges Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image Retrieval Robust against degradation Lost Geometry Use Spatial Verification SIFT based. Longest Subsequence alignment. V1 V2 V6 V4 V8 V9 x y 0.5 1 1.5 2 2.5 3 Merge Clean Cuts R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. I. Z. Yalniz and R. Manmatha. An Efficient Framework for Searching Text in Noisy Document Images. In DAS, 2012.
Query Expansion Querying Database Query Image Query Image Histogram Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Refined Histogram
Query Expansion Better Results Querying Database Query Image Query Histogram Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Better Results
Text Query Support Originally formulated in a “query by example” setting. Input Query Image Histogram
Text Query Support Originally formulated in a “query by example” setting. Need Text Queries Input Text Query Text Query Histogram
Observations Are the results of OCR and BoVW complementary? BoVW OCR
Observations mAP v/s Word Length mAP No. of Characters
Observations “OCR system has a high precision while BoVW approach has a high recall.” Example: #GT = 5 OCR Out List; Precision = 1 ; Recall = 0.4 BoVW Out List; Precision = 0.8 ; Recall = 1
Fusion Fusion Techniques:- Naïve Fusion mAP Chart OCR
Fusion Fusion Techniques:- Naïve Fusion mAP Chart BoVW
Fusion Fusion Techniques:- Naïve Fusion Concatenating OCR Results with BoVW mAP Chart OCR BoVW
Fusion Fusion Techniques:- Edit Distance Based Fusion mAP Chart OCR BoVW
Fusion Fusion Techniques:- Edit Distance Based Fusion Reordering BoVW mAP Chart Reordering BoVW BoVW score Modified Edit distance cost BoVW
Fusion Fusion Techniques:- Edit Distance Based Fusion Reordering BoVW mAP Chart Reordering BoVW BoVW score Modified Edit distance cost BoVW
Fusion Fusion Techniques:- Edit Distance Based Fusion mAP Chart OCR BoVW
Fusion Fusion Techniques:- Hybrid Fusion mAP Chart OCR BoVW
Fusion Fusion Techniques:- Hybrid Fusion Re-querying BoVW using mAP Chart Re-querying BoVW using OCR retrieved results. Using rank aggregation techniques BoVW
Fusion Fusion Techniques:- Hybrid Fusion Re-querying BoVW using mAP Chart Re-querying BoVW using OCR retrieved results. Using rank aggregation techniques BoVW
Fusion Fusion Techniques:- Hybrid Fusion mAP Chart OCR BoVW
Experimental Results
Experimental Details OCR [1] Feature Detector Feature Descriptor Harris Interest point detection. [2] Feature Descriptor SIFT [2] Indexing Lucene [3] [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,”in ICDAR MOCR Workshop, 2011. [2] http://www.vlfeat.org [3] http://lucene.apache.org/
Test Bed Sample Word Images DLI Corpus Language #Books #Pages #Words #Annotation Hindi (HS1) 11 1000 362,593 Yes Hindi (HS2) 52 10,196 4,290,864 No Telugu (TS1) 161,276 Telugu (TS2) 69 13,871 2,531,069 DLI Corpus In addition, we used HP1 & TP1 fully annotated dataset
Evaluation Measures Precision Recall mAP (Mean Average Precision) Mean of the area under the precision recall curve for all the queries. Precision @ 10 Shows how accurate top 10 retrieved results are. TP = True Positive FP = False Positive FN = False Negative Precision-Recall Curve
Comparison of naïve BoVW with BoVW + Query Expansion Language #Query BoVW Search BoVW + Query Expansion mAP Prec@10 Hindi (HP1) 100 62.54 81.30 66.09 83.86 Telugu (TP1) 71.13 78 73.08 79.89 Comparison of naïve BoVW with BoVW + Query Expansion
Comparison of naïve BoVW with BoVW + Text Query Support Language #Query BoVW Search BoVW using Text Queries mAP Prec@10 Hindi (HP1) 100 62.54 81.30 56.32 73.89 Telugu (TP1) 71.13 78 69.06 78.83 Comparison of naïve BoVW with BoVW + Text Query Support
Comparative performance of different fusion Language #Query Naïve Edit Distance Hybrid mAP Prec@10 Hindi (HP1) 100 75.66 90.7 79.58 90.8 80.37 91.4 Telugu (TP1) 76.02 81.2 78.01 81.4 80.23 83.7 Comparative performance of different fusion techniques on HP1 & TP1
Performance statistics on DLI Annotated Corpus Language #Query OCR BoVW Fusion mAP Prec@10 Hindi (HS1) 100 14.95 62.60 60.55 95.5 68.81 95.6 Telugu (TS1) 27.03 62.10 74.38 90.6 78.41 91.9 Performance statistics on DLI Annotated Corpus
Performance statistics on DLI Un-Annotated Corpus Language #Query Precision @ N OCR BoVW Fusion Hindi (HS2) 50 Prec@10 82.03 96.94 97.11 Prec@20 75.16 94.83 95.42 Prec@30 71.12 92.82 93.16 Telugu (TS2) 90.85 99.14 85.42 98.00 98.85 80.76 96.38 96.57 Performance statistics on DLI Un-Annotated Corpus
Retrieved Results
Retrieved Results
Failure Cases The word images shown in the figure fails in both OCR and BoVW. Reason: (a) Word Image smaller in length and containing a character not used these days. (b) A highly degraded word image.
Implementation Details Search Engine Development An elegant web based search and retrieval interface. Lucene Scalability Time in milliseconds No of Images Sample Retrieved Page No of Visual Words
Search Architecture (Ongoing) Query Expansion Ranking OCR BoVW F U S I O N Partial Scores Index Delegator Web Service Search Query Ranked Results
Ongoing Work Learn to improve from annotated dataset Use of visual confusion matrix to improve BoVW results from annotated datasets. Necessity of Costly Features for Re-ranking The images shows in failure cases would require costly features to show up. Use of machine learning algorithms. Exploration of features better than SIFT.
Thank You