Download presentation
Presentation is loading. Please wait.
Published byBrent Stewart Modified over 9 years ago
1
Content Level Access to Digital Library of India Pages
10/29/11 Content Level Access to Digital Library of India Pages Praveen Krishnan, Ravi Shekhar, C.V. Jawahar CVIT, IIIT Hyderabad 1
2
Digital Library of India (DLI)
Vision : To enhance access to information and knowledge to masses. Partner to Million Book Universal Digital Library Programme. Information for people Dataset for researchers Vamshi Ambati, N.Balakrishnan, Raj Reddy, Lakshmi Pratha, C V Jawahar: The Digital Library of India Project: Process, Policies and Architecture, ICDL , 2007.
3
Digital Library of India (DLI)
Vision : To enhance access to information and knowledge to masses. Content Languages Statistics 41 different languages Includes - Hindi, Telugu, Marathi.. - English, French, Greek.. #Books 4 Lakhs #Pages 134 Million #Words 26 Billion Source:
4
Digital Library of India (DLI)
Meta data search Supports Meta data based search. No Content Level Access Indian freedom struggle and independence Search
5
Digital Library of India (DLI)
Need Content Level Access Content + Meta Data Indian freedom struggle and independence Search
6
Digital Library of India (DLI)
Reliable Text Representation ? Need Content Level Access Content + Meta Data Indian freedom struggle and independence Search
7
Digital Library of India Search
Goal Digital Library of India Search Build a search engine with support for Indian languages. Word Spotting
8
Indian Language Document Search Engine
Goal Indian Language Document Search Engine Text Query Support खोज Page 1
9
Indian Language Document Search Engine
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Multi Keyword Support Page 1
10
Indian Language Document Search Engine
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Ranks based on # Occurrences Page 1
11
Indian Language Document Search Engine
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Semantically Related Words Page 1
12
Seamless scaling to billions of word images.
Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Seamless scaling to billions of word images. Sub second retrieval Page 1
13
Text from OCR Hindi Page Telugu Page
- Hindi: Title - Praachin Bhaartiy Vichaar Aur Vibhutiyaan, Published in 1624 - Telugu: Title - Andhra Vagmayaramba Dasha, Published in 1960
14
Text from OCR Hindi Page Telugu Page Cuts Cuts
15
Text from OCR Hindi Page Telugu Page Merges Cuts
16
Variations in Script, Font and Typesetting.
Text from OCR Hindi Page Telugu Page Variations in Script, Font and Typesetting. Cuts
17
Text from OCR Char % Hindi Telugu
[1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.
18
Text from OCR Word % Hindi Telugu
[1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.
19
Text from OCR Search % Hindi Telugu
20
BoVW for Image Retrieval
Text Retrieval Image Recognition Query Image Ranked Retrieved Results Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003
21
BoVW for Image Retrieval
Fixed Length Representation Invariant to popular deformation Query Image Ranked Retrieved Results Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003
22
BoVW for Document Image Retrieval
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
23
BoVW for Document Image Retrieval
Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
24
BoVW for Document Image Retrieval
Cuts R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
25
BoVW for Document Image Retrieval
Cuts Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
26
BoVW for Document Image Retrieval
Merges R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
27
BoVW for Document Image Retrieval
Merges Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
28
BoVW for Document Image Retrieval
Robust against degradation Lost Geometry Use Spatial Verification SIFT based. Longest Subsequence alignment. V1 V2 V6 V4 V8 V9 x y 0.5 1 1.5 2 2.5 3 Merge Clean Cuts R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. I. Z. Yalniz and R. Manmatha. An Efficient Framework for Searching Text in Noisy Document Images. In DAS, 2012.
29
Query Expansion Querying Database Query Image Query Image Histogram
Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Refined Histogram
30
Query Expansion Better Results Querying Database Query Image
Query Histogram Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Better Results
31
Text Query Support Originally formulated in a “query by example” setting. Input Query Image Histogram
32
Text Query Support Originally formulated in a “query by example” setting. Need Text Queries Input Text Query Text Query Histogram
33
Observations Are the results of OCR and BoVW complementary? BoVW OCR
34
Observations mAP v/s Word Length mAP No. of Characters
35
Observations “OCR system has a high precision while BoVW approach has a high recall.” Example: #GT = 5 OCR Out List; Precision = 1 ; Recall = 0.4 BoVW Out List; Precision = 0.8 ; Recall = 1
36
Fusion Fusion Techniques:- Naïve Fusion mAP Chart OCR
37
Fusion Fusion Techniques:- Naïve Fusion mAP Chart BoVW
38
Fusion Fusion Techniques:- Naïve Fusion
Concatenating OCR Results with BoVW mAP Chart OCR BoVW
39
Fusion Fusion Techniques:- Edit Distance Based Fusion mAP Chart OCR
BoVW
40
Fusion Fusion Techniques:- Edit Distance Based Fusion Reordering BoVW
mAP Chart Reordering BoVW BoVW score Modified Edit distance cost BoVW
41
Fusion Fusion Techniques:- Edit Distance Based Fusion Reordering BoVW
mAP Chart Reordering BoVW BoVW score Modified Edit distance cost BoVW
42
Fusion Fusion Techniques:- Edit Distance Based Fusion mAP Chart OCR
BoVW
43
Fusion Fusion Techniques:- Hybrid Fusion mAP Chart OCR BoVW
44
Fusion Fusion Techniques:- Hybrid Fusion Re-querying BoVW using
mAP Chart Re-querying BoVW using OCR retrieved results. Using rank aggregation techniques BoVW
45
Fusion Fusion Techniques:- Hybrid Fusion Re-querying BoVW using
mAP Chart Re-querying BoVW using OCR retrieved results. Using rank aggregation techniques BoVW
46
Fusion Fusion Techniques:- Hybrid Fusion mAP Chart OCR BoVW
47
Experimental Results
48
Experimental Details OCR [1] Feature Detector Feature Descriptor
Harris Interest point detection. [2] Feature Descriptor SIFT [2] Indexing Lucene [3] [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,”in ICDAR MOCR Workshop, 2011. [2] [3]
49
Test Bed Sample Word Images DLI Corpus
Language #Books #Pages #Words #Annotation Hindi (HS1) 11 1000 362,593 Yes Hindi (HS2) 52 10,196 4,290,864 No Telugu (TS1) 161,276 Telugu (TS2) 69 13,871 2,531,069 DLI Corpus In addition, we used HP1 & TP1 fully annotated dataset
50
Evaluation Measures Precision Recall mAP (Mean Average Precision)
Mean of the area under the precision recall curve for all the queries. 10 Shows how accurate top 10 retrieved results are. TP = True Positive FP = False Positive FN = False Negative Precision-Recall Curve
51
Comparison of naïve BoVW with BoVW + Query Expansion
Language #Query BoVW Search BoVW + Query Expansion mAP Hindi (HP1) 100 62.54 81.30 66.09 83.86 Telugu (TP1) 71.13 78 73.08 79.89 Comparison of naïve BoVW with BoVW + Query Expansion
52
Comparison of naïve BoVW with BoVW + Text Query Support
Language #Query BoVW Search BoVW using Text Queries mAP Hindi (HP1) 100 62.54 81.30 56.32 73.89 Telugu (TP1) 71.13 78 69.06 78.83 Comparison of naïve BoVW with BoVW + Text Query Support
53
Comparative performance of different fusion
Language #Query Naïve Edit Distance Hybrid mAP Hindi (HP1) 100 75.66 90.7 79.58 90.8 80.37 91.4 Telugu (TP1) 76.02 81.2 78.01 81.4 80.23 83.7 Comparative performance of different fusion techniques on HP1 & TP1
54
Performance statistics on DLI Annotated Corpus
Language #Query OCR BoVW Fusion mAP Hindi (HS1) 100 14.95 62.60 60.55 95.5 68.81 95.6 Telugu (TS1) 27.03 62.10 74.38 90.6 78.41 91.9 Performance statistics on DLI Annotated Corpus
55
Performance statistics on DLI Un-Annotated Corpus
Language #Query N OCR BoVW Fusion Hindi (HS2) 50 82.03 96.94 97.11 75.16 94.83 95.42 71.12 92.82 93.16 Telugu (TS2) 90.85 99.14 85.42 98.00 98.85 80.76 96.38 96.57 Performance statistics on DLI Un-Annotated Corpus
56
Retrieved Results
57
Retrieved Results
58
Failure Cases The word images shown in the figure fails in both OCR and BoVW. Reason: (a) Word Image smaller in length and containing a character not used these days. (b) A highly degraded word image.
59
Implementation Details
Search Engine Development An elegant web based search and retrieval interface. Lucene Scalability Time in milliseconds No of Images Sample Retrieved Page No of Visual Words
60
Search Architecture (Ongoing)
Query Expansion Ranking OCR BoVW F U S I O N Partial Scores Index Delegator Web Service Search Query Ranked Results
61
Ongoing Work Learn to improve from annotated dataset
Use of visual confusion matrix to improve BoVW results from annotated datasets. Necessity of Costly Features for Re-ranking The images shows in failure cases would require costly features to show up. Use of machine learning algorithms. Exploration of features better than SIFT.
62
Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.