Content Level Access to Digital Library of India Pages

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Three things everyone should know to improve object retrieval
Presented by Xinyu Chang
Content-Based Image Retrieval
Florian Schroff, Antonio Criminisi & Andrew Zisserman ICCV 2007 Harvesting Image Databases from the Web.
CS4670 / 5670: Computer Vision Bag-of-words models Noah Snavely Object
Word Recognition of Indic Scripts
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Localization in indoor environments by querying omnidirectional visual maps using perspective images Miguel Lourenco, V. Pedro and João P. Barreto ICRA.
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
CVPR 2008 James Philbin Ondˇrej Chum Michael Isard Josef Sivic
Packing bag-of-features ICCV 2009 Herv´e J´egou Matthijs Douze Cordelia Schmid INRIA.
Bundling Features for Large Scale Partial-Duplicate Web Image Search Zhong Wu ∗, Qifa Ke, Michael Isard, and Jian Sun CVPR 2009.
Bag of Features Approach: recent work, using geometric information.
Search Engines and Information Retrieval
Tour the World: building a web-scale landmark recognition engine ICCV 2009 Yan-Tao Zheng1, Ming Zhao2, Yang Song2, Hartwig Adam2 Ulrich Buddemeier2, Alessandro.
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
WISE: Large Scale Content-Based Web Image Search Michael Isard Joint with: Qifa Ke, Jian Sun, Zhong Wu Microsoft Research Silicon Valley 1.
Object retrieval with large vocabularies and fast spatial matching
1 Image Recognition - I. Global appearance patterns Slides by K. Grauman, B. Leibe.
Image Search Presented by: Samantha Mahindrakar Diti Gandhi.
CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Information Retrieval in Practice
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Spatial Pyramid Pooling in Deep Convolutional
Lecture 6: Feature matching and alignment CS4670: Computer Vision Noah Snavely.
Chapter 5: Information Retrieval and Web Search
IIIT HyderabadUMASS AMHERST Robust Recognition of Documents by Fusing Results of Word Clusters Venkat Rasagna 1, Anand Kumar 1, C. V. Jawahar 1, R. Manmatha.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
CC 2007, 2011 attrbution - R.B. Allen Text and Text Processing.
Search Engines and Information Retrieval Chapter 1.
IIIT Hyderabad Synthesizing Classifiers for Novel Settings Viresh Ranjan CVIT,IIIT-H Adviser: Prof. C. V. Jawahar, IIIT-H Co-Adviser: Dr. Gaurav Harit,
IIIT Hyderabad Thesis Presentation By Raman Jain ( ) Towards Efficient Methods for Word Image Retrieval.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Lecture 4: Feature matching CS4670 / 5670: Computer Vision Noah Snavely.
A Statistical Approach to Speed Up Ranking/Re-Ranking Hong-Ming Chen Advisor: Professor Shih-Fu Chang.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
1 Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval Ondrej Chum, James Philbin, Josef Sivic, Michael Isard and.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Chapter 6: Information Retrieval and Web Search
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
IIIT Hyderabad Word Hashing for Efficient Search in Document Image Collections Anand Kumar Advisors: Dr. C. V. Jawahar IIIT Hyderabad Dr. R. Manmatha University.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
MSRI workshop, January 2005 Object Recognition Collected databases of objects on uniform background (no occlusions, no clutter) Mostly focus on viewpoint.
IIIT Hyderabad Document Image Retrieval using Bag of Visual Words Model Ravi Shekhar CVIT, IIIT Hyderabad Advisor : Prof. C.V. Jawahar.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Digital libraries and web- based information systems Mohsen Kamyar.
Lecture 8: Feature matching CS6670: Computer Vision Noah Snavely.
Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Presented by Brandon Smith Computer Vision.
Lecture 08 27/12/2011 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Unsupervised Auxiliary Visual Words Discovery for Large-Scale Image Object Retrieval Yin-Hsi Kuo1,2, Hsuan-Tien Lin 1, Wen-Huang Cheng 2, Yi-Hsuan Yang.
Bundling Features for Large Scale Partial-Duplicate Web Image Search Zhong Wu ∗, Qifa Ke, Michael Isard, and Jian Sun Microsoft Research.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Heritage App: Annotating Images on Mobile Phones
Video Google: Text Retrieval Approach to Object Matching in Videos
A research literature search engine with abbreviation recognition
Multilingual Information Access in a Digital Library
Chapter 5: Information Retrieval and Web Search
Video Google: Text Retrieval Approach to Object Matching in Videos
iLayout: Performance Evaluation
Presentation transcript:

Content Level Access to Digital Library of India Pages 10/29/11 Content Level Access to Digital Library of India Pages Praveen Krishnan, Ravi Shekhar, C.V. Jawahar CVIT, IIIT Hyderabad 1

Digital Library of India (DLI) http://www.dli.iiit.ac.in/ Vision : To enhance access to information and knowledge to masses. Partner to Million Book Universal Digital Library Programme. Information for people Dataset for researchers Vamshi Ambati, N.Balakrishnan, Raj Reddy, Lakshmi Pratha, C V Jawahar: The Digital Library of India Project: Process, Policies and Architecture, ICDL , 2007.

Digital Library of India (DLI) Vision : To enhance access to information and knowledge to masses. Content Languages Statistics 41 different languages Includes - Hindi, Telugu, Marathi.. - English, French, Greek.. #Books 4 Lakhs #Pages 134 Million #Words 26 Billion Source: http://www.new1.dli.ernet.in/

Digital Library of India (DLI) Meta data search Supports Meta data based search. No Content Level Access Indian freedom struggle and independence Search

Digital Library of India (DLI) Need Content Level Access Content + Meta Data Indian freedom struggle and independence Search

Digital Library of India (DLI) Reliable Text Representation ? Need Content Level Access Content + Meta Data Indian freedom struggle and independence Search

Digital Library of India Search Goal Digital Library of India Search Build a search engine with support for Indian languages. Word Spotting

Indian Language Document Search Engine Goal Indian Language Document Search Engine Text Query Support खोज Page 1

Indian Language Document Search Engine Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Multi Keyword Support Page 1

Indian Language Document Search Engine Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Ranks based on # Occurrences Page 1

Indian Language Document Search Engine Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Semantically Related Words Page 1

Seamless scaling to billions of word images. Goal Indian Language Document Search Engine शिवाजी और मराठा साम्राज्य खोज Seamless scaling to billions of word images. Sub second retrieval Page 1

Text from OCR Hindi Page Telugu Page - Hindi: Title - Praachin Bhaartiy Vichaar Aur Vibhutiyaan, Published in 1624 - Telugu: Title - Andhra Vagmayaramba Dasha, Published in 1960

Text from OCR Hindi Page Telugu Page Cuts Cuts

Text from OCR Hindi Page Telugu Page Merges Cuts

Variations in Script, Font and Typesetting. Text from OCR Hindi Page Telugu Page Variations in Script, Font and Typesetting. Cuts

Text from OCR Char % Hindi Telugu [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.

Text from OCR Word % Hindi Telugu [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,” in ICDAR MOCR Workshop, 2011.

Text from OCR Search % Hindi Telugu

BoVW for Image Retrieval Text Retrieval Image Recognition Query Image Ranked Retrieved Results Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003

BoVW for Image Retrieval Fixed Length Representation Invariant to popular deformation Query Image Ranked Retrieved Results Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003

BoVW for Document Image Retrieval R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

BoVW for Document Image Retrieval Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

BoVW for Document Image Retrieval Cuts R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

BoVW for Document Image Retrieval Cuts Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

BoVW for Document Image Retrieval Merges R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

BoVW for Document Image Retrieval Merges Histogram of Visual Words R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.

BoVW for Document Image Retrieval Robust against degradation Lost Geometry Use Spatial Verification SIFT based. Longest Subsequence alignment. V1 V2 V6 V4 V8 V9 x y 0.5 1 1.5 2 2.5 3 Merge Clean Cuts R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012. I. Z. Yalniz and R. Manmatha. An Efficient Framework for Searching Text in Noisy Document Images. In DAS, 2012.

Query Expansion Querying Database Query Image Query Image Histogram Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Refined Histogram

Query Expansion Better Results Querying Database Query Image Query Histogram Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Better Results

Text Query Support Originally formulated in a “query by example” setting. Input Query Image Histogram

Text Query Support Originally formulated in a “query by example” setting. Need Text Queries Input Text Query Text Query Histogram

Observations Are the results of OCR and BoVW complementary? BoVW OCR

Observations mAP v/s Word Length mAP No. of Characters

Observations “OCR system has a high precision while BoVW approach has a high recall.” Example: #GT = 5 OCR Out List; Precision = 1 ; Recall = 0.4 BoVW Out List; Precision = 0.8 ; Recall = 1

Fusion Fusion Techniques:- Naïve Fusion mAP Chart OCR

Fusion Fusion Techniques:- Naïve Fusion mAP Chart BoVW

Fusion Fusion Techniques:- Naïve Fusion Concatenating OCR Results with BoVW mAP Chart OCR BoVW

Fusion Fusion Techniques:- Edit Distance Based Fusion mAP Chart OCR BoVW

Fusion Fusion Techniques:- Edit Distance Based Fusion Reordering BoVW mAP Chart Reordering BoVW BoVW score Modified Edit distance cost BoVW

Fusion Fusion Techniques:- Edit Distance Based Fusion Reordering BoVW mAP Chart Reordering BoVW BoVW score Modified Edit distance cost BoVW

Fusion Fusion Techniques:- Edit Distance Based Fusion mAP Chart OCR BoVW

Fusion Fusion Techniques:- Hybrid Fusion mAP Chart OCR BoVW

Fusion Fusion Techniques:- Hybrid Fusion Re-querying BoVW using mAP Chart Re-querying BoVW using OCR retrieved results. Using rank aggregation techniques BoVW

Fusion Fusion Techniques:- Hybrid Fusion Re-querying BoVW using mAP Chart Re-querying BoVW using OCR retrieved results. Using rank aggregation techniques BoVW

Fusion Fusion Techniques:- Hybrid Fusion mAP Chart OCR BoVW

Experimental Results

Experimental Details OCR [1] Feature Detector Feature Descriptor Harris Interest point detection. [2] Feature Descriptor SIFT [2] Indexing Lucene [3] [1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,”in ICDAR MOCR Workshop, 2011. [2] http://www.vlfeat.org [3] http://lucene.apache.org/

Test Bed Sample Word Images DLI Corpus Language #Books #Pages #Words #Annotation Hindi (HS1) 11 1000 362,593 Yes Hindi (HS2) 52 10,196 4,290,864 No Telugu (TS1) 161,276 Telugu (TS2) 69 13,871 2,531,069 DLI Corpus In addition, we used HP1 & TP1 fully annotated dataset

Evaluation Measures Precision Recall mAP (Mean Average Precision) Mean of the area under the precision recall curve for all the queries. Precision @ 10 Shows how accurate top 10 retrieved results are. TP = True Positive FP = False Positive FN = False Negative Precision-Recall Curve

Comparison of naïve BoVW with BoVW + Query Expansion Language #Query BoVW Search BoVW + Query Expansion mAP Prec@10 Hindi (HP1) 100 62.54 81.30 66.09 83.86 Telugu (TP1) 71.13 78 73.08 79.89 Comparison of naïve BoVW with BoVW + Query Expansion

Comparison of naïve BoVW with BoVW + Text Query Support Language #Query BoVW Search BoVW using Text Queries mAP Prec@10 Hindi (HP1) 100 62.54 81.30 56.32 73.89 Telugu (TP1) 71.13 78 69.06 78.83 Comparison of naïve BoVW with BoVW + Text Query Support

Comparative performance of different fusion Language #Query Naïve Edit Distance Hybrid mAP Prec@10 Hindi (HP1) 100 75.66 90.7 79.58 90.8 80.37 91.4 Telugu (TP1) 76.02 81.2 78.01 81.4 80.23 83.7 Comparative performance of different fusion techniques on HP1 & TP1

Performance statistics on DLI Annotated Corpus Language #Query OCR BoVW Fusion mAP Prec@10 Hindi (HS1) 100 14.95 62.60 60.55 95.5 68.81 95.6 Telugu (TS1) 27.03 62.10 74.38 90.6 78.41 91.9 Performance statistics on DLI Annotated Corpus

Performance statistics on DLI Un-Annotated Corpus Language #Query Precision @ N OCR BoVW Fusion Hindi (HS2) 50 Prec@10 82.03 96.94 97.11 Prec@20 75.16 94.83 95.42 Prec@30 71.12 92.82 93.16 Telugu (TS2) 90.85 99.14 85.42 98.00 98.85 80.76 96.38 96.57 Performance statistics on DLI Un-Annotated Corpus

Retrieved Results

Retrieved Results

Failure Cases The word images shown in the figure fails in both OCR and BoVW. Reason: (a) Word Image smaller in length and containing a character not used these days. (b) A highly degraded word image.

Implementation Details Search Engine Development An elegant web based search and retrieval interface. Lucene Scalability Time in milliseconds No of Images Sample Retrieved Page No of Visual Words

Search Architecture (Ongoing) Query Expansion Ranking OCR BoVW F U S I O N Partial Scores Index Delegator Web Service Search Query Ranked Results

Ongoing Work Learn to improve from annotated dataset Use of visual confusion matrix to improve BoVW results from annotated datasets. Necessity of Costly Features for Re-ranking The images shows in failure cases would require costly features to show up. Use of machine learning algorithms. Exploration of features better than SIFT.

Thank You