Download presentation
Presentation is loading. Please wait.
Published byEgbert Walker Modified over 9 years ago
1
Overview of RISOT: Retrieval of Indic Script OCR’d Text Utpal GarainIndian Statistical Institute, Kolkata Tamaltaru PalIndian Statistical Institute, Kolkata Jiaul Paik Indian Statistical Institute, Kolkata Kripa GhoshIndian Statistical Institute, Kolkata David DoermannUniversity of Maryland, College Park, USA Douglas W. OardUniversity of Maryland, College Park, USA
2
o Evaluate retrieval of automatically recognized text from machine printed text o Goals Support experimentation of retrieval from printed documents Evaluate IR effectiveness for retrieval based on Indic script OCR Provide venue where IR and OCR researchers can work together Task
3
o Bengali newspaper articles About half the FIRE 2008/2010 collection 62,875 documents o Text o Rendered image o OCR’d text 66 topics RISOT 2011
4
o Two teams participated o Techniques OCR error modeling Query time stemming o Best absolute OCR results resulted from stemming + error modeling 83% the TEXT MAP for TD queries o Best same-team relative MAP 90% of TEXT 88% for P@10 RISOT 2011
5
o N-gram statistics were used o Stemming beats words or n-grams o Statistically significant improvement over words for T and TD; Clean and OCR; w/ and w/o error model Further experiments on RISOT 2011 Data
6
RunQDocTermModelMAPMAP%P@5P@10Rprec TD-C-STDCleanStem 0.4229 0.44130.35540.3940 TD-O-S-MTDOCRStemMulti0.361986%0.39730.32070.3379 TD-O-S-ETDOCRStemOne0.352183%0.38580.30080.3294 TD-O-STDOCRStem 0.291569%0.31090.24890.2832 RunQDocTermModelMAPMAP%P@5P@10Rprec TD-C-WTDCleanWord 0.344982% 0.38260.31520.3250 TD-O-W-MTDOCRWordMulti0.343481%0.35770.29620.3131 TD-O-W-ETDOCRWordOne0.325177%0.33880.26940.3068 TD-O-WTDOCRWord 0.229354%0.27170.22170.2336 RunQDocTermModelMAPMAP%P@5P@10Rprec TD-O-3-ETDOCR3-gramOne0.328578%0.34190.27090.2925 TD-O-3TDOCR3-gram 0.307273%0.32390.27070.2903 TD-O-4-ETDOCR4-gramOne0.297270%0.31400.26510.2717 TD-O-2-ETDOCR2-gramOne0.279566%0.28700.20000.2635 TD-O-4TDOCR4-gram 0.270864%0.30000.24890.2631 TD-O-5-ETDOCR5-gramOne0.268664%0.29300.24650.2460 TD-O-5TDOCR5-gram 0.245158%0.27390.22830.2339 TD-O-2TDOCR2-gram 0.198447%0.24780.19240.2085
7
o English query Bengali collection (OCR’d) o Dictionary based translation o Transliteration of OOVs o Additional resources o Stemming o OCR error modeling CLIR
8
CLIR Results RunQ Retrieval Condition ProcessingMAPMAP%P@5P@10Rprec T1TDMono+Text-- 0.3205100%0.37620.31820.3083 O1TDMono-- 0.268984%0.2420 0.4166 O2TDCLIRDQT 0.081325%0.10250.08540.0679 O3TDCLIR DQT (Manual Selection) 0.084826%0.11500.09380.0864 O4TDCLIRDQT + OOV 0.186658%0.25290.20630.1901 O5TDCLIRDQT+OOV+OEM 0.265083%0.33380.27230.2509 O6TDCLIR DQT+OOV+OEM+ Stem 0.291591%0.36720.29960.2760
9
o Devanagari (Hindi) Dataset o 94,432 articles from two newspaper o Subset of FIRE data o Text o Rendered image o OCR’d o 28 topics o Tasks o OCR Post-processing o Retrieval from Bengali OCR’d text o Retrieval from Devanagari (Hindi) OCR’d Text Addition in 2012
10
o One team participated o ISI team o Kripabandhu Ghosh and Anirban Chakraborty o Method o Did not use previous OCR error modeling technique o Assumed that clean text is not available o Co-occurrence based synonym searching o tobacc, 1obacco, etc. are synonyms of tobacco RISOT Runs
11
RISOT Results MAPP@5 Clean text 0.25670.3485 OCR’d Text 0.17910.2738 OCR’d text + Processing 0.19740.2831 o OCR error modeling gave better improvement
12
o Next RISOT will introduce image degradation o Module of OCRopus o LAMP, UMD tool o How to attract more teams o Involvement of OCR consortium o Better OCR o Better error modeling o Summer code projects o Once in two years RISOT Future
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.