Download presentation
Presentation is loading. Please wait.
Published byTheresa Dalton Modified over 9 years ago
1
Omni Font OCR Error Correction with Effect on Retrieval Walid Magdy 1 Kareem Darwish 2 1 Faculty of Engineering, Cairo University, Egypt 1 School of Computing, Dublin City University. Ireland 2 Faculty of Computers and Information, Cairo University, Egypt 2 Cairo Microsoft Innovation Center, Microsoft Research, Egypt ISDA, 30 November 2010
2
What do I mean by Printed text is converted into digital text through optical character recognition (OCR) process. Some errors can exist, which affect search Printecl text i8 convenled into diyital tex throuyh optical chavacter recognition (0CK) process. Some ettors can exist, which attect search Omni Font OCR Error Correction with Effect on Retrieval
3
State-of-the-art Omni Font OCR Error Correction with Effect on Retrieval Error Model: Printed ↔ Printecl d ↔ cl Needs manual effort Needs accurate algorithm for alignment Dependent on font
4
Question of Research Omni Font OCR Error Correction with Effect on Retrieval Can we create a correction model for OCR: Font independent (Omni font) Totally unsupervised Comparable with state-of-the-art Correction ability Retrieval effectiveness
5
Approach Error Model Language Model OCR Text Generate Candidates Select Correction List of poss. corr. Corr. Text Context Calculate Edit Distance (ED) cokkection
6
Initial Long List of Candidates cokkection: collection, correction, …, pyramids Index the dictionary of words collection (index): {c, o, l, l, e, c, t, i, o, n, #c, co, ol, ll, le, ec, ct, ti, io, on, n#, #co, col, oll, lle, lec, ect, cti, tio, ion, on#, 10} cokkection (search): {c, o, k, k, e, c, t, i, o, n, #c, co, ok, kk, ke, ec, ct, ti, io, on, n#, #co, cok, okk, kke, kec, ect, cti, tio, ion, on#, 10} 1000 initial candidates to calculate ED ED + Unigram probability = Prior probability LM probability of trigrams words = posterior probability
7
Experimental Setup Two Arabic OCR document collections: ZAD: religious book WER = 39% TREC AFP: newspapers WER = 31% Correction using Error Model (EM) ZAD: 2000 training words AFP: 4000 training words Two domain specific language models Test EM vs ED correction: Error reduction Retrieval effectiveness
8
Error Reduction CorrectionWER Error Reduction ZAD WER = 39% ED 17%56% EM (ref) 12%70% AFP WER = 31% ED 7.3%76% EM (ref) 5.9%81%
9
Retrieval results for ZAD
10
Conclusion Omni font correction: Reduces errors up to 75% Slightly lower than correction based on error model (EM) Statistically indistinguishable from EM correction for search No training required Independent on font or language
11
Advices Enjoy your stay in Egypt Cairo: Pyramids, Nile Luxur, Aswan: Temples, Nile Sharm El-Shiekh: Red Sea, Safari Do not drive unless you are Egyptian Do not cross the road alone Do not ask questions Thank you
12
Equations S ED (w i ) =
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.