Download presentation
Presentation is loading. Please wait.
Published byAldous Griffin Modified over 9 years ago
1
Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents Walid Magdy & Kareem Darwish IBM Technology Development Center PO Box 166 El-Ahram, Giza, Egypt {wmagdy,darwishk}@eg.ibm.com
2
Outlines: 1.Motivation 2.Background 3.Approach 4.Experimental Setup 5.Results 6.Conclusion 7.Future Work
3
Motivation: 1400 15001600 1700 18001900 2000 First printing press Read to search E-text becomes commonplace Automated full text search Problem: 500+ years of legacy documents Goal: To search printed documents efficiently and effectively 1998: Arabic e-text comes online Does OCR solve the problem?
4
Arabic Language Challenges Orthography –Character shape depends on position –15 of the 28 letters contain dots –Optional diacritics may be present –Printed text may include ligatures and kashida Morphology –Prefix, infix, and suffix –6x10 10 possible surface forms Other factors –Eighth most widely spoken language in the world –Web growth started only recently وسـيــكـتبونـهـا wasaya+ktub+uunahaa and will + write + they it = and they will write it
5
Pre-processing: –Remove diacritics –Normalize different forms of alef & ya to accommodate for ∙Common spelling errors ∙Grammatical, morphological, and orthographic properties أ ، آ ، إ ، ا, ؤ, ئ, ء ا,and ى ، ي ي Text Retrieval: Best Index Terms –Regular text: Light stemming and character 3 & 4- grams are best –OCR text: character 3 & 4 grams are best Arabic Pre-processing & Retrieval
6
Word-Based Correction for Retrieval of Arabic OCR Degraded Documents Main Idea: Word-Based Correction for Retrieval of Arabic OCR Degraded Documents VVorcl-Easod Comectlon l0r Belrieval of Arahie OCR Dcgraclod Doeurnerits Correction OCR ImageDegraded Text Corrected Text We want to examine the effect of correction on Retrieval
7
Approach: OCR system OCR Degraded Text -------------- ------------- OCR Corrected Text ------------ ------------- Indexing Ranked List of Documents OCR Correction
8
Test collections Error Correction Building Error Model Training & Decoding Experiments Experimental Setup:
9
Document Collections: TREC 2002 CLIRZAD Arabic newswire articles from Agence France Press (AFP) Printed 14 th century religious book, scanned at 300x300 dpi and OCR’ed 383,872 articles2,730 documents 50 topics25 topics Synthetic degraded text using degradation model Real Degraded text by OCR process WER = 30.8 %WER = 39 %
10
The ZAD Collection: حكم التيمم ومتى شرع Sample Document: Sample Query:
11
The TREC 2002 CLIR Collection: Sample Document: Sample Query: سجناء حرب ايرانيين وعراقيين 19940513_AFP_ARB0001 ارا0800 4 ع 7710 قبرص /افب-تصج86 الشرق الاوسط/سلام/حكم ذاتي &HT; العلم الفلسطيني لم يُرفع فوق كنيس اريحا اريحا (الضفة الغربية) 31-5 (اف ب)- يقوم احد عناصر الشرطة الفلسطينية بحراسة مدخل الكنيس اليهودي في وسط اريحا احد آخر مواقع المدينة التي تم تسليمها الى الشرطة الفلسطينية الا انه لم يتم رفع العلم الفلسطيني فوق الكنيس وقال ضابط فلسطيني لفلسطينية كانت تحاول رفع العلم الفلسطيني فوق الكنيس "هذا مكان مقدس" وقبيل ذلك اقترب ثلاثة مستوطنين يهود من مدخل الكنيس الذي كان الجنود الاسرائيليون ما زالوا يوءمنون حراسته وعندما منعهم الجنود من الدخول قاموا بتمزيق ثيابهم
12
Manual Corrected OCR Text Aligning Characters Mapping Build Error Model OCR Degraded Text Generate Corrections Pick up most likely correction using Bayes Rule OCR Corrected Text Decoding Training OCR-Correction Model :
13
Aligning Characters Mapping: m:n Mapping Ex: walid vvaicl w vv S a a √ l Null D i i √ d cl S w a l i d v v a i c l 1 : 1 Mapping Ex: walid vvaicl w v S Null v I a a √ l Null D i i √ d c S Null l I w a l i d v v a i c l
14
Building Error Model: Where C k C l, and D x D y are a character or more
15
Decoding: Baye’s Rule: P ( Word correct | Word OCR ) = argmax ( P ( Word OCR | Word correct ) P ( Word correct ) ) P ( Word OCR | Word correct ) = P ( Word correct ) = LM probability (used simple unigram probability) Character Level model Word Level model
16
ε ε ε ε ε Example: Character Level Model: 1.Segmentation 2.Mapping 3.Generate Candidates Ex: dairn d a i r n da i r n d ai r n dai r n d a i rn da ir n d air n dair n d a i rn da i rn d ai rn dai rn d a irn da irn d airn dairn d a i rn rn 0.7 m 0.15 im 0.02 ln 0.015 0.005 i 0.84 l 0.12 0.02 t 0.015 ll 0.005 0.005 d 0.8 h 0.1 cl 0.08 0.02 a 0.9 o 0.05 r 0.02 oi 0.015 0.005 n 0.005 e 0.005 dairn 0.425 daim 0.091 claim 0.0091 aim 0.00227 horn 0.00007 l 0.09 i 0.05 li 0.02 s 0.015 f 0.005 t 0.005 a 0.005
17
Example (cont): Word Level Model: Find the Frequency of Occurrence of each generated word in the dictionary P ( dairn | dairn ) = 0.425 P ( daim | dairn ) = 0.091 P ( claim | dairn ) = 0.0091 P ( aim | dairn ) = 0.00227 P ( horn | dairn ) = 0.00007 Freq ( dairn ) = 0 Freq ( daim ) = 0 Freq ( claim ) = 1500 Freq ( aim ) = 4000 Freq ( horn ) = 150 dairnclaim
18
IR Experiments Degraded Collections are corrected, best one, two, three and five corrections were picked up for each word to be indexed The collections were indexed and searched using words, character 3-grams, character 4-grams, and lightly stemmed word Retrieval performance were tested for all combination between index type and number of correction Measure of merit is Mean Average Precision Significance testing done using t-test with p-value = 0.05
19
Correction Results: ZAD CollectionTREC Collection
20
IR Results: “ ZAD Collection” : Clean Bad
21
IR Results: “ TREC Collection” : Clean Bad
22
Conclusion & future work: Despite WER was halved IR effectiveness was not improved with statistically significant increase Using more than one correction does not help Indexing using n-grams (shorter index terms) is better than “moderate” error correction Effect of using n-gram word LM on error correction “Magdy, W. and K. Darwish. Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology. IN EMNLP 2006” Effect of “good” error correction on improving the retrieval effectiveness
23
Lnanh gonThank you Correction
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.