Statistical Methods for Text Error Correction

Statistical Methods for Text Error Correction
By Walid Magdy Supervision of Prof. Dr. Mohsen Rashwan Prof. Dr. Kareem Darwish

Outline Motivation Prior Work Contribution Text Fusion
Omni-Font Error Correction Integrated system Conclusion Possible Future Directions 16 November 2018

Motivation Massive efforts is moving toward digitization
Digitization is for: Availability & Information Retrieval OCR is the main enabling technology OCR systems is far from perfect Poor quality OCR = Low readability & Low IR effectiveness Arabic OCR accuracy is much lower than other languages The need of higher quality text for Arabic documents became a must for improving readability & IR effectiveness. 16 November 2018

Prior Work Previous work on OCR focused on:
Building better OCR systems Improving text quality through text correction Improving IR effectiveness regardless of improving text quality Examples: Sakhr & RDI OCR systems OCR correction based on character error model and Language model Index term selection for indexing degraded text Query garbling based on character error model 16 November 2018

Contribution Fusion of multiple OCR output text
Omni-Font OCR error correction Information retrieval improvement for degraded text System design that can reduce errors in degraded text by more than 80% 16 November 2018

Text Fusion Outlines Definition Approach Implementation
Experimental Setup Results 16 November 2018

Degraded version of text
Text Fusion Definition Clean version of text Noisy edit operations Degraded version of text OCR Sx = S0 + εx Simage Previous approaches depends on the presence of only one source of degraded text. Our approach assumes the presence of more than one version of the degraded text. Correction S0’ = S0 + ε0’ Sx = S0 + εx ε0’ < εx Fusion S0’ = S0 + ε0’ S1 = S0 + ε1 S2 = S0 + ε2 Sn = S0 + εn ε0’ < min(ε1 … εn) 16 November November 2018

Text Fusion Approach Image OCR System1 OCR System2 ولا ثدي إلا في ألمستدلاله بنور4 ولا حيا4 إلا في رضا4 ولا ثدي إلا في ألمستدلاله بنور4 ولا حيا4 إلا في رضا4 ولم هدء! إلم في الاستدلال بنوره ولم حياة ملأ في رضا5 ولم هدء! إلم في الاستدلال بنوره ولم حياة ملأ في رضا5 Text Fusion ولا ثدي إلا في الاستدلال بنوره ولا حياة إلا في رضا5 16 November 2018

Best Fitting Word Selection
Text Fusion Implementation Word Alignment Best Fitting Word Selection Language Model OCR Text 1 OCR Text 2 Fused Text OCR1: W1 W2 W3 W4 ….. Wn-2 Wn-1 Wn OCR2: W1 W2 W3 W4 ….. Wm-2 Wm-1 Wm 16 November 2018

Experimental Setup Text Fusion
Fusion was tested in two ways: 1. Effect on Error Reduction 2. Effect on Information Retrieval ZAD Alma’ad book was used for test, which contains - OCR output using Sakhr - Clean version of the book - Queries with relevance judgments A tri-gram LM from Ibn-Taymia books Two OCR systems were available: - Sakhr Automatic Reader - RDI OCR system 16 November 2018

Effect on Error Reduction (1/2)
Text Fusion Effect on Error Reduction (1/2) Ten clean pages were selected at random & printed in three different fonts (Kufi, Mudir, and Simplified Arabic) Set contains 4,200 words with 0.9% of words are OOV Each version of text is scanned with two different resolutions (200x200 dpi & 300x300dpi) Each scanned version is OCRed using both OCR systems (RDI & Sakhr) Different versions were fused, & CER’s & WER’s were checked for all version (Original & Fused versions) 16 November 2018

Effect on Error Reduction (2/2)
Text Fusion Effect on Error Reduction (2/2) 200 dpi 300 dpi RDI 14.6% 4.03% 5.6% 8.73% 4.97% 6.10% Sakhr 41.8% 34.70% 44.7% WER for different versions of ZAD test set Original Version Fusion Version Kufi RDI 25.6% 2.45% 3.0% 3.63% 2.00% 1.91% Sakhr 8.8% 5.84% 10.7% Mudir RDI 56.2% 3.98% 9.4% 11.90% 2.10% 2.50% Sakhr 16.5% 3.37% 9.1% Simplified 16 November 2018

Effect on Retrieval Effectiveness
Text Fusion Effect on Retrieval Effectiveness Relevance judgments on ZAD was built on the whole book (2,730 documents) To produce multiple versions of degraded text of ZAD: Sample of the original OCR version were manually corrected Degraded and Clean text were used to create a character error model based on 1:1 character mapping Generated model is then used to garble clean version with different CER’s Different versions were fused, them MAP was used as the figure of merit for IR results Results showed some improvement in IR effectiveness 16 November 2018

Omni-Font Correction Outlines
Idea Implementation Experimental Setup Results 16 November 2018

Candidates (Dictionary)
Omni-Font Correction Idea Character Error Model ب ب ب ي ب د ب ق ب ذ Domain Politics Religious Science Sports Candidates (Dictionary) الاستقلال الاستبدال الاستدلال الاستبلاد الاستغلال الاستبسال الاستذلال Religious الاستبلال Context على وجوده لله تعـالـى 16 November 2018

Best Fitting Word Selection
Omni-Font Correction Implementation OCR Text Corrected Text Generate Candidates Best Fitting Word Selection Edit Distance Language Model Dictionary 16 November 2018

Experimental Setup Omni-Font Correction
Correction was tested in two ways: 1. Effect on Error Reduction 2. Effect on Information Retrieval ZAD was used for the experiments, and another collection (AFP (TREC)) was used to test the system in a different domain (news domain) Two LMs were used to test the AFP collection: 1. LM built from same time period of AFP 2. LM built from different time period of AFP All results were compared to previous work in correction using the character error model 16 November 2018

Effect of Error Reduction
Omni-Font Correction Effect of Error Reduction WER Error Reduction Original OCR version 39% NA Uniform Character Model 17% 56% Trained Character Model 12% 70% ZAD WER Error Reduction Original OCR version 31% NA Uniform Character Model AFP LM 7.3% 76% News LM 11.7% 62% Trained 5.9% 81% 10.7% 65% TREC 16 November 2018

Results in MAP for searching different versions of the ZAD collection
Omni-Font Correction Effect of Retrieval Results in MAP for searching different versions of the ZAD collection 16 November 2018

Integrated System Outlines
Possible Implementations Detailed Implementation Experimental Setup Results 16 November 2018

Possible Implementations
Integrated System Possible Implementations Correction Fusion Degraded Text Less degraded Text Fusion Correction Degraded Text Less degraded Text 16 November 2018

Implementation Integrated System Fusion Correction
Degraded Text Version1 Degraded Text Version2 Fused Text Version Much lower errors version Text Fusion Correction Degraded Text Versionn 16 November 2018

Experimentation Setup
Integrated System Experimentation Setup Correction was applied on all fused versions of text shown in fusion section (versions of ZAD) Correction was applied in two different manners: 1. Whole text correction 2. OOV text correction For fused versions, 55% of WER are OOV 16 November 2018

Results Integrated System 0.9% 1.3% 26.3% Version Fused error rates
WER after Correction Error Reduction OOV Cor. Full Cor. K.200 8.7% 7.1% 7.9% 18.9% 9.1% K.300 6.1% 5.1% 7.2% 15.6% -18.7% K 5.0% 4.1% 6.2% 16.8% -25.7% M.200 3.6% 2.4% 4.8% 32.6% -32.8% M.300 1.9% 0.9% 3.7% 52.7% -96.5% M 2.0% 1.0% 3.8% 47.9% -88.4% S.200 11.9% 9.4% 9.5% 20.6% 19.8% S.300 2.5% 1.3% 48.7% -54.0% S 2.1% 1.8% 4.3% 15.1% -103.8% RDI.K 4.0% 3.3% 19.0% -53.7% RDI.M 1.5% 39.9% -68.6% RDI.S 39.8% -9.1% Sakhr.K 34.7% 29.7% 26.3% 14.6% 24.1% Sakhr.M 5.8% 34.2% -6.4% Sakhr.S 3.4% 4.4% 41.6% -29.6% Version Least Original WER New WER Error Reduction M.300 3.0% 0.9% 70.0% M 1.0% 66.7% RDI.M 1.5% 50.0% K.300 5.6% 5.1% 8.9% K 4.1% 26.8% RDI.K 3.3% 41.1% M.200 8.8% 2.4% 72.7% Sakhr.M 3.8% 56.8% S.300 9.1% 1.3% 85.7% S 1.8% 80.2% Sakhr.S 2.0% 78.0% RDI.S 9.4% 74.5% K.200 14.6% 7.1% 51.4% S.200 16.5% 43.0% Sakhr.K 41.8% 29.7% 28.9% 16 November 2018

Conclusion (1/2) Text fusion proved to be mostly effective
Fusion of two OCRed text of the same image with different resolution improves text quality Omni-font correction proved its effectiveness on error reduction Using trained character error model have better effect on error reduction than uniform one. However, both has indistinguishable IR effectiveness. 16 November 2018

Conclusion (2/2) The key for better correction is using a well trained language model An integrated system that comprise the effectiveness of text fusion and error correction proved its ability on achieving a significant reduction in errors: Average Error Reduction = 56% Max. Error Reduction = 86% Max. Theoretical Error Reduction: WER = OOV 16 November 2018

Possible Future Directions
Applying Text Fusion on character level instead of word level Applying different implementations for the integrated system Using Factored language model Applying all experiments for different types of degraded text, such as ASR Testing the usage of a huge amount of data for creating a general LM instead of a domain specific LM 16 November 2018

بزاكم الته خيرا جزاكم الله خيرا جزاكم الله خبرا 16 November 2018

Statistical Methods for Text Error Correction

Similar presentations

Presentation on theme: "Statistical Methods for Text Error Correction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Methods for Text Error Correction

Similar presentations

Presentation on theme: "Statistical Methods for Text Error Correction"— Presentation transcript:

Similar presentations

About project

Feedback