Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval Walid Magdy Kareem Darwish Mohsen Rashwan.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Word Spotting DTW.
XHTML Basics.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
Overview of PubWEST Patent and Trademark Depository Library Training Seminar April 2006.
Omni Font OCR Error Correction with Effect on Retrieval Walid Magdy 1 Kareem Darwish 2 1 Faculty of Engineering, Cairo University, Egypt 1 School of Computing,
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Metric Inverted - An efficient inverted indexing method for metric spaces Benjamin Sznajder Jonathan Mamou Yosi Mass Michal Shmueli-Scheuer IBM Research.
Evaluating Search Engine
Search Engines and Information Retrieval
Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
INFO 624 Week 3 Retrieval System Evaluation
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Hardware, Software & Automatic input devices LO: Recognise hardware, software. Learning outcome: Correctly identify hardware and software. Recognise and.
IIIT HyderabadUMASS AMHERST Robust Recognition of Documents by Fusing Results of Word Clusters Venkat Rasagna 1, Anand Kumar 1, C. V. Jawahar 1, R. Manmatha.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents Walid Magdy & Kareem Darwish IBM Technology Development Center PO Box 166.
Topics Covered: Data preparation Data preparation Data capturing Data capturing Data verification and validation Data verification and validation Data.
Overview of RISOT: Retrieval of Indic Script OCR’d Text Utpal GarainIndian Statistical Institute, Kolkata Tamaltaru PalIndian Statistical Institute, Kolkata.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Search Engines and Information Retrieval Chapter 1.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
Million Book Bibliotheca Alexandrina Noha Adly 20 November 2006.
A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.
Advanced Higher Physics Investigation Report. Hello, and welcome to Advanced Higher Physics Investigation Presentation.
Automatic Ground Truth Generation of Camera Captured Documents Using Document Image Retrieval Sheraz Ahmed, Koichi Kise, Masakazu Iwamura, Marcus Liwicki,
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.
CAASL July Using OWA Fuzzy Operator to Merge Retrieval System Results Tehran University Hadi Amiri, Abolfazl AleAhmad, Caro Lucas, Masoud.
More Than Relevance: High Utility Query Recommendation By Mining Users' Search Behaviors Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, Yanyan Lan Institute of.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval  Representation Retrieval Thanks for David Doermann for most of these.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Preliminary Transformations Presented By: -Mona Saudagar Under Guidance of: - Prof. S. V. Jain Multi Oriented Text Recognition In Digital Images.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç
Research Vocabulary. Research The investigation of a particular topic using a variety of reliable resources.
Text2PTO: Modernizing Patent Application Filing A Proposal for Submitting Text Applications to the USPTO.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Walid Magdy Gareth Jones
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Evaluation of IR Systems
An Empirical Study of Learning to Rank for Entity Search
WRITING AND PUBLISHING RESEARCH ARTICLES
Statistics Pep-Talk for Senior Thesis Bill Menke Sept 21, 2017.
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Statistical Methods for Text Error Correction
ACT Close and Critical Reading Using ACT Content Passages
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Information Retrieval and Web Design
Presentation transcript:

Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval Walid Magdy Kareem Darwish Mohsen Rashwan

Outlines Motivation Prior work Fusion Definition Approach Experimental Setup Results Conclusion & Future work

Motivation Many Arabic documents are available only in print form. The need of transforming these documents into electronic form increased since the end of last century, where searching E-text is much easier. Arabic OCR accuracy is still much lower than the state-of-the-art for other languages, such as English. Degraded text, resulting from OCR systems, affects the effectiveness of Information Retrieval. The need for having higher quality text for Arabic documents became a must for improving IR effectiveness.

Prior Art: Previous work on OCRed text focused on two main aspects: Work involves improving Information Retrieval effectiveness regardless of improving text quality. Work focuses on improving text quality leading to improvement in IR effectiveness. Examples: Query garbling based on character error model. OCR correction based on character error model and Language model.

Degraded version of text Fusion Definition: Clean version of text Noisy edit operations Degraded version of text OCR Sx = S0 + εx Simage Previous approaches depends on the presence of only one source of degraded text. Our approach assumes the presence of more than one version of the degraded text. Correction S0’ = S0 + ε0’ Sx = S0 + εx ε0’ < εx Fusion S0’ = S0 + ε0’ S1 = S0 + ε1 S2 = S0 + ε2 Sn = S0 + εn ε0’ < min(ε1 … εn)

Approach: ولا ثدي إلا في ألمستدلاله بنور4 ولا حيا4 إلا في رضا4 Image OCR System1 OCR System2 ولا ثدي إلا في ألمستدلاله بنور4 ولا حيا4 إلا في رضا4 ولا ثدي إلا في ألمستدلاله بنور4 ولا حيا4 إلا في رضا4 ولم هدء! إلم في الاستدلال بنوره ولم حياة ملأ في رضا5 ولم هدء! إلم في الاستدلال بنوره ولم حياة ملأ في رضا5 Language Model ولا ثدي إلا في الاستدلال بنوره ولا حياة إلا في رضا5

Experimental Setup: Only one OCR system was available “Sakhr Automatic Reader v4”. In order to obtain multiple sources for a given data set: Few pages were selected at random from a book, OCRed, then outcome text was manually corrected. Degraded and Clean text were used to create a character error model based on 1:1 character mapping. Generated model is then used to garble a clean text using different CER’s. Used OCRed book for test was Zad Alma’ad, with the following specs: Eight pages scanned at 300x300 dpi that contain 4,236 words, with CER of 13.9% and WER 36.8%. Clean version of the book was available in electronic form that consists of 2,730 separate documents. Associated a set of 25 topics and relevance judgments. LM is built using a web-mined collection of religious text by Ibn Taymiya, the teacher of the author of Zad Alma’ad MAP was used as the figure of merit for IR results.

Experimental Setup: Generating Synthetic Garbled Data For a clean word “قنبلة” ق ق 0.8 ف 0.1 ت 0.05 ن 0.05 Generate random number قـنـبـلـة قـنـبـلـة Garbler تـنـبـلـة تـ Character Error Model 0.921 ق ف ت ن 0.0 0.8 0.9 0.95 1

Experimental Setup: Generating Synthetic Garbled Data k = CERnew CERorg ق ف ت ن 0.0 0.8 0.9 0.95 1 k = 2 ق ف ت ن 0.0 0.6 0.8 0.9 1 k = 0.5 ق ف ت ن 0.0 0.95 0.9 1 0.975

Experimental Setup: Generated Versions Data set k CER WER OOV Original NA 13.9% 36.8% 20.9% Model-1 1 36.3% 21.1% 36.4% Model-2 0.5 7.0% 20.3% 11.9% 20.4% Model-3 0.67 9.3% 26.1% 15.2% 25.9% Model-4 1.25 17.4% 43.2% 25.0% 43.3% 24.9% Model-5 2 27.9% 59.2% 33.8% 33.7% Model-1 Model-2 Model-3 Model-4 Model-5 Error rates for generated versions Retrieval results for generated versions

Results: Fusion Results WER after fusion of both versions Common Errors between versions WER for outcome text from fusion process between couples of versions

Results: Retrieval Results Results in MAP of searching different fused models, hashed bars refers to statistical significant retrieval results better than the original degraded versions

Conclusion & Future Work: Text fusion proved to be an effective method for selecting the proper word among different candidate words coming from different sources. Effectiveness of text fusion on WER reduction depends on the percentage of error overlap among different versions. Information retrieval improvement as a cause of text fusion was found to be promising specially for the few outcome versions that are statistically indistinguishable from the clean version. As a future work, fusion technique needs to be tested on real degraded data coming from different sources that will introduce a new challenge, which is word alignment among different sources.

بزاكم الته خيرا جزاكم الله خيرا جزاكم الله خبرا