Extracting Information from Diverse and Noisy Scanned Document Images

Slides:



Advertisements
Similar presentations
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
Advertisements

Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.
Ancestry OCR Project: Extractors Thomas L. Packer
1 Lecture 35 Brief Introduction to Main AI Areas (cont’d) Overview  Lecture Objective: Present the General Ideas on the AI Branches Below  Introduction.
Information Retrieval in Practice
Semi-Supervised, Knowledge-Based Information Extraction for the Semantic Web Thomas L. Packer Funded in part by the National Science Foundation. 1.
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Named Entity Recognition for Digitised Historical Texts by Claire Grover, Sharon Givon, Richard Tobin and Julian Ball (UK) presented by Thomas Packer 1.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Overview of Search Engines
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Bloom’s Critical Thinking Level 1 Knowledge Exhibits previously learned material by recalling facts, terms, basic concepts, and answers.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Information Paragraphs. What are they? Information paragraphs do exactly that: they provide information to readers about one aspect of a person, place,
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
Machine Learning.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Presenter: Shanshan Lu 03/04/2010
Content analysis and CERN Roman Chyla. Artificial intelligence Natural language processing Web of data Content analysis.
Informative Speech Scoring Guide Category4321 Body language and rate of speech Uses positive body language including movement and gestures to aid in understanding.
The Clil4U project has been funded with support from the European Commission. This publication reflects the views only of the author, and the Commission.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Intelligent Database Systems Lab Presenter : Chang,Chun-Chih Authors : Youngjoong Ko, Jungyun Seo 2009, IPM Text classification from unlabeled documents.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007.
Data, Information and knowledge Unit 1_1. What have computers ever done for me? By the end of this lesson you will be able to correctly answer this question:
Programming Errors. Errors of different types Syntax errors – easiest to fix, found by compiler or interpreter Semantic errors – logic errors, found by.
The Unreasonable Effectiveness of Data
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Persuasive Letter Scoring Guide Category4321 Audience Demonstrates a clear understanding of the potential reader and uses appropriate vocabulary and arguments.
Information Retrieval in Practice
Queensland University of Technology
Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval Walid Magdy Kareem Darwish Mohsen Rashwan.
Recent Trends in Text Mining
CSC 594 Topics in AI – Natural Language Processing
Machine Learning overview Chapter 18, 21
Machine Learning overview Chapter 18, 21
Introduction Machine Learning 14/02/2017.
Understanding the Assignment
Introduction to Data Mining
Applications of Text Mining
Introduction to Data Science Lecture 7 Machine Learning Overview
Lecture 24: NER & Entity Linking
CSC 594 Topics in AI – Natural Language Processing
Automatic Wrapper Induction: “Look Mom, no hands!”
Thomas L. Packer BYU CS DEG
Overview of Machine Learning
Adaptive2 Language Model
Extracting Full Names from Diverse and Noisy Scanned Document Images
Family History Technology Workshop
presented by Thomas L. Packer
ListReader: Wrapper Induction for Lists in OCRed Documents
Extracting Information from Diverse and Noisy Scanned Document Images
CSCI 5832 Natural Language Processing
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Entity Linking Survey
Machine Learning overview Chapter 18, 21
Extracting Information from Diverse and Noisy Scanned Document Images
Presentation transcript:

Extracting Information from Diverse and Noisy Scanned Document Images Thomas L. Packer and DEG & NLP Labs 5/24/2019

Goal Extract topic-specific (e.g. genealogy-related) entities and relationships from scanned document images. For document search and knowledge engineering. 5/24/2019

Sub-goals Text Document Reconstruction (incl. OCR) Information Extraction 5/24/2019

Initial Approach Use existing commercial OCR engine output (provided by Ancestry.com) Use existing information extraction techniques Hand-written regex and CFG rules and dictionaries MEMM and CRF trained from CoNLL NER data 5/24/2019

Results of Initial Approach BOPCRIS was getting 70-80% F-measure on their data. 5/24/2019

Challenge 1 Existing OCR lost punctuation and formatting. 5/24/2019

Proposed Solution to Challenge 1 Re-OCR, preserve punctuation and formatting. 5/24/2019

Challenge 2 Diverse document structure Line breaks in one document or paragraph mean something different than what they do in another. 5/24/2019

Proposed Solution to Challenge 2 Adaptive format interpretation (Grammar Induction) Categorize lines by inferred tab stops (RANSAC). Propose multiple candidate groupings. Score candidates using language model, extraction quality or pattern similarity/compressibility. What I’d most like to consider is scoring indirectly by an extraction grammar applicability score to fuel semi-supervised grammar induction as part of bootstrapping. 5/24/2019

Challenge 3 OCR and other character-level noise 5/24/2019

Proposed Solution 1 to Challenge 3 GivenName  “John” | “Margaret” Surname  “Steele” | “Anderson” PersonName  GivenName | GivenName Surname ParentChildRelationship  PersonName ChildKeyWord PersonName “Post-Terminal” CFG Rules “Steele”  “S t e e l e” MergedTokens  (SingleCharToken “ ”) {2-20} “Anderson”  “Ander- son” MergedTokens  AnyToken “-” “ ” AnyToken “Anderson”  “Andersor” NoiseModelMappedToken  AnyToken 5/24/2019

Proposed Solution 2 to Challenge 3 Develop and compose weighted finite state transducers for: Language modeling / OCR error correction Information extraction Plug into the OCRopus OCR engine. 5/24/2019

Challenge 4 More (labeled) data needed: Training data (supervised ML) Evaluation data (statistical significance) Knowledge sources (dictionaries, etc.) 5/24/2019

Proposed Solutions to Challenge 4 Get more data and label it. Use semi-supervised and indirectly-supervised ML: Bootstrapping from a few labeled examples and/or a few rules + lots of unlabeled data. 5/24/2019

Questions? Feedback? 5/24/2019