Extracting Full Names from Diverse and Noisy Scanned Document Images

Slides:



Advertisements
Similar presentations
Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.
Advertisements

1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Ancestry OCR Project: Extractors Thomas L. Packer
Overview of PubWEST Patent and Trademark Depository Library Training Seminar April 2006.
Computer Science Research for Family History and Genealogy David W. Embley Heath Nielson, Mike Rimer, Luke Hutchison, Ken Tubbs, Doug Kennard, Tom Finnigan.
IN350 Document Management & Information Steering Introduction to Document Management. Class 1 August 25, 2003 Judith A. Molka-Danielsen
Domain-Independent Data Extraction: Person Names Carl Christensen and Deryle Lonsdale Brigham Young University
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Semi-Supervised, Knowledge-Based Information Extraction for the Semantic Web Thomas L. Packer Funded in part by the National Science Foundation. 1.
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Named Entity Recognition for Digitised Historical Texts by Claire Grover, Sharon Givon, Richard Tobin and Julian Ball (UK) presented by Thomas Packer 1.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Toward Automatic Processing and Indexing of Microfilm.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Information Retrieval in Practice
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Information Extraction: Distilling Structured Data from Unstructured Text. -Andrew McCallum Presented by Lalit Bist.
INTRODUCTION TO RESEARCH. Learning to become a researcher By the time you get to college, you will be expected to advance from: Information retrieval–
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf.
“Automating Reasoning on Conceptual Schemas” in FamilySearch — A Large-Scale Reasoning Application David W. Embley Brigham Young University More questions.
Information Retrieval
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
ONLINE SEARCH AND REDACTION SYSTEM Many concepts of digitalization which aim is to present datas on internet are faced with two main subjects and problems:
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
The Semantic Web. What is the Semantic Web? The Semantic Web is an extension of the current Web in which information is given well-defined meaning, enabling.
AUTONOMOUS REQUIREMENTS SPECIFICATION PROCESSING USING NATURAL LANGUAGE PROCESSING - Vivek Punjabi.
Cost-effective Ontology Population with Data from Lists in OCRed Historical Documents Thomas L. Packer David W. Embley HIP ’13 BYU CS 1.
Chapter Three Presentation: User interface How to Build a Digital Library Ian H. Witten and David Bainbridge.
Study Skills Taking control of your reading. Useful Tips Try to recognize the key features of a text, this helps you read quicker. Try to indentify appropriate.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Indian Community Languages Schools Parents and Teachers Conference July 2017.
Sentiment analysis algorithms and applications: A survey
Machine Learning overview Chapter 18, 21
Machine Learning overview Chapter 18, 21
Types of Search Questions
Data and Applications Security Developments and Directions
Discovery Learning by Investigation
Vision for an Automatically Constructed FH-WoK
Automatic Wrapper Induction: “Look Mom, no hands!”
Thomas L. Packer BYU CS DEG
CSCI 5832 Natural Language Processing
Family History Technology Workshop
Introduction to Information Retrieval
presented by Thomas L. Packer
ListReader: Wrapper Induction for Lists in OCRed Documents
Extracting Information from Diverse and Noisy Scanned Document Images
A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…
Family History - Getting Started
PolyAnalyst Web Report Training
Extracting Information from Diverse and Noisy Scanned Document Images
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Machine Learning overview Chapter 18, 21
Dan Roth Department of Computer Science
presented by Thomas L. Packer
Extracting Information from Diverse and Noisy Scanned Document Images
Module 2 - Xtrata Pro Product Overview Module 2 – Product Overview
Presentation transcript:

Extracting Full Names from Diverse and Noisy Scanned Document Images Thomas L. Packer, BYU Noisy OCR Group, Lee Jensen, Ancestry.com, Inc. 2/2/2019

Lots of Old, Inaccessible Printed Information How can they reach the information age / digital age? For document search and knowledge engineering. 2/2/2019

Scanning: Good for Web Distribution Brings them to the web: multiple, remote viewers. Don’t want to read all of it to find information. 2/2/2019

OCR: Good for Keyword Search Key-word search. OCR errors Want: OCR error correction I’m looking for a human, not a word. Search for entities like humans – people names – not words Example: Packers as an occupation. (notes) 2/2/2019

Process Take scanned document images. Extract topic-specific information, e.g. person names, event and relationships. Goal: Fielded search or record linkage. For document discoverability and knowledge engineering. Discoverability: search and browsing with metadata. 2/2/2019

Sub-Processes Document Image Analysis (incl. OCR) Information Extraction Evaluate with respect to OCR and then image. (notes) 2/2/2019

Challenges with Diversity Diverse document structure, little training data. City directories, church year books, family histories, local histories, navy cruise books, newspapers. Line breaks in one document or paragraph mean something different than what they do in another. Name order. 2/2/2019

Challenges with Images and OCR evening when your aching bones are crying for a respite from the Charles son of hn Ban was married twice had oneaught heat and rush of existence when the cooling zephers of the night are by first and no issue y second wife spent a brow like large part of his playing on your fevered the cosmic r Balm of Gilead life in quo Quebec and died a calm clear beneficent sky e Ontario but in effulgent in the lights of heaven Mixed columns, no line boundaries, missing punctuation, 2/2/2019

Our Approach Use existing commercial OCR engine output (provided by Ancestry.com) Use existing IE techniques: Hand-written grammar rules and dictionaries Statistical sequence models (machine learning) Ensembles 2/2/2019

Rule-Based Extractors Dictionary: Contiguous Dictionary Matches Regex: Regular Expressions CFG: Context Free Grammar and Chart Parser Rules tested on dev test set. 2/2/2019

Machine Learning Based Extractors MEMM: Maximum Entropy Markov Model CRF: Conditional Random Field Other … lived in Virginia. … Stone Phillips of NBC … Conoco Phillips hired Joe. Stone walls stand well. lived in of walls … Given Name Surname Joe Virginia Stone Phillips 2/2/2019

Characteristic Outputs and Ensemble OCR Output John Lilbby Image-Based Standard John Libby Dictionary John Regex MEMM Ensemble (Majority) 2/2/2019

Results F-Measure of Full-Names (green) and Full-Name Fragments (blue) on Blind Test, 10 Documents, 490 Names Dictionary is so simple and good enough compared. Rule-based does well. ML does help when adding to an ensemble. 2/2/2019

Conclusions & Future Work New research group, learned a lot. More to come… Public data set. Page layout analysis. Ambiguous cases: Generate and test hypotheses. Combine strengths of hand-coding and machine learning: Apply specialized extractors to tabular and unstructured text Semi-supervised machine learning of HMMs with semantic constraints. 2/2/2019

Questions? 2/2/2019