Extracting Information from Diverse and Noisy Scanned Document Images

Slides:



Advertisements
Similar presentations
Review of AI from Chapter 3. Journal May 13  What advantages and disadvantages do you see with using Expert Systems in real world applications like business,
Advertisements

Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Semantic Web Workshop Exploiting Synergy Between Ontologies and Recommender Systems Stuart E. Middleton, Harith Alani Nigel R. Shadbolt, David.
Ancestry OCR Project: Extractors Thomas L. Packer
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Semi-Supervised, Knowledge-Based Information Extraction for the Semantic Web Thomas L. Packer Funded in part by the National Science Foundation. 1.
Named Entity Recognition for Digitised Historical Texts by Claire Grover, Sharon Givon, Richard Tobin and Julian Ball (UK) presented by Thomas Packer 1.
Traditional Information Extraction -- Summary CS652 Spring 2004.
Open Information Extraction From The Web Rani Qumsiyeh.
Big Ideas in Cmput366. Search Blind Search Iterative deepening Heuristic Search A* Local and Stochastic Search Randomized algorithm Constraint satisfaction.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Presented by Zeehasham Rasheed
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Spring Research Conference.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Information Retrieval in Practice
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Graphical models for part of speech tagging
7-Speech Recognition Speech Recognition Concepts
Qual Presentation Daniel Khashabi 1. Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond.
Machine Learning.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
John Lafferty Andrew McCallum Fernando Pereira
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
AUTONOMOUS REQUIREMENTS SPECIFICATION PROCESSING USING NATURAL LANGUAGE PROCESSING - Vivek Punjabi.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Sentiment analysis algorithms and applications: A survey
CSC 594 Topics in AI – Natural Language Processing
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Learning Coordination Classifiers
Exploiting Synergy Between Ontologies and Recommender Systems
Machine Learning in Natural Language Processing
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 20
CSC 594 Topics in AI – Natural Language Processing
Automatic Wrapper Induction: “Look Mom, no hands!”
CSCI 5832 Natural Language Processing
Thomas L. Packer BYU CS DEG
Overview of Machine Learning
Extracting Full Names from Diverse and Noisy Scanned Document Images
Family History Technology Workshop
presented by Thomas L. Packer
Voice Activation for Wealth Management
ListReader: Wrapper Induction for Lists in OCRed Documents
Extracting Information from Diverse and Noisy Scanned Document Images
CSCI 5832 Natural Language Processing
Using Uneven Margins SVM and Perceptron for IE
Extracting Information from Diverse and Noisy Scanned Document Images
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Dan Roth Department of Computer Science
Presentation transcript:

Extracting Information from Diverse and Noisy Scanned Document Images Thomas L. Packer and DEG & NLP Labs 6/22/2019

Lots of Analog Text and Cameras 6/22/2019

Lots of Old Printed Information How can they reach the information age / digital age? For document search and knowledge engineering. 6/22/2019

Scanning: Are we there yet? Brings them to the web: multiple, remote viewers. http://seadragon.com/showcase/chris-jordan/#office-paper 6/22/2019

OCR: Are we there yet? Key-word search. OCR errors Want: OCR error correction Search for entities, not words 6/22/2019

Process Extract topic-specific (e.g. genealogy-related) entities and relationships from scanned document images. For document discoverability and knowledge engineering. Discoverability: search and browsing with metadata. 6/22/2019

Sub-Processes Document Image Analysis (incl. OCR) Information Extraction 6/22/2019

Initial Approach Use existing commercial OCR engine output (provided by Ancestry.com) Use existing IE techniques: Hand-written grammar rules and dictionaries MEMM and CRF trained from CoNLL NER data 6/22/2019

Initial IE Approaches: Full Names Dictionary: Contiguous Dictionary Matches Regex: Regular Expressions CFG: Context Free Grammar and Chart Parser (Plus other entities and relations) MEMM: Maximum Entropy Markov Model CRF: Conditional Random Field Ensemble: Majority Vote (input from 1, 2, 4) 6/22/2019

Results of Initial Approach All-or-None (green) and Partial-Credit (blue) F-Measure of Full-Names, Blind Test, a Dozen Different Documents BOPCRIS was getting 70-80% F-measure on their data. Dictionary is so simple and good enough compared. 6/22/2019

Challenges Diverse document structure, little training data. Line breaks in one document or paragraph mean something different than what they do in another. 6/22/2019

Proposed Solution: Step One Adaptive page structure interpretation Categorize lines by inferred tab stops (RANSAC). Propose multiple candidate line groupings. Induce an extractor (e.g. HMM) for each candidate. Score candidate grammars against: Prior knowledge constraints (ontology) and Simplicity (MDL) or data likelihood 6/22/2019

Proposed Solution: Step Two Semi-supervised Training of Multiple Multi-level HMMs Input semantic constraints and domain knowledge from ontology. Optionally label a few page images. Build HMM structure from ontology constraints. Estimate HMM parameters (probabilities) using modified Viterbi and EM. Bootstrapping: feedback acquired knowledge among multiple HMMs. 6/22/2019

Questions? 6/22/2019

System Name SOCRATES: Synergistic OCR-based Acquisition of Textualized Entities and other Stuff 6/22/2019