Populating Ontologies with Data from Lists in Family History Books Thomas L. Packer David W. Embley 2013.03 RT.FHTW BYU.CS 1.

Slides:

Advertisements

Similar presentations

IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan

Advertisements

FOCIH: Form-based Ontology Creation and Information Harvesting Cui Tao, David W. Embley, Stephen W. Liddle Brigham Young University Nov. 11, 2009 Supported.

Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.

IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.

Xyleme A Dynamic Warehouse for XML Data of the Web.

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.

Traditional Information Extraction -- Summary CS652 Spring 2004.

A System for A Semi-Automatic Ontology Annotation Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, Borislav Kirilov BulTreeBank Group LML,

1 COS 425: Database and Information Management Systems XML and information exchange.

Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.

Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.

Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University.

1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.

Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Information Retrieval in Practice

Kendall & KendallCopyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall 9 Kendall & Kendall Systems Analysis and Design, 9e Process Specifications.

Webpage Understanding: an Integrated Approach

A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University.

2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.

Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore.

Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.

Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google.

Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene Agichtein*Vanja Josifovski IBM Almaden and Columbia University*

1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,

Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.

A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.

Information Extraction: Distilling Structured Data from Unstructured Text. -Andrew McCallum Presented by Lalit Bist.

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.

1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.

Presenter: Shanshan Lu 03/04/2010

Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

Declaratively Producing Data Mash-ups Sudarshan Murthy 1, David Maier 2 1 Applied Research, Wipro Technologies 2 Department of Computer Science, Portland.

Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.

Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

DataBase and Information System … on Web The term information system refers to a system of persons, data records and activities that process the data.

Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning Chia-Hui Chang Dept. of Computer Science and Information.

“Automating Reasoning on Conceptual Schemas” in FamilySearch — A Large-Scale Reasoning Application David W. Embley Brigham Young University More questions.

TimeML compliant text analysis for Temporal Reasoning Branimir Boguraev and Rie Kubota Ando.

Supertagging CMSC Natural Language Processing January 31, 2006.

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

GreenFIE-HD: A “Green” Form-based Information Extraction Tool for Historical Documents Tae Woo Kim.

Cost-effective Ontology Population with Data from Lists in OCRed Historical Documents Thomas L. Packer David W. Embley HIP ’13 BYU CS 1.

Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.

Information Extractors Hassan A. Sleiman. Author Cuba Spain Lebanon.

Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.

HTML CS 4640 Programming Languages for Web Applications

Potter’s Wheel: An Interactive Data Cleaning System

Web Information Extraction

Social Knowledge Mining

Automatic Wrapper Induction: “Look Mom, no hands!”

Thomas L. Packer BYU CS DEG

Benchmark Series Microsoft Word 2016 Level 2

Extracting Full Names from Diverse and Noisy Scanned Document Images

Early Profile Pruning on XML-aware Publish-Subscribe Systems

Family History Technology Workshop

Web Development Using ASP .NET

Temple Ready within an Hour of Collection Capture

ListReader: Wrapper Induction for Lists in OCRed Documents

Extracting Information from Diverse and Noisy Scanned Document Images

Extraction Rule Creation by Text Snippet Examples

Extracting Information from Diverse and Noisy Scanned Document Images

presented by Thomas L. Packer

Extracting Information from Diverse and Noisy Scanned Document Images

HTML CS 4640 Programming Languages for Web Applications

Presentation transcript:

Populating Ontologies with Data from Lists in Family History Books Thomas L. Packer David W. Embley RT.FHTW BYU.CS 1

What’s the challenge? 2

What “rich data” is found in lists? 1.Lexical vs. non-lexical 2.Arbitrary relationship arity 3.Arbitrary ontology path lengths 4.Functional and optional constraints 5.Generalization- specialization class hierarchies (with inheritance) 3 1.Name(“Elias”) vs. Person(p 1 ) 2.Husband-married-Wife-in- Year(p 1, p 2, “1702”) 3. 4.Person-Birth() vs. Person- Marriage() 5.Child(p 3 )  Person(p 3 ), Parent(p 2 )  Person(p 2 )

What’s the value? 4

What’s been done already? 5 Wrapper Induction General Lists Noise Tolerant Rich Data Effort- Scalable Blanco, Dalvi, Gupta, Carlson, Heidorn, Chang, Crescenzi, Lerman, Chidlovskii, Kushmerick, Lerman, Thomas, Adelberg, Kushmerick, = well-covered 0.0 = not covered

What’s our contribution? ListReader Mappings Formal correspondence among – Populated ontologies (predicates) – Inline annotated text (labels) – List wrappers (grammars) – Data entry (forms) 6

What’s our contribution? ListReader Wrapper Induction Low-cost wrapper induction – Semi-supervised + active learning Decreasing-cost wrapper induction – Self-supervised + active learning 7

Cheap Training Data 8

Automatic Mapping 9 Child(p 1 ) Person(p 1 ) Child-ChildNumber(p 1, “1”) Child-Name(p 1, n 1 ) …

Semi-supervised Induction Andy b Becky Beth h, Charles Conrad 1.Initialize 1. Andy b Becky Beth h, i Charles Conrad C FN BD \n(1)\. (Andy) b\. (1816)\n 3.Alignment-Search 2.Generalize C FN BD \n([\dlio])[.,] (\w{4}) [bh][.,] ([\dlio]{4})\n C FN BD \n([\dlio])\[.,] (\w{4}) [bh][.,] ([\dlio]{4})\n X Deletion C FN Unknown BD \n([\dlio])[.,] (\w{4,5}) (\S{1,10}) [bh][.,] ([\dlio]{4})\n Insertion 1. Andy b Becky Beth h, Charles Conrad Expansion 4.Evaluate (edit sim. * match prob.) One match! No Match 5.Active Learning 1. Andy b Becky Beth h, i Charles Conrad C FN MN BD \n([\dlio])[.,] (\w{4,5}) (\w{4}) [bh][.,] ([\dlio]{4})\n 6.Extract 1. Andy b Becky Beth h, i Charles Conrad Many more …

Alignment-Search 11 A B C E F G A B C’ D E F Branching Factor = 6 * 4 = 24 A B C’ E F G Goal State Start State A B C E F G H A B C’ E F Tree Depth = 3 This search space size = ~24 3 = 13,824 Other search space sizes = ~ (12*4) 7 = 48 7 = 587,068,342, And many more …

A B C E F G H A* Alignment-Search 12 A B C E F G A B C’ D E F Branching Factor = 2 * 4 = 8 A B C’ E F G Goal State Start State 4 7 A B C’ E F 6 Never traverses this branch Tree Depth = 3 This search space size = ~10 (hard and soft constraints) Instead of ~8 3 = 512 (hard constraint) or 13,824 (no constraint) Other search space sizes = ~1000 instead of 587,068,342,272 f(s) = g(s) + h(s) 4 = f(s) = g(s) + h(s) 3 = 1 + 2

Self-supervised Induction 13 No additional labeling required Limited additional labeling via active learning

Why is this approach promising? 14 Semi-supervised Regex Induction vs. CRF Self-supervised Regex Induction vs. CRF 30 lists | 137 records | ~10 fields / list Stat. Sig. at p < 0.01 using both a paired t-test and McNemar’s test + + +

What next? Improve time, space, and accuracy with HMM wrappers Expanded class of input lists 15

Conclusions Ontology population to sequence labeling Induce wrapper with single click per field Noise tolerant and accurate 16

17

Typical Ontology Population 18

Why not Apply Web Wrapper Induction to OCR Text? Noise tolerance: – Allow character variations  increase recall  decrease precision Populate only the simplest ontologies Problems with wrapper language: – Left-right context (Wien, Kushmeric 2000) – Disjunction rule nad FSA model to traverse landmarks along tree structure (Stalker, Softmealy) – Xpath (Dalvi 2009, etc.) – CRF (Gupta 2009) 19

Why not use left-right context? Field boundaries Field position and character content Record boundaries 20 OCRed List:

Why not use xpaths? OCR text has no explicit XML DOM tree structure Xpaths require HTML tag to perfectly mark field text 21

Why not Use (Gupta’s) CRFs? HTML lists and records are explicitly marked Different application: Augment tables using tuples from any lists on web At web scale, they can throw away harder-to- process lists They rely on more training data than we will We will compare our approach to CRFs 22

Page Grammars Conway [1993] 2-D CFG and chart parser for page layout recognition from document images Can assign logical labels to blocks of text Manually constructed grammars Rely on spatial features 23

Semi-supervised Regex Induction 24

25

List Reading Specialized for one kind of list: – Printed ToC: Marinai 2010, Dejean 2009, Lin 2006 – Printed bibs: Besagni 2004, Besagni 2003, Belaid 2001 – HTML lists: Elmeleegy 2009, Gupta 2009, Tao 2009, Embley 2000, Embley 1999 Use specialized hand-crafted knowledge Rely on clean input text containing useful HTML structure or tags NER or flat attribute extraction–limited ontology population Omit one or more reading steps 26

Research Project 27 Related WorkProject DescriptionValidationConclusion Child(child 1 ) Child-ChildNumber(child 1, “1”) Child-Name(child 1, name 1 ) Name-GivenName(name 1, “Sarah”) Child-BirthDate(child 1, date 1 ) BirthDate-Year(date 1, “1797”) Motivation

Wrapper Induction for Printed Text Adelberg 1998: – Grammar induction for any structured text – Not robust to OCR errors – No empirical evaluation Heidorn 2008: – Wrapper induction for museum specimen labels – Not typical lists Supervised—will not scale well Entity attribute extraction–limited ontology population 28 Project DescriptionValidationMotivationConclusionRelated Work

Semi-supervised Wrapper Induction 29 Related WorkValidationMotivationConclusionProject Description Child(child 1 ) Child-ChildNumber(child 1, “1”) Child-Name(child 1, name 1 ) Name-GivenName(name 1, “Sarah”) Child-BirthDate(child 1, date 1 ) BirthDate-Year(date 1, “1797”)

Construct Form, Label First Record 30 Related WorkValidationMotivationConclusionProject Description 1. Sarah, b

Wrapper Generalization 31 Related WorkValidationMotivationConclusionProject Description Child.BirthDate.Year,.b/h Child.BirthDate.Year,..b\n … … ??.?? 1. Sarah, b Amy, h. 1799, d. i John Erastus, b. 1836, d

1. Sarah, b Amy, h. 1799, d. i John Erastus, b. 1836, d Wrapper Generalization 32 Related WorkValidationMotivationConclusionProject Description Child.BirthDate.Year,.b/h Child.BirthDate.Year,..b\n … … ??.?? Child.BirthDate.Year,.b/h … Child.DeathDate.Year,..d\n

Wrapper Generalization as Beam Search 1.Initialize wrapper from first record 2.Apply predefined set of wrapper adjustments 3.Score alternate wrappers with: – “Prior” (is like known list structure) – “Likelihood” (how well they match next text) 4.Add best to wrapper set 5.Repeat until end of list 33 Related WorkValidationMotivationConclusionProject Description

Mapping Sequential Labels to Predicates 34 Related WorkValidationMotivationConclusionProject Description Child(child 1 ) Child-ChildNumber(child 1, “1”) Child-Name(child 1, name 1 ) Name-GivenName(name 1, “Sarah”) Child-BirthDate(child 1, date 1 ) BirthDate-Year(date 1, “1797”) 1. Sarah, b Child.ChildNumber.Child.Name.GivenNameChild.BirthDate.Year,..b\n

Weakly Supervised Wrapper Induction 1.Apply wrappers and ontologies 2.Spot list by repeated patterns 3.Find best ontology fragments for best-labeled record 4.Generalize wrapper – Both above and below – Active learning without human input 35 Related WorkValidationMotivationConclusionProject Description

Knowledge from Previously Wrapped Lists 36 Related WorkValidationMotivationConclusionProject Description Child.Child Number. Child.Name. GivenName Child.BirthDate. Year,;.b\n Child.DeathDate. Year ;.dm Child.Spouse.Name. GivenName..\n Child.Spouse.Name.Surname

List Spotting 37 Related WorkValidationMotivationConclusionProject Description 1. Sarah, b Amy, h. 1799, d. i John Erastus, b Child.Child Number. Child.Name. GivenName \n.

Select Ontology Fragments and Label the Starting Record 38 Related WorkValidationMotivationConclusionProject Description Child.Child Number.\n 1. Sarah, b Amy, h. 1799, d. i John Erastus, b Child.Birth Date.Year.b,

Merge Ontology and Wrapper Fragments 39 Related WorkValidationMotivationConclusionProject Description

Generalize Wrapper, & Learn New Fields without User 40 Related WorkValidationMotivationConclusionProject Description 1. Sarah, b Amy, h. 1799, d. i John Erastus, b Child.Death Date.Year.d.

Thesis Statement It is possible to populate an ontology semi- automatically, with better than state-of-the-art accuracy and cost, by inducing information extraction wrappers to extract the stated facts in the lists of an OCRed document, firstly relying only on a single user-provided field label for each field in each list, and secondly relying on less ongoing user involvement by leveraging the wrappers induced and facts extracted previously from other lists. 41 Related WorkValidationMotivationConclusionProject Description

Four Hypotheses 1.Is a single labeling of each field sufficient? 2.Is fully automatic induction possible? 3.Does ListReader perform increasingly better? 4.Are induced wrappers better than the best? 42 Related WorkProject DescriptionMotivationConclusionValidation

Hypothesis 1 Single user labeling of each field per list Evaluate detecting new optional fields Evaluate semi-supervised wrapper induction 43 Related WorkProject DescriptionMotivationConclusionValidation

Hypothesis 2 No user input required with imperfect recognizers Find required level of noisy recognizer P & R 44 Related WorkProject DescriptionMotivationConclusionValidation

Hypothesis 3 Increasing repository knowledge decreases the cost Show repository can produce P- and R-level recognizers Evaluate number of user-provided labels over time 45 Related WorkProject DescriptionMotivationConclusionValidation

Hypothesis 4 ListReader performs better than a representative state-of-the-art information extraction system Compare ListReader with the supervised CRF in Mallet 46 Related WorkProject DescriptionMotivationConclusionValidation

Evaluation Metrics Precision Recall F-measure Accuracy Number of user-provided labels 47 Related WorkProject DescriptionMotivationConclusionValidation

Work and Results Thus Far 48 Large, diverse corpus of OCRed documents Semi-supervised regex and HMM induction Both beat CRF trained on three times the data Designed label to predicate mapping Implemented preliminary mapping 85% accuracy of word-level list spotting Related WorkProject DescriptionValidationMotivationConclusion

Questions & Answers 49

What Does that Mean? Populating Ontologies – A machine-readable and mathematically specified conceptualization of a collection of facts Semi-automatically Inducing – Pushing more work to the machine Information Extraction Wrappers – Specialized processes exposing data in documents Lists in OCRed Documents – Data-rich with variable format and noisy content 50 Related WorkProject DescriptionValidationConclusionMotivation

Who Cares? Populating Ontologies – Versatile, expressive, structured, digital information is queryable, linkable, editable. Semi-automatically Inducing – Lowers cost of data Information Extraction Wrappers – Accurate by specializing for each document format Lists in OCRed Documents – Lots of data useful for family history, marketing, personal finance, etc. but challenging to extract 51 Related WorkProject DescriptionValidationConclusionMotivation

Reading Steps 1.List spotting 2.Record segmentation 3.Field segmentation 4.Field labeling 5.Nested list recognition 52 Related WorkValidationMotivationConclusionProject Description Members of the football team: Captain: Donald Bakken Right Half Back LeRoy "sonny' Johnson , Lcft Half Back Orley "Dude" Bakken......, ,......Quarter Back Roger Jay Myhrum Full Back Bill "Snoz" Krohg, Center They had a good year.

Special Labels Resolve Ambiguity 53 Related WorkValidationMotivationConclusionProject Description Child(child 1 ) Child-ChildNumber(child 1, “1”) Child-Name(child 1, name 1 ) Name-GivenName(name 1, “Sarah”) Child-BirthDate(child 1, date 1 ) BirthDate-Year(date 1, “1797”) 1. Sarah, b Amy, h. 1799, d. i John Erastus, b. 1836, d Child.ChildNumber.Child.Name.GivenNameChild.BirthDate.Year,..b\n