Extracting and Organizing Facts of Interest from OCRed Historical Documents Joseph Park, Computer Science Brigham Young University.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.
Finding Genealogy Facts with Linguistic Analysis Peter Lindes, Deryle W. Lonsdale, David W. Embley Brigham Young University © 2014 Peter Lindes 3/19/2014PL.
Of 27 lecture 7: owl - introduction. of 27 ece 627, winter ‘132 OWL a glimpse OWL – Web Ontology Language describes classes, properties and relations.
Automating the Extraction of Genealogical Information from Historical Documents Aaron P. Stewart David W. Embley March 20, 2011.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Enabling Search for Facts and Implied Facts in Historical Documents David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Spencer Machado, Thomas Packer,
A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,
Named Entity Recognition for Digitised Historical Texts by Claire Grover, Sharon Givon, Richard Tobin and Julian Ball (UK) presented by Thomas Packer 1.
OWL-AA: Enriching OWL with Instance Recognition Semantics for Automated Semantic Annotation 2006 Spring Research Conference Yihong Ding.
BYU 2003BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
Thesis Defense Mini-Ontology GeneratOr (MOGO) Mini-Ontology Generation from Canonicalized Tables Stephen Lynn Data Extraction Research Group Department.
UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF.
By ANDREW ZITZELBERGER A Framework for Extraction Ontology Based Information Management.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,
Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.
LDK R Logics for Data and Knowledge Representation Description Logics as query language.
Deryle W. Lonsdale, David W. Embley, Stephen W. Liddle, and Joseph Park BYU Data Extraction Research Group.
Knowledge based Learning Experience Management on the Semantic Web Feng (Barry) TAO, Hugh Davis Learning Society Lab University of Southampton.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
OntoSoar: Feeding a Growing Ontology CS 652 Information Extraction and Integration Fall 2012 Peter Lindes pl 12/4/2012OntoSoar1.
Information Extraction From Medical Records by Alexander Barsky.
FROntIER: A Framework for Extracting and Organizing Biographical Facts in Historical Documents Joseph Park.
Proposal for Synergistic Name Extraction from Historical Text Documents.
A Web of Knowledge for Historical Documents David W. Embley.
Joseph Park Brigham Young University.  Motivation.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Scanned Books: Annotator Training. Project Overview Untapped sources – 100,000+ scanned/OCRed books – Problem: how to cost-effectively extract Extraction.
Populating A Knowledge Base From Text Clay Fink, Tim Finin, Christine Piatko and Jim Mayfield.
Soar and Construction Grammar Peter Lindes, Deryle Lonsdale, David Embley Brigham Young University 2014 Soar Workshop © 2014 Peter Lindes 6/19/2014PL 2014.
Ontology-based Information Extraction with a Cognitive Agent Peter Lindes 1, Deryle Lonsdale, David Embley Brigham Young University AAAI Now at.
Bootstrapping Regular-Expression Recognizer to Help Human Annotators Tae Woo Kim.
Target schema and domain evolution Source metadata preparation Source data preparation Metadata matching Target data instantiation Transformation and analysis.
Fabian M. SuchanekLEILA - Learning to Extract Information by Linguistic Analysis 1 LEILA – Learning to Extract Information by Linguistic Analysis presented.
FROntIER: Fact Recognizer for Ontologies with Inference and Entity Resolution Joseph Park, Computer Science Brigham Young University.
Using Several Ontologies for Describing Audio-Visual Documents: A Case Study in the Medical Domain Sunday 29 th of May, 2005 Antoine Isaac 1 & Raphaël.
Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
“Automating Reasoning on Conceptual Schemas” in FamilySearch — A Large-Scale Reasoning Application David W. Embley Brigham Young University More questions.
CREAM: Semantic annotation system May 24, 2013 Hee-gook Jun.
Scanned Books: Annotator Training. Project Overview Untapped sources – 100,000+ scanned/OCRed books – Problem: cost-effective extraction Extraction tools.
Shridhar Bhalerao CMSC 601 Finding Implicit Relations in the Semantic Web.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.
OntoSoar: Soar Finds Facts in Text Peter Lindes, Deryle Lonsdale, David Embley Brigham Young University 33 rd Soar Workshop, June 2013 pl 6/6/201333rd.
Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov 2, Birgitta.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
GreenFIE-HD: A “Green” Form-based Information Extraction Tool for Historical Documents Tae Woo Kim.
David W. Embley Brigham Young University Provo, Utah, USA.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Scanned Books: Annotator Training. Project Overview Untapped sources – 100,000+ scanned/OCRed books – Problem: cost-effective extraction Extraction tools.
Lecturer: Santokh Singh
Extracting Data Automatically from Scanned Books with OntoSoar
A Web of Knowledge for Family History (Research Directions)
David W. Embley Brigham Young University Provo, Utah, USA
ece 720 intelligent web: ontology and beyond
D-Dupe-like Mary Ely Example
Joseph S. Park and David W. Embley Brigham Young University
Automating Schema Matching for Data Integration
(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley.
Temple Ready within an Hour of Collection Capture
A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…
EXAMPLE.
Joseph Park Brigham Young University
Extraction Rule Creation by Text Snippet Examples
Joseph Park Brigham Young University
Presentation transcript:

Extracting and Organizing Facts of Interest from OCRed Historical Documents Joseph Park, Computer Science Brigham Young University

Motivation 2 Large collection of scanned, OCRed books Stated facts Implied facts ▫ Inferred ▫ Same-as (entities)

Stated Facts of Interest 3 William Gerard Lathrop ▫ married Charlotte Brackett Jennings in 1837 ▫ is the son of Mary Ely ▫ was born in 1812

Inferred Facts of Interest 4 William Gerard Lathrop has gender Male Maria Jennings has gender Female Maria Jennings has surname Lathrop

Same-as (Entities) William Gerard Lathrop Gerard Lathrop Mary Ely 5

FROntIER 6 OntoES convert to RDF Jena reasoner comparators parameters owl:sameAs Abigail Huntington Lathrop Female RDF output [(?x rdf:type source:Person) -> (?x rdf:type target:Person)] [(?x source:Person-Name ?n),(?n source: NameValue ?nv), isMale(?nv),makeTemp(?gender) -> (?x target:Person-Gender ?gender),(?gender rdf:type target:Gender), (?gender target:GenderValue `Male'^^xsd:string)] Duke convert to csv

FROntIER 7 OntoES convert to RDF Jena reasoner comparators parameters owl:sameAs Abigail Huntington Lathrop Female RDF output [(?x rdf:type source:Person) -> (?x rdf:type target:Person)] [(?x source:Person-Name ?n),(?n source: NameValue ?nv), isMale(?nv),makeTemp(?gender) -> (?x target:Person-Gender ?gender),(?gender rdf:type target:Gender), (?gender target:GenderValue `Male'^^xsd:string)] Duke convert to csv

Extraction Ontologies Conceptual model Instance recognizers 8

Lexical Object-Set Recognizers 9 BirthDate external representation: \b[1][6-9]\d\d\b left context: b\.\s right context: [.,] …

Non-lexical Object-Set Recognizers 10 Person object existence rule: {Name} … Name external representation: \b{FirstName}\s{LastName}\b …

Relationship-set Recognizers 11 Person-BirthDate external representation: ^\d{1,3}\.\s{Person},\sb\.\s{BirthDate}[.,] …

Ontology-snippet Recognizers 12 ChildRecord external representation: ^(\d{1,3})\.\s+([A-Z]\w+\s[A-Z]\w+) (,\sb\.\s([1][6-9]\d\d))?(,\sd\.\s([1][6-9]\d\d))?\.

FROntIER 13 OntoES convert to RDF Jena reasoner comparators parameters owl:sameAs Abigail Huntington Lathrop Female RDF output [(?x rdf:type source:Person) -> (?x rdf:type target:Person)] [(?x source:Person-Name ?n),(?n source: NameValue ?nv), isMale(?nv),makeTemp(?gender) -> (?x target:Person-Gender ?gender),(?gender rdf:type target:Gender), (?gender target:GenderValue `Male'^^xsd:string)] Duke convert to csv

Canonicalization 14 “Sam’l” and “Geo.” -> “Samuel” and “George” “New York City” -> “New York, NY” “Boonton, N.J.” -> “Boonton, NJ”

Schema Mapping 15 author’s view our view

Direct Schema Mapping Rule 16 [(?x rdf:type source:Person) -> (?x rdf:type target:Person)] Person 7

Name Decomposition Rule 17 [(?x rdf:type source:Person),(?x source:Person-Name ?y),(?y rdf:type source:Name),… -> (?x rdf:type target:Person),(?x target:Person-Name ?y),(?y rdf:type target:Name),…] Person 7 William Gerard Lathrop Gerard Lathrop William Name 7 Person 7

Person has gender Male Rule 18 [(?x rdf:type source:Son),makeSkolem(?gender, ?x) -> (?x target:Person-Gender ?gender),(?gender rdf:type target:Gender), (?gender target:GenderValue 'Male'^^xsd:string)] Name 7 William Gerard Lathrop Gerard Lathrop William Male Person 7 Name 7

FROntIER 19 OntoES convert to RDF Jena reasoner comparators parameters owl:sameAs Abigail Huntington Lathrop Female RDF output [(?x rdf:type source:Person) -> (?x rdf:type target:Person)] [(?x source:Person-Name ?n),(?n source: NameValue ?nv), isMale(?nv),makeTemp(?gender) -> (?x target:Person-Gender ?gender),(?gender rdf:type target:Gender), (?gender target:GenderValue `Male'^^xsd:string)] Duke convert to csv

Resolving Mary Elys Example Person 2 (1 st Mary Ely) owl:sameAs Person 8 (3 rd Mary Ely)

Resolving Gerard Lathrop Example Person 3 (1 st Gerard Lathrop) owl:sameAs Person 9 (2 nd Gerard Lathrop) ~

Experiment The Ely Ancestry, page 419 ▫ Extracted facts: full page ▫ Inferred facts: excerpt ▫ Same-as (entities) facts: excerpt Annotated gold standard files ▫ Extracted facts: automated evaluation tool ▫ Inferred facts: hand checked with partial matches ▫ Same-as (entities) facts: hand checked 22

Extracted Facts Results 23

Implied Facts Results 24 *Name is composed of GivenName and Surname concepts

Conclusions Proof of concept Current and Future work: ▫ Automated evaluation tool for inferred & same-as facts ▫ Corpus of 50,000+ books provided by LDS Church ▫ 200 randomly selected pages ▫ Estimate the following:  Time required  Expertise required  Accuracy (precision & recall) 25