Automating the Extraction of Genealogical Information from Historical Documents Aaron P. Stewart David W. Embley March 20, 2011.

Slides:

Advertisements

Similar presentations

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.

Advertisements

Personalized Navigation in the Semantic Web: An Enhanced Faceted Browser Michal Tvarožek FIIT STU BA.

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.

Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010.

Copyright © 2006 Educational Testing Service Listening. Learning. Leading. The Test Collection at ETS Karen McQuillen, Manager, Library Services ETS

Training on Read&Write 6 Gold for Mac. See the key features of Read&Write 6 Gold for Mac in order to familiarise yourself with the functionality of the.

® Copyright 2008 Adobe Systems Incorporated. All rights reserved. ADOBE® ACCESSIBILITY Achieving Accessibility with PDF Greg Pisocky Accessibility Specialist.

Commercial Data Processing Lesson 2: The Data Processing Cycle.

Computer Science Research for Family History and Genealogy David W. Embley Heath Nielson, Mike Rimer, Luke Hutchison, Ken Tubbs, Doug Kennard, Tom Finnigan.

David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Aaron Stewart, and Cui Tao* Brigham Young University, Provo, Utah, USA *Mayo Clinic, Rochester,

Ancestry OCR Project: Data Thomas L. Packer

Enabling Search for Facts and Implied Facts in Historical Documents David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Spencer Machado, Thomas Packer,

Principled Pragmatism: A Guide to the Adaptation of Philosophical Disciplines to Conceptual Modeling David W. Embley, Stephen W. Liddle, & Deryle W. Lonsdale.

HyKSS: A Multiple Ontology Approach to Hybrid Search Andrew Zitzelberger Brigham Young University MS Thesis Proposal.

Named Entity Recognition for Digitised Historical Texts by Claire Grover, Sharon Givon, Richard Tobin and Julian Ball (UK) presented by Thomas Packer 1.

BYU Craigslist Alerter Oliver Nina, Meher Shaikh Andrew Zitzelberger.

By ANDREW ZITZELBERGER A Framework for Extraction Ontology Based Information Management.

Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:

Bieber et al., NJIT © Slide 1 Digital Library Integration Masters Project and Masters Thesis Summer and Fall 2005 CIS 786 / CIS Fall.

The Walden's Paths Virtual Directories Unmil P. Karadkar, Luis Francisco-Revilla, Richard Furuta, Frank M. Shipman III Texas A&M University Structuring.

Use Case Modelling Visual Annotator for studying ICU Notes Bacchus Beale.

HyKSS: Hybrid Keyword and Semantic Search Andrew Zitzelberger 1.

Enabling Efficient Chinese Jiapu Information Extraction

SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:

Software All parts of the computer people can NOT touch, such as programs, files, documents and any other data.

Solutions Summit 2014 Discrepancy Processing & Resolution Terri Sullivan.

CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

FROntIER: A Framework for Extracting and Organizing Biographical Facts in Historical Documents Joseph Park.

A Web of Knowledge for Historical Documents David W. Embley.

Joseph Park Brigham Young University.  Motivation.

Scanned Books: Annotator Training. Project Overview Untapped sources – 100,000+ scanned/OCRed books – Problem: how to cost-effectively extract Extraction.

Soar and Construction Grammar Peter Lindes, Deryle Lonsdale, David Embley Brigham Young University 2014 Soar Workshop © 2014 Peter Lindes 6/19/2014PL 2014.

Ontology-based Information Extraction with a Cognitive Agent Peter Lindes 1, Deryle Lonsdale, David Embley Brigham Young University AAAI Now at.

Bootstrapping Regular-Expression Recognizer to Help Human Annotators Tae Woo Kim.

E-Books Presentation. Hard Copy (Book) Scanning OCR Text Document HTML Conversion Text Formatting Linking Image Insertion Final QC Soft Copy (JPG/TIFF)

FROntIER: Fact Recognizer for Ontologies with Inference and Entity Resolution Joseph Park, Computer Science Brigham Young University.

Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.

“Automating Reasoning on Conceptual Schemas” in FamilySearch — A Large-Scale Reasoning Application David W. Embley Brigham Young University More questions.

FAMILYSEARCH INDEXING IS WORLDWIDE. INDEXING 1.WHAT IS INDEXING? - A PROCESS WHERE A PERSON CAN TRANSCRIBE DATA FROM A DIGITAL IMAGE WHICH IS THEN POSTED.

Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.

(Semi)automatic Extraction of Genealogical Information from Scanned & OCRed Historical Documents Elder David W. Embley.

PageManager /16 What ’ s the strength in PM6 ? Open Architecture Tree View to Browse Any Folders In Your System Open Architecture Tree View to Browse.

OntoSoar: Soar Finds Facts in Text Peter Lindes, Deryle Lonsdale, David Embley Brigham Young University 33 rd Soar Workshop, June 2013 pl 6/6/201333rd.

Knowledge Modeling and Discovery. About Thetus Thetus develops knowledge modeling and discovery infrastructure software for customers who: Have high-value.

VERI is an interface that provides a Web based front end to the access the datasets generated by the MVED. The goal is to Provide open access to the Don.

Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:

Challenges and Opportunities. Common Pedigree Research Issues Record Linkage Data Standardization Efficient Data Access Expert Finding.

GreenFIE-HD: A “Green” Form-based Information Extraction Tool for Historical Documents Tae Woo Kim.

Cost-effective Ontology Population with Data from Lists in OCRed Historical Documents Thomas L. Packer David W. Embley HIP ’13 BYU CS 1.

Extracting and Organizing Facts of Interest from OCRed Historical Documents Joseph Park, Computer Science Brigham Young University.

Scanned Books: Annotator Training. Project Overview Untapped sources – 100,000+ scanned/OCRed books – Problem: cost-effective extraction Extraction tools.

Our Goal Provide our medical clients the ability to:  Organize  Navigate  Manage their digital documents All through a simplified, innovative and affordable.

Extracting Data Automatically from Scanned Books with OntoSoar

External Services & Frameworks

A Web of Knowledge for Family History (Research Directions)

Teaching Strategies for Reading Electronic Texts

Vision for an Automatically Constructed FH-WoK

Joseph S. Park and David W. Embley Brigham Young University

Thomas L. Packer BYU CS DEG

Family Search and the scanning of OCPL’s historical book collection.

Extracting Full Names from Diverse and Noisy Scanned Document Images

Combining Keyword and Semantic Search for Best Effort Information Retrieval Andrew Zitzelberger 1.

Temple Ready within an Hour of Collection Capture

Extracting Information from Diverse and Noisy Scanned Document Images

A Green Form-Based Information Extraction System for Historical Documents Tae Woo Kim No HD. I’m glad to present GreenFIE today. A Green Form-…

Joseph Park Brigham Young University

Extracting Information from Diverse and Noisy Scanned Document Images

Mock-ups for Discussing the CMS Administrator Interface

Swangling S S Inference U C B M        

Extracting Information from Diverse and Noisy Scanned Document Images

Presentation transcript:

Automating the Extraction of Genealogical Information from Historical Documents Aaron P. Stewart David W. Embley March 20, 2011

Part I: Vision Current projects at the BYU Data Extraction Group

Problem: Enable users to search for facts and implied facts contained historical documents Motivation: Genealogy Solution: – Scan & OCR historical docs (e.g., The Ely Ancestry) – Extract facts (e.g., OntoES) along with keywords and index them for fast search (e.g., HyKSS) – Define inference rules (Nathan’s work), process them and index them (HyKSS) – Free-form and form-based queries Results of queries displayed Results “clickable”: highlights and displays extracted facts on page where results were obtained and for any inference rules used, displays the reasoning chain Implementation Status & Hard Problems to Solve

Goal: Search books for names 4 History of the Jones Family scanner George Jones

Original Document

Extracted Facts Names William Gerard Lathrop Mary Ely Gerard Lathrop Charlotte Brackett Jennings Nathan Tilestone Jennings Maria Miller Maria Jennings [Lathrop] Donald McKenzie [Lathrop] Anna Margaretta [Lathrop] Anna Catherine [Lathrop] Relationships William Gerard Lathrop : son of : Mary Ely William Gerard Lathrop : son of : Gerard Lathrop William Gerard Lathrop : m. : Charlotte Brackett Jennings Charlotte Brackett Jennings : dau. of : Nathan Tilestone Jennings Charlotte Brackett Jennings : dau. of : Maria Miller Relationships (continued) Maria Jennings : child of : William Gerard Lathrop Maria Jennings : child of : Charlotte Brackett William Gerard : child of : William Gerard Lathrop William Gerard : child of : Charlotte Brackett Donald McKenzie : child of : William Gerard Lathrop Donald McKenzie : child of : Charlotte Brackett Anna Margaretta : child of : William Gerard Lathrop Anna Margaretta : child of : Charlotte Brackett Anna Catherine : child of : William Gerard Lathrop Anna Catherine : child of : Charlotte Brackett

Inferred Facts Names William Gerard Lathrop Mary Ely Gerard Lathrop Charlotte Brackett Jennings Nathan Tilestone Jennings Maria Miller Maria Jennings [Lathrop] Donald McKenzie [Lathrop] Anna Margaretta [Lathrop] Anna Catherine [Lathrop] Relationships William Gerard Lathrop : son of : Mary Ely William Gerard Lathrop : son of : Gerard Lathrop William Gerard Lathrop : m. : Charlotte Brackett Jennings Charlotte Brackett Jennings : dau. of : Nathan Tilestone Jennings Charlotte Brackett Jennings : dau. of : Maria Miller Relationships (continued) Maria Jennings : child of : William Gerard Lathrop Maria Jennings : child of : Charlotte Brackett William Gerard : child of : William Gerard Lathrop William Gerard : child of : Charlotte Brackett Donald McKenzie : child of : William Gerard Lathrop Donald McKenzie : child of : Charlotte Brackett Anna Margaretta : child of : William Gerard Lathrop Anna Margaretta : child of : Charlotte Brackett Anna Catherine : child of : William Gerard Lathrop Anna Catherine : child of : Charlotte Brackett Inferred Relationships Maria Jennings : grandchild of : Mary Ely Maria Jennings : grandchild of : Gerard Lathrop Maria Jennings : grandchild of : Nathan Tilestone Jennings Maria Jennings : grandchild of : Maria Miller William Gerard : grandchild of : Mary Ely William Gerard : grandchild of : Gerard Lathrop William Gerard : grandchild of : Nathan Tilestone Jennings William Gerard : grandchild of : Maria Miller …

Keywords Chief Justice

Queries Is there a chief justice related to Mary Ely? Who are the sons of Gerard Lathrop? Who are the grandchildren of Mary Ely?

Part II: Implementation

Ontology Editor

Data Frame Editor

Rule Editor

Name Query

HyKSS Indexing

Keyword Search

Relationship Search

Inferred Relationship Search

Maria Jennings is a grandchild of Mary Ely GrandchildOf(Maria Jennings, Mary Ely) :- Child-Parent(Maria Jennings, William Gerard Lathrop), Child-Parent(William Gerard Lathrop, Mary Ely)

Inferred Relationship Search Who are the grandchildren of Mary Ely? PersonName  Maria Jennings  William Gerard  Donald McKenzie  Anna Margaretta  Anna Catherine

Part III: Improvements

Challenges in Name Extraction

Extraction Tools

Need Better Extraction Results From Packer et al., Lists

Example of a Better Extractor (Margin Finder) B\ liee (OCR error) Buekman (OCR error) Jobsph (OCR error) Baseline errors Uuckkman (OCR error) Charles. (OCR error)

Example of a Better Extractor (Margin Finder) LEVEL 1 LEVEL 2

Need Annotation Tools

Simplified Evaluation System PDF+ PDF markup applet OntoES Multi- page XML Name Extraction Ontology Reference XML Output XML Eval- uation Tool Test Results Custom Reader / Writer Converter Adobe Acrobat Professional BYU Digital Collections Other PDF sources SegmentationByLayout: AnnotatableImage3 otationPrototype.php?formName= PersonName (FireFox + Java 6) otationPrototype.php?formName= PersonName Actual data/code sources are shown in yellow boxes SegmentationByLayout: DiscreteAreaEvaluator OntosEvaluation: PdfProcessor Needs multi-page support (feed one page at a time) OntosEvaluation2: resource ontology/deg-name-extractor.xml To run this code: SegmentationByLayout: edu.byu.deg.segmentation.layout.annotation.PdfAnnotationEditor Arguments: census-floyd-county-kentucky-1860.pdf unused-parameter familysearch census-floyd-county-kentucky-1860-gold.xml Annotation Evaluation

Summary Vision Implementation Status Current and Future Projects

Credits Ontology Editor – Numerous past students Data Frame Editor – Numerous past students Rule Editor – Nathan Tate Hybrid Keyword and Semantic Search (HyKSS) – Andrew Zitzelberger This presentation contains both actual screenshots and mock-ups of projected results