A Web of Knowledge for Historical Documents David W. Embley.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.
Ontologies for multilingual extraction Deryle W. Lonsdale David W. Embley Stephen W. Liddle Supported by the.
Automating the Extraction of Genealogical Information from Historical Documents Aaron P. Stewart David W. Embley March 20, 2011.
1 Automating the Extraction of Genealogical Information from the Web GeneTIQS Troy Walker & David W. Embley Family History Technology Conference March.
David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Aaron Stewart, and Cui Tao* Brigham Young University, Provo, Utah, USA *Mayo Clinic, Rochester,
FOCIH: Form-based Ontology Creation and Information Harvesting Cui Tao, David W. Embley, Stephen W. Liddle Brigham Young University Nov. 11, 2009 Supported.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Enabling Search for Facts and Implied Facts in Historical Documents David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Spencer Machado, Thomas Packer,
Principled Pragmatism: A Guide to the Adaptation of Philosophical Disciplines to Conceptual Modeling David W. Embley, Stephen W. Liddle, & Deryle W. Lonsdale.
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
Xyleme A Dynamic Warehouse for XML Data of the Web.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
OWL-AA: Enriching OWL with Instance Recognition Semantics for Automated Semantic Annotation 2006 Spring Research Conference Yihong Ding.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
1 Semi-Automatic Semantic Annotation for Hidden-Web Tables Cui Tao & David W. Embley Data Extraction Research Group Department of Computer Science Brigham.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
By ANDREW ZITZELBERGER A Framework for Extraction Ontology Based Information Management.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University.
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.
Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,
BYU A Synergistic Semantic Annotation Model December 2007 Yihong Ding,
Overview of Search Engines
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Amarnath Gupta Univ. of California San Diego. An Abstract Question There is no concrete answer …but …
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Managing & Integrating Enterprise Data with Semantic Technologies Susie Stephens Principal Product Manager, Oracle
Cross-Language Hybrid Keyword and Semantic Search David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Joseph S. Park, Andrew Zitzelberger Brigham Young.
Deryle W. Lonsdale, David W. Embley, Stephen W. Liddle, and Joseph Park BYU Data Extraction Research Group.
FROntIER: A Framework for Extracting and Organizing Biographical Facts in Historical Documents Joseph Park.
Joseph Park Brigham Young University.  Motivation.
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
Soar and Construction Grammar Peter Lindes, Deryle Lonsdale, David Embley Brigham Young University 2014 Soar Workshop © 2014 Peter Lindes 6/19/2014PL 2014.
An Aspect of the NSF CDI InitiativeNSF CDI: Cyber-Enabled Discovery and Innovation.
Ontology-based Information Extraction with a Cognitive Agent Peter Lindes 1, Deryle Lonsdale, David Embley Brigham Young University AAAI Now at.
ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6,
Presenter: Shanshan Lu 03/04/2010
David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.
FROntIER: Fact Recognizer for Ontologies with Inference and Entity Resolution Joseph Park, Computer Science Brigham Young University.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.
“Automating Reasoning on Conceptual Schemas” in FamilySearch — A Large-Scale Reasoning Application David W. Embley Brigham Young University More questions.
Intro: 1 What is a Database? Collection of Dynamic Data –Large Large of yesteryear now fits on a PC (small DBs) Many applications require even more (terabytes,
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
ODE: Ontology-Assisted Data Extraction Weifeng Su, Jiying Wang, Frederick H. Lochovsky Summarized by Joseph Park.
Cost-effective Ontology Population with Data from Lists in OCRed Historical Documents Thomas L. Packer David W. Embley HIP ’13 BYU CS 1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
David W. Embley Brigham Young University Provo, Utah, USA.
Extracting and Organizing Facts of Interest from OCRed Historical Documents Joseph Park, Computer Science Brigham Young University.
David W. Embley Brigham Young University Provo, Utah, USA
Stephen W. Liddle, Deryle W. Lonsdale, and Scott N. Woodfield
Vision for an Automatically Constructed FH-WoK
Joseph S. Park and David W. Embley Brigham Young University
Temple Ready within an Hour of Collection Capture
Grant Number: IIS Institution of PI: Brigham Young University PI’s: David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale Title:
Joseph Park Brigham Young University
Extraction Rule Creation by Text Snippet Examples
Joseph Park Brigham Young University
Presentation transcript:

A Web of Knowledge for Historical Documents David W. Embley

Enabling Search for Facts and Implied Facts by Automating the Construction of a Web of Knowledge for Historical Documents David W. Embley

BYU Data Extraction Research Group Stephen W. Liddle, Deryle W. Lonsdale, Spencer Machado, Thomas Packer, Joseph Park, Nathan Tate, Andrew Zitzelberger

WoK-HD (A Web of Knowledge Superimposed over Historical Documents) …… ……

…… grandchildren of Mary Ely ……

WoK-HD (A Web of Knowledge Superimposed over Historical Documents) …… …… grandchildren of Mary Ely

WoK-HD (A Web of Knowledge Superimposed over Historical Documents) …… grandchildren of Mary Ely ……

WoK-HD (A Web of Knowledge Superimposed over Historical Documents) …… ……

WoK-HD Input

Querying for Facts & Implied Facts

Animation of 1.Extraction query, results, highlighting 2.Reasoned Query, results, reasoning chain, highlighting

Extraction Ontologies

Fact Extraction

Reasoning for Implied Facts

Query Interpretation “Mary Ely” grandchild

Query Interpretation “Mary Ely” grandchild

Query Interpretation “Mary Ely” grandchild

Generated SPARQL Query

Query Results

Results of Processing the Ely Ancestry (all 830 Pages) Number of facts extracted: 22,251 – 8,740 Person-Birthdate facts – 3,803 Person-Deathdate facts – 9,708 children facts, including 5,020 Child-has-parent-Person facts 2,394 Son-of-Person facts 2,294 Daughter-of-Person facts Number of implied grandchild facts inferred: 5,277 Processing time: – ~18 seconds per page – CPU time: ~4 hours – Processing 10 in parallel: ~24 minutes

Results of Processing the Ely Ancestry (all 830 Pages) Precision:.52 (by randomly selecting & checking 100 of the 22,251 facts) Recall:.33 & Precision:.40 (by randomly selecting and checking 2 fact-filled family pages) Errors: – Name recognizer – Text pattern expectations – OCR Varying accuracy (for pages checked) – Recall:.11, Precision:.11 (bad combination of all problems) – Recall:.50, Precision:.68 (some problems, but closer to expectations) – Recall:.59, Precision:.71 (10 pages, mostly as expected) – Recall:.91, Precision:.94 (tuned, no problems except a few OCR errors)

Results after adding Recognizers for Relationship Sets Name: 0.91 Birthdate: 0.92 Deathdate: 1.00 Person born on Birthdate: 0.89 Person died on Deathdate: 0.75 Son of Person: 0.83 Daughter of Person: 0.33 Child: has parent Person: 0.79 Average F-measure: 0.80 Correct extracted facts: 143 Total extracted facts: 165 Actual number of facts: 172

Results after adding Recognizers for Relationship Sets Name: 0.70 Birthdate: 0.88 Deathdate: 1.00 Person born on Birthdate: 0.56 Person died on Deathdate: 1.00 Son of Person: 0.89 Daughter: of Person: 0.45 Child has parent Person: 0.30 Average F-measure: 0.72 Correct extracted facts: 79 Total extracted facts: 119 Actual number of facts: 119

Current and Future Work (Making the WoK-HD Vision Practical) …… …… grandchildren of Mary Ely 1. Cost-effective and Accurate Extraction 2. Automated Conceptual Organization 3. Accurate and Efficient Query Processing 4. Internationalization: Français 한국인 …

1. Cost-effective and Accurate Extraction Focus on semi-structured elements first Bootstrap synergistically – Extract from semi-structured elements – Learn extraction ontologies – Extract from plain text

ListReader: Wrapper Induction for Lists

Part I: Semi-supervised

OCR newline First row, left to right: C. Paulson, G. Whaley, E Eastlund, B. Krohg, D. Bakken, R. Norgaard, 0. Bakken, A. Vig, newline H. Megorden, D Wynne newline Second row- Mr. See bach, D. Colligan, J. Wogsland, F Knudson, A. Hagen, R. Myhrum, R. Nienaber, J. Mittun, newline Mr. Bohnsack. newline Third row: G. Carlm, R. Reterson, K Larson, J Skatvold, A. Enckson, R Roysland, L.Johnson, L. Nystrom. newLine Fourth row: R. Kvare, H. Haugen, R. Lubken, R Larson, A. Carlson, A. Nienaber, W Ram bo I, V Hanson, K. Ny- newline newline QootLaM "leam newline newline Captain Donald "Dude" Bakken Right Half Back newline LeRoy "Sonny' Johnson ,.... Lcft Half Back newline Orley Bakken , , Quarter Back newline Roger Myhrum Full Back newline Bill "Schnozz" Krohg Center newline Howard "Little Huby" Megorden Right Guard newline Royce "Shorty" Norgaard Left Guard newline Eugene "Mad Russian" Easthind Right Tackle newline Alvin "Stuben" Hagen Left Tackle newline Richard "Dick" Nienabcr Right End newline James "Oakie" Wogsland Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

Hand Form Creation & Labeling

Hand Form Creation & Labeling √

Hand Form Creation & Labeling Donald√

Hand Form Creation & Labeling DonaldBakken√

Hand Form Creation & Labeling DonaldBakkenDude√

Hand Form Creation & Labeling DonaldBakkenDude Right Half Back √

Generate Wrapper for First Record Captain Donald "Dude" Bakken Right Half Back newline LeRoy "Sonny' Johnson ,.... Lcft Half Back newline Orley Bakken , , Quarter Back newline Roger Myhrum Full Back newline Bill "Schnozz" Krohg Center newline Howard "Little Huby" Megorden Right Guard newline Royce "Shorty" Norgaard Left Guard newline Eugene "Mad Russian" Easthind Right Tackle newline Alvin "Stuben" Hagen Left Tackle newline Richard "Dick" Nienabcr Right End newline James "Oakie" Wogsland Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline 1. Captain, 2. Given Name, 3. Nickname, 4. Surname, 5. Position (Captain) (\w{6,6}) "(\w{4,4})" (\w{6,6}) \.{14,14} ((\w{4,5}){3,3})\n

Update Wrapper & Annotate Records Captain Donald "Dude" Bakken Right Half Back newline LeRoy "Sonny' Johnson ,.... Lcft Half Back newline Orley Bakken , , Quarter Back newline Roger Myhrum Full Back newline Bill "Schnozz" Krohg Center newline Howard "Little Huby" Megorden Right Guard newline Royce "Shorty" Norgaard Left Guard newline Eugene "Mad Russian" Easthind Right Tackle newline Alvin "Stuben" Hagen Left Tackle newline Richard "Dick" Nienabcr Right End newline James "Oakie" Wogsland Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline 2. Captain, 3. Given Name, 5. Nickname, 6. Surname, 7. Position ((Captain) )?(\w{5,6})( "(\w{4,5}) ['"] )? (\w{6,7}) [\.,]{14,34} ((\w{4,7} ){2,3})\n

Final Wrapper and Annotation Captain Donald "Dude" Bakken Right Half Back newline LeRoy "Sonny' Johnson ,.... Lcft Half Back newline Orley Bakken , , Quarter Back newline Roger Myhrum Full Back newline Bill "Schnozz" Krohg Center newline Howard "Little Huby" Megorden Right Guard newline Royce "Shorty" Norgaard Left Guard newline Eugene "Mad Russian" Easthind Right Tackle newline Alvin "Stuben" Hagen Left Tackle newline Richard "Dick" Nienabcr Right End newline James "Oakie" Wogsland Lcft End newline 2. Captain, 3. Given Name, 5. Nickname, 7. Surname, 8. Position ((Captain) )?(\w{4,7})( “((\w{4,7}){1,2})['"] )? (\w{5,8} ) [\.,]{14,34} ((\w{4,7} ){1,3})\n

Part II: Weakly-supervised

Apply Extraction Ontologies

Find List and Generate Wrapper Base list finding on whether a wrapper can be generated. Base wrapper generation on best-labeled record.

Extract Synergistically from Text

RDF OWL Sibling tables— automatically converted to OWL and RDF Reason Query TISP & TISP++ TISP interprets sibling tables. TISP++ annotates interpreted tables for generated ontologies. TISP & TISP++ superimpose a queriable knowledge structure over web pages.

Making Relational Databases Sematic-Web Queriable RDB WordNet “Birthdate of Mary Ely, grandmother of Maria Jennings Lathrop” SQL 2 Nov 1843

2. Automated Conceptual Organization Resolve object identity Integrate – Conceptualizations – Data Allow for discrepancies – Alternate views – Arbitration – Ground truth authentication

…… Clean & Organize Facts ……

Resolve Identity Mary_Ely_1Mary_Ely_2Mary_Ely_3Mary_Ely_4Mary_Ely_5 Mary_Ely_ Mary_Ely_ Mary_Ely_ Mary_Ely_ Mary_Ely_

Information Integration Schema mapping Logic Lexical parsing Lexical inference

Integrating, Storing, and Querying Uncertain Data

3. Accurate and Efficient Query Processing Issues – Context discovery – Query generation – Query execution (and result ranking) – Authentication processing and display – Advanced search Resolutions

59

Recognition Accuracy Depends on extraction ontology recognizers – Instance recognition – Keyword recognition – Operator & parameter recognition Disambiguation options – No cycles in context ontologies – Cycle disambiguation Keywords select paths Multiple hits Execute and return all

Advanced Search Negations and disjunctions Recursion –Unfold for bounded depth –Closure for unbounded depth

Enormous Data Sets Number of Facts – Est B people – 10 facts per person, as much as 1T – However, only ?? people for which we can locate records Number of Implied Facts – Potentially unbounded – But realistically, 20? pre-computed with hundreds more computable Uncertainty

Ideas for Resolutions Context discovery: Semantic indexes over recognizers Query generation: Fast given –context ontology set & –semantic-index matches Query execution : –Semantic indexes over extracted facts –Materialized views (partitioned data sets) –Current research on SPARQL query optimization Authentication: real-time highlighting of cached pages

4. Internationalization Extract facts in any language Support multilingual queries Allow for cultural differences

Arabic Nutrition Project

Multilingual Query Processing “Find a BBQ restaurant near the Umeda station, with typical prices under $40” Language-Agnostic Ontology

Multilingual Query Processing Wie alt war Mary Ely als ihr Son William geboren wurde? (die Mary Ely die Maria Jennings Lathrops Oma ist) 이름생년월일사망날짜 사람성별 자식 의 nom individu enfant de date de décès date de naissance date de baptême sexe …

Obituaries: Some Examples and Interesting Issues

Summary and Conclusion WoK-HD – Web of Knowledge – Superimposed over historical documents To build and deploy successfully: – Better, more cost-effective extraction – Integration, organization, & record linkage – Efficient implementation – Internationalization