1 Automating the Extraction of Genealogical Information from the Web GeneTIQS Troy Walker & David W. Embley Family History Technology Conference March.

Slides:

Advertisements

Similar presentations

Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.

Advertisements

Growing the Semantic Web By Charla Woodbury June 11, 2004.

Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.

Computer Science Research for Family History and Genealogy David W. Embley Heath Nielson, Mike Rimer, Luke Hutchison, Ken Tubbs, Doug Kennard, Tom Finnigan.

Enabling Search for Facts and Implied Facts in Historical Documents David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Spencer Machado, Thomas Packer,

Semi-Supervised, Knowledge-Based Information Extraction for the Semantic Web Thomas L. Packer Funded in part by the National Science Foundation. 1.

Text Classification With Support Vector Machines

Traditional Information Extraction -- Summary CS652 Spring 2004.

Modern Information Retrieval

A Fully Automated Object Extraction System for the World Wide Web a paper by David Buttler, Ling Liu and Calton Pu, Georgia Tech.

Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University.

6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

Automatically Identifying Record Patterns from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs David W. Embley.

Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.

1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

Business Sales Expert System. Intro As the Internet becomes more accessible, it is important to build more sophisticated systems on the web. These sophisticated.

1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources Cui Tao March, 2002 Founded by NSF.

Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:

Results of Using an Efficient Algorithm to Query Disjunctive Genealogical Data Lars E. Olson and David W. Embley DEG Lab BYU Computer Science Dept. Supported.

Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.

Information Retrieval

Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University.

7/15/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer.

BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.

1 Automating the Extraction of Domain Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004.

1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Spring Research Conference.

Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.

18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star.

1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.

Online Research Nothing has revolutionized genealogy.

How the Computer and the Internet Have Changed Genealogical Research Larry D. Crummer Lib15, Spring 2004 Joy Chase, Instructor.

November 4, How do I start when I have no information?  Create a Family Group Sheet with the following:  name  birth date, place  marriage date,

Imagine Everything is Before You: Past, Present, and Future Paper and Demonstration for the 2014 Family History Technology BYU Dr. Brand Niemann.

Collaborative Research Assistant 2007 Family History Technology Conference John Finlay Christopher Stolworthy Daniel Parker.

April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.

Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.

Ontology-based Information Extraction with a Cognitive Agent Peter Lindes 1, Deryle Lonsdale, David Embley Brigham Young University AAAI Now at.

4 FamilyHistory.com is a member of Ancestry.com View as a Free Front End to Ancestry.com.

McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.

ELIJAH: Extracting Genealogy from the Web By David Barney and Rachel Lee WhizBang! Labs.

CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.

Scientific Workflow systems: Summary and Opportunities for SEEK and e-Science.

What is Google? Google is a popular web search engine— And learning techniques saves time and results in rewarding research.

Search Engines: A History  First search engine was Veronica for the Gopher network  1991 Gopher  After Gopher disappeared, the first one for modern.

 There is a Family History section on the BYU- I Library home page.  This site includes:  vital records for eastern and western states  Death indexes.

 What Qualifies Individuals for Temple Work :  Deceased for 1 year and 1 day  A date and a place for a vital event (birth, marriage, death, burial,

People and Families of the Bible Nathan Friedly. Overview Introduction Key Ideas Description and use Deliverables Demonstration Conclusion.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

The anatomy of a Large-Scale Hypertextual Web Search Engine.

GreenFIE-HD: A “Green” Form-based Information Extraction Tool for Historical Documents Tae Woo Kim.

1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

On the Relation Between Simulation-based and SAT-based Diagnosis CMPE 58Q Giray Kömürcü Boğaziçi University.

19 th Nov 2008 U3A Family History Group Developing your Family History Research -Part 2. Parish Records & the IGI.

Automated Information Retrieval

Pennsylvania Research

Genealogy And The Internet

Personal Assistants for the Web: An MIT Perspective

Family History Technology Workshop

Combining Keyword and Semantic Search for Best Effort Information Retrieval Andrew Zitzelberger 1.

Genealogy with the Internet

Presentation transcript:

1 Automating the Extraction of Genealogical Information from the Web GeneTIQS Troy Walker & David W. Embley Family History Technology Conference March 25, 2004 Research funded by NSF grant #IIS

2 Genealogical Information on the Web Hundreds of thousands of sites Hundreds of thousands of sites Some professional (Ancestry.com, Familysearch.org) Some professional (Ancestry.com, Familysearch.org) Mostly hobbyist (203,200 indexed by Cyndislist.com) Mostly hobbyist (203,200 indexed by Cyndislist.com) Search engines Search engines “Walker genealogy” on Google: 199,000 results “Walker genealogy” on Google: 199,000 results 1 page/minute = 5 months to go through 1 page/minute = 5 months to go through Why not enlist the help of a computer? Why not enlist the help of a computer?

3 Problems No standard way of presenting data No standard way of presenting data Sites have differing schemas Sites have differing schemas Web pages change Web pages change New pages continuously come on line New pages continuously come on line

4 GeneTIQS Based on work done by BYU DEG Based on work done by BYU DEG Able to extract from: Able to extract from: Single-record documents Single-record documents Simple multiple-record documents Simple multiple-record documents Complex multiple-record documents Complex multiple-record documents Robust to changes in pages Robust to changes in pages Immediately works for new pages Immediately works for new pages

5 Person Ontology

6 Value Matchers

7 Record Separation Separating data related to each person Separating data related to each person Previous technique Previous technique Combines many heuristics Combines many heuristics Has problems Has problems Assumes multiple records Assumes multiple records Must be simple separation Must be simple separation

8 Single-Record Document

9 Simple Multiple-Record Document

10 Complex Multiple-Record Document

11 Vector Space Modeling Ontology Vector Ontology Vector Compare to candidate records Compare to candidate records Cosine measure Cosine measure Magnitude measure Magnitude measure

12 Ontology Vector { 0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0}

13 Vector Space Modeling <!DOCTYPE…><html> … …header… …header… … {0, 0, 0, 0, 0, 0, 0, 0} {0, 141, 89, 76, 0, 0, 48, 23} {0, 1, 0, 0, 0, 0, 0, 0} {0, 1, 0, 0, 0, 0, 0, 0} {0, 140, 89, 76, 0, 0, 48, 23} {0, 140, 89, 76, 0, 0, 48, 23} {0, 0, 0, 0, 0, 0, 0, 0} {0, 0, 0, 0, 0, 0, 0, 0} {0, 138, 88, 76, 0, 0, 48, 23} {0, 138, 88, 76, 0, 0, 48, 23}… Gender Christening Burial Marriage Relation Birth Name Death

14 Improvements Differing schemas Differing schemas Low cosine measures Low cosine measures Discarded data Discarded data Prune dimensions Prune dimensions {0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0} {0.0, 141.0, 89.0, 76.0, 0.0, 0.0, 48.0, 23.0} Richness of data in single-record documents Richness of data in single-record documents High magnitude measure High magnitude measure Higher magnitude to split documents Higher magnitude to split documents

15 Demonstration

16 Presenting Results

17 Preliminary Results Semi-structured Text Semi-structured Text 10 single-record documents 10 single-record documents 3 simple documents containing 268 records 3 simple documents containing 268 records 3 complex documents containing 266 records 3 complex documents containing 266 records Precision and recall for record separation Precision and recall for record separation

18 Record Separation RecallPrecision Single100%94.1% Simple94.7%97.3% Complex88.3%93.6%

19 Conclusion Integrate, build on previous DEG work Integrate, build on previous DEG work Accurate record separation Accurate record separation Average recall: 94.3% Average recall: 94.3% Average precision: 95.0% Average precision: 95.0% Ontology based Ontology based Robust to changes in pages Robust to changes in pages Immediately works with new pages Immediately works with new pages