1 Automating the Extraction of Domain Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004 Research funded by NSF
2 Genealogical Information on the Web Hundreds of thousands of sites Hundreds of thousands of sites Some professional (Ancestry.com, Familysearch.org) Some professional (Ancestry.com, Familysearch.org) Mostly hobbyist (203,200 indexed by Cyndislist.com) Mostly hobbyist (203,200 indexed by Cyndislist.com) Search engines Search engines “Walker genealogy” on Google: 199,000 results “Walker genealogy” on Google: 199,000 results 1 page/minute = 5 months to go through 1 page/minute = 5 months to go through Why not enlist the help of a computer? Why not enlist the help of a computer?
3 Problems No standard way of presenting data No standard way of presenting data Text formatted with HTML tags Text formatted with HTML tags Tables Tables Forms to access information Forms to access information Sites have differing schemas Sites have differing schemas
4 Proposed Solution Based on Ontos and other work done by the BYU Data Extraction Group (DEG) Based on Ontos and other work done by the BYU Data Extraction Group (DEG) Able to extract from: Able to extract from: Single-Record or Multiple Record Documents Single-Record or Multiple Record Documents Tables Tables Forms Forms Scalable and robust to changes in pages Scalable and robust to changes in pages Easily adaptable to other domains Easily adaptable to other domains
5 Text
6 Tables
7 Forms
8 Forms
9 System Overview URL Selector Form Engine Table Engine Single- or Multiple-Record Engine URL List User Query Result Filter Document Retriever and Structure Recognizer Data Constrainer Ontology Result Presenter
10 User Query Generated from ontology Generated from ontology Generated once per application domain Generated once per application domain
11 User Query
12 URL List and URL Selector Contains Genealogy URLs Contains Genealogy URLs Search each URL—too much time Search each URL—too much time Select likely URLs Select likely URLs Distribute document processing using DOGMA Distribute document processing using DOGMA
13 URL List and Document Retriever URLFilter main.htm?lfl=adv hs/cgi-bin/deaths.cgi Death Date > walker/johngene/johngenes.htm Name: Bates, Boyle, Damon, Eliot, … Walker, Woodsworth on/cedarcem.htm Burial Location: Thomaston, GA enealogy/LISTS/Adams.html Name: Adams enealogy/LISTS/Walker.html Name: Walker enealogy/LISTS/Warley.html Name: Warley ~gemmell/walkdesc.htm Name: Walker place/Kemp/f html Name: Anderson, Burt, Summers, Walker
14 Document Structure Recognizer Requests analysis from each Data Extraction Engine Requests analysis from each Data Extraction Engine Selects appropriate method Selects appropriate method
15 Data Extraction Engines Text Text Improved record-separation Improved record-separation Ability to handle single-record pages Ability to handle single-record pages Table Table Forms Forms
16 Data Constrainer Selects attribute/value pairs Selects attribute/value pairs Fits data to ontology Fits data to ontology
17 Result Filter Fits data to query Fits data to query Returns to central Result Presenter Returns to central Result Presenter
18 Result Presenter Creates XML Schema from Ontology Creates XML Schema from Ontology Presents results to user Presents results to user
19 Result Presenter
20 Evaluation Scalability Scalability Query on large URL list Query on large URL list Experiment on number of PCs Experiment on number of PCs Precision and recall Precision and recall Recall difficult to determine Recall difficult to determine Query on small URL list Query on small URL list Adaptability Adaptability Car ontology Car ontology Small URL list Small URL list
21 Conclusion Integrates, builds on previous DEG work Integrates, builds on previous DEG work Extracts from: Extracts from: Single- or Multiple-Record Documents Single- or Multiple-Record Documents Tables Tables Forms Forms Scalable Scalable Only searches probable pages Only searches probable pages Distributed with DOGMA Distributed with DOGMA Robust to changes in pages Robust to changes in pages Ontology based—easily adapted to other domains Ontology based—easily adapted to other domains