Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Spring Research Conference.

Similar presentations


Presentation on theme: "1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Spring Research Conference."— Presentation transcript:

1 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Spring Research Conference March 20, 2004 Research funded by NSF grant #IIS-0083127

2 2 Genealogical Information on the Web Hundreds of thousands of sites Hundreds of thousands of sites Some professional (Ancestry.com, Familysearch.org) Some professional (Ancestry.com, Familysearch.org) Mostly hobbyist (203,200 indexed by Cyndislist.com) Mostly hobbyist (203,200 indexed by Cyndislist.com) Search engines Search engines “Walker genealogy” on Google: 199,000 results “Walker genealogy” on Google: 199,000 results 1 page/minute = 5 months to go through 1 page/minute = 5 months to go through Why not enlist the help of a computer? Why not enlist the help of a computer?

3 3 Problems No standard way of presenting data No standard way of presenting data Sites have differing schemas Sites have differing schemas

4 4 Proposed Solution Based on Ontos and other work done by the BYU Data Extraction Group (DEG) Based on Ontos and other work done by the BYU Data Extraction Group (DEG) Able to extract from: Able to extract from: Single-record documents Single-record documents Simple multiple-record documents Simple multiple-record documents Complex multiple-record documents Complex multiple-record documents Robust to changes in pages Robust to changes in pages Easily adaptable to other domains Easily adaptable to other domains

5 5 Person Ontology

6 6 Record Separation Separating data related to each person Separating data related to each person Previous technique Previous technique Combines many heuristics Combines many heuristics Has problems Has problems Assumes multiple records Assumes multiple records Must be simple separation Must be simple separation

7 7 Single-Record Document

8 8 Simple Multiple-Record Document

9 9 Complex Multiple-Record Document

10 10 Vector Space Modeling { 0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0} Ontology Vector Ontology Vector Compare to candidate records Compare to candidate records Cosine measure Cosine measure Magnitude measure Magnitude measure

11 11 Vector Space Modeling <!DOCTYPE…><html> … …header… …header… … {0, 0, 0, 0, 0, 0, 0, 0} {0, 141, 89, 76, 0, 0, 48, 23} {0, 1, 0, 0, 0, 0, 0, 0} {0, 1, 0, 0, 0, 0, 0, 0} {0, 140, 89, 76, 0, 0, 48, 23} {0, 140, 89, 76, 0, 0, 48, 23} {0, 0, 0, 0, 0, 0, 0, 0} {0, 0, 0, 0, 0, 0, 0, 0} {0, 138, 88, 76, 0, 0, 48, 23} {0, 138, 88, 76, 0, 0, 48, 23}…

12 12 Improvements Differing schemas Differing schemas Low cosine measures Low cosine measures Discarded data Discarded data Prune dimensions Prune dimensions {0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0} {0.0, 141.0, 89.0, 76.0, 0.0, 0.0, 48.0, 23.0} Richness of data in single-record documents Richness of data in single-record documents High magnitude measure High magnitude measure Higher magnitude to split documents Higher magnitude to split documents

13 13 Presenting Results

14 14 Preliminary Results Semi-structured Text Semi-structured Text 10 single-record documents 10 single-record documents 3 simple documents containing 268 records 3 simple documents containing 268 records 3 complex documents containing 266 records 3 complex documents containing 266 records Precision and recall calculated on record separation Precision and recall calculated on record separation

15 15 Record Separation RecallPrecision Single100%94.1% Simple94.7%97.3% Complex88.3%93.6%

16 16 Conclusion Integrate, build on previous DEG work Integrate, build on previous DEG work Accurate record separation Accurate record separation Average recall: 94.3% Average recall: 94.3% Average precision: 95.0% Average precision: 95.0% Ontology based—easily adapted to other domains Ontology based—easily adapted to other domains


Download ppt "1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Spring Research Conference."

Similar presentations


Ads by Google