Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF.

Slides:



Advertisements
Similar presentations
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Advertisements

Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.
Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF.
Data-Extraction Ontology Generation by Example Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored by NSF.
HyKSS: A Multiple Ontology Approach to Hybrid Search Andrew Zitzelberger Brigham Young University MS Thesis Proposal.
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Xyleme A Dynamic Warehouse for XML Data of the Web.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Data-Extraction Ontology Generation by Example Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored by NSF.
OWL-AA: Enriching OWL with Instance Recognition Semantics for Automated Semantic Annotation 2006 Spring Research Conference Yihong Ding.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University.
Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF.
Two-Level Semantic Annotation Model BYU Spring Conference 2007 Yihong Ding Sponsored by NSF.
INFO 624 Week 3 Retrieval System Evaluation
Thesis Defense Mini-Ontology GeneratOr (MOGO) Mini-Ontology Generation from Canonicalized Tables Stephen Lynn Data Extraction Research Group Department.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.
From OSM-L to JAVA Cui Tao Yihong Ding. Overview of OSM.
Conceptual-Model-Based Web Data Extraction by Example Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored by NSF.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
A New Web Semantic Annotator Enabling A Machine Understandable Web BYU Spring Research Conference 2005 Yihong Ding Sponsored by NSF.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
1 Extracting RDF Data from Unstructured Sources Based on an RDF Target Schema Tim Chartrand Research Supported By NSF.
Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources Cui Tao March, 2002 Founded by NSF.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
Semi-Automatically Generating Data-Extraction Ontology Yihong Ding March 6, 2001.
PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment Natalya F. Noy and Mark A. Musen.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.
1 Ontology Based Extraction of RDF Data from the World Wide Web Tim Chartrand Masters Thesis Research Supported By NSF.
1 A Tool to Support Ontology Creation Based on Incremental Mini-ontology Merging Zonghui Lian.
Record-Boundary Discovery in Web Documents D.W. Embley, Y. Jiang, Y.-K. Ng Data-Extraction Group* Department of Computer Science Brigham Young University.
Generating Data-Extraction Ontologies By Example Joe Zhou Data Extraction Group Brigham Young University.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.
Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.
OIL: An Ontology Infrastructure for the Semantic Web D. Fensel, F. van Harmelen, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider Presenter: Cristina.
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
Approximated Provenance for Complex Applications
A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Dimitrios Skoutas Alkis Simitsis
An Aspect of the NSF CDI InitiativeNSF CDI: Cyber-Enabled Discovery and Innovation.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
Marko Grobelnik, Janez Brank, Blaž Fortuna, Igor Mozetič.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
David W. Embley Brigham Young University Provo, Utah, USA.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Setting the stage: linked data concepts Moving-Away-From-MARC-a-thon.
Presented by: Hassan Sayyadi
Cross-language Information Retrieval
Knowledge Representation
David W. Embley Brigham Young University Provo, Utah, USA
Automating Schema Matching for Data Integration
Chaitali Gupta, Madhusudhan Govindaraju
Presentation transcript:

Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

2 Wrapper-Driven Data Extraction Web data extraction –Obtain user-specified information from Web documents Wrapper –Convert implicit HTML data into explicit formatted data –Data-source-specified, high performance Examples: –SoftMealy, STALKER, WIEN, Omini, ROADRUNNER, …

3 Common Problem of Wrappers Mani Chandy, Professor of Computer Science and Executive Officer for Computer Science b U _U_U N _N_N ? / ε etc. ? / ε ? / next_token s / ε s / “U=” + next_token s / ε s / “N=” + next_token s / “N=” + next_token SoftMealy Resiliency fixed domain changeable layout Scalability unchanged existing wrapper extendable domain and functions

4 Data-Extraction Ontology Structure –Object sets –Relationship sets –Participation constraints –Data frames Pros: resilient and scalable Cons: hard to create –Knowledge requirements –Tedious and error-prone work Car [-> object]; Car [0:1] has Make [1:*]; Make matches [10] constant { extract "\baudi\b"; }; end; Car [0:1] has Model [1:*]; Model matches [25] constant { extract "80"; context "\baudi\S*\s*80\b"; }; end; Car [0:1] has Mileage [1:*]; Mileage matches [8] constant {extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000";}; end; Car [0:1] has Price [1:*]; Price matches [8] constant { extract "[1-9]\d{3,6}"; context "\$[1-9]\d{3,6}";}; end;

5 Motif of Ontology Generation Human Brain Concepts of Interest Concepts with Relations Data-Extraction Ontology Knowledge Base Sample Documents

6 Thesis Statement Given: knowledge base Input: sample Web pages of interest Output: a data-extraction ontology for the domain of interest Between input and output: this is the work of this thesis

7 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents

8 Primary Knowledge Source Requirements –Available –General in coverage –Rich in meaningful relationship –Encoded in or easily converted to XML Mikrokosmos (  K) Ontology –Developed by NMSU jointly with U.S. DoD –Contains over 5000 concepts –Connects to an average 14 links per concept –Represented in XML format

9 Integrated Knowledge Base Data-Frame Library  K Ontology Synonym Dictionary (WordNet) Lexicons KNOWLEDGE BASE

10 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents

11 Domain Specification Training documents –Data-rich –Narrow in topic breadth Preprocessing

12 Example – Car Advertisement Record 1: 00 GrandAM SE, Sunfire Red, CD, AC, PW, PL Great Condition, $10,800, Call Record 2: 02 Buick Century Custom, Pwr Seat, Nada Retail 13, Record 3: 02 Buick Century, lo mi, mint cond, $11, dlr# 2755 Record 4: 00 Buick Century Stk# HU7159 Green $9,319, To Apply By Phone, , OREM Utah

13 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents

14 Concept Selection Selection strategies –Compare a string with the name of a concept –Compare a string with the values belonging to a concept –Apply data-frame recognizers to recognize a string 00 Buick Century Stk# HU7159 Green $9,319, To Apply By Phone, , OREM Utah KB

15 Concept Selection Reasons of conflict –Synonymy –Polysemy Conflict resolution –Same-string only one meaning –Favor longer over shorter –Context decides meaning 02 Buick Century Custom, Pwr Seat, Nada Retail 13, KB price by keyword identification

16 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents

17 Relationship Retrieval KB

18 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents

19 Constraint Discovery 02 Buick Century, lo mi, mint cond, green, pwr seat, $11, dlr# Buick Century Stk# HU7159 Green $9,319, To Apply By Phone, , OREM Utah AUTOMOBILE [0:1] IsA.ARTIFACT.CostofPro duction PRICE [1:1]

20 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents

21 Ontology Generation concept nodes  object sets paths  relationship sets discovered constraints  participation constraints concept recognizers  data frames

22 Automatically Generated Ontology -- Car Advertisement (01) {Automobile [-> object];} (02) {Automobile [0:1] has Mileage [1:1];} (03) {Automobile [0:1] IsA.ARTIFACT.CostOfProduction Price [1:1];} (12) {Price [1:1] IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year [0:*];} (20) {Automobile [0:1] relatesTo PhoneNr [1:*] relatesTo ArtifactPart [1:*] relatesTo Mileage [1:*] relatesTo Truck [1:*] relatesTo AudioMediaArtifact [1:*] relatesTo CommunicationDevice [1:*] relatesTo ControlEvent [1:*] relatesTo TravelEvent [1:*];}

23 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents

24 Updating Strategies Remove all bad relationship sets Modify remaining incorrect relationship sets –Substitute incorrect object sets –Reduce long n -ary relationship sets –Fix participation constraints Adjust names or re-arrange sequences Add new relationship sets

25 Final Ontology Car [-> object] Car [0:1] has Year [1:*] Car [0:1] has Mileage [1:*] Car [0:1] has Price [1:*] PhoneNr [1:*] is for Car [0:1] PhoneNr [0:1] has Extension [1:*] Car [0:*] has Feature [1:*] Car [0:1] has Make [1:*] Car [0:1] has Model [1:*]

26 Evaluation Criteria Basic measures –POG (Precision of Ontology Generation) –ROG (Recall of Ontology Generation) Human constraints –PROG (Pseudo-ROG) –Comparing with an expert-created ontology Knowledge base constraints –EPROG (Effective-PROG) Correctness dependency –DEPROG (Dependent-EPROG) –For example: relationship sets depends on object sets

27 Evaluation Results

28 Discussion of Results Bottleneck: cannot generate what not in the knowledge base Object sets –Concept-selection procedure works well –Desired concept not shown in training records Rarely occurring concept  not severe even if we don’t fix the error Example: extension –Aggregation and union USAddressCity, USAddressState, USAddressZipCode  Location CropPlant, AnimalProduct, FruitFoodStuff  AgriculturalProduct –Close-meaning concepts: FurniturePart  Furnished

29 Discussion of Results Relationship sets –Binary relationship sets over 95% –Most errors due to incorrectly generated object sets –Semantically incorrect relationship sets Price IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year – n -ary relationship sets (usually huge) Participation constraints –Error due to lack of training examples –How much is enough?

30 Knowledge Base Extensibility Add SALT -- a new knowledge source Successfully integrated into existing KB Sample new relationship set (DOE abstract domain) –CrudeOil IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation

31 Conclusion Experimented with knowledge-base construction and extension Standardized application domain specification Generated data-extraction ontologies from a specified domain and an integrated knowledge base Showed DEPROG results of more than 70% on average and over 90% for well-defined domains

32 Future Work Build a general-purpose knowledge source for data- extraction usage Study more about data frames –Can a system correctly identify concepts with data frames? –Can a system update a data frame to fit a special situation? –Can a system generate a data frame from a collection of information of interest?