Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF
2 Introduction Wrapper-driven data extraction –Pros: data-source-specified, high performance –Cons: lack of resiliency and scalability Ontology-driven data extraction –Pros: application-domain-specified, resilient and scalable –Cons: hard to create Objective –Generating data-extraction ontologies
3 Generation Architecture Data Extraction Ontology Integrated Knowledge Base training documents interact if necessary Results Storage Concept Selection Extraction Processing pre-processing clean records Relation Retrieval Constraint Discovery test documents Knowledge Sources pre-processing Result Evaluation Knowledge Preparation Application Specification Domain Allocation Ontology Generation
4 Knowledge Base Construction Knowledge Sources –Mikrokosmos ( K) Ontology –Data-Frame Library –Additional Lexicons –WordNet Integration of Knowledge Base Data-Frame Library K Ontology Synonym Dictionary (WordNet) Lexicons KNOWLEDGE BASE
5 Application Specification Record 1: 00 GrandAM SE, Sunfire Red, CD, AC, PW, PL Great Condition, $10,800, Call Record 2: 02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 Only $12, R ecord 3: 02 Buick Century, lo mi, mint cond, $11, dlr# 2755 Record 4: 00 Buick Century Stk# HU7159 Green $9,319, To Apply By Phone, , OREM Utah
6 Domain Allocation: concept selection Select concepts using string-matching with object values Resolve conflict by context or semantic meanings 02 Buick Century Pwr Seat, Nada Retail 13,695. Data Frame Library retail by keyword identification
7 Domain Allocation: relationship retrieval Record 1: 00 GrandAM SE, Sunfire Red, CD, AC, PW, PL Great Condition, $ 10,800, Call Record 2: 02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 Only $ 12, Record 3: 02 Buick Century, lo mi, mint cond, $ 11, dlr# 2755 Record 4: 00 Buick Century Stk# HU7159 Green $ 9,319, To Apply By Phone, , OREM Utah Find paths among selected concept nodes Retrieve cluster representing application domain
8 Domain Allocation: constraint discovery Discover participation times for each object values Specify discovered values to be participation constraints 02 Buick Century, lo mi, mint cond, green, pwr seat, $11, dlr# Buick Century Stk# HU7159 Green $9,319, To Apply By Phone, , OREM Utah AUTOMOBILE [0:1] has MAKE [1:*] AUTOMOBILE [0:*] has FEATURE [1:*] AUTOMOBILE [0:1] has PRICE [1:1]
9 Ontology Generation Initial ontology: automatically generated Updated ontology: user tuning Expectation –Rejecting existence much easier than adding new –Modification as less as possible
10 Evaluation and Results Evaluation –Compare: Generated vs. Expert-created –POG (Precision of Ontology Generation) –PROG (Pseudo-Recall of Ontology Generation) –EPROG (Effective-PROG) Results –Three testing domains: Apt-Rental, Used-Auto-Ads, Nation- Essence –Average POG less than 0.23 –Lowest EPROG is around 0.70, highest is almost 1.0
11 Conclusion Exploits existing knowledge Specifies application domain Allocates domain inside the knowledge base Generates a data-extraction ontology Shows effective recall of more than 70% on average