Download presentation
Presentation is loading. Please wait.
1
Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF
2
2 Wrapper-Driven Data Extraction Web data extraction –Obtain user-specified information from Web documents Wrapper –Convert implicit HTML data into explicit formatted data –Data-source-specified, high performance Examples: –SoftMealy, STALKER, WIEN, Omini, ROADRUNNER, …
3
3 Common Problem of Wrappers Mani Chandy, Professor of Computer Science and Executive Officer for Computer Science b U _U_U N _N_N ? / ε etc. ? / ε ? / next_token s / ε s / “U=” + next_token s / ε s / “N=” + next_token s / “N=” + next_token SoftMealy Resiliency fixed domain changeable layout Scalability unchanged existing wrapper extendable domain and functions
4
4 Data-Extraction Ontology Structure –Object sets –Relationship sets –Participation constraints –Data frames Pros: resilient and scalable Cons: hard to create –Knowledge requirements –Tedious and error-prone work Car [-> object]; Car [0:1] has Make [1:*]; Make matches [10] constant { extract "\baudi\b"; }; end; Car [0:1] has Model [1:*]; Model matches [25] constant { extract "80"; context "\baudi\S*\s*80\b"; }; end; Car [0:1] has Mileage [1:*]; Mileage matches [8] constant {extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000";}; end; Car [0:1] has Price [1:*]; Price matches [8] constant { extract "[1-9]\d{3,6}"; context "\$[1-9]\d{3,6}";}; end;
5
5 Motif of Ontology Generation Human Brain Concepts of Interest Concepts with Relations Data-Extraction Ontology Knowledge Base Sample Documents
6
6 Thesis Statement Given: knowledge base Input: sample Web pages of interest Output: a data-extraction ontology for the domain of interest Between input and output: this is the work of this thesis
7
7 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents
8
8 Primary Knowledge Source Requirements –Available –General in coverage –Rich in meaningful relationship –Encoded in or easily converted to XML Mikrokosmos ( K) Ontology –Developed by NMSU jointly with U.S. DoD –Contains over 5000 concepts –Connects to an average 14 links per concept –Represented in XML format
9
9 Integrated Knowledge Base Data-Frame Library K Ontology Synonym Dictionary (WordNet) Lexicons KNOWLEDGE BASE
10
10 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents
11
11 Domain Specification Training documents –Data-rich –Narrow in topic breadth Preprocessing
12
12 Example – Car Advertisement Record 1: 00 GrandAM SE, Sunfire Red, CD, AC, PW, PL Great Condition, $10,800, Call 798-3446 Record 2: 02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 221-1250 Record 3: 02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755 Record 4: 00 Buick Century Stk# HU7159 Green $9,319, 714-2200 To Apply By Phone, 1-877-228-9486, OREM Utah
13
13 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents
14
14 Concept Selection Selection strategies –Compare a string with the name of a concept –Compare a string with the values belonging to a concept –Apply data-frame recognizers to recognize a string 00 Buick Century Stk# HU7159 Green $9,319, 714- 2200 To Apply By Phone, 1-877- 228-9486, OREM Utah KB
15
15 Concept Selection Reasons of conflict –Synonymy –Polysemy Conflict resolution –Same-string only one meaning –Favor longer over shorter –Context decides meaning 02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 221-1250. KB price by keyword identification
16
16 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents
17
17 Relationship Retrieval KB
18
18 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents
19
19 Constraint Discovery 02 Buick Century, lo mi, mint cond, green, pwr seat, $11,999. 373-4445 dlr# 2755 00 Buick Century Stk# HU7159 Green $9,319, 714-2200 To Apply By Phone, 1- 877-228-9486, OREM Utah AUTOMOBILE [0:1] IsA.ARTIFACT.CostofPro duction PRICE [1:1]
20
20 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents
21
21 Ontology Generation concept nodes object sets paths relationship sets discovered constraints participation constraints concept recognizers data frames
22
22 Automatically Generated Ontology -- Car Advertisement (01) {Automobile [-> object];} (02) {Automobile [0:1] has Mileage [1:1];} (03) {Automobile [0:1] IsA.ARTIFACT.CostOfProduction Price [1:1];} (12) {Price [1:1] IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year [0:*];} (20) {Automobile [0:1] relatesTo PhoneNr [1:*] relatesTo ArtifactPart [1:*] relatesTo Mileage [1:*] relatesTo Truck [1:*] relatesTo AudioMediaArtifact [1:*] relatesTo CommunicationDevice [1:*] relatesTo ControlEvent [1:*] relatesTo TravelEvent [1:*];}
23
23 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents
24
24 Updating Strategies Remove all bad relationship sets Modify remaining incorrect relationship sets –Substitute incorrect object sets –Reduce long n -ary relationship sets –Fix participation constraints Adjust names or re-arrange sequences Add new relationship sets
25
25 Final Ontology Car [-> object] Car [0:1] has Year [1:*] Car [0:1] has Mileage [1:*] Car [0:1] has Price [1:*] PhoneNr [1:*] is for Car [0:1] PhoneNr [0:1] has Extension [1:*] Car [0:*] has Feature [1:*] Car [0:1] has Make [1:*] Car [0:1] has Model [1:*]
26
26 Evaluation Criteria Basic measures –POG (Precision of Ontology Generation) –ROG (Recall of Ontology Generation) Human constraints –PROG (Pseudo-ROG) –Comparing with an expert-created ontology Knowledge base constraints –EPROG (Effective-PROG) Correctness dependency –DEPROG (Dependent-EPROG) –For example: relationship sets depends on object sets
27
27 Evaluation Results
28
28 Discussion of Results Bottleneck: cannot generate what not in the knowledge base Object sets –Concept-selection procedure works well –Desired concept not shown in training records Rarely occurring concept not severe even if we don’t fix the error Example: extension –Aggregation and union USAddressCity, USAddressState, USAddressZipCode Location CropPlant, AnimalProduct, FruitFoodStuff AgriculturalProduct –Close-meaning concepts: FurniturePart Furnished
29
29 Discussion of Results Relationship sets –Binary relationship sets over 95% –Most errors due to incorrectly generated object sets –Semantically incorrect relationship sets Price IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year – n -ary relationship sets (usually huge) Participation constraints –Error due to lack of training examples –How much is enough?
30
30 Knowledge Base Extensibility Add SALT -- a new knowledge source Successfully integrated into existing KB Sample new relationship set (DOE abstract domain) –CrudeOil IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation
31
31 Conclusion Experimented with knowledge-base construction and extension Standardized application domain specification Generated data-extraction ontologies from a specified domain and an integrated knowledge base Showed DEPROG results of more than 70% on average and over 90% for well-defined domains
32
32 Future Work Build a general-purpose knowledge source for data- extraction usage Study more about data frames –Can a system correctly identify concepts with data frames? –Can a system update a data frame to fit a special situation? –Can a system generate a data frame from a collection of information of interest?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.