Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF.

Similar presentations


Presentation on theme: "Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF."— Presentation transcript:

1 Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

2 2 Wrapper-Driven Data Extraction Web data extraction –Obtain user-specified information from Web documents Wrapper –Convert implicit HTML data into explicit formatted data –Data-source-specified, high performance Examples: –SoftMealy, STALKER, WIEN, Omini, ROADRUNNER, …

3 3 Common Problem of Wrappers Mani Chandy, Professor of Computer Science and Executive Officer for Computer Science b U _U_U N _N_N ? / ε etc. ? / ε ? / next_token s / ε s / “U=” + next_token s / ε s / “N=” + next_token s / “N=” + next_token SoftMealy Resiliency fixed domain changeable layout Scalability unchanged existing wrapper extendable domain and functions

4 4 Data-Extraction Ontology Structure –Object sets –Relationship sets –Participation constraints –Data frames Pros: resilient and scalable Cons: hard to create –Knowledge requirements –Tedious and error-prone work Car [-> object]; Car [0:1] has Make [1:*]; Make matches [10] constant { extract "\baudi\b"; }; end; Car [0:1] has Model [1:*]; Model matches [25] constant { extract "80"; context "\baudi\S*\s*80\b"; }; end; Car [0:1] has Mileage [1:*]; Mileage matches [8] constant {extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000";}; end; Car [0:1] has Price [1:*]; Price matches [8] constant { extract "[1-9]\d{3,6}"; context "\$[1-9]\d{3,6}";}; end;

5 5 Motif of Ontology Generation Human Brain Concepts of Interest Concepts with Relations Data-Extraction Ontology Knowledge Base Sample Documents

6 6 Thesis Statement Given: knowledge base Input: sample Web pages of interest Output: a data-extraction ontology for the domain of interest Between input and output: this is the work of this thesis

7 7 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents

8 8 Primary Knowledge Source Requirements –Available –General in coverage –Rich in meaningful relationship –Encoded in or easily converted to XML Mikrokosmos (  K) Ontology –Developed by NMSU jointly with U.S. DoD –Contains over 5000 concepts –Connects to an average 14 links per concept –Represented in XML format

9 9 Integrated Knowledge Base Data-Frame Library  K Ontology Synonym Dictionary (WordNet) Lexicons KNOWLEDGE BASE

10 10 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents

11 11 Domain Specification Training documents –Data-rich –Narrow in topic breadth Preprocessing

12 12 Example – Car Advertisement Record 1: 00 GrandAM SE, Sunfire Red, CD, AC, PW, PL Great Condition, $10,800, Call 798-3446 Record 2: 02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 221-1250 Record 3: 02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755 Record 4: 00 Buick Century Stk# HU7159 Green $9,319, 714-2200 To Apply By Phone, 1-877-228-9486, OREM Utah

13 13 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents

14 14 Concept Selection Selection strategies –Compare a string with the name of a concept –Compare a string with the values belonging to a concept –Apply data-frame recognizers to recognize a string 00 Buick Century Stk# HU7159 Green $9,319, 714- 2200 To Apply By Phone, 1-877- 228-9486, OREM Utah KB

15 15 Concept Selection Reasons of conflict –Synonymy –Polysemy Conflict resolution –Same-string only one meaning –Favor longer over shorter –Context decides meaning 02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 221-1250. KB price by keyword identification

16 16 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents

17 17 Relationship Retrieval KB

18 18 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents

19 19 Constraint Discovery 02 Buick Century, lo mi, mint cond, green, pwr seat, $11,999. 373-4445 dlr# 2755 00 Buick Century Stk# HU7159 Green $9,319, 714-2200 To Apply By Phone, 1- 877-228-9486, OREM Utah AUTOMOBILE [0:1] IsA.ARTIFACT.CostofPro duction PRICE [1:1]

20 20 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents

21 21 Ontology Generation concept nodes  object sets paths  relationship sets discovered constraints  participation constraints concept recognizers  data frames

22 22 Automatically Generated Ontology -- Car Advertisement (01) {Automobile [-> object];} (02) {Automobile [0:1] has Mileage [1:1];} (03) {Automobile [0:1] IsA.ARTIFACT.CostOfProduction Price [1:1];} (12) {Price [1:1] IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year [0:*];} (20) {Automobile [0:1] relatesTo PhoneNr [1:*] relatesTo ArtifactPart [1:*] relatesTo Mileage [1:*] relatesTo Truck [1:*] relatesTo AudioMediaArtifact [1:*] relatesTo CommunicationDevice [1:*] relatesTo ControlEvent [1:*] relatesTo TravelEvent [1:*];}

23 23 Ontology-Generation Procedure Concept Selection Relation Retrieval Constraint Discovery Data Extraction Ontology interact if necessary Integrated Knowledge Base Knowledge Sources pre-processing Results Storage Extraction Processing Result Evaluation training documents pre-processing clean records test documents

24 24 Updating Strategies Remove all bad relationship sets Modify remaining incorrect relationship sets –Substitute incorrect object sets –Reduce long n -ary relationship sets –Fix participation constraints Adjust names or re-arrange sequences Add new relationship sets

25 25 Final Ontology Car [-> object] Car [0:1] has Year [1:*] Car [0:1] has Mileage [1:*] Car [0:1] has Price [1:*] PhoneNr [1:*] is for Car [0:1] PhoneNr [0:1] has Extension [1:*] Car [0:*] has Feature [1:*] Car [0:1] has Make [1:*] Car [0:1] has Model [1:*]

26 26 Evaluation Criteria Basic measures –POG (Precision of Ontology Generation) –ROG (Recall of Ontology Generation) Human constraints –PROG (Pseudo-ROG) –Comparing with an expert-created ontology Knowledge base constraints –EPROG (Effective-PROG) Correctness dependency –DEPROG (Dependent-EPROG) –For example: relationship sets depends on object sets

27 27 Evaluation Results

28 28 Discussion of Results Bottleneck: cannot generate what not in the knowledge base Object sets –Concept-selection procedure works well –Desired concept not shown in training records Rarely occurring concept  not severe even if we don’t fix the error Example: extension –Aggregation and union USAddressCity, USAddressState, USAddressZipCode  Location CropPlant, AnimalProduct, FruitFoodStuff  AgriculturalProduct –Close-meaning concepts: FurniturePart  Furnished

29 29 Discussion of Results Relationship sets –Binary relationship sets over 95% –Most errors due to incorrectly generated object sets –Semantically incorrect relationship sets Price IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year – n -ary relationship sets (usually huge) Participation constraints –Error due to lack of training examples –How much is enough?

30 30 Knowledge Base Extensibility Add SALT -- a new knowledge source Successfully integrated into existing KB Sample new relationship set (DOE abstract domain) –CrudeOil IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation

31 31 Conclusion Experimented with knowledge-base construction and extension Standardized application domain specification Generated data-extraction ontologies from a specified domain and an integrated knowledge base Showed DEPROG results of more than 70% on average and over 90% for well-defined domains

32 32 Future Work Build a general-purpose knowledge source for data- extraction usage Study more about data frames –Can a system correctly identify concepts with data frames? –Can a system update a data frame to fit a special situation? –Can a system generate a data frame from a collection of information of interest?


Download ppt "Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF."

Similar presentations


Ads by Google