Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth.

Similar presentations


Presentation on theme: "Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth."— Presentation transcript:

1 Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth

2 Introduction Data Extraction n Data Expectation Goals n Wrapper Verification n Wrapper Maintenance

3 Representation Break web page into tokens more general than characters but more specific than word symbols

4 DataPro For complex fields it is sufficient to learn only the starting and ending sequences of a data field DataPro n Only Positive Examples n Statistical Algorithm n Polynomial Time n Greedy

5 Prefix Tree For a given data field, the tokens are encoded in a prefix tree (a suffix tree would be similar) Each node is a specification of its parent. Example: data field is City node: “New” children: “Haven”, “York”, CAPS

6 Significant(count1, count2, P, α) Significance is the main measure used in the DataPro algorithm Parameters: count1, count2 - number of times a pattern of tokens appear in the data field examples P - probability of count1 given count2 α - null hypothesis limit

7 DataPro Algorithm Create root node of tree For next node Q of tree Create children of Q Prune Generalizations Determinize children Extract patterns from tree

8 Wrapper Verification Wrapper Fragility is a common problem and wrapper verification is rare Take patterns created by DataPro for the current wrapper and create a distribution t from the number of pattern matches of each pattern on the original web pages Take a similar distribution k from the new web pages that are being verified if t and k have approximately the same distribution the wrapper is still valid, otherwise it needs to be updated Recall: 95%Precision: 47%

9 Wrapper Maintenance n Take original patterns n Find matching start and end patterns n Remove sequences with unusually high or low length n Score remaining sequences based on location, adjacent tokens, and visibility to the user n Cluster choices by score and highest scoring cluster should contain only correct examples of the data field n 62 of 77 tests contained Correct Complete data field examples


Download ppt "Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth."

Similar presentations


Ads by Google