Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth.

Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth

Introduction Data Extraction n Data Expectation Goals n Wrapper Verification n Wrapper Maintenance

Representation Break web page into tokens more general than characters but more specific than word symbols

DataPro For complex fields it is sufficient to learn only the starting and ending sequences of a data field DataPro n Only Positive Examples n Statistical Algorithm n Polynomial Time n Greedy

Prefix Tree For a given data field, the tokens are encoded in a prefix tree (a suffix tree would be similar) Each node is a specification of its parent. Example: data field is City node: “New” children: “Haven”, “York”, CAPS

Significant(count1, count2, P, α) Significance is the main measure used in the DataPro algorithm Parameters: count1, count2 - number of times a pattern of tokens appear in the data field examples P - probability of count1 given count2 α - null hypothesis limit

DataPro Algorithm Create root node of tree For next node Q of tree Create children of Q Prune Generalizations Determinize children Extract patterns from tree

Wrapper Verification Wrapper Fragility is a common problem and wrapper verification is rare Take patterns created by DataPro for the current wrapper and create a distribution t from the number of pattern matches of each pattern on the original web pages Take a similar distribution k from the new web pages that are being verified if t and k have approximately the same distribution the wrapper is still valid, otherwise it needs to be updated Recall: 95%Precision: 47%

Wrapper Maintenance n Take original patterns n Find matching start and end patterns n Remove sequences with unusually high or low length n Score remaining sequences based on location, adjacent tokens, and visibility to the user n Cluster choices by score and highest scoring cluster should contain only correct examples of the data field n 62 of 77 tests contained Correct Complete data field examples

Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth.

Similar presentations

Presentation on theme: "Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth.

Similar presentations

Presentation on theme: "Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth."— Presentation transcript:

Similar presentations

About project

Feedback