Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth
Introduction Data Extraction n Data Expectation Goals n Wrapper Verification n Wrapper Maintenance
Representation Break web page into tokens more general than characters but more specific than word symbols
DataPro For complex fields it is sufficient to learn only the starting and ending sequences of a data field DataPro n Only Positive Examples n Statistical Algorithm n Polynomial Time n Greedy
Prefix Tree For a given data field, the tokens are encoded in a prefix tree (a suffix tree would be similar) Each node is a specification of its parent. Example: data field is City node: “New” children: “Haven”, “York”, CAPS
Significant(count1, count2, P, α) Significance is the main measure used in the DataPro algorithm Parameters: count1, count2 - number of times a pattern of tokens appear in the data field examples P - probability of count1 given count2 α - null hypothesis limit
DataPro Algorithm Create root node of tree For next node Q of tree Create children of Q Prune Generalizations Determinize children Extract patterns from tree
Wrapper Verification Wrapper Fragility is a common problem and wrapper verification is rare Take patterns created by DataPro for the current wrapper and create a distribution t from the number of pattern matches of each pattern on the original web pages Take a similar distribution k from the new web pages that are being verified if t and k have approximately the same distribution the wrapper is still valid, otherwise it needs to be updated Recall: 95%Precision: 47%
Wrapper Maintenance n Take original patterns n Find matching start and end patterns n Remove sequences with unusually high or low length n Score remaining sequences based on location, adjacent tokens, and visibility to the user n Cluster choices by score and highest scoring cluster should contain only correct examples of the data field n 62 of 77 tests contained Correct Complete data field examples