Download presentation
Presentation is loading. Please wait.
1
Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth
2
Introduction Data Extraction n Data Expectation Goals n Wrapper Verification n Wrapper Maintenance
3
Representation Break web page into tokens more general than characters but more specific than word symbols
4
DataPro For complex fields it is sufficient to learn only the starting and ending sequences of a data field DataPro n Only Positive Examples n Statistical Algorithm n Polynomial Time n Greedy
5
Prefix Tree For a given data field, the tokens are encoded in a prefix tree (a suffix tree would be similar) Each node is a specification of its parent. Example: data field is City node: “New” children: “Haven”, “York”, CAPS
6
Significant(count1, count2, P, α) Significance is the main measure used in the DataPro algorithm Parameters: count1, count2 - number of times a pattern of tokens appear in the data field examples P - probability of count1 given count2 α - null hypothesis limit
7
DataPro Algorithm Create root node of tree For next node Q of tree Create children of Q Prune Generalizations Determinize children Extract patterns from tree
8
Wrapper Verification Wrapper Fragility is a common problem and wrapper verification is rare Take patterns created by DataPro for the current wrapper and create a distribution t from the number of pattern matches of each pattern on the original web pages Take a similar distribution k from the new web pages that are being verified if t and k have approximately the same distribution the wrapper is still valid, otherwise it needs to be updated Recall: 95%Precision: 47%
9
Wrapper Maintenance n Take original patterns n Find matching start and end patterns n Remove sequences with unusually high or low length n Score remaining sequences based on location, adjacent tokens, and visibility to the user n Cluster choices by score and highest scoring cluster should contain only correct examples of the data field n 62 of 77 tests contained Correct Complete data field examples
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.