Presentation is loading. Please wait.

Presentation is loading. Please wait.

RoadRunner: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo Presented by Lei Lei.

Similar presentations


Presentation on theme: "RoadRunner: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo Presented by Lei Lei."— Presentation transcript:

1 RoadRunner: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo Presented by Lei Lei

2 Outline  Problem  Theoretical Background  Matching Technique  Examples and Experimental Results  Comparison with other works  Assessment

3 Introduction and Problems  Fast growing information  Little machine understandable  Wrappers and its key problem

4 Previous Works  Data-intensive Web sites  Grammar inference  Gold’s work Positive examples alone Problem; Complexity of learning;

5 Common Features of Wrappers  Extra information from users’ interactions  Priori knowledge  One HTML page at a time

6 Background  Nested types  Generate a Union-free Regular Expression (UFRE)  Locate the least upper bounds on the RE lattice to generate a wrapper  Reduces to find the least upper bound on two UFRES

7 Matching/Mismatching  Start with the first page and create a RE that defines the wrapper  Match each successive sample against the wrapper  Mismatches result in generalizations of the regular expression  Types of mismatches  String mismatches  Tag mismatches

8 Example Pages

9 Simple Matching Example String Mismatch: discover fields Replace string by #PCDATA # PCDATA

10 Example (Cont.) Tag Mismatch: Discover Optionals: * Find repeated and optional patterns * Cross-Search * Wrapper Generalization

11 Example (Cont.) #PCDATA ( )? #PCDATA Tag Mismatches :Discovering Iterators Assume mismatch is caused by repeated elements in a list Match possible squares against earlier squares Generalize the wrapper by finding all contiguous repeated occurrences – Ie. ( Title: #PCDATA )+

12 A More Complex Example

13 Extraction Output

14 Experiment Results

15 Comparison with other works

16 Assessment  Quality of extracted datasets  Assumption for simplicity’s sake - regular structured pages - no disjunctions Search Space for explaining mismatches - Uses a number of heuristics to prune space  Limited backtracking due to lots of alternatives  Patterns can not be delimited by optionals - Will result in pruning possible wrappers

17 Questions  Can RoadRunner be improved to work with more than 2 pages at a time?  Anything to improve the manually named field process?  Introduce disjunction?


Download ppt "RoadRunner: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo Presented by Lei Lei."

Similar presentations


Ads by Google