Download presentation
Presentation is loading. Please wait.
1
RoadRunner: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo Presented by Lei Lei
2
Outline Problem Theoretical Background Matching Technique Examples and Experimental Results Comparison with other works Assessment
3
Introduction and Problems Fast growing information Little machine understandable Wrappers and its key problem
4
Previous Works Data-intensive Web sites Grammar inference Gold’s work Positive examples alone Problem; Complexity of learning;
5
Common Features of Wrappers Extra information from users’ interactions Priori knowledge One HTML page at a time
6
Background Nested types Generate a Union-free Regular Expression (UFRE) Locate the least upper bounds on the RE lattice to generate a wrapper Reduces to find the least upper bound on two UFRES
7
Matching/Mismatching Start with the first page and create a RE that defines the wrapper Match each successive sample against the wrapper Mismatches result in generalizations of the regular expression Types of mismatches String mismatches Tag mismatches
8
Example Pages
9
Simple Matching Example String Mismatch: discover fields Replace string by #PCDATA # PCDATA
10
Example (Cont.) Tag Mismatch: Discover Optionals: * Find repeated and optional patterns * Cross-Search * Wrapper Generalization
11
Example (Cont.) #PCDATA ( )? #PCDATA Tag Mismatches :Discovering Iterators Assume mismatch is caused by repeated elements in a list Match possible squares against earlier squares Generalize the wrapper by finding all contiguous repeated occurrences – Ie. ( Title: #PCDATA )+
12
A More Complex Example
13
Extraction Output
14
Experiment Results
15
Comparison with other works
16
Assessment Quality of extracted datasets Assumption for simplicity’s sake - regular structured pages - no disjunctions Search Space for explaining mismatches - Uses a number of heuristics to prune space Limited backtracking due to lots of alternatives Patterns can not be delimited by optionals - Will result in pruning possible wrappers
17
Questions Can RoadRunner be improved to work with more than 2 pages at a time? Anything to improve the manually named field process? Introduce disjunction?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.