Kriti Chauhan CSE6339 Spring 2009 Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Kriti Chauhan CSE6339 Spring 2009
CSE6339 Spring 2009 University of Texas at Arlington Introduction Main Issue: To extract information from the massive amount of data on the web. We can search and rank web pages, but information pertaining to fielded searches, range-based or join-based structured queries, data mining, and decision support typically require detailed and fine-grained processing. Solution: Extract data from web sites and transform it into structured format like XML. How to extract data: wrappers 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
CSE6339 Spring 2009 University of Texas at Arlington Wrappers Definition: A wrapper is a piece of software that enables a semi-structured web source to be queried as if it were a database. Utilize implicit underlying structure of the source. Each website has a different layout and structure. Hence each website has a different wrapper that is customized for it. Logical components of a virtual data integration system. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
STALKER: Hierarchical Wrapper Induction Algorithm Salient features: Learns highly accurate extraction rules Verifies wrapper to ensure correct data continues to be extracted Automatically adapts to changes in the sites from which data is being extracted 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
Building A Wrapper
CSE6339 Spring 2009 University of Texas at Arlington Example Consider E1, E2, E3: Start Rule for Address: R = SkipTo(</i><p>Address:<i>) 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
Identifying Extraction Rules: Key Idea Start rule: for finding start of “Address” (considering E1, E2, E3): R1 = SkipTo(Address)SkipTo(<i>) Other possible rules: R2 = SkipTo(Address: <i>) R3 = SkipTo(Cuisine: <i>) SkipTo(Address: <i>) R4 = SkipTo(Cuisine: <i>_Capitalized_</i><p> Address: <i>) R2: 3-token landmark R3: Two 3-token landmarks R4: 9-token landmark; uses wildcard Wildcards: _Capitalized_, _Number_, _AllCaps_, _HtmlTag_ 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
Disjunctive extraction rules Disjunctions are allowed in extraction rules to deal with variations in the format of the documents. Example: Addresses within one mile of location are in bold (E4), while others are in italics (E1, E2, E3). S1 = either SkipTo(Address: <b>) or SkipTo(Address) SkipTo(<i>) Applying disjunctive rule: Wrapper successively applies each disjunct in the list until it finds the first one that matches. Wildcards: _Capitalized_, _Number_, _AllCaps_, _HtmlTag_ S1 ≡ S2 = SkipTo(Address: _HtmlTag_) 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
Wrapper Creation Basics Main issue: To define a set of extraction rules that precisely define how to locate the information on the page. We need an extraction rule to locate both beginning and end for each item to be extracted from a page. In web pages each document consists of sequence of tokens (like words, numbers, HTML tags, etc). Extraction rule: finding first and last tokens of an item. Extraction rules: Based on “landmarks” (groups of consecutive tokens) that enable wrapper to locate start and end of each item within the page. The set of extraction rules has to work for ALL the pages in the source. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
STALKER: Generating Extraction Rules Step 1: Select example to guide search (say, E4) Step 2: Generate set of initial candidates (rules consisting of 1-token landmark). R5 = SkipTo( <b> ) R6 = SkipTo( _HtmlTag_ ) Step 3: Select R6 for further refinement (R5 does not match other examples, while R6 has better generalization potential) Step 4: Create new candidates while refining R6 R7 = SkipTo( : _HtmlTag_ ) R8 = SkipTo( _Punctuation_ _HtmlTag_ ) R9 = SkipTo( : ) SkipTo( _HtmlTag_ ) R10 = SkipTo( Address ) SkipTo( _HtmlTag_ ) … R7, R8: landmark refinement (token added to landmark in R6) R9, R10: topology refinement (new landmark added to R6) 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
CSE6339 Spring 2009 University of Texas at Arlington STALKER: Efficiency Rarely requires more than 10 examples, most case 2 are sufficient to generate extraction rules. Mostly pages in a source are based on a fixed template with few variations. Since STALKER tries to learn landmarks part of this template, few examples suffice to figure out reliable landmarks Exploits the hierarchical structure of source to constrain the learning problem. For example: First apply a rule to extract whole list of restaurants Then use another rule to break list into tuples corresponding to individual restaurants Finally, extract name, address and phone number from each tuple. Can extract data from pages containing complicated formatting layouts (e.g., list embedded in other lists) that other approaches are unable to handle. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
CSE6339 Spring 2009 University of Texas at Arlington STALKER: Performance In an empirical evaluation on 28 sources, STALKER had to learn 206 extraction rules. It learned 182 perfect rules (100% accurate), and another 18 rules that had an accuracy of at least 90%. In other words, only 3% of the learned rules were less that 90% accurate. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
CSE6339 Spring 2009 University of Texas at Arlington Backward Rules Forward rule: start at the beginning of the document and go towards the end Backward rule: start at the end of the page and goes towards its beginning Backward rules to find beginning of addresses: R11 = BackTo( Phone ) BackTo( _Number_ ) R12 = BackTo( Phone: <i> ) BackTo( _Number_ ) 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
STALKER: Active Learning Approach/ Co-testing System learns both, a forward and backward rule, after user labels one or two examples. Then it runs BOTH rules on given set of unlabeled pages. Whenever rules disagree on an example, system asks user to label that example. By asking the user to label that particular example, we obtain a highly informative training example. Thus, Co-testing makes it possible to generate accurate extraction rules with a very small number of labeled examples. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
Co-testing Performance Co-testing was applied on the 24 tasks on which STALKER fails to learn perfect rules based on 10 random examples. To keep the comparison fair, co-testing started with one random example and made up to 9 queries. The results were excellent: the average accuracy over all tasks improved from 85.7% to 94.2% (error rate reduced by 59.5%). Furthermore, 10 of the learned rules were 100% accurate, while another 11 rules were at least 90% accurate. In these experiments as well as in other related tests applying co-testing led to a significant improvement in accuracy without having to label more training data. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
Wrapper Verification
CSE6339 Spring 2009 University of Texas at Arlington DataPro Algorithm Data prototype: Starting and ending patterns of a field taken together For example, a set of street addresses – 12 Pico St., 512 Oak Blvd., 416 Main St. and 97 Adams Blvd. – all start with a pattern (_Number_ _Capitalized_) and end with (Blvd.) or (St.). DataPro algorithm learns significant pattern for each field The learning algorithm finds the patterns that describe the common beginnings and endings of each field of the training examples. In the verification phase, the wrapper generates a test set of examples from pages retrieved using the same or similar set of queries. If the patterns describe statistically the same (at a given significance level) proportion of the test examples as the training examples, the wrapper is judged to be extracting correctly; otherwise, it is judged to have failed. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
CSE6339 Spring 2009 University of Texas at Arlington DataPro: Performance Algorithm has a high rate of false positives. 27 wrappers (representing 23 distinct Web sources) were monitored over a period of several months. For each wrapper, the results of 15-30 queries were stored periodically. All new results were compared with the last correct wrapper output (training examples). A manual check of the results revealed 37 wrapper changes out of the total 443 comparisons. The verification algorithm correctly discovered 35 of these changes. The algorithm incorrectly decided that the wrapper has changed in 40 cases. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
Automatically Repairing Wrappers
CSE6339 Spring 2009 University of Texas at Arlington Reinduction Example Wrapper reinduction algorithm: Updates extraction rules based on the premise that formatting, rather than content has changed. Algorithm learns starting and ending patterns for address: Start pattern: p1 = (_Number_ _Capitalized_) End pattern: p2 = (Blvd.) OR (St.) Lets say the web site changes the word “Address” to “Location” New start & end patterns: p3 = NIL Algorithm finds text segments which have start pattern p1 and those that have end pattern p2. All segments of approximately same length as the Addresses identified in training set are retained, while others are eliminated. Segments having similar pattern have a lot in common (like size, location, etc) so they end up in the same cluster group. Each group is scored based on similarity to training examples. So the highest ranked group is identified as the “Address” field and extraction rules are updated. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
Actual Example of Change to Amazon’s Site 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
Reinduction Performance Extraction algorithm was applied to 21 distinct Web sources, attempting to extract 77 data fields from all the sources. In 62 cases the top ranked cluster contained correct complete instances of the data field. In eight cases the correct cluster was ranker lower, while in six cases no candidates were identified on the pages. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
CSE6339 Spring 2009 University of Texas at Arlington Wrapper: Lifecycle 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
CSE6339 Spring 2009 University of Texas at Arlington Summary STALKER algorithm: Types of extraction rules: forward and backward An extraction rule comprises: start rule and end rule Rules contain landmarks (groups of consecutive tokens) Lots of rules are possible towards the same goal Extraction rules allow disjunction (think of it as the Boolean union operator) Types of rule refinements: landmark and topology Co-testing uses both forward and backward rules to avoid mistakes and get user input only for essential examples. Datapro learns patterns; if patterns from set of pages retrieved from web is not statistically similar to training examples, wrapper is judged to have failed. Data prototype: Starting and ending patterns of a field taken together Wrapper reinduction algorithm: Updates extraction rules based on the premise that formatting, rather than content has changed. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
CSE6339 Spring 2009 University of Texas at Arlington Discussion Limitations? Differences from the approach outlined in previous paper (Information Extraction: Distilling Structured Data from Unstructured Text by Andrew McCallum)? 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
CSE6339 Spring 2009 University of Texas at Arlington Discussion Limitations: Doesn't work for complex pages containing tables and complex lists. Differences from approach in previous paper: Follows hierarchical approach. Hence instead of “Segmentation-> Classification-> Association-> Normalization-> Deduplication”, it performs “Association-> Classification-> Segmentation” 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington
Thank you!