Kriti Chauhan CSE6339 Spring 2009

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

User Interface Structure Design
The Semantic Web-Week 22 Information Extraction and Integration (continued) Module Website: Practical this week:
Information Retrieval in Practice
Aki Hecht Seminar in Databases (236826) January 2009
Structured Data Extraction Based on the slides from Bing Liu at UCI.
 Image Search Engine Results now  Focus on GIS image registration  The Technique and its advantages  Internal working  Sample Results  Applicable.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Information Extraction from HTML: General Machine Learning Approach Using SRV.
Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth.
Overview of Search Engines
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
MS Access: Database Concepts Instructor: Vicki Weidler.
Fundamentals of Python: From First Programs Through Data Structures
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Fundamentals of Python: First Programs
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
1 ISA&D7‏/8‏/ ISA&D7‏/8‏/2013 Systems Development Life Cycle Phases and Activities in the SDLC Variations of the SDLC models.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Querying Structured Text in an XML Database By Xuemei Luo.
© RightNow Technologies, Inc. Measuring and Analyzing Feedback Results Expert Seminar Susie Boyer, RightNow Product Manager Aaron Schubert, RightNow Development.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Data Mining By Dave Maung.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Presenter: Shanshan Lu 03/04/2010
Systems Life Cycle. Know the elements of the system that are created Understand the need for thorough testing Be able to describe the different tests.
Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
ITGS Databases.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
As Of March 28 th, 2001 A quick summary of LeNDI / Celware Integration. rbp.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Information Retrieval in Practice
Search Engine Architecture
Auditing Information Technology
A paper on Join Synopses for Approximate Query Answering
Chapter 12: Query Processing
FORMAL SYSTEM DEVELOPMENT METHODOLOGIES
Web Data Extraction Based on Partial Tree Alignment
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment.
Introduction Artificial Intelligent.
Program Design Introduction to Computer Programming By:
Database Vs. Data Warehouse
Hidden Markov Models Part 2: Algorithms
GCSE Computing Databases.
CSc4730/6730 Scientific Visualization
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
Spreadsheets, Modelling & Databases
Evaluation of Relational Operations: Other Techniques
The ultimate in data organization
Web Mining Research: A Survey
M. Kezunovic (P.I.) S. S. Luo D. Ristanovic Texas A&M University
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment Pearson Education © 2009.
Database management systems
Presentation transcript:

Kriti Chauhan CSE6339 Spring 2009 Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Kriti Chauhan CSE6339 Spring 2009

CSE6339 Spring 2009 University of Texas at Arlington Introduction Main Issue: To extract information from the massive amount of data on the web. We can search and rank web pages, but information pertaining to fielded searches, range-based or join-based structured queries, data mining, and decision support typically require detailed and fine-grained processing. Solution: Extract data from web sites and transform it into structured format like XML. How to extract data: wrappers 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

CSE6339 Spring 2009 University of Texas at Arlington Wrappers Definition: A wrapper is a piece of software that enables a semi-structured web source to be queried as if it were a database. Utilize implicit underlying structure of the source. Each website has a different layout and structure. Hence each website has a different wrapper that is customized for it. Logical components of a virtual data integration system. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

STALKER: Hierarchical Wrapper Induction Algorithm Salient features: Learns highly accurate extraction rules Verifies wrapper to ensure correct data continues to be extracted Automatically adapts to changes in the sites from which data is being extracted 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

Building A Wrapper

CSE6339 Spring 2009 University of Texas at Arlington Example Consider E1, E2, E3: Start Rule for Address: R = SkipTo(</i><p>Address:<i>) 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

Identifying Extraction Rules: Key Idea Start rule: for finding start of “Address” (considering E1, E2, E3): R1 = SkipTo(Address)SkipTo(<i>) Other possible rules: R2 = SkipTo(Address: <i>) R3 = SkipTo(Cuisine: <i>) SkipTo(Address: <i>) R4 = SkipTo(Cuisine: <i>_Capitalized_</i><p> Address: <i>) R2: 3-token landmark R3: Two 3-token landmarks R4: 9-token landmark; uses wildcard Wildcards: _Capitalized_, _Number_, _AllCaps_, _HtmlTag_ 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

Disjunctive extraction rules Disjunctions are allowed in extraction rules to deal with variations in the format of the documents. Example: Addresses within one mile of location are in bold (E4), while others are in italics (E1, E2, E3). S1 = either SkipTo(Address: <b>) or SkipTo(Address) SkipTo(<i>) Applying disjunctive rule: Wrapper successively applies each disjunct in the list until it finds the first one that matches. Wildcards: _Capitalized_, _Number_, _AllCaps_, _HtmlTag_ S1 ≡ S2 = SkipTo(Address: _HtmlTag_) 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

Wrapper Creation Basics Main issue: To define a set of extraction rules that precisely define how to locate the information on the page. We need an extraction rule to locate both beginning and end for each item to be extracted from a page. In web pages each document consists of sequence of tokens (like words, numbers, HTML tags, etc). Extraction rule: finding first and last tokens of an item. Extraction rules: Based on “landmarks” (groups of consecutive tokens) that enable wrapper to locate start and end of each item within the page. The set of extraction rules has to work for ALL the pages in the source. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

STALKER: Generating Extraction Rules Step 1: Select example to guide search (say, E4) Step 2: Generate set of initial candidates (rules consisting of 1-token landmark). R5 = SkipTo( <b> ) R6 = SkipTo( _HtmlTag_ ) Step 3: Select R6 for further refinement (R5 does not match other examples, while R6 has better generalization potential) Step 4: Create new candidates while refining R6 R7 = SkipTo( : _HtmlTag_ ) R8 = SkipTo( _Punctuation_ _HtmlTag_ ) R9 = SkipTo( : ) SkipTo( _HtmlTag_ ) R10 = SkipTo( Address ) SkipTo( _HtmlTag_ ) … R7, R8: landmark refinement (token added to landmark in R6) R9, R10: topology refinement (new landmark added to R6) 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

CSE6339 Spring 2009 University of Texas at Arlington STALKER: Efficiency Rarely requires more than 10 examples, most case 2 are sufficient to generate extraction rules. Mostly pages in a source are based on a fixed template with few variations. Since STALKER tries to learn landmarks part of this template, few examples suffice to figure out reliable landmarks Exploits the hierarchical structure of source to constrain the learning problem. For example: First apply a rule to extract whole list of restaurants Then use another rule to break list into tuples corresponding to individual restaurants Finally, extract name, address and phone number from each tuple. Can extract data from pages containing complicated formatting layouts (e.g., list embedded in other lists) that other approaches are unable to handle. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

CSE6339 Spring 2009 University of Texas at Arlington STALKER: Performance In an empirical evaluation on 28 sources, STALKER had to learn 206 extraction rules. It learned 182 perfect rules (100% accurate), and another 18 rules that had an accuracy of at least 90%. In other words, only 3% of the learned rules were less that 90% accurate. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

CSE6339 Spring 2009 University of Texas at Arlington Backward Rules Forward rule: start at the beginning of the document and go towards the end Backward rule: start at the end of the page and goes towards its beginning Backward rules to find beginning of addresses: R11 = BackTo( Phone ) BackTo( _Number_ ) R12 = BackTo( Phone: <i> ) BackTo( _Number_ ) 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

STALKER: Active Learning Approach/ Co-testing System learns both, a forward and backward rule, after user labels one or two examples. Then it runs BOTH rules on given set of unlabeled pages. Whenever rules disagree on an example, system asks user to label that example. By asking the user to label that particular example, we obtain a highly informative training example. Thus, Co-testing makes it possible to generate accurate extraction rules with a very small number of labeled examples. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

Co-testing Performance Co-testing was applied on the 24 tasks on which STALKER fails to learn perfect rules based on 10 random examples. To keep the comparison fair, co-testing started with one random example and made up to 9 queries. The results were excellent: the average accuracy over all tasks improved from 85.7% to 94.2% (error rate reduced by 59.5%). Furthermore, 10 of the learned rules were 100% accurate, while another 11 rules were at least 90% accurate. In these experiments as well as in other related tests applying co-testing led to a significant improvement in accuracy without having to label more training data. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

Wrapper Verification

CSE6339 Spring 2009 University of Texas at Arlington DataPro Algorithm Data prototype: Starting and ending patterns of a field taken together For example, a set of street addresses – 12 Pico St., 512 Oak Blvd., 416 Main St. and 97 Adams Blvd. – all start with a pattern (_Number_ _Capitalized_) and end with (Blvd.) or (St.). DataPro algorithm learns significant pattern for each field The learning algorithm finds the patterns that describe the common beginnings and endings of each field of the training examples. In the verification phase, the wrapper generates a test set of examples from pages retrieved using the same or similar set of queries. If the patterns describe statistically the same (at a given significance level) proportion of the test examples as the training examples, the wrapper is judged to be extracting correctly; otherwise, it is judged to have failed. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

CSE6339 Spring 2009 University of Texas at Arlington DataPro: Performance Algorithm has a high rate of false positives. 27 wrappers (representing 23 distinct Web sources) were monitored over a period of several months. For each wrapper, the results of 15-30 queries were stored periodically. All new results were compared with the last correct wrapper output (training examples). A manual check of the results revealed 37 wrapper changes out of the total 443 comparisons. The verification algorithm correctly discovered 35 of these changes. The algorithm incorrectly decided that the wrapper has changed in 40 cases. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

Automatically Repairing Wrappers

CSE6339 Spring 2009 University of Texas at Arlington Reinduction Example Wrapper reinduction algorithm: Updates extraction rules based on the premise that formatting, rather than content has changed. Algorithm learns starting and ending patterns for address: Start pattern: p1 = (_Number_ _Capitalized_) End pattern: p2 = (Blvd.) OR (St.) Lets say the web site changes the word “Address” to “Location” New start & end patterns: p3 = NIL Algorithm finds text segments which have start pattern p1 and those that have end pattern p2. All segments of approximately same length as the Addresses identified in training set are retained, while others are eliminated. Segments having similar pattern have a lot in common (like size, location, etc) so they end up in the same cluster group. Each group is scored based on similarity to training examples. So the highest ranked group is identified as the “Address” field and extraction rules are updated. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

Actual Example of Change to Amazon’s Site 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

Reinduction Performance Extraction algorithm was applied to 21 distinct Web sources, attempting to extract 77 data fields from all the sources. In 62 cases the top ranked cluster contained correct complete instances of the data field. In eight cases the correct cluster was ranker lower, while in six cases no candidates were identified on the pages. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

CSE6339 Spring 2009 University of Texas at Arlington Wrapper: Lifecycle 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

CSE6339 Spring 2009 University of Texas at Arlington Summary STALKER algorithm: Types of extraction rules: forward and backward An extraction rule comprises: start rule and end rule Rules contain landmarks (groups of consecutive tokens) Lots of rules are possible towards the same goal Extraction rules allow disjunction (think of it as the Boolean union operator) Types of rule refinements: landmark and topology Co-testing uses both forward and backward rules to avoid mistakes and get user input only for essential examples. Datapro learns patterns; if patterns from set of pages retrieved from web is not statistically similar to training examples, wrapper is judged to have failed. Data prototype: Starting and ending patterns of a field taken together Wrapper reinduction algorithm: Updates extraction rules based on the premise that formatting, rather than content has changed. 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

CSE6339 Spring 2009 University of Texas at Arlington Discussion Limitations? Differences from the approach outlined in previous paper (Information Extraction: Distilling Structured Data from Unstructured Text by Andrew McCallum)? 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

CSE6339 Spring 2009 University of Texas at Arlington Discussion Limitations: Doesn't work for complex pages containing tables and complex lists. Differences from approach in previous paper: Follows hierarchical approach. Hence instead of “Segmentation-> Classification-> Association-> Normalization-> Deduplication”, it performs “Association-> Classification-> Segmentation” 1/14/2019 CSE6339 Spring 2009 University of Texas at Arlington

Thank you!