Robust Web Extraction Domain Centric Extraction

Slides:



Advertisements
Similar presentations
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Advertisements

Nilesh Dalvi, Philip Bohannon, Fei Sha Presented by Vinay Rambhia.
Optimal Schemes for Robust Web Extraction Aditya Parameswaran Stanford University (Joint work with: Nilesh Dalvi, Hector Garcia-Molina, Rajeev Rastogi)
Aki Hecht Seminar in Databases (236826) January 2009
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Contact: Fei Gao Office of Information Technology Montclair State University Creating Professional ePortfolios with Google Sites.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Webpage Understanding: an Integrated Approach
RefWorks: Advanced February 13, What We’ll Cover Today Managing Your Personal Database Searching Your Personal Database Linking to the Full Text.
Internet and Social Networking Research Tools for Academic Writing Copyright © 2014 Todd A. Whittaker
Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore.
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Search Engines and Information Retrieval Chapter 1.
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
Bare bones notes. Suggested organization for main folder. REQUIRED organization for the 115 folder.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Presenter: Shanshan Lu 03/04/2010
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing Applying records management processes principles to the open government.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
Searching ProQuest Ebook Resources Susan Watson / Lou Peck June 2007.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Information Retrieval in Practice
Google Scholar and ShareLaTeX
DHTML.
HTML Basics.
Queensland University of Technology
SEARCH ENGINE OPTIMIZATION
RefWorks: Advanced November 23, 2005.
Evaluation Anisio Lacerda.
Browse Content by Subfield
Project Objectives Publish to a remote server
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Experimental Psychology
Elements of HTML Web Design – Sec 3-2
Elements of HTML Web Design – Sec 3-2
Based on Menu Information
Research4Life Programmes: Similarities and Differences!
Elements of HTML Web Design – Sec 3-2
UNIT 15 Webpage Creator.
SEARCH ENGINE OPTIMIZATION
Federated & Meta Search
Web Data Extraction Based on Partial Tree Alignment
From taking notes to creating a bibliography
Restrict Range of Data Collection for Topic Trend Detection
Social Knowledge Mining
Data Integration for Relational Web
Searching for Truth: Locating Information on the WWW
Panagiotis G. Ipeirotis Luis Gravano
Searching for Truth: Locating Information on the WWW
Introduction to HTML.
Searching for Truth: Locating Information on the WWW
Learning to Rank with Ties
Introduction to Search Engines
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Connecting the Dots Between News Article
Presentation transcript:

Robust Web Extraction Domain Centric Extraction Philip Bohannon 5.19.2010 [SIGMOD 2009] Automatic Knowledge Base Construction Workshop Grenoble, France, May 17-20, 2010.

Yahoo! Presentation, Confidential

Robust Web Extraction, A Principled Approach Philip Bohannon 5.19.2010 [SIGMOD 2009] Joint work with Nilesh Dalvi, Fei Sha. Published SIGMOD 2009.

We can use the following XPath wrapper to extract directors html body head div class=‘head’ div class=‘content’ title Godfather table width=’60%’ table width=80% width=’80%’ td td td td td td Title : Godfather Director : Coppola Runtime 118min add a slide with numbers (4 times a year) We can use the following XPath wrapper to extract directors w1 = /html/body/div[2]/table/td[2]/text() Finding a wrapper based on a examples is the well-studied wrapper induction problem

Background: Wrapper Induction /html/div[2]/ /html//table Wrapper generator labeled data wrapper test data Goal: based on few labeled examples, maximize precision and recall on test data Challenge: test data is structurally similar to labeled data, but not same all these wrappers are “indistinguishable” by state of the art methods.

Problem : Wrappers break! html body head div div class=‘head’ div class=‘content’ title Godfather table width=’60%’ ad content table width=80% width=’80%’ td td td td td td td Title : Godfather 1972 Director : Coppola Runtime 118min add a slide with numbers (4 times a year) To monitor a site over time, wrappers are run repeatedly each time a page changes, or when a new page appears Problem : Wrappers break!

Background: Wrapper Repair Old labeled data /html/div[3]/ /html//table Wrapper Repair Old wrapper /html/div[2]/ /html//table New wrapper New test data New test data (overlaps old) Goal: based on old labeled data, old wrapper and old results produce a new wrapper [Lerman 2003] [Meng 2003], etc. Problems with Repair: Some breakage is hard to detect, especially for values that change frequently (e.g. price) Wrapper repair is offline, and thus breakage impacts production until repair is effected all these wrappers are “indistinguishable” by state of the art methods.

Which one should we choose? html body head div class=‘head’ div class=‘content’ Godfather title table width=’60%’ table width=80% width=’80%’ td td td td td td Title : Godfather Director : Coppola Runtime 118min Apparently, several alternative wrappers are less likely to fail due to page change than w1: w2 = //div[class=‘content’]/table/td[2]/text() w3 = //table[width=80%]/td[2]/text() w4 = //td[preceding-sibling/text() = “Director”]/text() all these wrappers are “indistinguishable” by state of the art methods. Which one should we choose?

Preview: Robust Wrappers for IMDB Robustness ranking of candidate wrappers to extract director names from IMDB pages: First technique to define problem and provide such scores. Remainder of talk: how we produce these rankings. Robustness XPath 0.400 //*[preceding-sibling = ‘Director’]]/text() 0.370 //*[h5/text() = ‘Director’]/a/text() … 0.190 //*[@id='tn15main']/*/div[3]/a/text() 0.080 //div[4]/*/div[3]/a/text() Put at beginning as teaster

Overview of Approach 2 1 3 4 S1 S2 S3 T3 Generate Candidate Wrappers Define Model of Page Evolution 1 w2 = //div[class=‘content’]/table/td[ 2]/text() w3 = //table[width=80%]/td[2]/text() w4 = //td[preceding-sibling/text() = “Director”]/text() Future Training Data? Training Data 3 Learn Parameters of Model from Archival Data Use Learned Model to Rank Candidates 4 Psub(tr,div)=.02 Psub(br,p)=.1 Pdel(div)=0.01 Pins(a)=0.12 S1 T1 S2 T2 S3 T3 Robustness(w2) = 0.13 Robustness(w3) = 0.07 Robustness(w4) = 0.21 Yahoo! Presentation, Confidential

Generating Candidate Wrappers Past work on XML wrapper induction generates one maximal wrapper [Tobias 2005] Need a variety of wrappers, not a single one Past work on Robust XML Wrappers has focused on evaluating hand-generated wrappers [Myllymaki 02][Abe 03] Contributions: Define a class of minimal XPath wrappers Algorithm for efficiently enumerating minimal wrappers 1 w2 = //div[class=‘content’]/table/td[ 2]/text() w3 = //table[width=80%]/td[2]/text() w4 = //td[preceding-sibling/text() = “Director”]/text() Yahoo! Presentation, Confidential

Define a Model of Page Evolution Let S be an HTML document tree We would like to predict the likely future changes in S In other words, we want to sample a distribution P[T | S], the probability that S evolves into T Similar to, but not the same as, the shortest edit distance [Zhang 89] Contributions: Define probabilistic tree edit model based on label insertion, deletion and substitution probabilities 2 Future Training Data Training Data Yahoo! Presentation, Confidential

Learn Parameters of Model from Archival Data Let θ be the collection of model parameters Pins(l), Pdel(l), Psub(l1,l2) Let (Si,Ti) be a pairsof web page snapshots, where Si evolved into Ti How to estimate θ given a collection {(Si,Ti)} of such pairs? Contributions: Efficient dynamic-programming algorithm to compute Pθ [Ti | Si] A search strategy to maximize θ Heuristic to improve performance in practice 3 Psub(tr,div)=.02 Psub(br,p)=.1 Pdel(div)=0.01 Pins(a)=0.12 S1 T1 S2 T2 S3 T3 Yahoo! Presentation, Confidential

Use Learned Model to Rank Candidates Let robustθ(w,S) be the probability that XPath wrapper w will work on a tree T that evolves from S according to θ Need to compute robustθ(w,S) in order to rank candidate XPaths Contributions: Proof that perfect estimation of robustness is #P complete Approximation with Monte-Carlo simulation 4 Robustness(W2) = 0.13 Robustness(W3) = 0.07 Robustness(W4) = 0.21 Yahoo! Presentation, Confidential

Datasets Faculty : a set of faculty homepages monitored over last 5 years NOAA : webpages from www.noaa.gov, a public website on environmental information, over last 5 years IMDB : webpages from www.imdb.com, monitored over last 5 years.

Top Model Probabilities Insert Delete a 0.0030 td 0.0366 br 0.0019 0.0355 0.0008 0.0271 span tr 0.0096 b 0.0064 0.0007 li 0.0006 Insert Delete a 0.0141 0.0971 br 0.0073 p 0.0828 li 0.0055 0.0581 0.0037 0.0172 b 0.0024 0.0074 I 0.0015 ul 0.0046 img 0.0012 0.0038 NOAA Faculty

Robust Wrappers for IMDB Robustness ranking of candidate wrappers to extract director names from IMDB pages Robustness XPath 0.400 //*[preceding-sibling = ‘Director’]]/text() 0.370 //*[h5/text() = ‘Director’]/a/text() … 0.190 //*[@id='tn15main']/*/div[3]/a/text() 0.080 //div[4]/*/div[3]/a/text() Put at beginning as teaster

Evaluation of robustness

Domain-Centric Extraction (work in progress) Philip Bohannon 5.19.2010 Joint work with Ashwin Machanavajjhala, Keerthi Selvaraj, Nilesh Dalvi, Anish Das Sarma, Srujana Merugu, Raghu Ramakrishnan, Srinivasan Segmundu, Cong Yu.

Information Extraction Scenarios One use of extracted information is to power information portals around particular domains. Movie Information e.g. movie name, release date, director, local theatres, showtimes School Information e.g. name, contact info, principal, test dates, sport events Restaurant Information e.g. restaurant name, cuisine, hours of operation, credit cards, parking, rating, atmosphere Academic Information e.g. title, conference or journal, author list, citation list, etc. Dblife.com, rexa.com, google scholar, Sports Information e.g. Team name, player name, player stats, recent game stats, etc. Product Information e.g. name, brand, features, vendor, price, description, reviews, etc Pricegrabber.com, techbargains.com,..

Content Portal’s Wish Lists Content portals analyze user interests to set competitive goals around content. Let’s call our portal E, and summarize these goals as a query over a “true table” and a quality metric. select name, phone, principal-name, api-score, start-date,.. (many more) from true-schools-in-world where type = “high school” and state = “CA” QE= Name Phone … Cupertino High School (408) 366-7300 Fremont High School (408) 522-2400 Missing public school: -0.03 Wrong api-score: -0.0001 Missing picture: -0.0002 Missing start date: -0.007 … An estimate VE of QE Quality Metric

True Databases QE is over an imaginary but conceptually useful true-schools-in-world table: Every possible “attribute” Every real “entity” from Oxford to the daycare down the street.

End-to-End Extraction is the Goal The goal of extraction is to use a corpus (such as the web) to improve the content portal’s estimate: The Web Name Phone … Cupertino High School (408) 366-7300 Fremont High School (408) 522-2400 Name Phone Cupertino High School (408) 366-7300 … New estimate VE of QE Extraction System Improve the Quality Metric? Initial estimate V0 of QE $

What’s new with this perspective? Observations: Quality metrics change – need quality of final database, not for each extraction Must account supervision full cost – if using unsupervised extraction, then pay to supervise integration As competition continues, the number of sites needed to further improve coverage grows

Competitive Extraction: an Illustrative Experiment Observation: as competition continues, the number of sites needed to further improve coverage grows Experiment study distribution of “homepage” attribute of schools: Start with a list of 75,000 schools and their official home page URLs from Yahoo! Local Find in-links for each URL, and group by host (i.e. for this attribute, webmap tells us where it can be extracted) Order hosts (other than Yahoo! Local) by the number of OHPs they know (decreasing) Now, to get a particular coverage, (on x-axis), how many sites must we extract from (y-axis)?

Content availability for official home page of 75,000 schools … this talk… Supervised Techniques Mention that this is almost one site per entity! Perl

Surfacing Clearly, constructing an interesting corpus is not trivial, but neither is it usually considered an academic contribution Pleasant exceptions include a variety of Deep Web work, Webtables, and [Blanco, JUCS Journal, 2008] and citations. Automatically and rapidly finding a good corpus for extracting tail information in a domain (last rectangle) is far from easy How to scale to domain after domain, attribute after attribute? This needs definition as a formal problem, and attention from community!

Story so far… Goals: end-to-end extraction, improve quality metrics Challenge: surfacing gets harder over time Challenge: so many sites to extract from Perl Supervised Techniques … this talk…

Domains with Strong Cross-Site Formatting Conventions Bibliographic Records (DBLife, REXA, Google Scholar) NLP-pattern extraction (Knowitall) Address/Phone/Date extractors (Regex/CRF) The success of techniques for such domains comes from identifying and leveraging cross-site redundancies in the way information is embedded in web sites. Examples of bibliographic records; nlp-pattern;

Bibliography …

Address/Phone

NLP /Dates

Cross-site signal does not always work Attributes that can be extracted across sites with domain- specific techniques are exceptionally valuable, but limited There are many attributes without such strong formatting conventions. (E.g. principal’s name, school sports team name, etc.) In addition to different formatting, there is different nesting, choices of attributes, etc.

Content Diversity in the Tail

Content Diversity continued For this particular attribute, nicely structured tables do exist, but are in the minority – out of one small sample of 100, we found 3.

Story so far… Goal: end-to-end extraction, improve quality metrics Challenge: surfacing gets harder over time Challenge: content diversity These challenges will outlast particular techniques Time to switch gears – now a brief introduction to principles of our approach, “domain-centric extraction”

A Model for Generating the Web The true world database Site queries Information loss Noise addition Site, page layout Motivated by data integration/exchange, seek a formal model Extends EXALG/Roadrunner/MDR generative models for a page Surround is done after pages hit the web Surround is novel part of the problem, just recently considered for data integration If you think this is well structured, you have not recognized how many true database there are, or how useless they can be (true ads, true spam, true opportunities from Nigeria) Attributes of interest may be drowned out by ones of non-interest Nice hierarchical links may be drowned in surround links Surround generation WWW Document Corpus

Model and Problem Definitions Traditional Integration Traditional Extraction Not-yet Traditional Surfacing Motivated by data integration/exchange, seek a formal model Extends EXALG/Roadrunner/MDR generative models for a page Surround is done after pages hit the web Surround is novel part of the problem, just recently considered for data integration If you think this is well structured, you have not recognized how many true database there are, or how useless they can be (true ads, true spam, true opportunities from Nigeria) Attributes of interest may be drowned out by ones of non-interest Nice hierarchical links may be drowned in surround links WWW End-to-End Extraction

A Model for Generating the Web The true world database Site queries Information loss Noise addition Site, page layout Motivated by data integration/exchange, seek a formal model Extends EXALG/Roadrunner/MDR generative models for a page Surround is done after pages hit the web Surround is novel part of the problem, just recently considered for data integration If you think this is well structured, you have not recognized how many true database there are, or how useless they can be (true ads, true spam, true opportunities from Nigeria) Attributes of interest may be drowned out by ones of non-interest Nice hierarchical links may be drowned in surround links Surround generation WWW Document Corpus

Each Step Leads to One or More Priors Useful for Surfacing or Extraction Site queries -> site schema Information loss -> site database content Site/page layout -> url structure, link structure, attribute rendering, xpath->attribute, static text, separators, attribute order page topology (attribute nesting) Surround generation -> clicks, inlinks, anchor text Content redundancy already mentioned Each is a potential form of redundancy to support low—supervision extraction Used across surfacing and extraction

Domain-Centric Information Extraction In progress: formalizing this model, forms of redundancy and problem definitions Key Point: every form of signal will be stronger within a particular domain By utilizing our model the variety of forms of redundancy, we seek to exploit these signals as they appear in each domain while scaling …to increasingly tail entities, tail attributes …to new domains

Domain Centric Extraction – Some key points Know an estimate of the data  noisy labeling on new pages See [Gulharne, Rastogi, Segmundu, Tengli, WWW 2010] (Yahoo! Labs Bangalore) for a start on using content redundancy, also WWT Partial data embedded in many web pages  refine estimate of local site/page features based on generative model Same attributes and similar data appear on a variety of new pages across sites  joint extraction-integration Name Phone Cupertino High School (408) 366-7300 …

Example Application in Production

In summary End-to-end extraction problem definition Importance of competitive extraction (pushes toward the tail) Challenge: surfacing Challenge: diversity Generative model of site creation leads to catalog of forms of surfacing and extraction signal Domain-Centric Extraction: Proceed domain-by-domain to maximize signal against diversity

The End