Robust Web Extraction Domain Centric Extraction

Robust Web Extraction Domain Centric Extraction
Philip Bohannon [SIGMOD 2009] Automatic Knowledge Base Construction Workshop Grenoble, France, May 17-20, 2010.

Yahoo! Presentation, Confidential

Robust Web Extraction, A Principled Approach
Philip Bohannon [SIGMOD 2009] Joint work with Nilesh Dalvi, Fei Sha. Published SIGMOD 2009.

We can use the following XPath wrapper to extract directors
html body head div class=‘head’ div class=‘content’ title Godfather table width=’60%’ table width=80% width=’80%’ td td td td td td Title : Godfather Director : Coppola Runtime 118min add a slide with numbers (4 times a year) We can use the following XPath wrapper to extract directors w1 = /html/body/div[2]/table/td[2]/text() Finding a wrapper based on a examples is the well-studied wrapper induction problem

Background: Wrapper Induction
/html/div[2]/ /html//table Wrapper generator labeled data wrapper test data Goal: based on few labeled examples, maximize precision and recall on test data Challenge: test data is structurally similar to labeled data, but not same all these wrappers are “indistinguishable” by state of the art methods.

Problem : Wrappers break!
html body head div div class=‘head’ div class=‘content’ title Godfather table width=’60%’ ad content table width=80% width=’80%’ td td td td td td td Title : Godfather 1972 Director : Coppola Runtime 118min add a slide with numbers (4 times a year) To monitor a site over time, wrappers are run repeatedly each time a page changes, or when a new page appears Problem : Wrappers break!

Background: Wrapper Repair
Old labeled data /html/div[3]/ /html//table Wrapper Repair Old wrapper /html/div[2]/ /html//table New wrapper New test data New test data (overlaps old) Goal: based on old labeled data, old wrapper and old results produce a new wrapper [Lerman 2003] [Meng 2003], etc. Problems with Repair: Some breakage is hard to detect, especially for values that change frequently (e.g. price) Wrapper repair is offline, and thus breakage impacts production until repair is effected all these wrappers are “indistinguishable” by state of the art methods.

Which one should we choose?
html body head div class=‘head’ div class=‘content’ Godfather title table width=’60%’ table width=80% width=’80%’ td td td td td td Title : Godfather Director : Coppola Runtime 118min Apparently, several alternative wrappers are less likely to fail due to page change than w1: w2 = //div[class=‘content’]/table/td[2]/text() w3 = //table[width=80%]/td[2]/text() w4 = //td[preceding-sibling/text() = “Director”]/text() all these wrappers are “indistinguishable” by state of the art methods. Which one should we choose?

Preview: Robust Wrappers for IMDB
Robustness ranking of candidate wrappers to extract director names from IMDB pages: First technique to define problem and provide such scores. Remainder of talk: how we produce these rankings. Robustness XPath 0.400 //*[preceding-sibling = ‘Director’]]/text() 0.370 //*[h5/text() = ‘Director’]/a/text() … 0.190 0.080 //div[4]/*/div[3]/a/text() Put at beginning as teaster

Overview of Approach 2 1 3 4 S1 S2 S3 T3 Generate Candidate Wrappers
Define Model of Page Evolution 1 w2 = //div[class=‘content’]/table/td[ 2]/text() w3 = //table[width=80%]/td[2]/text() w4 = //td[preceding-sibling/text() = “Director”]/text() Future Training Data? Training Data 3 Learn Parameters of Model from Archival Data Use Learned Model to Rank Candidates 4 Psub(tr,div)=.02 Psub(br,p)=.1 Pdel(div)=0.01 Pins(a)=0.12 S1 T1 S2 T2 S3 T3 Robustness(w2) = 0.13 Robustness(w3) = 0.07 Robustness(w4) = 0.21 Yahoo! Presentation, Confidential

Generating Candidate Wrappers
Past work on XML wrapper induction generates one maximal wrapper [Tobias 2005] Need a variety of wrappers, not a single one Past work on Robust XML Wrappers has focused on evaluating hand-generated wrappers [Myllymaki 02][Abe 03] Contributions: Define a class of minimal XPath wrappers Algorithm for efficiently enumerating minimal wrappers 1 w2 = //div[class=‘content’]/table/td[ 2]/text() w3 = //table[width=80%]/td[2]/text() w4 = //td[preceding-sibling/text() = “Director”]/text() Yahoo! Presentation, Confidential

Define a Model of Page Evolution
Let S be an HTML document tree We would like to predict the likely future changes in S In other words, we want to sample a distribution P[T | S], the probability that S evolves into T Similar to, but not the same as, the shortest edit distance [Zhang 89] Contributions: Define probabilistic tree edit model based on label insertion, deletion and substitution probabilities 2 Future Training Data Training Data Yahoo! Presentation, Confidential

Learn Parameters of Model from Archival Data
Let θ be the collection of model parameters Pins(l), Pdel(l), Psub(l1,l2) Let (Si,Ti) be a pairsof web page snapshots, where Si evolved into Ti How to estimate θ given a collection {(Si,Ti)} of such pairs? Contributions: Efficient dynamic-programming algorithm to compute Pθ [Ti | Si] A search strategy to maximize θ Heuristic to improve performance in practice 3 Psub(tr,div)=.02 Psub(br,p)=.1 Pdel(div)=0.01 Pins(a)=0.12 S1 T1 S2 T2 S3 T3 Yahoo! Presentation, Confidential

Use Learned Model to Rank Candidates
Let robustθ(w,S) be the probability that XPath wrapper w will work on a tree T that evolves from S according to θ Need to compute robustθ(w,S) in order to rank candidate XPaths Contributions: Proof that perfect estimation of robustness is #P complete Approximation with Monte-Carlo simulation 4 Robustness(W2) = 0.13 Robustness(W3) = 0.07 Robustness(W4) = 0.21 Yahoo! Presentation, Confidential

Datasets Faculty : a set of faculty homepages monitored over last 5 years NOAA : webpages from a public website on environmental information, over last 5 years IMDB : webpages from monitored over last 5 years.

Top Model Probabilities
Insert Delete a 0.0030 td 0.0366 br 0.0019 0.0355 0.0008 0.0271 span tr 0.0096 b 0.0064 0.0007 li 0.0006 Insert Delete a 0.0141 0.0971 br 0.0073 p 0.0828 li 0.0055 0.0581 0.0037 0.0172 b 0.0024 0.0074 I 0.0015 ul 0.0046 img 0.0012 0.0038 NOAA Faculty

Robust Wrappers for IMDB
Robustness ranking of candidate wrappers to extract director names from IMDB pages Robustness XPath 0.400 //*[preceding-sibling = ‘Director’]]/text() 0.370 //*[h5/text() = ‘Director’]/a/text() … 0.190 0.080 //div[4]/*/div[3]/a/text() Put at beginning as teaster

Evaluation of robustness

Domain-Centric Extraction (work in progress)
Philip Bohannon Joint work with Ashwin Machanavajjhala, Keerthi Selvaraj, Nilesh Dalvi, Anish Das Sarma, Srujana Merugu, Raghu Ramakrishnan, Srinivasan Segmundu, Cong Yu.

Information Extraction Scenarios
One use of extracted information is to power information portals around particular domains. Movie Information e.g. movie name, release date, director, local theatres, showtimes School Information e.g. name, contact info, principal, test dates, sport events Restaurant Information e.g. restaurant name, cuisine, hours of operation, credit cards, parking, rating, atmosphere Academic Information e.g. title, conference or journal, author list, citation list, etc. Dblife.com, rexa.com, google scholar, Sports Information e.g. Team name, player name, player stats, recent game stats, etc. Product Information e.g. name, brand, features, vendor, price, description, reviews, etc Pricegrabber.com, techbargains.com,..

Content Portal’s Wish Lists
Content portals analyze user interests to set competitive goals around content. Let’s call our portal E, and summarize these goals as a query over a “true table” and a quality metric. select name, phone, principal-name, api-score, start-date,.. (many more) from true-schools-in-world where type = “high school” and state = “CA” QE= Name Phone … Cupertino High School (408) Fremont High School (408) Missing public school: -0.03 Wrong api-score: Missing picture: Missing start date: … An estimate VE of QE Quality Metric

True Databases QE is over an imaginary but conceptually useful true-schools-in-world table: Every possible “attribute” Every real “entity” from Oxford to the daycare down the street.

End-to-End Extraction is the Goal
The goal of extraction is to use a corpus (such as the web) to improve the content portal’s estimate: The Web Name Phone … Cupertino High School (408) Fremont High School (408) Name Phone Cupertino High School (408) … New estimate VE of QE Extraction System Improve the Quality Metric? Initial estimate V0 of QE $

What’s new with this perspective? Observations:
Quality metrics change – need quality of final database, not for each extraction Must account supervision full cost – if using unsupervised extraction, then pay to supervise integration As competition continues, the number of sites needed to further improve coverage grows

Competitive Extraction: an Illustrative Experiment
Observation: as competition continues, the number of sites needed to further improve coverage grows Experiment study distribution of “homepage” attribute of schools: Start with a list of 75,000 schools and their official home page URLs from Yahoo! Local Find in-links for each URL, and group by host (i.e. for this attribute, webmap tells us where it can be extracted) Order hosts (other than Yahoo! Local) by the number of OHPs they know (decreasing) Now, to get a particular coverage, (on x-axis), how many sites must we extract from (y-axis)?

Content availability for official home page of 75,000 schools
… this talk… Supervised Techniques Mention that this is almost one site per entity! Perl

Surfacing Clearly, constructing an interesting corpus is not trivial, but neither is it usually considered an academic contribution Pleasant exceptions include a variety of Deep Web work, Webtables, and [Blanco, JUCS Journal, 2008] and citations. Automatically and rapidly finding a good corpus for extracting tail information in a domain (last rectangle) is far from easy How to scale to domain after domain, attribute after attribute? This needs definition as a formal problem, and attention from community!

Story so far… Goals: end-to-end extraction, improve quality metrics
Challenge: surfacing gets harder over time Challenge: so many sites to extract from Perl Supervised Techniques … this talk…

Domains with Strong Cross-Site Formatting Conventions
Bibliographic Records (DBLife, REXA, Google Scholar) NLP-pattern extraction (Knowitall) Address/Phone/Date extractors (Regex/CRF) The success of techniques for such domains comes from identifying and leveraging cross-site redundancies in the way information is embedded in web sites. Examples of bibliographic records; nlp-pattern;

Bibliography …

Address/Phone

NLP /Dates

Cross-site signal does not always work
Attributes that can be extracted across sites with domain- specific techniques are exceptionally valuable, but limited There are many attributes without such strong formatting conventions. (E.g. principal’s name, school sports team name, etc.) In addition to different formatting, there is different nesting, choices of attributes, etc.

Content Diversity in the Tail

Content Diversity continued
For this particular attribute, nicely structured tables do exist, but are in the minority – out of one small sample of 100, we found 3.

Story so far… Goal: end-to-end extraction, improve quality metrics
Challenge: surfacing gets harder over time Challenge: content diversity These challenges will outlast particular techniques Time to switch gears – now a brief introduction to principles of our approach, “domain-centric extraction”

A Model for Generating the Web
The true world database Site queries Information loss Noise addition Site, page layout Motivated by data integration/exchange, seek a formal model Extends EXALG/Roadrunner/MDR generative models for a page Surround is done after pages hit the web Surround is novel part of the problem, just recently considered for data integration If you think this is well structured, you have not recognized how many true database there are, or how useless they can be (true ads, true spam, true opportunities from Nigeria) Attributes of interest may be drowned out by ones of non-interest Nice hierarchical links may be drowned in surround links Surround generation WWW Document Corpus

Model and Problem Definitions
Traditional Integration Traditional Extraction Not-yet Traditional Surfacing Motivated by data integration/exchange, seek a formal model Extends EXALG/Roadrunner/MDR generative models for a page Surround is done after pages hit the web Surround is novel part of the problem, just recently considered for data integration If you think this is well structured, you have not recognized how many true database there are, or how useless they can be (true ads, true spam, true opportunities from Nigeria) Attributes of interest may be drowned out by ones of non-interest Nice hierarchical links may be drowned in surround links WWW End-to-End Extraction

A Model for Generating the Web
The true world database Site queries Information loss Noise addition Site, page layout Motivated by data integration/exchange, seek a formal model Extends EXALG/Roadrunner/MDR generative models for a page Surround is done after pages hit the web Surround is novel part of the problem, just recently considered for data integration If you think this is well structured, you have not recognized how many true database there are, or how useless they can be (true ads, true spam, true opportunities from Nigeria) Attributes of interest may be drowned out by ones of non-interest Nice hierarchical links may be drowned in surround links Surround generation WWW Document Corpus

Each Step Leads to One or More Priors Useful for Surfacing or Extraction
Site queries -> site schema Information loss -> site database content Site/page layout -> url structure, link structure, attribute rendering, xpath->attribute, static text, separators, attribute order page topology (attribute nesting) Surround generation -> clicks, inlinks, anchor text Content redundancy already mentioned Each is a potential form of redundancy to support low—supervision extraction Used across surfacing and extraction

Domain-Centric Information Extraction
In progress: formalizing this model, forms of redundancy and problem definitions Key Point: every form of signal will be stronger within a particular domain By utilizing our model the variety of forms of redundancy, we seek to exploit these signals as they appear in each domain while scaling …to increasingly tail entities, tail attributes …to new domains

Domain Centric Extraction – Some key points
Know an estimate of the data  noisy labeling on new pages See [Gulharne, Rastogi, Segmundu, Tengli, WWW 2010] (Yahoo! Labs Bangalore) for a start on using content redundancy, also WWT Partial data embedded in many web pages  refine estimate of local site/page features based on generative model Same attributes and similar data appear on a variety of new pages across sites  joint extraction-integration Name Phone Cupertino High School (408) …

Example Application in Production

In summary End-to-end extraction problem definition
Importance of competitive extraction (pushes toward the tail) Challenge: surfacing Challenge: diversity Generative model of site creation leads to catalog of forms of surfacing and extraction signal Domain-Centric Extraction: Proceed domain-by-domain to maximize signal against diversity

The End

Robust Web Extraction Domain Centric Extraction

Similar presentations

Presentation on theme: "Robust Web Extraction Domain Centric Extraction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Robust Web Extraction Domain Centric Extraction

Similar presentations

Presentation on theme: "Robust Web Extraction Domain Centric Extraction"— Presentation transcript:

Similar presentations

About project

Feedback