Optimal Schemes for Robust Web Extraction Aditya Parameswaran Stanford University (Joint work with: Nilesh Dalvi, Hector Garcia-Molina, Rajeev Rastogi)

Slides:

Advertisements

Similar presentations

Solving connectivity problems parameterized by treewidth in single exponential time Marek Cygan, Marcin Pilipczuk, Michal Pilipczuk Jesper Nederlof, Dagstuhl.

Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions

Chap 8-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 8 Estimation: Single Population Statistics for Business and Economics.

Branch & Bound Algorithms

Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.

Efficient Query Evaluation on Probabilistic Databases

Nilesh Dalvi, Philip Bohannon, Fei Sha Presented by Vinay Rambhia.

All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

Minimum-Buffered Routing of Non- Critical Nets for Slew Rate and Reliability Control Supported by Cadence Design Systems, Inc. and the MARCO Gigascale.

Aki Hecht Seminar in Databases (236826) January 2009

Jie Gao Joint work with Amitabh Basu*, Joseph Mitchell, Girishkumar Stony Brook Distributed Localization using Noisy Distance and Angle Information.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

Traditional Information Extraction -- Summary CS652 Spring 2004.

6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.

B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.

B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.

B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.

Chapter 11: Limitations of Algorithmic Power

Chapter 7 Estimation: Single Population

1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.

B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.

Strategic Behavior in Multi-Winner Elections A follow-up on previous work by Ariel Procaccia, Aviv Zohar and Jeffrey S. Rosenschein Reshef Meir The School.

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

Chapter 11 Limitations of Algorithm Power. Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples:

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 11 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.

Why is bin packing interesting?

June 27, 2002 HornstrupCentret1 Using Compile-time Techniques to Generate and Visualize Invariants for Algorithm Explanation Thursday, 27 June :00-13:30.

Edge Covering problems with budget constrains By R. Gandhi and G. Kortsarz Presented by: Alantha Newman.

© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.

C++ Programming: From Problem Analysis to Program Design, Second Edition Chapter 19: Searching and Sorting.

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

Packing Rectangles into Bins Nikhil Bansal (CMU) Joint with Maxim Sviridenko (IBM)

The Lower Bounds of Problems

The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.

1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.

Discrete Structures Lecture 12: Trees Ji Yanyan United International College Thanks to Professor Michael Hvidsten.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

CSC 211 Data Structures Lecture 13

Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Chap 8-1 Chapter 8 Confidence Interval Estimation Statistics for Managers Using Microsoft Excel 7 th Edition, Global Edition Copyright ©2014 Pearson Education.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

1/32 This Lecture Substitution model An example using the substitution model Designing recursive procedures Designing iterative procedures Proving that.

(A comparative study for XML change detection) Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST) Etude comparative sur la détection.

1 CSC 421: Algorithm Design & Analysis Spring 2014 Complexity & lower bounds  brute force  decision trees  adversary arguments  problem reduction.

Vasilis Syrgkanis Cornell University

Internal and External Sorting External Searching

Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes Changqing Li, Tok Wang Ling, Min Hu.

1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:

Statistics for Business and Economics 8 th Edition Chapter 7 Estimation: Single Population Copyright © 2013 Pearson Education, Inc. Publishing as Prentice.

Linear Models & Clustering Presented by Kwak, Nam-ju 1.

Optimization and Stability in Games with Restricted Interactions Reshef Meir, Yair Zick and Jeffrey S. Rosenschein CoopMAS 2012.

CSC317 Selection problem q p r Randomized‐Select(A,p,r,i)

Chapter 9 Audit Sampling: An Application to Substantive Tests of Account Balances McGraw-Hill/Irwin ©2008 The McGraw-Hill Companies, All Rights Reserved.

Robust Web Extraction Domain Centric Extraction

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

Web Data Extraction Based on Partial Tree Alignment

Analysis and design of algorithm

k-center Clustering under Perturbation Resilience

CS 188: Artificial Intelligence

Instructor: Shengyu Zhang

CAP 5636 – Advanced Artificial Intelligence

Chapter 11 Limitations of Algorithm Power

Created by _____ & _____

Huffman Coding Greedy Algorithm

Presentation transcript:

Optimal Schemes for Robust Web Extraction Aditya Parameswaran Stanford University (Joint work with: Nilesh Dalvi, Hector Garcia-Molina, Rajeev Rastogi) 1

2

3 html body head title div table td table td class=‘content’ width=80% Godfather Title :Godfather Director : Coppola Runtime118min div td 1972 ad content Problem : Wrappers break!  We can use the following Xpath wrapper to extract directors W1 = /html/body/div[2]/table/td[2]/text() class=‘head’

4 But how do we find the most robust wrapper? Several alternative wrappers are “more robust” ◦ W2 = //div[class=‘content’]/table/td[2]/text() ◦ W3 = //table[width=80%]/td[2]/text() ◦ W4 = //td[preceding-sibling/text() = “Director”]/text() html body head title div table td table td class=‘content’ width=80% Godfather Title :Godfather Director : Coppola Runtime118min class=‘head’

5 w1’w1’ … w1w1 w2w2 wkwk t = 0 t = t 1 Labeled PagesUnlabeled Pages … w k+1 wnwn w k+2 … Unlabeled Pages … w2’w2’wk’wk’w k+2 ’wn’wn’w k+1 ’ Focus on Robustness Generalize ???

6 Page Level Wrapper Approach Compute a wrapper given: ◦ Old version (ordered labeled tree) w ◦ Distinguished node d(w) in w (May be many) On being given a new version (ordered labeled tree) w’: Our wrapper returns: ◦ Distinguished node d(w’) in w’ ◦ Estimate of the confidence

Two Core Problems Problem 1: Given w find the most “robust” wrapper on w Problem 2: Given w, w’, estimate the “confidence” of extraction 7

Change Model Adversarial: ◦ Each edit: insert, delete, substitute has a known cost ◦ Sum costs for an edit script Probabilistic: [Dalvi et. al., SIGMOD09] ◦ Each edit has a known probability ◦ Transducer that transforms the tree ◦ Multiply probabilities 8

Summary of Theoretical Results 9 Focus on these problems Will touch upon this if there is time PART 1PART 3PART 4 Experiments! Adversarial has better complexity Finding the wrapper is EASIER than estimating its robustness! PART 2, 5

Part 1: Adversarial Wrapper: Robustness Recall: Adversarial has costs for each edit operation Given a webpage w, fix a wrapper 10 Robustness of a wrapper on a webpage w : Largest c such that for any edit script s with cost < c, wrapper can find the distinguished node in s(w) Robustness of a wrapper on a webpage w : Largest c such that for any edit script s with cost < c, wrapper can find the distinguished node in s(w) Cost Script 1: del(X), ins(Y), subs (Z, W) Script 2: …. … Robustness

How do we show optimality? 11 w1w1 w2w2 w3w3 Proof 1: Upperbound on Robustness w0w0 Robustness Proof 2: Lowerbound of robustness of w 0 w4w4 Thus, w 0 is optimal! c

Adversarial Wrapper: Upper Bound Let c be the smallest cost such that ◦ S 1 <= c, S 2 <= c, so that this “bad” case happens Then, c is an upperbound on the robustness of any wrapper on w ! 12 s1s1 s2s2 w BAD CASE: Same structure (i.e., S 1 (w) = S 2 (w)) Different locations of distinguished nodes. BAD CASE: Same structure (i.e., S 1 (w) = S 2 (w)) Different locations of distinguished nodes. w’s1s1 s2s2

Adversarial Optimal Wrapper Given w, d(w), w’: ◦ Find the smallest cost edit script S such that S(w) = w’ ◦ Return the location of d(w) on applying S to w 13 S w w’

Robustness Lowerbound Proof Assume the contrary (robustness of our wrapper is < c) Then, there is an actual edit script S 1 where it fails ◦ and cost(S 1 ) < c Let the min cost script be S 2 Then: cost(S 2 ) <= cost(S 1 ) < c But then this situation cannot happen! 14 s1s1 s2s2 ww’

Detour: Minimum Cost Edit Script Classical paper by Zhang-Shasha Dynamic programming over subtrees Complexity: O(n 1 n 2 d 1 d 2 ) 15

Part 2: Evaluation Crawls from internet-archive.org ◦ Domains: IMDB, CNN, Wikipedia ◦ Roughly webpages per domain ◦ Roughly 100’s of versions per webpage Finding distinguished nodes ◦ We looked for unique patterns that appear in all webpages, like votes ◦ Allows us to do automatic evaluation How do we set the costs? ◦ Learn from prior data… 16

Evaluation (Continued) Baseline comparisons ◦ XPATH: Robust XPath Wrapper [SIGMOD09] ◦ FULL: Entire Xpath Two kinds of experiments ◦ Variation with difference in archive.org version number  A proxy on time  How do wrappers perform as the time gap is increased? ◦ Precision/Recall of the confidence estimates provided  Can I use the confidence values to decide whether to refer the web-page to an editor? 17

18

19

Part 2: Computation of Robustness NP-Hard via a reduction from the partition problem. {x 1, x 2, …, x n } Costs: d(a 0 ) = 0 and d(a n ) = 0 Costs: s(a i,b i ) = 0; s(a i, b i-1 ) = x i ; s(a i, b i+1 ) = x i ; Everything else infty. 20 a 0 a 1 a n … a 1 a 2 a n … a 0 a 1 a n-1 … b 0/1 b 1/2 b n/n+1 … c = sum(x i )/2 iff there is a partition c = sum(x i )/2 iff there is a partition

Part 3: Confidence in Extraction Let s 1 be the min cost edit script Let s 2 be the min cost edit script that has a different location of distinguished node Confidence = cost(s 2 ) - cost(s 1 ) Also computed in O(n 1 n 2 d 1 d 2 ) 21 s1s1 s2s2 w w’

Probabilistic Wrapper No single “edit script” All “edit scripts” have some non-zero probability Location of node is ◦ Argmax s Pr(w, w’, d(w), s) Simple algorithm: For each s, compute above. Problem: Too slow! Solution: Share computation… 22

Evaluation (Continued) Baseline comparisons ◦ XPATH: Most robust XPath Wrapper [SIGMOD09] ◦ FULL: Entire Xpath Two kinds of experiments ◦ Variation with difference in archive.org version number  A proxy on time  How do wrappers perform as the time gap is increased? ◦ Precision/Recall of the confidence estimates provided  Can I use the confidence values to decide whether to refer the web-page to an editor? 23

24

25

Conclusions Our wrappers provide provable guarantees of optimal robustness under ◦ Adversarial change model ◦ Probabilistic change model Experimentally, too: ◦ Perform much better in terms of correctness considerations ◦ Plus, they provide reliable confidence estimates 26

Thanks for coming! 27