WebTables & Octopus Michael J. Cafarella University of Washington CSE454 April 30, 2009.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Center for E-Business Technology Seoul National University Seoul, Korea WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon.

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.

Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

Improved TF-IDF Ranker

TEXTRUNNER Turing Center Computer Science and Engineering

Introduction Age of big data: – Availability of massive amounts of data is driving many technical advances on the web and off – The web is traditionally.

Data Integration for the Relational Web Katsarakis Michalis.

Outline Introduction to the work Motivation WebTables Relation Recovery – Relational Filtering – Metadata recovery Attribute Correlation Statistics Database.

WebTables: Exploring the Power of Tables on the Web

Search Engines and Information Retrieval

Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.

Reduced Support Vector Machine

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun

Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.

Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010.

Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Deep-Web Crawling “Enlightening the dark side of the web”

Presented by Mat Kelly CS895 – Web-based Information Retrieval Old Dominion University Septmber 27, 2011 The Deep Web: Surfacing Hidden Value Michael K.

Outline Introduction to the work Motivation WebTables Relation Recovery – Relational Filtering – Metadata recovery Attribute Correlation Statistics Database.

Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, Yang Zhang Presented by : Kevin Quadros CSE Department University at Buffalo WebTables: Exploring.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

Search Engines and Information Retrieval Chapter 1.

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.

Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County

The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

“Here is my data. Where do I start?” Examples of Ad Hoc Databases Automatic Example Queries for Ad Hoc Databases Bill Howe 1, Garret Cole 2, Nodira Khoussainova.

Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Searching for Statistical Diagrams Michael Cafarella University of Michigan Joint work with Shirley Zhe Chen and Eytan Adar Brigham Young University November.

Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,

WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Searching for Statistical Diagrams Michael Cafarella University of Michigan Joint work with Shirley Zhe Chen and Eytan Adar U.S. Frontiers of Engineering.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.

Using linked data to interpret tables Varish Mulwad September 14,

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,

Neighborhood - based Tag Prediction

Introduction to Data Science Lecture 5 Data Integration

Data Integration for the Relational Web

Data Integration for Relational Web

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Topic: Semantic Text Mining

Presentation transcript:

WebTables & Octopus Michael J. Cafarella University of Washington CSE454 April 30, 2009

2 Outline WebTables Octopus

3

This page contains 16 distinct HTML tables, but only one relational database Each relational database has its own schema, usually with labeled columns.

6 WebTables WebTables system automatically extracts dbs from web crawl [WebDB08, “Uncovering…”, Cafarella et al] [VLDB08, “WebTables: Exploring…”, Cafarella et al] An extracted relation is one table plus labeled columns Estimate that our crawl of 14.1B raw HTML tables contains ~154M good relational dbs Raw crawled pagesRaw HTML TablesRecovered Relations Applications Schema Statistics

7 The Table Corpus Table type% totalcount Small tables B HTML forms M Calendars M Obvious non-rel B Other non-rel (est.) B Rel (est.) M

8 The Table Corpus

9 Relation Recovery Output 271M databases, about 125M are good Five orders of magnitude larger than previous largest corpus [WWW02, “A Machine Learning…”, Wang & Hu] 2.6M unique relational schemas What can we do with those schemas? [VLDB08, “WebTables: Exploring…”, Cafarella et al] Step 1. Relational Filtering Step 2. Metadata Detection {} {} Recall 81%, Precision 41% Recall 85%, Precision 89%

10 Schema Statistics SchemaFreq Recovered Relations nameaddrcitystatezip Dan S16 ParkSeattleWA98195 Alon H129 ElmBelmontCA94011 makemodelyear ToyotaCamry1984 namesizelast-modified Readme.txt182Apr 26, 2005 cac.xml813Jul 23, 2008 makemodelyearcolor ChryslerVolare1974yellow NissanSentra1994red makemodelyear MazdaProtégé2003 ChevroletImpala1979 {make, model, year}1 {name, size, last-modified} 1 {name, addr, city, state, zip}1 {make, model, year, color}1 2 Schema stats useful for computing attribute probabilities p(“make”), p(“model”), p(“zipcode”) p(“make” | “model”), p(“make” | “zipcode”)

11 App #1: Relation Search Problem: keyword search for high- quality extracted databases Output depends on both quality of extracted tables and the ranking function

12 App #1: Relation Search

13 App #1: Relation Search Schema statistics can help improve both: Relation Recovery (Metadata detection) Ranking By computing a schema coherency score S(R) for relation R, and adding it to feature vector Measures how well a schema “hangs together” High: {make, model} Low: {make, zipcode} Average pairwise Pointwise Mutual Information score for all attributes in schema

14 App #1: Experiments Metadata detection, when adding schema stats scoring Precision 0.79  0.89 Recall 0.84  0.85 Ranking: compared 4 rankers on test set Naïve: Top-10 pages from google.com Filter: Top-10 good tables from google.com Rank: Trained ranker Rank-Stats: Trained ranker with coherency score What fraction of top-k are relevant? kNaïveFilterRankRank-Stats (34%) 0.43 (65%) 0.47 (80%) (42%) 0.56 (70%) 0.59 (79%) (74%) 0.66 (94%) 0.68 (100%)

15 App #2: Schema Autocomplete Input: topic attribute (e.g., make) Output: relevant schema {make, model, year, price} “tab-complete” for your database For input set I, output S, threshold t while p(S-I | I) > t newAttr = max p(newAttr, S-I | I) S = S  newAttr emit newAttr

16 App #2: Schema Autocomplete namename, size, last-modified, type instructorinstructor, time, title, days, room, course electedelected, party, district, incumbent, status, … abab, h, r, bb, so, rbi, avg, lob, hr, pos, batters sqftsqft, price, baths, beds, year, type, lot-sqft, …

17 App #2: Experiments Asked experts for schemas in 10 areas What was autocompleter’s recall?

18 App #3: Synonym Discovery Input: topic attribute (e.g., address ) Output: relevant synonym pairs ( telephone = tel-# ) Used for schema matching [VLDB01, “Generic Schema Matching…”, Madhavan et al] Linguistic thesauri are incomplete; hand-made thesauri are burdensome For attributes a, b and input domain C, when p(a,b)=0

19 App #3: Synonym Discovery name | , phone|telephone, _address| _address, date|last_modified instructorcourse-title|title, day|days, course|course-#, course-name|course-title electedcandidate|name, presiding-officer|speaker abk|so, h|hits, avg|ba, name|player sqftbath|baths, list|list-price, bed|beds, price|rent

20 App #3: Experiments For each input attr, repeatedly emit best synonym pair (until min threshold reached)

21 WebTables Contributions Largest collection of databases and schemas, by far Large-scale extracted schema data for first time; enables novel applications

22 Outline WebTables Octopus

23 Multiple Tables Can we combine tables to create new data sources? Data integration for the Structured Web Many existing “mashup” tools, which ignore realities of Web data A lot of useful data is not in XML User cannot know all sources in advance Transient integrations

24 Integration Challenge Try to create a database of all “VLDB program committee members”

25 Provides “workbench” of data integration operators to build target database Most operators are not correct/incorrect, but high/low quality (like search) Also, prosaic traditional operators Octopus

26 Walkthrough - Operator #1 SEARCH(“ VLDB program committee members ”) serge abiteboulinria anastassia ail…carnegie… gustavo alonsoetz zurich …… serge abiteboulinria michael adiba…grenoble antonio albano…pisa ……

27 Walkthrough - Operator #2 Recover relevant data serge abiteboulinria michael adiba…grenoble antonio albano…pisa …… serge abiteboulinria anastassia ail…carnegie… gustavo alonsoetz zurich …… CONTEXT()

28 Walkthrough - Operator #2 Recover relevant data serge abiteboulinria1996 michael adiba…grenoble1996 antonio albano…pisa1996 ……… serge abiteboulinria2005 anastassia ail…carnegie…2005 gustavo alonsoetz zurich2005 ……… CONTEXT()

29 Walkthrough - Union Combine datasets serge abiteboulinria1996 michael adiba…grenoble1996 antonio albano…pisa1996 ……… serge abiteboulinria2005 anastassia ail…carnegie…2005 gustavo alonsoetz zurich2005 ……… Union() serge abiteboulinria1996 michael adiba…grenoble1996 antonio albano…pisa1996 serge abiteboulinria2005 anastassia ail…carnegie…2005 gustavo alonsoetz zurich2005 ………

30 Walkthrough - Operator #3 Add column to data Similar to “join” but join target is a topic EXTEND( “publications”, col=0) serge abiteboulinria1996 michael adiba…grenoble1996 antonio albano…pisa1996 serge abiteboulinria2005 anastassia ail…carnegie…2005 gustavo alonsoetz zurich2005 ……… serge abiteboulinria1996“Large Scale P2P Dist…” michael adiba…grenoble1996“Exploiting bitemporal…” antonio albano…pisa1996“Another Example of a…” serge abiteboulinria2005“Large Scale P2P Dist…” anastassia ail…carnegie…2005“Efficient Use of the…” gustavo alonsoetz zurich2005“A Dynamic and Flexible…” ……… User has integrated data sources with little effort No wrappers; data was never intended for reuse “publications”

31 CONTEXT Algorithms Input: table and source page Output: data values to add to table SignificantTerms sorts terms in source page by “importance” (tf-idf)

32 Related View Partners Looks for different “views” of same data

33 CONTEXT Experiments

34 EXTEND Algorithms Recall: EXTEND( “publications”, col=0) JoinTest looks for single “joinable” table E.g., extend a column of US cities with “mayor” data Algorithm: 1. SEARCH for a table that is relevant to topic (e.g., “mayors”); rank by relevance 2. Retain results with a joinable column (to col=0). Use Jaccardian distance fn between columns

35 EXTEND Algorithms MultiJoin finds compatible “joinable tuples”; tuples can come from many different tables E.g., extend PC members with “publications” Algorithm: 1. SEARCH for each source tuple (using cell text and “publications”) 2. Cluster results, rank by weighted mix of topic- relevance and source-table coverage 3. Choose best cluster, then apply join-equality test as in JoinTest Algorithms reflect different data ideas: found in one table or scattered over many?

36 EXTEND Experiments Many test queries not EXTENDable We chose column and query to EXTEND TestJoin: 60% of src tuples for three topics; avg 1 correct extension per src tuple MultiJoin: 33% of src tuples for all topics; avg 45.5 correct extensions per src tuple Join column desc.Topic query CountriesUniversities Us statesGovernors Us citiesMayors Film titlesCharacters UK political partiesMember of parliament Baseball teamsPlayers Musical bandsalbums

37 CollocSplit Algorithm Mike Cafarella Univ of Washington Alon Halevy Google, Inc. Oren Etzioni Univ of Washington H.V. Jagadish Univ of Michigan Mike CafarellaUniv of Washington Alon HalevyGoogle, Inc. Oren EtzioniUniv of Washington H.V. JagadishUniv of Michigan Mike Cafarella Univ of Washington Alon Halevy Google, Inc. Oren Etzioni Univ of Washington H.V. Jagadish Univ of Michigan 1.For i = 1..MAX, find breakpoints, ranked by co-location score

38 CollocSplit Algorithm Mike Cafarella Univ of Washington Alon Halevy Google, Inc. Oren Etzioni Univ of Washington H.V. Jagadish Univ of Michigan Mike CafarellaUniv of Washington Alon HalevyGoogle, Inc. Oren EtzioniUniv of Washington H.V. JagadishUniv of Michigan Mike Cafarella || Univ of Washington Alon Halevy || Google, Inc. Oren Etzioni || Univ of Washington H.V. Jagadish || Univ of Michigan 1.For i = 1..MAX, find breakpoints, ranked by co-location score i=1

39 CollocSplit Algorithm Mike Cafarella Univ of Washington Alon Halevy Google, Inc. Oren Etzioni Univ of Washington H.V. Jagadish Univ of Michigan Mike CafarellaUniv of Washington Alon HalevyGoogle, Inc. Oren EtzioniUniv of Washington H.V. JagadishUniv of Michigan Mike Cafarella || Univ of || Washington Alon || Halevy || Google, Inc. Oren || Etzioni || Univ of Washington H.V. || Jagadish || Univ of Michigan i=2 1.For i = 1..MAX, find breakpoints, ranked by co-location score

40 CollocSplit Algorithm Mike Cafarella Univ of Washington Alon Halevy Google, Inc. Oren Etzioni Univ of Washington H.V. Jagadish Univ of Michigan Mike CafarellaUniv of Washington Alon HalevyGoogle, Inc. Oren EtzioniUniv of Washington H.V. JagadishUniv of Michigan Mike || Cafarella || Univ of || Washington Alon || Halevy || Google, || Inc. Oren || Etzioni || Univ of || Washington H.V. || Jagadish || Univ of || Michigan i=3 1.For i = 1..MAX, find breakpoints, ranked by co-location score

41 CollocSplit Algorithm Mike Cafarella Univ of Washington Alon Halevy Google, Inc. Oren Etzioni Univ of Washington H.V. Jagadish Univ of Michigan Mike CafarellaUniv of Washington Alon HalevyGoogle, Inc. Oren EtzioniUniv of Washington H.V. JagadishUniv of Michigan Mike Cafarella || Univ of Washington Alon Halevy || Google, Inc. Oren Etzioni || Univ of Washington H.V. Jagadish || Univ of Michigan 2.Choose i that yields most-consistent columns Our current consistency measure is avg std-dev of cell strlens

42 Octopus Contributions Basic operators that enable Web data integration with very small user burden Realistic and useful implementations for all three operators

43 Future Work WebTables Schema autocomplete & synonyms just few of many possible semantic services Input: schema; Output: tuples database autopopulate Input: tuples; Output: schema schema autogenerate Octopus Index support for interactive speeds

44 Future Work (2) “The Database of Everything” [CIDR09, “Extracting and Querying…”, Cafarella] Is domain-independence enough? Multi-model, multi-extractor approach probable Vast deduplication challenge New tools needed for user/extractor interaction be/is ask/call arrive/come/go join/lead born-in Text-embedded Table-embedded