Data Integration for the Relational Web

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Data Integration for the Relational Web Katsarakis Michalis.
Presented by: Thabet Kacem Spring Outline Contributions Introduction Proposed Approach Related Work Reconception of ADLs XTEAM Tool Chain Discussion.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
Academic Advisor: Prof. Ronen Brafman Team Members: Ran Isenberg Mirit Markovich Noa Aharon Alon Furman.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?
WebTables & Octopus Michael J. Cafarella University of Washington CSE454 April 30, 2009.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
WAD Web application for managing the indicators of the research activity in a university department.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
FlexTable: Using a Dynamic Relation Model to Store RDF Data IDS Lab. Seungseok Kang.
Understanding User Goals in Web Search University of Seoul Computer Science Database Lab. Min Mi-young.
Lawrence Snyder University of Washington, Seattle © Lawrence Snyder 2004.
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
GOOGLE FUSION TABLES: WEB- CENTERED DATA MANAGEMENT AND COLLABORATION HectorGonzalez, et al. Google Inc. Presented by Donald Cha December 2, 2015.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
1 FollowMyLink Individual APT Presentation First Talk February 2006.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Of 24 lecture 11: ontology – mediation, merging & aligning.
Shuang Wu REU-DIMACS, 2010 Mentor: James Abello. Project description Our research project Input: time data recorded from the ‘Name That Cluster’ web page.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Proposal for Term Project
Improvements to Search
Wrangler: Interactive Visual Specification of Data Transformation Scripts Presented by Tifany Yung October 5, 2015.
Big Data Quality the next semantic challenge
Lecture 12: Data Wrangling
Implementing Mapping Composition
Data Integration for Relational Web
Structure and Content Scoring for XML
International Marketing and Output Database Conference 2005
Introduction to Information Retrieval
Probabilistic Databases
Structure and Content Scoring for XML
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
ONTOMERGE Ontology translations by merging ontologies Paper: Ontology Translation on the Semantic Web by Dejing Dou, Drew McDermott and Peishen Qi 2003.
Metadata supported full-text search in a web archive
Presentation transcript:

Data Integration for the Relational Web Michael J. Cafarella, Alon Halevy, Nodira Khoussainova Work done while at Google, Inc. Presenter: Michael J. Cafarella, University of Michigan VLDB August 27, 2009

Web Challenge Try to create a database of all “VLDB program committee members” Should be easy to obtain this dataset, as all the information exists on the Web But unfortunately, data is: Scattered across a dozen sites - User cannot know all of them in advance (Good luck finding the VLDB Cairo website!) Not in XML, never intended for reuse Transient integrations

Data Integration for Web Can we combine tables to create new data sources? Existing mashup, data integration tools ignore realities of Web data A lot of useful data is not in XML User cannot know all sources in advance Transient integrations Data semantics semi-tied to src page Given the scope of data online, this should be HUGELY PROMISING AND SEDUCTIVE Traditional database integration tools, we would map sources to a single designed “mediated schema” for every query!

Octopus Octopus Our test system has over 200M src tables Name Inst. Year Serge Inria 1996 Michel … Gren Anton.. … Pisa 2005 Crawl Web Extract Tables Integrate Tables Obtain Database Lots of table/list-extraction work, e.g., [VLDB09, “Answering Table Augmentation…”, Gupta & Sarawagi] [JAIR08, “Creating relational data…”, Michelson & Knoblock] [WWW07, “Towards domain-independent…”, Gatterbauer et al] [WWW02, “A machine learning based…”, Wang & Hu] Octopus Our test system has over 200M src tables Our system uses data from: WebTables [WebDB08, “Uncovering…”, Cafarella et al] [VLDB08, “WebTables: Exploring…”, Cafarella et al] Harvesting Relational Tables from Lists [VLDB09, “Harvesting Relational Tables from Lists…”, Elmeleegy et al] Lots of tabular-extraction work, e.g., [VLDB09, “Answering Table Augmentation…”, Gupta & Sarawagi] [WWW07, “Towards domain-independent…”, Gatterbauer et al] [WWW02, “A machine learning based…”, Wang & Hu] …

Outline Introduction Data Sources Octopus Operators SEARCH CONTEXT EXTEND Algorithms & Experiments Conclusions

Outline Introduction Data Sources Octopus Operators SEARCH CONTEXT EXTEND Algorithms & Experiments Conclusions

List Extraction

List Extraction What’s Opera Doc Warner 1957 Duck Amuck 1953 The Band Concert Disney 1935 Duck Dodgers… One Froggy Evening 1956 …

Outline Introduction Data Sources Octopus Operators SEARCH CONTEXT EXTEND Algorithms & Experiments Conclusions

Octopus Provides “workbench” of data integration operators to build target database Most operators are not correct/incorrect, but high/low quality Some prosaic operators: project, select, … Three original operators SEARCH CONTEXT EXTEND Under covers, each operator recovers different aspect of implicit GLAV src desc. Each operator can be thought of as recovering a different aspect of an implicit set of GLAV source descriptions. These source descriptions are never explicitly shown to the user, but are revealed by the interaction of the user and the data.

Operator #1 - SEARCH SEARCH(“VLDB program committee members”) serge abiteboul inria michael adiba …grenoble antonio albano …pisa … RANK of CLUSTERS. Here, a cluster of two tables. ------ Each cluster returned by SEARCH corresponds to a mediated schema relation in the GLAV representation. Each member table of a cluster is a concrete table that contributes to the cluster’s relation. (and are unioned together) ------- User can perform SELECT PROJECT UNION integrations, plus some limited JOINs. serge abiteboul inria anastassia ail… carnegie… gustavo alonso etz zurich …

Operator #2 - CONTEXT Recover relevant data CONTEXT() CONTEXT() serge abiteboul inria michael adiba …grenoble antonio albano …pisa … CONTEXT() serge abiteboul inria anastassia ail… carnegie… gustavo alonso etz zurich … CONTEXT()

Operator #2 - CONTEXT Recover relevant data CONTEXT() CONTEXT() serge abiteboul inria 1996 michael adiba …grenoble antonio albano …pisa … CONTEXT() CONTEXT operates on a single table. In the GLAV description, CONTEXT is equivalent to figuring out the selection predicates that apply to the mapping between the source table and a mediated table. Here, for example, we figure out that one table effectively has a year=1996 predicate and another has year=2005. This information is only available via the source page’s embedding web page. CONTEXT makes it explicit. serge abiteboul inria 2005 anastassia ail… carnegie… gustavo alonso etz zurich … CONTEXT()

Prosaic Operator - Union Combine datasets serge abiteboul inria 1996 michael adiba …grenoble antonio albano …pisa … serge abiteboul inria 1996 michael adiba …grenoble antonio albano …pisa 2005 anastassia ail… carnegie… gustavo alonso etz zurich … Union() serge abiteboul inria 2005 anastassia ail… carnegie… gustavo alonso etz zurich …

Operator #3 - EXTEND Add column to data Similar to “join” but join target is a topic EXTEND( “publications”, col=0) “publications” serge abiteboul inria 1996 “Large Scale P2P Dist…” michael adiba …grenoble “Exploiting bitemporal…” antonio albano …pisa “Another Example of a…” 2005 anastassia ail… carnegie… “Efficient Use of the…” gustavo alonso etz zurich “A Dynamic and Flexible…” … serge abiteboul inria 1996 michael adiba …grenoble antonio albano …pisa 2005 anastassia ail… carnegie… gustavo alonso etz zurich … EXTEND modifies a GLAV description to contain another table and join key (((Union of remaining tables in cluster yields single relation Contains 243 tuples (223 completely correct) Drawn from five sources across three SIGMOD years (and three websites))))

Straightforward Sequence SEARCH(“VLDB program committee members”) CONTEXT serge abiteboul inria michael adiba …grenoble antonio albano …pisa … CONTEXT serge abiteboul inria anastassia ail… carnegie… gustavo alonso etz zurich …

Straightforward Sequence CONTEXT union serge abiteboul inria 1996 michael adiba …grenoble antonio albano …pisa … CONTEXT serge abiteboul inria 2005 anastassia ail… carnegie… gustavo alonso etz zurich …

Straightforward Sequence EXTEND union serge abiteboul inria 1996 “Large Scale P2P Dist…” michael adiba …grenoble “Exploiting bitemporal…” antonio albano …pisa “Another Example of a…” 2005 anastassia ail… carnegie… “Efficient Use of the…” gustavo alonso etz zurich “A Dynamic and Flexible…” … serge abiteboul inria 1996 michael adiba …grenoble antonio albano …pisa 2005 anastassia ail… carnegie… gustavo alonso etz zurich … User integrated data sources with 4 operations No wrappers; data was never intended for reuse User never visited source web pages

Outline Introduction Data Sources Octopus Operators SEARCH CONTEXT EXTEND Algorithms & Experiments Conclusions

Experiments ~50 queries, suggested and evaluated by Amazon Mechanical Turk Query load of ~50 queries, suggested and evaluated by Amazon Mechanical Turk

SEARCH Algorithms - Ranking SimpleRank - search engine ranking SCPRank - symmetric conditional probability between query, table data Similar to Pointwise Mutual Information [Lopes, DaSilva, 1999], multiword units Unfortunately, you can have very relevant tables that do not have a text hit on the query Informally, measures correlation between query and each table term; find max of column-sums Cite: Sixth meeting on Mathematics of Language Max of per-column sums SCPRank is very computationally burdensome, so we very roughly approximate it

SEARCH Algorithms - Ranking Top-2 Top-5 Top-10 SimpleRank 27% 51% 73% SCPRank 47% 64% 81% Informally, measures correlation between query and each table term; find max of column-sums SCPRank is very computationally burdensome, so we very roughly approximate it Cite: Sixth meeting on Mathematics of Language

SEARCH Algorithms - Ranking Top-2 Top-5 Top-10 SimpleRank 27% 51% 73% SCPRank 47% 64% 81% See paper for clustering results Substantial gains possible beyond default web search relevance (as shown in our paper last VLDB).

CONTEXT Algorithms Input: table and source page Output: data values to add to table SignificantTerms sorts terms in source page by “importance” (tf-idf) On the VLDB site, the conerence name, year, and location are in the surrounding text. The data itself contains a person name and that person’s institution.

Related View Partners Looks for different “views” of same data Consider the VLDB conference page and a PC member’s home page Find tables elsewhere on Web that contain values from SignificantTerms But on a researcher’s home page, the data table probably contains a set of PC memberships, Listing the conference-name, year, and place. The researcher’s name and institution are probably in the surrounding text. So RVP works by looking for source terms that are matched a LOT by other tables on the Web.

CONTEXT Experiments Here, on a query load of ~50 queries, we measure the percentage of queries that yield a good CONTEXT value in the top-1 returned by the system, the top-2, the top-3, etc. Y-axis is the percentage of tables that yielded a GOOD context term within the top-k BLUE is SignificantTerms RED is the RelatedViewPartners GREEN is a hybrid algorithm

EXTEND Algorithms Input: src table, src column, dst topic JoinTest: EXTEND(t, col=0, “publications”) JoinTest: Tests a single table for join-compatibility “City mayors”: yes “VLDB publications”: no Rank all tables by relevance to query topic Select tables that are joinable to query column MultiJoin Finds a join-target tuple for each src tuple “City mayors”: maybe “VLDB publications”: yes For each cell in src column, perform topic search Cluster resulting tables, rank by column coverage JoinTest uses search engine ranking for relevance Joinability-test is performed by computing Jaccard score between the set of items in query column and the set of items in tested join-column. If jaccard score passes threshold (0.8 I believe), -- Strict item equality not required. It’s a string-edit-distance threshold test. Multijoin

EXTEND Early Experiments JoinTest 3 of 7 source tables 60% of source tuples Single extension for each extended tuple MultiJoin All 7 source tables 33% of source tuples Avg 45.5 extensions for each extended tuple 113 NYC mayors 12 albums by Led Zeppelin Join Column Topic Query countries universities Us states governors Us cities mayors Film titles characters UK political parties MP Baseball teams players Musical bands albums Not many of our source queries are actually EXTENDable. Just 7…. MultiJoin and JoinTest should probably be separate operators. OR, perhaps we could automatically choose one based on source data. BUT, they are not really competitive - they apply in different situations, depending on the nature of the data.

Related Work Octopus relies on info extraction work Substantial work in data integration Mashup Tools Yahoo! Pipes Marmite - [Wong and Hong, 2007] Karma - [Tuchinda, et al., 2007] CIMPLE - [DeRose, et al., 2007] Potter’s Wheel - [Raman and Hellerstein, 2001] Yahoo Pipes allows user to easily pipe togther XML flows, but assumes structured data inputs (and that user can find them) Karma populates a user’s db, but requires sources with formal declarations Cimple tries the Web Integration project, but still requires a lot of manual work by an administrator. Not intended for transient integrations, but rather long-lasting ones that are easy to maintain (however still relatively burdensome to create) Potter’s Wheel emphasizes live interaction for data cleaning. Its workbench-style interface is the closest to the Octopus model

Octopus Contributions Basic operators that enable Web data integration with very small user burden Realistic and useful implementations for all three operators Future work: Efficient large-scale implementation Some serious challenges for performance, esp. for items that issue a huge number of Web queries (the MultiJoin algorithm) or that use a lot of non-adjancent word statistics from the web (as in the SCP-ranking function). Not sure what the efficiency/accuracy tradeoff is yet.