Data Integration for the Relational Web

Data Integration for the Relational Web
Michael J. Cafarella, Alon Halevy, Nodira Khoussainova Work done while at Google, Inc. Presenter: Michael J. Cafarella, University of Michigan VLDB August 27, 2009

Web Challenge Try to create a database of all “VLDB program committee members” Should be easy to obtain this dataset, as all the information exists on the Web But unfortunately, data is: Scattered across a dozen sites - User cannot know all of them in advance (Good luck finding the VLDB Cairo website!) Not in XML, never intended for reuse Transient integrations

Data Integration for Web
Can we combine tables to create new data sources? Existing mashup, data integration tools ignore realities of Web data A lot of useful data is not in XML User cannot know all sources in advance Transient integrations Data semantics semi-tied to src page Given the scope of data online, this should be HUGELY PROMISING AND SEDUCTIVE Traditional database integration tools, we would map sources to a single designed “mediated schema” for every query!

Octopus Octopus Our test system has over 200M src tables
Name Inst. Year Serge Inria 1996 Michel … Gren Anton.. … Pisa 2005 Crawl Web Extract Tables Integrate Tables Obtain Database Lots of table/list-extraction work, e.g., [VLDB09, “Answering Table Augmentation…”, Gupta & Sarawagi] [JAIR08, “Creating relational data…”, Michelson & Knoblock] [WWW07, “Towards domain-independent…”, Gatterbauer et al] [WWW02, “A machine learning based…”, Wang & Hu] Octopus Our test system has over 200M src tables Our system uses data from: WebTables [WebDB08, “Uncovering…”, Cafarella et al] [VLDB08, “WebTables: Exploring…”, Cafarella et al] Harvesting Relational Tables from Lists [VLDB09, “Harvesting Relational Tables from Lists…”, Elmeleegy et al] Lots of tabular-extraction work, e.g., [VLDB09, “Answering Table Augmentation…”, Gupta & Sarawagi] [WWW07, “Towards domain-independent…”, Gatterbauer et al] [WWW02, “A machine learning based…”, Wang & Hu] …

Outline Introduction Data Sources Octopus Operators
SEARCH CONTEXT EXTEND Algorithms & Experiments Conclusions

List Extraction

List Extraction What’s Opera Doc Warner 1957 Duck Amuck 1953
The Band Concert Disney 1935 Duck Dodgers… One Froggy Evening 1956 …

Octopus Provides “workbench” of data integration operators to build target database Most operators are not correct/incorrect, but high/low quality Some prosaic operators: project, select, … Three original operators SEARCH CONTEXT EXTEND Under covers, each operator recovers different aspect of implicit GLAV src desc. Each operator can be thought of as recovering a different aspect of an implicit set of GLAV source descriptions. These source descriptions are never explicitly shown to the user, but are revealed by the interaction of the user and the data.

Operator #1 - SEARCH SEARCH(“VLDB program committee members”)
serge abiteboul inria michael adiba …grenoble antonio albano …pisa … RANK of CLUSTERS. Here, a cluster of two tables. ------ Each cluster returned by SEARCH corresponds to a mediated schema relation in the GLAV representation. Each member table of a cluster is a concrete table that contributes to the cluster’s relation. (and are unioned together) User can perform SELECT PROJECT UNION integrations, plus some limited JOINs. serge abiteboul inria anastassia ail… carnegie… gustavo alonso etz zurich …

Operator #2 - CONTEXT Recover relevant data CONTEXT() CONTEXT()
serge abiteboul inria michael adiba …grenoble antonio albano …pisa … CONTEXT() serge abiteboul inria anastassia ail… carnegie… gustavo alonso etz zurich … CONTEXT()

Operator #2 - CONTEXT Recover relevant data CONTEXT() CONTEXT()
serge abiteboul inria 1996 michael adiba …grenoble antonio albano …pisa … CONTEXT() CONTEXT operates on a single table. In the GLAV description, CONTEXT is equivalent to figuring out the selection predicates that apply to the mapping between the source table and a mediated table. Here, for example, we figure out that one table effectively has a year=1996 predicate and another has year=2005. This information is only available via the source page’s embedding web page. CONTEXT makes it explicit. serge abiteboul inria 2005 anastassia ail… carnegie… gustavo alonso etz zurich … CONTEXT()

Prosaic Operator - Union
Combine datasets serge abiteboul inria 1996 michael adiba …grenoble antonio albano …pisa … serge abiteboul inria 1996 michael adiba …grenoble antonio albano …pisa 2005 anastassia ail… carnegie… gustavo alonso etz zurich … Union() serge abiteboul inria 2005 anastassia ail… carnegie… gustavo alonso etz zurich …

Operator #3 - EXTEND Add column to data
Similar to “join” but join target is a topic EXTEND( “publications”, col=0) “publications” serge abiteboul inria 1996 “Large Scale P2P Dist…” michael adiba …grenoble “Exploiting bitemporal…” antonio albano …pisa “Another Example of a…” 2005 anastassia ail… carnegie… “Efficient Use of the…” gustavo alonso etz zurich “A Dynamic and Flexible…” … serge abiteboul inria 1996 michael adiba …grenoble antonio albano …pisa 2005 anastassia ail… carnegie… gustavo alonso etz zurich … EXTEND modifies a GLAV description to contain another table and join key (((Union of remaining tables in cluster yields single relation Contains 243 tuples (223 completely correct) Drawn from five sources across three SIGMOD years (and three websites))))

Straightforward Sequence
SEARCH(“VLDB program committee members”) CONTEXT serge abiteboul inria michael adiba …grenoble antonio albano …pisa … CONTEXT serge abiteboul inria anastassia ail… carnegie… gustavo alonso etz zurich …

CONTEXT union serge abiteboul inria 1996 michael adiba …grenoble antonio albano …pisa … CONTEXT serge abiteboul inria 2005 anastassia ail… carnegie… gustavo alonso etz zurich …

EXTEND union serge abiteboul inria 1996 “Large Scale P2P Dist…” michael adiba …grenoble “Exploiting bitemporal…” antonio albano …pisa “Another Example of a…” 2005 anastassia ail… carnegie… “Efficient Use of the…” gustavo alonso etz zurich “A Dynamic and Flexible…” … serge abiteboul inria 1996 michael adiba …grenoble antonio albano …pisa 2005 anastassia ail… carnegie… gustavo alonso etz zurich … User integrated data sources with 4 operations No wrappers; data was never intended for reuse User never visited source web pages

Experiments ~50 queries, suggested and evaluated by Amazon Mechanical Turk Query load of ~50 queries, suggested and evaluated by Amazon Mechanical Turk

SEARCH Algorithms - Ranking
SimpleRank - search engine ranking SCPRank - symmetric conditional probability between query, table data Similar to Pointwise Mutual Information [Lopes, DaSilva, 1999], multiword units Unfortunately, you can have very relevant tables that do not have a text hit on the query Informally, measures correlation between query and each table term; find max of column-sums Cite: Sixth meeting on Mathematics of Language Max of per-column sums SCPRank is very computationally burdensome, so we very roughly approximate it

Top-2 Top-5 Top-10 SimpleRank 27% 51% 73% SCPRank 47% 64% 81% Informally, measures correlation between query and each table term; find max of column-sums SCPRank is very computationally burdensome, so we very roughly approximate it Cite: Sixth meeting on Mathematics of Language

Top-2 Top-5 Top-10 SimpleRank 27% 51% 73% SCPRank 47% 64% 81% See paper for clustering results Substantial gains possible beyond default web search relevance (as shown in our paper last VLDB).

CONTEXT Algorithms Input: table and source page
Output: data values to add to table SignificantTerms sorts terms in source page by “importance” (tf-idf) On the VLDB site, the conerence name, year, and location are in the surrounding text. The data itself contains a person name and that person’s institution.

Related View Partners Looks for different “views” of same data
Consider the VLDB conference page and a PC member’s home page Find tables elsewhere on Web that contain values from SignificantTerms But on a researcher’s home page, the data table probably contains a set of PC memberships, Listing the conference-name, year, and place. The researcher’s name and institution are probably in the surrounding text. So RVP works by looking for source terms that are matched a LOT by other tables on the Web.

CONTEXT Experiments Here, on a query load of ~50 queries, we measure the percentage of queries that yield a good CONTEXT value in the top-1 returned by the system, the top-2, the top-3, etc. Y-axis is the percentage of tables that yielded a GOOD context term within the top-k BLUE is SignificantTerms RED is the RelatedViewPartners GREEN is a hybrid algorithm

EXTEND Algorithms Input: src table, src column, dst topic JoinTest:
EXTEND(t, col=0, “publications”) JoinTest: Tests a single table for join-compatibility “City mayors”: yes “VLDB publications”: no Rank all tables by relevance to query topic Select tables that are joinable to query column MultiJoin Finds a join-target tuple for each src tuple “City mayors”: maybe “VLDB publications”: yes For each cell in src column, perform topic search Cluster resulting tables, rank by column coverage JoinTest uses search engine ranking for relevance Joinability-test is performed by computing Jaccard score between the set of items in query column and the set of items in tested join-column. If jaccard score passes threshold (0.8 I believe), -- Strict item equality not required. It’s a string-edit-distance threshold test. Multijoin

EXTEND Early Experiments
JoinTest 3 of 7 source tables 60% of source tuples Single extension for each extended tuple MultiJoin All 7 source tables 33% of source tuples Avg 45.5 extensions for each extended tuple 113 NYC mayors 12 albums by Led Zeppelin Join Column Topic Query countries universities Us states governors Us cities mayors Film titles characters UK political parties MP Baseball teams players Musical bands albums Not many of our source queries are actually EXTENDable. Just 7…. MultiJoin and JoinTest should probably be separate operators. OR, perhaps we could automatically choose one based on source data. BUT, they are not really competitive - they apply in different situations, depending on the nature of the data.

Related Work Octopus relies on info extraction work
Substantial work in data integration Mashup Tools Yahoo! Pipes Marmite - [Wong and Hong, 2007] Karma - [Tuchinda, et al., 2007] CIMPLE - [DeRose, et al., 2007] Potter’s Wheel - [Raman and Hellerstein, 2001] Yahoo Pipes allows user to easily pipe togther XML flows, but assumes structured data inputs (and that user can find them) Karma populates a user’s db, but requires sources with formal declarations Cimple tries the Web Integration project, but still requires a lot of manual work by an administrator. Not intended for transient integrations, but rather long-lasting ones that are easy to maintain (however still relatively burdensome to create) Potter’s Wheel emphasizes live interaction for data cleaning. Its workbench-style interface is the closest to the Octopus model

Octopus Contributions
Basic operators that enable Web data integration with very small user burden Realistic and useful implementations for all three operators Future work: Efficient large-scale implementation Some serious challenges for performance, esp. for items that issue a huge number of Web queries (the MultiJoin algorithm) or that use a lot of non-adjancent word statistics from the web (as in the SCP-ranking function). Not sure what the efficiency/accuracy tradeoff is yet.

Data Integration for the Relational Web

Similar presentations

Presentation on theme: "Data Integration for the Relational Web"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Integration for the Relational Web

Similar presentations

Presentation on theme: "Data Integration for the Relational Web"— Presentation transcript:

Similar presentations

About project

Feedback