BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

Google News Personalization Scalable Online Collaborative Filtering

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.

CrowdER - Crowdsourcing Entity Resolution

Chapter 5: Introduction to Information Retrieval

Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.

Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,

Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,

Supervised Learning Recap

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Graph Analysis Matching Program Burdette Pixton. Record Linkage Object Identification Problem Identifies possible links in pedigrees Advantages Compress.

Data Integration with Uncertainty Xin (Luna) Dong Data Management AT&T Joint work w. Mike Franklin (Berkeley), Alon Halevy (Google), Anish Das Sarma.

Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.

Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

HCS Clustering Algorithm

TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

By ANDREW ZITZELBERGER A Framework for Extraction Ontology Based Information Management.

Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

(hyperlink-induced topic search)

Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.

Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.

Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.

Utilizing Video Ontology for Fast and Accurate Query-by-Example Retrieval Kimiaki Shirahama Graduate School of Economics, Kobe University Kuniaki Uehara.

Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

M Taimoor Khan Course Objectives 1) Basic Concepts 2) Tools 3) Database architecture and design 4) Flow of data (DFDs)

Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

Facilitating Document Annotation using Content and Querying Value.

Probabilistic Networks Chapter 14 of Dechter’s CP textbook Speaker: Daniel Geschwender April 1, 2013 April 1&3, 2013DanielG--Probabilistic Networks1.

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, Keyword Search on Relational Data Streams Alexander Markowetz Yin.

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.

Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Facilitating Document Annotation Using Content and Querying Value.

Presented by: Siddhant Kulkarni Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.

Introduction: Databases and Database Systems Lecture # 1 June 19,2012 National University of Computer and Emerging Sciences.

Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,

Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.

Privacy Preserving Subgraph Matching on Large Graphs in Cloud

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Associative Query Answering via Query Feature Similarity

Data Mining Lecture 11.

Chapter 15 QUERY EXECUTION.

Data Integration with Dependent Sources

Data Integration for Relational Web

Keyword Searching and Browsing in Databases using BANKS

Analysis models and design models

Feature Selection for Ranking

Chaitali Gupta, Madhusudhan Govindaraju

Materializing Views With Minimal Size To Answer Queries

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Presentation transcript:

BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration Systems Presented by Andrew Zitzelberger

Data Integration Offer a single-point interface to a set of data sources  Mediated schema  Semantic mappings  Query through mediated schema Pay-as-you-go  Many contexts can be useful without full integration  System starts with few (or inaccurate) semantic mappings  Mappings are improved over time Problem  Requires significant upfront and ongoing effort

Contributions Self-configuring data integration system  Provides an advanced starting point for pay-as-you-go systems  Initial configuration provides good precision and recall Algorithms  Mediated schema generation  Semantic mapping generation Concept  Probabilistic mediated schema

Probabilistic Mediated Schema

Mediated Schema Generation 1) Remove infrequent attributes  Ensure mediated schema contain most relevant attributes 2) Construct weighted graph  Nodes are remaining attributes  Edges are the values of some similarity measure: s(a i, a j )  Cull edges below threshold τ 3) Cluster nodes  Cluster is a connected component of the graph

Probabilistic Mediated Schema Generation Allow for error є in weighted graph  Certain edges ≥ τ + є  τ - є < Uncertain edges ≤ τ + є  Cull edges < τ – є Remove unnecessary uncertain edges Create schema from every subset of uncertain edges

Probabilistic Mediated Schema Generation Assign probability

Probabilistic Mediated Schema

Probabilistic Semantic Mappings

Probabilistic Mapping Generation Weighted correspondence Choose the consistent p-mapping with the maximum entropy.

Probabilistic Mapping Generation 1) Enumerate one-to-one mappings  Mappings must contain subset of correspondences 2) Assign probabilities that maximize entropy  Solve the following constraint maximization problem

Probabilistic Mediated Schema Consolidation Why?  User expects a single deterministic schema  More efficient query answering How?

Schema Consolidation Example M = {M1, M2} M1 contains {a1, a2, a3}, {a4}, and {a5, a6} M2 contains {a2, a3, a4} and {a1, a5, a6} T contains {a1}, {a2, a3}, {a4}, and {a5, a6}

Probabilistic Mapping Consolidation Modify p-mapping  Update the mappings to match new mediated schema Modify probabilities  Schema mapping probability by Pr(M i ) Consolidate  Add all new mappings to new set  If mapping already in new set during addition, add probabilites

Experimental Setup UDI – the data integration system  Accepts select-project queries (only one table) Source data – MySQL Query processor – Java Jaro Winkler simularity computation – SecondString Entropy maximization problem – Knitro Operating System – Windows Vista CPU – Intel Core 2 GHz Memory – 2GB

Experimental Setup τ = 0.85 є = 0.02 θ = 10%

Experiments Domains: Movie, Car, People, Course, Bibliography Golden Standards  Manually created for People and Bibliography  Partially created for others 10 test queries  One to four attributes in SELECT clause  Zero to three predicates in WHERE clause

Results Estimated actual recall between 0.8 and 0.85

Experiments Compare to other methods:  MySQL keyword search engine  KEYWORDNAIVE  KEYWORDSTRUCT  KEYWORDSTRICT  SOURCE  Unions results of each data source  TOPMAPPING  Only consider p-mapping with highest probability

Results

Experiments Compare against other Q&A methods:  SINGLEMED – single deterministic mediated schema  UNIONALL – single deterministic mediated schema that contains a singleton cluster for each frequent source attribute

Results

Experiment and Results Quality of mediated schema  Test against manually created schema

Experiment and Result Setup efficiency  3.5 minutes for 817 data sources  Roughly linear increase of time with data sources  Maximum-entropy problem is most time consuming

Future Work Different schema matcher Dealing with multiple-table sources Including multi-table schemas Normalizing mediated schemas

Analysis Positives  Lots of support (proofs and experiments) Negatives  Detail  Pictures