Efficient Management of Inconsistent and Uncertain Data Renée J. Miller University of Toronto.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

Discovering Queries based on Example Tuples
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
Representing and Querying Correlated Tuples in Probabilistic Databases
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
University of Washington Database Group Tiresias The Database Oracle for How-To Queries Alexandra Meliou § ✜ Dan Suciu ✜ § University of Massachusetts.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
ConQuer: Efficient Management of Inconsistent Databases Presented by: Presented by: Ariel Fuxman (Univ. of Toronto) Ariel Fuxman (Univ. of Toronto) Joint.
University of Konstanz Advances in Database Query Processing Sahak Maloyan Avoiding Sorting and Grouping In Processing Queries Sahak Maloyan.
C-Store: Introduction to TPC-H Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009.
CoPhy: A Scalable, Portable, and Interactive Index Advisor for Large Workloads Debabrata Dash, Anastasia Ailamaki, Neoklis Polyzotis 1.
Reasoning and Identifying Relevant Matches for XML Keyword Search Yi Chen Ziyang Liu, Yi Chen Arizona State University.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,
Merging Models Based on Given Correspondences Rachel A. Pottinger Philip A. Bernstein.
University of Washington Database Group Reverse Data Management … and the case for Reverse What-If queries 1 Alexandra Meliou, Wolfgang Gatterbauer, Dan.
1 Draft of a Matchmaking Service Chuang liu. 2 Matchmaking Service Matchmaking Service is a service to help service providers to advertising their service.
How can Computer Science contribute to Research Publishing?
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
Database Access Control & Privacy: Is There A Common Ground? Surajit Chaudhuri, Raghav Kaushik and Ravi Ramamurthy Microsoft Research.
Chapter 1 Overview of Databases and Transaction Processing.
ArcGIS Workflow Manager An Introduction
Introduction to the Enterprise Library. Sounds familiar? Writing a component to encapsulate data access Building a component that allows you to log errors.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade.
Chapter 9 Integrity. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.9-2 Topics in this Chapter Predicates and Propositions Internal vs.
DBrev: Dreaming of a Database Revolution Gjergji Kasneci, Jurgen Van Gael, Thore Graepel Microsoft Research Cambridge, UK.
1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton.
Analyzing Plan Diagrams of Database Query Optimizers Naveen Reddy Jayant Haritsa Database Systems Lab Indian Institute of Science Bangalore, INDIA.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Data-Centric Human Computation Jennifer Widom Stanford University.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
© ETH Zürich Eric Lo ETH Zurich a joint work with Carsten Binnig (U of Heidelberg), Donald Kossmann (ETH Zurich), Tamer Ozsu (U of Waterloo) and Peter.
Semantic Query Optimization Techniques November 16, 2005 By : Mladen Kovacevic.
RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
A Study of Central Auction Based Wholesale Electricity Markets S. Ceppi and N. Gatti.
Hippo a System for Computing Consistent Query Answers to a Class of SQL Queries Jan Chomicki University at Buffalo Jerzy Marcinkowski Wroclaw University.
Universität Innsbruck Leopold Franzens  Copyright 2007 DERI Innsbruck Technical Task Fair December 2007 SWS Composition The SUPER Approach.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Materialized View Selection and Maintenance using Multi-Query Optimization Hoshi Mistry Prasan Roy S. Sudarshan Krithi Ramamritham.
AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration.
ConQuer: Efficient Management of Inconsistent Databases Presented by: Presented by: Ariel Fuxman (Univ. of Toronto) Ariel Fuxman (Univ. of Toronto) Joint.
Generalized Hash Teams for Join and Group-By Alfons Kemper Donald Kossmann Christian Wiesner Universität Passau Germany.
1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.
32nd International Conference on Very Large Data Bases September , 2006 Seoul, Korea Efficient Detection of Empty Result Queries Gang Luo IBM T.J.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
Chapter 18 Query Processing and Optimization. Chapter Outline u Introduction. u Using Heuristics in Query Optimization –Query Trees and Query Graphs –Transformation.
Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data.
Chapter 1 Overview of Databases and Transaction Processing.
Integrating SysML with OWL (or other logic based formalisms)
Ontology Evolution: A Methodological Overview
Chapter 15 QUERY EXECUTION.
Interactive repairing of inconsistent knowledge bases
CSc4730/6730 Scientific Visualization
Consistent Query Answering: a personal perspective
A Framework for Testing Query Transformation Rules
Query Optimization.
Test-Driven Ontology Development in Protégé
Presentation transcript:

Efficient Management of Inconsistent and Uncertain Data Renée J. Miller University of Toronto

Contributors Ariel Fuxman, PhD Thesis Microsoft Search Labs Jim Gray SIGMOD 2008 Dissertation Award Periklis Andritsos, PhD Jiang Du, MS Elham Fazli, MS Diego Fuxman, Undergrad

Dirty Databases The presence of dirty data is a major problem in enterprises Traditional solution: data cleaning 3 No. I don’t see Any problem with the data

Limitations of Data Cleaning Semi-automatic process Requires highly-qualified domain experts Time consuming May not be possible to wait until the database is clean Operational systems answer queries assuming clean data

Our Work Identify classes of queries for which we can obtain meaningful answers from potentially dirty databases Show how to do it efficiently and reusing existing database technology 5

Why is this Business Intelligence? Business intelligence (BI) refers to technologies, applications and practices for the collection, integration, analysis, and presentation of information. The goal of BI is to support better decision making, based on information. DBMS should provide meaningful query answers even over data that is dirty

Outline Introduction  Semantics for dirty databases  Contributions  Conclusions 7

Outline Introduction  Semantics for dirty databases  Contributions  Conclusions 8

A Data Integration Example Integrating customer data… 9 Sales Shipping Customer Support Web Forms Demographic Data IntegratedCustomerDatabase

Matching and Merging 10 Web Sales Matching and merging are two fundamental tasks in data integration

True Disagreement Between Sources 11 Web Sales What’s Peter’s salary?

Inconsistent Integrated Databases In the absence of complete resolution rules… 12 SATISFY custid KEYVIOLATES custid KEY Web Sales In Inconsistent Integrated Database

Query: “Get customers who make more than 100K” 13 sales web sales/web sales web Peter,Paul,Mary Are we sure that we want to offer a card to Peter? Example: Offering a Platinum credit card… Querying Inconsistent Databases

Aggressive: Get customers who possibly make more than 100K Peter, Paul, Mary Conservative: Get customers who certainly make more than 100K Paul, Mary 14 Querying Inconsistent Databases

Formal Semantics Related to semantics for querying incomplete data [Imielinski Lipski 84, Abiteboul Duschka 98] Possible world: “complete” databases Consistent answers Proposed by Arenas, Bertossi, and Chomicki in 1999 Corresponds to conservative semantics Possible world: “consistent” databases 15

16 sales web sales/web sales web Inconsistent database Repairs Key: custid Consistent Answers

17 CONSISTENT ANSWERS Answers obtained no matter which repair we choose Query=“Get customers who make more than 100K” q q q q CONSISTENT ANSWER= {Paul,Mary} Repairs Consistent Answers

Outline Introduction Semantics for dirty databases  Contributions  Conclusions 18

When We Started… Semantics well understood Problem Potentially HUGE number of repairs! Negative results [Chomicki et al 02, Arenas et al. 01, Cali et al 04] Few tractability results [Arenas et al. 99, Arenas et al. 01] Logic programming approaches [Bravo and Bertossi 03, Eiter et al. 03] Expressive queries and constraints Computationally expensive Applicable only to small databases with small number of inconsistencies 19

Our Proposal: ConQuer 20 Commercial database engine SQL query q Keys Rewritten SQL query Q * ConQuer’sRewritingAlgorithm Inconsistentdatabase Consistent answer to q

Class of Rewritable Queries ConQuer handles a broad class of SPJ queries with Set semantics Bag semantics, grouping, and aggregation No restrictions on Number of relations Number of joins Conditions or built-in predicates Key-to-key joins The class is “maximal” 21

Why not all SPJ queries? Some SPJ queries cannot be rewritten into SQL Consistent query answering is coNP-complete even for some SPJ queries and key constraints Maximality of ConQuer’s class Minimal relaxations lead to intractability Restrictions only on Nonkey-to-nonkey joins Self joins Nonkey-to-key joins that form a cycle 22

Example: A Rewritable Query SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= ' ' and o_orderdate < date(' ') + 3 MONTHS and l_returnflag = 'R' and c_nationkey = n_nationkey GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment ORDER BY revenue desc 23 TPC-H Query 10

Rewritings Can Get Quite Complex Rewriting of TPC-H Query 10 Can this rewriting be executed efficiently? 1.7 overhead 20 GB database, 5% inconsistency

Experimental Evaluation Goals Quantify the overhead of the rewritings Assess the scalability of the approach Determine sensitivity of the rewritten queries to level of inconsistency of the instance Queries and databases Representative decision support queries (TPC-H benchmark) TPC-H databases, altered to introduce inconsistencies Database parameters database size percentage of the database that is inconsistent conflicts per key value (in inconsistent portion) 25

26 Worst Case 5.8 overhead Selectivity % Size (GB) 5 % inconsistent tuples 2 conflicts per inconsistent key value Scalability Best Case 1.2 overhead Selectivity %

Contributions – Theory Formal characterization of a broad class of queries For which computing consistent answers is tractable under key constraints That can be rewritten into first-order/SQL Query rewriting algorithms for a class of Select- Project-Join queries With set semantics With bag semantics, grouping, and aggregation Maximality of the class of queries 27

Contributions – Practice Implementation of ConQuer Designed to compute consistent answers efficiently Multiple rewriting strategies Experimental validation of efficiency and scalability Representative queries from TPC-H Large databases 28

Uncertain Data custid…income Peter…40K Paul…400K Mary…110K custid…income Peter…200K Paul…400K Mary…130K custid…income Peter…40K Peter…200K Paul…400K Mary…110K Mary…130K Web Sales Integrated Database PROVENANCE INFORMATION (e.g., source reputation)

Publications and Demo These and other contributions appear in ICDT05/JCSS06 SIGMOD05 ICDE06 PODS06/TODS06 VLDB06 Demo given at VLDB

Outline Introduction Semantics for dirty databases Contributions  Conclusions 31

A Virtuous Cycle 32 Query Answering Data Integration Recognize and characterize inconsistent data Use knowledge about inconsistencies to: give better answers suggest ways to clean the database

Beyond the Enterprise Can we apply principled models of inconsistency or uncertainty to the Web? Different assumptions Uncertainty in queries There’s never a “true” answer Challenge Build models based on user preferences Leverage massive repositories of user behavior data 33

THANK YOU Plug: Discovering Data Quality Rules, Fei Chiang Thursday 11:15am Research Session 33 34