Efficient Management of Inconsistent and Uncertain Data Renée J. Miller University of Toronto
Contributors Ariel Fuxman, PhD Thesis Microsoft Search Labs Jim Gray SIGMOD 2008 Dissertation Award Periklis Andritsos, PhD Jiang Du, MS Elham Fazli, MS Diego Fuxman, Undergrad
Dirty Databases The presence of dirty data is a major problem in enterprises Traditional solution: data cleaning 3 No. I don’t see Any problem with the data
Limitations of Data Cleaning Semi-automatic process Requires highly-qualified domain experts Time consuming May not be possible to wait until the database is clean Operational systems answer queries assuming clean data
Our Work Identify classes of queries for which we can obtain meaningful answers from potentially dirty databases Show how to do it efficiently and reusing existing database technology 5
Why is this Business Intelligence? Business intelligence (BI) refers to technologies, applications and practices for the collection, integration, analysis, and presentation of information. The goal of BI is to support better decision making, based on information. DBMS should provide meaningful query answers even over data that is dirty
Outline Introduction Semantics for dirty databases Contributions Conclusions 7
Outline Introduction Semantics for dirty databases Contributions Conclusions 8
A Data Integration Example Integrating customer data… 9 Sales Shipping Customer Support Web Forms Demographic Data IntegratedCustomerDatabase
Matching and Merging 10 Web Sales Matching and merging are two fundamental tasks in data integration
True Disagreement Between Sources 11 Web Sales What’s Peter’s salary?
Inconsistent Integrated Databases In the absence of complete resolution rules… 12 SATISFY custid KEYVIOLATES custid KEY Web Sales In Inconsistent Integrated Database
Query: “Get customers who make more than 100K” 13 sales web sales/web sales web Peter,Paul,Mary Are we sure that we want to offer a card to Peter? Example: Offering a Platinum credit card… Querying Inconsistent Databases
Aggressive: Get customers who possibly make more than 100K Peter, Paul, Mary Conservative: Get customers who certainly make more than 100K Paul, Mary 14 Querying Inconsistent Databases
Formal Semantics Related to semantics for querying incomplete data [Imielinski Lipski 84, Abiteboul Duschka 98] Possible world: “complete” databases Consistent answers Proposed by Arenas, Bertossi, and Chomicki in 1999 Corresponds to conservative semantics Possible world: “consistent” databases 15
16 sales web sales/web sales web Inconsistent database Repairs Key: custid Consistent Answers
17 CONSISTENT ANSWERS Answers obtained no matter which repair we choose Query=“Get customers who make more than 100K” q q q q CONSISTENT ANSWER= {Paul,Mary} Repairs Consistent Answers
Outline Introduction Semantics for dirty databases Contributions Conclusions 18
When We Started… Semantics well understood Problem Potentially HUGE number of repairs! Negative results [Chomicki et al 02, Arenas et al. 01, Cali et al 04] Few tractability results [Arenas et al. 99, Arenas et al. 01] Logic programming approaches [Bravo and Bertossi 03, Eiter et al. 03] Expressive queries and constraints Computationally expensive Applicable only to small databases with small number of inconsistencies 19
Our Proposal: ConQuer 20 Commercial database engine SQL query q Keys Rewritten SQL query Q * ConQuer’sRewritingAlgorithm Inconsistentdatabase Consistent answer to q
Class of Rewritable Queries ConQuer handles a broad class of SPJ queries with Set semantics Bag semantics, grouping, and aggregation No restrictions on Number of relations Number of joins Conditions or built-in predicates Key-to-key joins The class is “maximal” 21
Why not all SPJ queries? Some SPJ queries cannot be rewritten into SQL Consistent query answering is coNP-complete even for some SPJ queries and key constraints Maximality of ConQuer’s class Minimal relaxations lead to intractability Restrictions only on Nonkey-to-nonkey joins Self joins Nonkey-to-key joins that form a cycle 22
Example: A Rewritable Query SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= ' ' and o_orderdate < date(' ') + 3 MONTHS and l_returnflag = 'R' and c_nationkey = n_nationkey GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment ORDER BY revenue desc 23 TPC-H Query 10
Rewritings Can Get Quite Complex Rewriting of TPC-H Query 10 Can this rewriting be executed efficiently? 1.7 overhead 20 GB database, 5% inconsistency
Experimental Evaluation Goals Quantify the overhead of the rewritings Assess the scalability of the approach Determine sensitivity of the rewritten queries to level of inconsistency of the instance Queries and databases Representative decision support queries (TPC-H benchmark) TPC-H databases, altered to introduce inconsistencies Database parameters database size percentage of the database that is inconsistent conflicts per key value (in inconsistent portion) 25
26 Worst Case 5.8 overhead Selectivity % Size (GB) 5 % inconsistent tuples 2 conflicts per inconsistent key value Scalability Best Case 1.2 overhead Selectivity %
Contributions – Theory Formal characterization of a broad class of queries For which computing consistent answers is tractable under key constraints That can be rewritten into first-order/SQL Query rewriting algorithms for a class of Select- Project-Join queries With set semantics With bag semantics, grouping, and aggregation Maximality of the class of queries 27
Contributions – Practice Implementation of ConQuer Designed to compute consistent answers efficiently Multiple rewriting strategies Experimental validation of efficiency and scalability Representative queries from TPC-H Large databases 28
Uncertain Data custid…income Peter…40K Paul…400K Mary…110K custid…income Peter…200K Paul…400K Mary…130K custid…income Peter…40K Peter…200K Paul…400K Mary…110K Mary…130K Web Sales Integrated Database PROVENANCE INFORMATION (e.g., source reputation)
Publications and Demo These and other contributions appear in ICDT05/JCSS06 SIGMOD05 ICDE06 PODS06/TODS06 VLDB06 Demo given at VLDB
Outline Introduction Semantics for dirty databases Contributions Conclusions 31
A Virtuous Cycle 32 Query Answering Data Integration Recognize and characterize inconsistent data Use knowledge about inconsistencies to: give better answers suggest ways to clean the database
Beyond the Enterprise Can we apply principled models of inconsistency or uncertainty to the Web? Different assumptions Uncertainty in queries There’s never a “true” answer Challenge Build models based on user preferences Leverage massive repositories of user behavior data 33
THANK YOU Plug: Discovering Data Quality Rules, Fei Chiang Thursday 11:15am Research Session 33 34