Uncertainty Lineage Data Bases Very Large Data Bases 19752006.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

The 20th International Conference on Software Engineering and Knowledge Engineering (SEKE2008) Department of Electrical and Computer Engineering
LIVE A lineage-supported, versioned DBMS  Anish Das Sarma  Martin Theobald  Jennifer Widom.
Transactions - Concurrent access & System failures - Properties of Transactions - Isolation Levels 4/13/2015Databases21.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
CS4432: Database Systems II
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
The Volcano/Cascades Query Optimization Framework
Rulebase Expert System and Uncertainty. Rule-based ES Rules as a knowledge representation technique Type of rules :- relation, recommendation, directive,
Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.
Dynamic Bayesian Networks (DBNs)
Intelligent systems Lecture 6 Rules, Semantic nets.
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
Efficient Query Evaluation on Probabilistic Databases
A Fairy Tale of Greedy Algorithms Yuli Ye Joint work with Allan Borodin, University of Toronto.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 13: Incorporating Uncertainty into Data Integration PRINCIPLES OF DATA INTEGRATION.
Managing Uncertain Data Anish Das Sarma Stanford University May 19, Anish Das Sarma.
Indexing the imprecise positions of moving objects Xiaofeng Ding and Yansheng Lu Department of Computer Science Huazhong University of Science & Technology.
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS Probabilistic Similarity Queries in Uncertain Databases.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.
Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio”
Trio A System for Integrated Management of Data, Accuracy, and Lineage Jennifer Widom Stanford University.
Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio”
Trio: A System for Data, Uncertainty, and Lineage Jennifer Widom et al Stanford University.
1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania Principles of Provenance (PrOPr) Philadelphia, PA June 26, 2007.
Representation Formalisms for Uncertain Data Jennifer Widom with Anish Das Sarma Omar Benjelloun Alon Halevy Trio and other participants in the Trio Project.
Trio: A System for Data, Uncertainty, and Lineage Jennifer Widom Stanford University.
CS 286, UC Berkeley, Spring 2007, R. Ramakrishnan 1 What is a Query Language? Universality of Data Retrieval Languages, Aho and Ullman, POPL 1979 Raghu.
Chapter 4: Organizing and Manipulating the Data in Databases
Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.
ULDBs: Databases with Uncertainty and Lineage O. Benjelloun, A. Das Sarma, A. Halevy, J. Widom.
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
A Model and Algorithms for Pricing Queries Tang Ruiming, Wu Huayu, Bao Zhifeng, Stephane Bressan, Patrick Valduriez.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
K-Hit Query: Top-k Query Processing with Probabilistic Utility Function SIGMOD2015 Peng Peng, Raymond C.-W. Wong CSE, HKUST 1.
Webdamlog and Contradictions Daniel Deutch Tel Aviv University Joint work with Serge Abiteboul, Meghyn Bienvenu, Victor Vianu.
Chapter 31 INTRODUCTION TO ALGEBRAIC CODING THEORY.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University.
272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 9: Test Generation from Models.
Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh
Chapter 18 Object Database Management Systems. Outline Motivation for object database management Object-oriented principles Architectures for object database.
The Object-Oriented Database System Manifesto Malcolm Atkinson, François Bancilhon, David deWitt, Klaus Dittrich, David Maier, Stanley Zdonik DOOD'89,
1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006.
Jennifer Widom Relational Databases The Relational Model.
Scheduling with Uncertain Resources Eugene Fink, Jaime G. Carbonell, Ulas Bardak, Alex Carpentier, Steven Gardiner, Andrew Faulring, Blaze Iliev, P. Matthew.
A PID Neural Network Controller
Probabilistic Skylines on Uncertain Data (VLDB2007) Jian Pei et al Supervisor: Dr Benjamin Kao Presenter: For Date: 22 Feb 2008 ??: the possible world.
1 Working Models for Uncertain Data Anish Das Sarma, Omar Benjelloun, Alon Halevy, Jennifer Widom Stanford InfoLab.
Approximate Lineage for Probabilistic Databases
TRIO Data Uncertainty Lineage Data Model Query Language System
Trio A System for Data, Uncertainty, and Lineage
Chapter 15 QUERY EXECUTION.
Probabilistic Data Management
Data Integration with Dependent Sources
Lecture 16: Probabilistic Databases
Steven Lindell Scott Weinstein
Trio A System for Integrated Management of Data, Accuracy, and Lineage
The Trio System for Data, Uncertainty, and Lineage: Overview and Demo
Probabilistic Databases
Non-Standard-Datenbanken
Query Optimization.
Presentation transcript:

Uncertainty Lineage Data Bases Very Large Data Bases

ULDBs: Databases with Uncertainty and Lineage Omar Benjelloun, Anish Das Sarma, Alon Halevy, Jennifer Widom Stanford InfoLab

3 Motivation Many applications involve data that is uncertain (approximate, probabilistic, inexact, incomplete, imprecise, fuzzy, inaccurate,...) Many of the same applications need to track the lineage of their data Neither uncertainty nor lineage are supported by conventional DBMSs Coincidence or Fate?

4 Sample Applications Needing Uncertainty and Lineage Scientific databases Sensor databases Data cleaning Data integration Information extraction

5 Trio Project Building a new kind of DBMS in which: 1.Data 2.Uncertainty 3.Lineage are all first-class interrelated concepts

6 Lineage and Uncertainty Lots of independent work in lineage and uncertainty (related work at end of talk) Turns out: The connection between uncertainty and lineage goes deeper than just a shared need by several applications Coincidence or Fate?

7 Lineage and Uncertainty Lineage... 1)Enables simple and consistent representation of uncertain data 2)Correlates uncertainty in query results with uncertainty in the input data 3)Can make computation over uncertain data more efficient

8 Outline of the Talk The ULDB data model Querying ULDBs ULDB properties Membership and extraction operations Confidences Current, related, and future work

9 Running Example: Crime Solver Saw(witness,car) Drives(person,car) Suspects(person) = π person (Saw ⋈ Drives)

10 Uncertainty An uncertain database represents a set of possible instances. Examples: –Amy saw either a Honda or a Toyota –Jimmy drives a Toyota, a Mazda, or both –Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3 –Hank is a suspect with confidence 0.7

11 Uncertainty in a ULDB 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences

12 Uncertainty in a ULDB 1. Alternatives: uncertainty about value 2. ‘?’ (Maybe) Annotations 3. Confidences Saw (witness,car) (Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda) Three possible instances witnesscar Amy{ Honda, Toyota, Mazda } =

13 Six possible instances Uncertainty in a ULDB 1. Alternatives 2. ‘?’ (Maybe): uncertainty about presence 3. Confidences Saw (witness,car) (Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda) (Betty, Acura) ?

14 Uncertainty in a ULDB 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences: weighted uncertainty Saw (witness,car) (Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2 (Betty, Acura): 0.6 ? Six possible instances, each with a probability

15 Data Models for Uncertainty Our model (so far) is not especially new We spent some time exploring the space of models for uncertainty [ICDE 2006] Tension between understandability and expressiveness – Our model is understandable – But it is not complete, or even closed under common operations

16 Closure and Completeness Completeness Can represent all sets of possible instances Closure Can represent results of operations Note: Completeness  Closure

17 Model (so far) Not Closed Saw (witness,car) (Cathy, Honda) ∥ (Cathy, Mazda) Drives (person,car) (Jimmy, Toyota) ∥ (Jimmy, Mazda) (Billy, Honda) ∥ (Frank, Honda) (Hank, Honda) Suspects Jimmy Billy ∥ Frank Hank Suspects = π person (Saw ⋈ Drives) ? ? ? Does not correctly capture possible instances in the result CANNOT

18 Lineage to the Rescue Lineage: “where data came from” –Internal lineage –External lineage (not covered in this talk) In ULDBs: A function λ from alternatives to sets of alternatives (or external sources)

19 Example with Lineage IDSaw (witness,car) 11 (Cathy, Honda) ∥ (Cathy, Mazda) IDDrives (person,car) 21 (Jimmy, Toyota) ∥ (Jimmy, Mazda) 22 (Billy, Honda) ∥ (Frank, Honda) 23(Hank, Honda) IDSuspects 31Jimmy 32 Billy ∥ Frank 33Hank ? ? ? Suspects = π person (Saw ⋈ Drives) λ (31) = (11,2),(21,2) λ (32,1) = (11,1),(22,1); λ (32,2) = (11,1),(22,2) λ (33) = (11,1), 23 Correctly captures possible instances in the result

20 ULDBs 1.Alternatives 2.‘?’ (Maybe) Annotations 3.Confidences 4.Lineage ULDBs are Closed and Complete

21 Outline of the Talk The ULDB data model Querying ULDBs ULDB properties Membership and extraction operations Confidences Current, related, and future work

22 Querying ULDBs Query Q on ULDB D D D D 1, D 2, …, D n possible instances Q on each instance representation of instances Q(D 1 ), Q(D 2 ), …, Q(D n ) D’ implementation of Q D + Result

23 Well-Behaved ULDBs If we start with a well-behaved ULDB and perform standard queries, it remains well- behaved Intuitively (details in paper): –Acyclic: No cycles in the lineage –Deterministic: Non-empty lineages of distinct alternatives are distinct –Uniform: Alternatives of same tuple are derived from the same set of tuples

24 ULDB Minimality Data-minimality Does every alternative appear in some possible instance? (no extraneous alternatives) Does every maybe-tuple in R not appear in some possible instance? (no extraneous ‘?’s) Lineage-minimality

25 Data-Minimality Examples Extraneous ‘?’ (Billy, Honda) ∥ (Frank, Honda) Billy ∥ Frank... ? λ (20,1)=(10,1); λ (20,2)=(10,2) extraneous

26 Data-Minimality Examples Extraneous alternative (Diane, Mazda) ∥ (Diane, Acura) Diane extraneous (Diane, Mazda) (Diane, Acura) ? ? ?

27 Data-Minimization Extraneous alternative theorem: –An alternative is extraneous iff it is (possibly transitively) derived from multiple alternatives of the same tuple. Extraneous “?” theorem –A “?” on tuple t is extraneous iff 1.it is derived from base tuples without “?” 2.t has as many alternatives as the product of the number in its base tuples Minimization algorithm based on the theorems (see paper)

28 Data-minimal Lineage-minimal Data-minimal Data-minimize Queries Lineage-minimize ULDB Properties and Operations Membership Extraction

29 Membership Questions Does a given tuple t appear in some (all) possible instance(s) of R ? –Polynomial algorithms based on Data-minimization Is a given table T one of (all of) the possible instances of R ? –NP-Hard t?, T? R R I 1, I 2, …, I n possible instances

30 Extraction Extraction algorithm in paper Suspects Eats Saw Drives

31 Outline of the Talk The ULDB data model Querying ULDBs ULDB properties Membership and extraction operations Confidences Current, related, and future work

32 Confidences Confidences supplied with base data Trio computes confidences on query results –Default probabilistic interpretation –Can choose to plug in different arithmetic Saw (witness,car) (Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4 Drives (person,car) (Jimmy, Mazda): 0.3 ∥ (Bill, Mazda): 0.6 (Hank, Honda): 1.0 Suspects Jimmy: 0.12 ∥ Bill: 0.24 Hank: 0.6 ? ? ? Probabilistic Min

33 Query Processing with Confidences Previous approach (probabilistic databases) –Each operator computes confidences during query execution –Only certain query plans allowed In ULDBs –Confidence of alternative A is function of confidences in its transitive lineage Our approach: Decouple data and confidence computation –Use any query plan for data computation –Compute confidences on-demand using lineage –Can give arbitrarily large improvements

34 Current Work: Algorithms Algorithms: confidence computation, extraneous data, membership questions –Minimize lineage traversal –Memoization –Batch computations

35 The Trio Trio Data Model –ULDBs (Coming: incomplete relations; continuous uncertainty; correlation uncertainty) Query Language –Simple extension to SQL –Query uncertainty, confidences, and lineage System –Did you see our demo? –Version 1: Entirely on top of conventional DBMS –Surprisingly easy and complete, reasonably efficient TriQL

36 Brief Related Work Uncertainty –Modeling C-tables [IL84], Probabilistic Databases [CP87], using Nested Relations [F90] –Systems ProbView [LLRS97], MYSTIQ [BDM+05], ORION [CSP05], Trio [BDHW05] Lineage –DBNotes [CTV05], Data Warehouses [CW03]

but don’t forget the lineage… Thank You Search “stanford trio” (or,