Presentation is loading. Please wait.

Presentation is loading. Please wait.

Uncertainty Lineage Data Bases Very Large Data Bases 19752006.

Similar presentations


Presentation on theme: "Uncertainty Lineage Data Bases Very Large Data Bases 19752006."— Presentation transcript:

1 Uncertainty Lineage Data Bases Very Large Data Bases 19752006

2 ULDBs: Databases with Uncertainty and Lineage Omar Benjelloun, Anish Das Sarma, Alon Halevy, Jennifer Widom Stanford InfoLab

3 3 Motivation Many applications involve data that is uncertain (approximate, probabilistic, inexact, incomplete, imprecise, fuzzy, inaccurate,...) Many of the same applications need to track the lineage of their data Neither uncertainty nor lineage are supported by conventional DBMSs Coincidence or Fate?

4 4 Sample Applications Needing Uncertainty and Lineage Scientific databases Sensor databases Data cleaning Data integration Information extraction

5 5 Trio Project Building a new kind of DBMS in which: 1.Data 2.Uncertainty 3.Lineage are all first-class interrelated concepts

6 6 Lineage and Uncertainty Lots of independent work in lineage and uncertainty (related work at end of talk) Turns out: The connection between uncertainty and lineage goes deeper than just a shared need by several applications Coincidence or Fate?

7 7 Lineage and Uncertainty Lineage... 1)Enables simple and consistent representation of uncertain data 2)Correlates uncertainty in query results with uncertainty in the input data 3)Can make computation over uncertain data more efficient

8 8 Outline of the Talk The ULDB data model Querying ULDBs ULDB properties Membership and extraction operations Confidences Current, related, and future work

9 9 Running Example: Crime Solver Saw(witness,car) Drives(person,car) Suspects(person) = π person (Saw ⋈ Drives)

10 10 Uncertainty An uncertain database represents a set of possible instances. Examples: –Amy saw either a Honda or a Toyota –Jimmy drives a Toyota, a Mazda, or both –Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3 –Hank is a suspect with confidence 0.7

11 11 Uncertainty in a ULDB 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences

12 12 Uncertainty in a ULDB 1. Alternatives: uncertainty about value 2. ‘?’ (Maybe) Annotations 3. Confidences Saw (witness,car) (Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda) Three possible instances witnesscar Amy{ Honda, Toyota, Mazda } =

13 13 Six possible instances Uncertainty in a ULDB 1. Alternatives 2. ‘?’ (Maybe): uncertainty about presence 3. Confidences Saw (witness,car) (Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda) (Betty, Acura) ?

14 14 Uncertainty in a ULDB 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences: weighted uncertainty Saw (witness,car) (Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2 (Betty, Acura): 0.6 ? Six possible instances, each with a probability

15 15 Data Models for Uncertainty Our model (so far) is not especially new We spent some time exploring the space of models for uncertainty [ICDE 2006] Tension between understandability and expressiveness – Our model is understandable – But it is not complete, or even closed under common operations

16 16 Closure and Completeness Completeness Can represent all sets of possible instances Closure Can represent results of operations Note: Completeness  Closure

17 17 Model (so far) Not Closed Saw (witness,car) (Cathy, Honda) ∥ (Cathy, Mazda) Drives (person,car) (Jimmy, Toyota) ∥ (Jimmy, Mazda) (Billy, Honda) ∥ (Frank, Honda) (Hank, Honda) Suspects Jimmy Billy ∥ Frank Hank Suspects = π person (Saw ⋈ Drives) ? ? ? Does not correctly capture possible instances in the result CANNOT

18 18 Lineage to the Rescue Lineage: “where data came from” –Internal lineage –External lineage (not covered in this talk) In ULDBs: A function λ from alternatives to sets of alternatives (or external sources)

19 19 Example with Lineage IDSaw (witness,car) 11 (Cathy, Honda) ∥ (Cathy, Mazda) IDDrives (person,car) 21 (Jimmy, Toyota) ∥ (Jimmy, Mazda) 22 (Billy, Honda) ∥ (Frank, Honda) 23(Hank, Honda) IDSuspects 31Jimmy 32 Billy ∥ Frank 33Hank ? ? ? Suspects = π person (Saw ⋈ Drives) λ (31) = (11,2),(21,2) λ (32,1) = (11,1),(22,1); λ (32,2) = (11,1),(22,2) λ (33) = (11,1), 23 Correctly captures possible instances in the result

20 20 ULDBs 1.Alternatives 2.‘?’ (Maybe) Annotations 3.Confidences 4.Lineage ULDBs are Closed and Complete

21 21 Outline of the Talk The ULDB data model Querying ULDBs ULDB properties Membership and extraction operations Confidences Current, related, and future work

22 22 Querying ULDBs Query Q on ULDB D D D D 1, D 2, …, D n possible instances Q on each instance representation of instances Q(D 1 ), Q(D 2 ), …, Q(D n ) D’ implementation of Q D + Result

23 23 Well-Behaved ULDBs If we start with a well-behaved ULDB and perform standard queries, it remains well- behaved Intuitively (details in paper): –Acyclic: No cycles in the lineage –Deterministic: Non-empty lineages of distinct alternatives are distinct –Uniform: Alternatives of same tuple are derived from the same set of tuples

24 24 ULDB Minimality Data-minimality Does every alternative appear in some possible instance? (no extraneous alternatives) Does every maybe-tuple in R not appear in some possible instance? (no extraneous ‘?’s) Lineage-minimality

25 25 Data-Minimality Examples Extraneous ‘?’... 10 (Billy, Honda) ∥ (Frank, Honda)... 20 Billy ∥ Frank... ? λ (20,1)=(10,1); λ (20,2)=(10,2) extraneous

26 26 Data-Minimality Examples Extraneous alternative (Diane, Mazda) ∥ (Diane, Acura) Diane extraneous (Diane, Mazda) (Diane, Acura) ? ? ?

27 27 Data-Minimization Extraneous alternative theorem: –An alternative is extraneous iff it is (possibly transitively) derived from multiple alternatives of the same tuple. Extraneous “?” theorem –A “?” on tuple t is extraneous iff 1.it is derived from base tuples without “?” 2.t has as many alternatives as the product of the number in its base tuples Minimization algorithm based on the theorems (see paper)

28 28 Data-minimal Lineage-minimal Data-minimal Data-minimize Queries Lineage-minimize ULDB Properties and Operations Membership Extraction

29 29 Membership Questions Does a given tuple t appear in some (all) possible instance(s) of R ? –Polynomial algorithms based on Data-minimization Is a given table T one of (all of) the possible instances of R ? –NP-Hard t?, T? R R I 1, I 2, …, I n possible instances

30 30 Extraction Extraction algorithm in paper Suspects Eats Saw Drives

31 31 Outline of the Talk The ULDB data model Querying ULDBs ULDB properties Membership and extraction operations Confidences Current, related, and future work

32 32 Confidences Confidences supplied with base data Trio computes confidences on query results –Default probabilistic interpretation –Can choose to plug in different arithmetic Saw (witness,car) (Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4 Drives (person,car) (Jimmy, Mazda): 0.3 ∥ (Bill, Mazda): 0.6 (Hank, Honda): 1.0 Suspects Jimmy: 0.12 ∥ Bill: 0.24 Hank: 0.6 ? ? ? 0.30.4 0.6 Probabilistic Min

33 33 Query Processing with Confidences Previous approach (probabilistic databases) –Each operator computes confidences during query execution –Only certain query plans allowed In ULDBs –Confidence of alternative A is function of confidences in its transitive lineage Our approach: Decouple data and confidence computation –Use any query plan for data computation –Compute confidences on-demand using lineage –Can give arbitrarily large improvements

34 34 Current Work: Algorithms Algorithms: confidence computation, extraneous data, membership questions –Minimize lineage traversal –Memoization –Batch computations

35 35 The Trio Trio Data Model –ULDBs (Coming: incomplete relations; continuous uncertainty; correlation uncertainty) Query Language –Simple extension to SQL –Query uncertainty, confidences, and lineage System –Did you see our demo? –Version 1: Entirely on top of conventional DBMS –Surprisingly easy and complete, reasonably efficient TriQL

36 36 Brief Related Work Uncertainty –Modeling C-tables [IL84], Probabilistic Databases [CP87], using Nested Relations [F90] –Systems ProbView [LLRS97], MYSTIQ [BDM+05], ORION [CSP05], Trio [BDHW05] Lineage –DBNotes [CTV05], Data Warehouses [CW03]

37 but don’t forget the lineage… Thank You Search “stanford trio” (or, http://i.stanford.edu/trio)


Download ppt "Uncertainty Lineage Data Bases Very Large Data Bases 19752006."

Similar presentations


Ads by Google