Download presentation
Presentation is loading. Please wait.
1
Representation Formalisms for Uncertain Data Jennifer Widom with Anish Das Sarma Omar Benjelloun Alon Halevy Trio and other participants in the Trio Project
2
2 !! Warning !! preliminaryin flux This work is preliminary and in flux Ditto for these slides Luckily I’m among friends…
3
3 Some Context Trio Project We’re building a new kind of DBMS in which: 1.Data 2.Accuracy 3.Lineage are all first-class interrelated concepts Potential applications –Scientific and sensor databases –Data cleaning and integration –Approximate query processing –And others…
4
4 Context (cont’d) accuracy We began by investigating the accuracy component = uncertainty = uncertainty (more on terminology coming) Recently we’ve made progress tying together uncertainty +lineage uncertainty + lineage
5
5 Approaching a New Problem 1)Work in a void for a while 2)Then see what others have done 3)Adjust and proceed
6
6 Void Part 1 Trio Data ModelTDM Defined initial Trio Data Model (TDM) [CIDR ’05] Based primarily on applications and intuition Accuracy Accuracy component of initial TDM A sub-trio: approximation 1.Attribute-level approximation confidence 2.Tuple-level (or relation-level) confidence coverage 3.Relation-level coverage
7
7 Terminology Wars
8
8 TDM: Approximation (Attributes) Broadly, an approximate value is a set of possible valuesprobability possible values along with a probability distribution distribution over them
9
9 TDM: Approximation (Attributes) Specifically, each Trio attribute value is either: 1)Exact value (default) 2)Set of values, each with prob [0,1] ( ∑ =1) 3)Min + Max for a range (uniform distribution) 4)Mean + Deviation for Gaussian distribution Type 2 sets may include “unknown” ( ┴ ) Independence of approximate values within a tuple
10
10 TDM: Confidence (Tuples) confidence Each tuple t has confidence [0,1] –Informally: chance of t correctly belonging in relation –Default: confidence=1 –Can also define at relation level
11
11 TDM: Coverage (Relations) coverage Each relation R has coverage [0,1] –Informally: percentage of correct R that is present –Default: coverage=1
12
12 Void Part 2 Started fiddling around with TDM accuracy –Suitability for applications –Expressiveness in general –Operations on data Immediately encountered interesting issues –Modeling is nontrivial –Operation behavior is nonobvious –Completeness and closure
13
13 End Void Time to… –Read up on other work –Study a simplified accuracy model –Get formal ☻ –Change terminology ☻
14
14 Uncertain Database: Semantics Definition: Definition: An uncertain database represents a set of possible (certain) databases a.k.a. “possible worlds” “possible instances” Example: Example: Jennifer attends workshop on Monday; Mike attends on Monday, Tuesday, or not at all personday JenniferMonday MikeMonday personday JenniferMonday MikeTuesday personday JenniferMonday Instance1 Instance2 Instance3
15
15 Restricted TDM Accuracy or-sets 1.Approximation: or-sets maybe-tuples ? 2.Confidence: maybe-tuples (denoted “?”) 3.Coverage: omit Straightforward mapping to possible-instances personday JenniferMonday Mike{Monday,Tuesday} ? maps to the three possible-instances on previous slide
16
16 Properties of Repsentations representationRestricted-TDM is one possible representation for uncertain databases well-definedA representation is well-defined if we know how to map any database in the representation to its set of possible instances completeA representation is complete if every set of possible instances can be represented Unfortunately, TDM (restricted or not) is incomplete
17
17Incompleteness personday JenniferMonday MikeTuesday personday JenniferMonday Instance1 Instance2 personday MikeTuesday Instance3 personday JenniferMonday MikeTuesday ? ? generates 4 th instance: empty relation
18
18 Completeness vs. Closure All sets-of-instances Representable sets-of-instances Op1 Op2 Proposition: Proposition: An incomplete representation is still interesting if it’s expressive enough and closed under all required operations Completeness: blue=pink Closure: arrow stays in blue
19
19 Operations: Semantics Easy and natural (re)definition for any standard database operation (call it Op) Closure: up-arrow always exists Completeness Closure Note: Completeness Closure D I 1, I 2, …, I n J 1, J 2, …, J m D’D’ possible instances Op on each instance rep. of instances Op’ direct implementation
20
20 Closure in TDM not closed Unfortunately, TDM (restricted or not) is not closed under many standard operations Next: 1.Examples of non-closure in TDM 2.Suggest possible extensions to the representation (hereafter “model”) 3.Hierarchy of models based on expressiveness
21
21 Non-Closure Under Join (Ex. 1) personday Mike{Monday,Tuesday} dayfood Mondaychicken Tuesdayfish ⋈ Result has two possible instances: persondayfood MikeMondaychicken Instance1 persondayfood MikeTuesdayfish Instance2 Not representable with or-sets and ?
22
22 Need for Xor Result has two possible instances: persondayfood MikeMondaychicken Instance1 persondayfood MikeTuesdayfish Instance2 Xor constraint Representable with Xor constraint persondayfood MikeMondaychicken MikeTuesdayfish t1 t2 Constraint: t1 XOR t2
23
23 Non-Closure Under Join (Ex. 2) personday Mike{Monday,Tuesday} dayfood Mondaychicken Mondaypie ⋈ Result has two possible instances: persondayfood MikeMonday chicken MikeMonday pie Instance1 persondayfood Instance2 Not representable with or-sets and ?
24
24 Need for Iff Result has two possible instances: persondayfood MikeMonday chicken MikeMonday pie Instance1 persondayfood Instance2 ≡Iff constraint Representable with ≡ (Iff) constraint persondayfood MikeMondaychicken MikeMondaypie t1 t2 Constraint: t1 ≡ t2
25
25 Constraints Completeness? YES Full propositional logic: YES NO Xor and Iff: NO NO General 2-clauses: NO How about “vertical or” (tuple-sets)? NOPE
26
26 Hierarchy of Incomplete Models R relations A or-sets ? maybe-tuples 2 2-clauses sets tuple-sets
27
27Closure But remember: –Completeness may not be necessary –Closure may be good enough Are any of these models closed under standard relational operations?
28
28 Closure Table
29
29 Closure Diagram Omitted: –Self-loops –Subsumed arrows to root
30
30 Some Final Theory (for today) Instance membership: Instance membership: Given instance I and uncertain relation R, is I an instance of R ? Instance certainty: Instance certainty: Given instance I and uncertain relation R, is I R ’s only instance? Tuple membership: Tuple membership: Given tuple t and uncertain relation R, is t in any of R ’s instances? Tuple certainty: Tuple certainty: Given tuple t and uncertain relation R, is t in all of R ’s instances? Many of these problems are NP-Hard in complete models but polynomial in our incomplete models
31
31 Back to Reality What does all of this mean for the Trio project? Fundamental dilemma: –Restricted-TDM: –Restricted-TDM: intuitive, understandable, incomplete –Unrestricted-TDM: –Unrestricted-TDM: more complex, still incomplete –Complete models: –Complete models: even more complex, nonintuitive
32
32 Uses of Incomplete Models Sufficient for some applications –Incomplete model can represent data –Closed under required operations Two-layer approach –Underlying complete model –Incomplete “working” model for users (recall Mike’s chicken and fish) approximate approximation –Challenge: approximate approximation
33
33 Lineage and Uncertainty Trio: Trio: Data + Accuracy + Lineage Surprise: Surprise: Restricted-TDM (v2) + Lineage is complete and (therefore) closed personday Mike{Monday,Tuesday} dayfood Mondaychicken Tuesdayfish ⋈ persondayfood MikeMondaychicken MikeTuesdayfish A1 A2 ? ? Lineage: A2, … Lineage: A1, …
34
34 Current Challenges Pursue uncertainty+lineage Remainder of accuracy model –Probability distributions –Intervals, Gaussians –Confidence values –Coverage Querying uncerrtainty Ex: Find all people with ≥ 3 alternate days Can we generalize the possible-instances semantics?
35
Search term: Search term: stanford trio
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.