Presentation is loading. Please wait.

Presentation is loading. Please wait.

Representation Formalisms for Uncertain Data Jennifer Widom with Anish Das Sarma Omar Benjelloun Alon Halevy Trio and other participants in the Trio Project.

Similar presentations


Presentation on theme: "Representation Formalisms for Uncertain Data Jennifer Widom with Anish Das Sarma Omar Benjelloun Alon Halevy Trio and other participants in the Trio Project."— Presentation transcript:

1 Representation Formalisms for Uncertain Data Jennifer Widom with Anish Das Sarma Omar Benjelloun Alon Halevy Trio and other participants in the Trio Project

2 2 !! Warning !! preliminaryin flux This work is preliminary and in flux Ditto for these slides Luckily I’m among friends…

3 3 Some Context Trio Project We’re building a new kind of DBMS in which: 1.Data 2.Accuracy 3.Lineage are all first-class interrelated concepts Potential applications –Scientific and sensor databases –Data cleaning and integration –Approximate query processing –And others…

4 4 Context (cont’d) accuracy We began by investigating the accuracy component = uncertainty = uncertainty (more on terminology coming) Recently we’ve made progress tying together uncertainty +lineage uncertainty + lineage

5 5 Approaching a New Problem 1)Work in a void for a while 2)Then see what others have done 3)Adjust and proceed

6 6 Void Part 1 Trio Data ModelTDM Defined initial Trio Data Model (TDM) [CIDR ’05] Based primarily on applications and intuition Accuracy Accuracy component of initial TDM A sub-trio: approximation 1.Attribute-level approximation confidence 2.Tuple-level (or relation-level) confidence coverage 3.Relation-level coverage

7 7 Terminology Wars

8 8 TDM: Approximation (Attributes) Broadly, an approximate value is a set of possible valuesprobability possible values along with a probability distribution distribution over them

9 9 TDM: Approximation (Attributes) Specifically, each Trio attribute value is either: 1)Exact value (default) 2)Set of values, each with prob  [0,1] ( ∑ =1) 3)Min + Max for a range (uniform distribution) 4)Mean + Deviation for Gaussian distribution Type 2 sets may include “unknown” ( ┴ ) Independence of approximate values within a tuple

10 10 TDM: Confidence (Tuples) confidence Each tuple t has confidence  [0,1] –Informally: chance of t correctly belonging in relation –Default: confidence=1 –Can also define at relation level

11 11 TDM: Coverage (Relations) coverage Each relation R has coverage  [0,1] –Informally: percentage of correct R that is present –Default: coverage=1

12 12 Void Part 2 Started fiddling around with TDM accuracy –Suitability for applications –Expressiveness in general –Operations on data Immediately encountered interesting issues –Modeling is nontrivial –Operation behavior is nonobvious –Completeness and closure

13 13 End Void Time to… –Read up on other work –Study a simplified accuracy model –Get formal ☻ –Change terminology ☻

14 14 Uncertain Database: Semantics Definition: Definition: An uncertain database represents a set of possible (certain) databases a.k.a. “possible worlds” “possible instances” Example: Example: Jennifer attends workshop on Monday; Mike attends on Monday, Tuesday, or not at all personday JenniferMonday MikeMonday personday JenniferMonday MikeTuesday personday JenniferMonday Instance1 Instance2 Instance3

15 15 Restricted TDM Accuracy or-sets 1.Approximation: or-sets maybe-tuples ? 2.Confidence: maybe-tuples (denoted “?”) 3.Coverage: omit Straightforward mapping to possible-instances personday JenniferMonday Mike{Monday,Tuesday} ? maps to the three possible-instances on previous slide

16 16 Properties of Repsentations representationRestricted-TDM is one possible representation for uncertain databases well-definedA representation is well-defined if we know how to map any database in the representation to its set of possible instances completeA representation is complete if every set of possible instances can be represented  Unfortunately, TDM (restricted or not) is incomplete

17 17Incompleteness personday JenniferMonday MikeTuesday personday JenniferMonday Instance1 Instance2 personday MikeTuesday Instance3 personday JenniferMonday MikeTuesday ? ? generates 4 th instance: empty relation

18 18 Completeness vs. Closure All sets-of-instances Representable sets-of-instances Op1 Op2 Proposition: Proposition: An incomplete representation is still interesting if it’s expressive enough and closed under all required operations Completeness: blue=pink Closure: arrow stays in blue

19 19 Operations: Semantics Easy and natural (re)definition for any standard database operation (call it Op) Closure: up-arrow always exists Completeness  Closure Note: Completeness  Closure D I 1, I 2, …, I n J 1, J 2, …, J m D’D’ possible instances Op on each instance rep. of instances Op’ direct implementation

20 20 Closure in TDM not closed Unfortunately, TDM (restricted or not) is not closed under many standard operations Next: 1.Examples of non-closure in TDM 2.Suggest possible extensions to the representation (hereafter “model”) 3.Hierarchy of models based on expressiveness

21 21 Non-Closure Under Join (Ex. 1) personday Mike{Monday,Tuesday} dayfood Mondaychicken Tuesdayfish ⋈ Result has two possible instances: persondayfood MikeMondaychicken Instance1 persondayfood MikeTuesdayfish Instance2 Not representable with or-sets and ?

22 22 Need for Xor Result has two possible instances: persondayfood MikeMondaychicken Instance1 persondayfood MikeTuesdayfish Instance2 Xor constraint Representable with Xor constraint persondayfood MikeMondaychicken MikeTuesdayfish t1 t2 Constraint: t1 XOR t2

23 23 Non-Closure Under Join (Ex. 2) personday Mike{Monday,Tuesday} dayfood Mondaychicken Mondaypie ⋈ Result has two possible instances: persondayfood MikeMonday chicken MikeMonday pie Instance1 persondayfood Instance2 Not representable with or-sets and ?

24 24 Need for Iff Result has two possible instances: persondayfood MikeMonday chicken MikeMonday pie Instance1 persondayfood Instance2 ≡Iff constraint Representable with ≡ (Iff) constraint persondayfood MikeMondaychicken MikeMondaypie t1 t2 Constraint: t1 ≡ t2

25 25 Constraints  Completeness? YES Full propositional logic: YES NO Xor and Iff: NO NO General 2-clauses: NO How about “vertical or” (tuple-sets)? NOPE

26 26 Hierarchy of Incomplete Models R relations A or-sets ? maybe-tuples 2 2-clauses sets tuple-sets

27 27Closure But remember: –Completeness may not be necessary –Closure may be good enough Are any of these models closed under standard relational operations?

28 28 Closure Table

29 29 Closure Diagram Omitted: –Self-loops –Subsumed arrows to root

30 30 Some Final Theory (for today) Instance membership: Instance membership: Given instance I and uncertain relation R, is I an instance of R ? Instance certainty: Instance certainty: Given instance I and uncertain relation R, is I R ’s only instance? Tuple membership: Tuple membership: Given tuple t and uncertain relation R, is t in any of R ’s instances? Tuple certainty: Tuple certainty: Given tuple t and uncertain relation R, is t in all of R ’s instances? Many of these problems are NP-Hard in complete models but polynomial in our incomplete models

31 31 Back to Reality What does all of this mean for the Trio project? Fundamental dilemma: –Restricted-TDM: –Restricted-TDM: intuitive, understandable, incomplete –Unrestricted-TDM: –Unrestricted-TDM: more complex, still incomplete –Complete models: –Complete models: even more complex, nonintuitive

32 32 Uses of Incomplete Models Sufficient for some applications –Incomplete model can represent data –Closed under required operations Two-layer approach –Underlying complete model –Incomplete “working” model for users (recall Mike’s chicken and fish) approximate approximation –Challenge: approximate approximation

33 33 Lineage and Uncertainty Trio: Trio: Data + Accuracy + Lineage Surprise: Surprise: Restricted-TDM (v2) + Lineage is complete and (therefore) closed personday Mike{Monday,Tuesday} dayfood Mondaychicken Tuesdayfish ⋈ persondayfood MikeMondaychicken MikeTuesdayfish A1 A2 ? ? Lineage: A2, … Lineage: A1, …

34 34 Current Challenges Pursue uncertainty+lineage Remainder of accuracy model –Probability distributions –Intervals, Gaussians –Confidence values –Coverage Querying uncerrtainty Ex: Find all people with ≥ 3 alternate days  Can we generalize the possible-instances semantics?

35 Search term: Search term: stanford trio


Download ppt "Representation Formalisms for Uncertain Data Jennifer Widom with Anish Das Sarma Omar Benjelloun Alon Halevy Trio and other participants in the Trio Project."

Similar presentations


Ads by Google