General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison 3 University of Washington
Study Cardinality Estimation 1. Model: Information that optimizer knows 2. Prediction: use the model to estimate cardinality of future queries Contribution: A principled, declarative approach to cardinality estimation based on Entropy Maximization. “We estimate that distinct # of Employees is 10” 2 Propose a declarative language with statistical assertions
Motivating Applications 3 1. Incorporate query feedback records - 3. Data generation and description 2. Optimizers for new domains (DB Kit 2.0) Cloud Computing, Information Extraction Underutilized: No general purpose mechanism
Outline Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions 4
Statistical Assertions An assertion is a CQ Views + sharp (#) statement: V 1 (x) :- R(x,-) “The number of values in the output of V1 is 20” #V 1 = 20 V 2 (y) :- R(-,y),S(y) “The number of values in the output V 2 is 50” #V 2 = 50 A program is a set of assertions V(x) :- R(x,y), …. #V=
Model as a Probabilistic Database Intuitively, # is “Expected Value” V 1 (x) :- R(x,-) A model is a probabilistic database s.t. the expected number of tuples in V 1 is 20. Ok, but which pdb? #V 1 = 20 V(x) :- R(x,y), …. #V= “The number of values in the output of V1 is 20”
Desiderata for our solution Two Desiderata for the distribution (D1): Should agree with provided statistics (D2): Should assume nothing else Approach: maximize entropy subject to D1 Challenge: Compute params of MaxEnt Distribution Technical Desideratum: want params analytically V(x) :- R(x,y), …. #V=
Outline Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions 8
Notation for Probabilistic Databases Consider a domain D of size n. Fix a schema R=R 1, R 2,… Let Inst(n) = all instances over R on D An element I of Inst (n) is called a world 9
Notation for Probabilistic Databases Consider a domain D of size n. Fix a schema R=R 1, R 2,… Let Inst(n) = all instances over R on D An element I of Inst (n) is called a world Essentially, any discrete probability distribution on relations A probabilistic database is a pair ( Inst (n),p) 10
The semantics of # V 1 (x) :- R(x,-) # means “expected value” #V 1 = 20 Achieving (D1): Stats must agree NB: In truth, we let n tend to infinity, and settle for asymptotically equal. 11 “The number of values in the output of V1 is 20”
Multiple Views Given V 1, V 2, … with #V i = d i for i=1,…,t If p satisfies these equations, we’ve achieved: (D1): Should agree with provided statistics If p satisfies these equations, we’ve achieved: (D1): Should agree with provided statistics Many such distributions exist. How do we pick one? Achieving (D1): Stats must agree 12
Selecting the best one Maximize Entropy subject to constraints: Achieving (D2) : No ad-hoc assumptions 13
Selecting the best one Maximize Entropy subject to constraints: Achieving (D2) : No ad-hoc assumptions Z is normalizing constant and i is positive parameter for i=1,..,t NB: p is only a function of the stats, and so we have achieved (D2) NB: p is only a function of the stats, and so we have achieved (D2) One can show that p has following form: 14
Benefits of MaxEnt Every (consistent) statistical program induces a well-defined distribution – Every query has a well-defined cardinality estimate Statistics as a whole, not as individual stats. Can add new statistics to our heart’s content Technical Challenge: i analytically 15 A statistical program
Outline Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions 16
Two quick Examples I: A material random Graph – Even simple EM solutions have interesting theory II: Intersection Models – Generating function, and – Different, analytic technique 17
Example I: Random Graphs are EM V(x,y) :- R(x,y)#V = d 18 Random Graph: Add edges independently at random
Example I: Random Graphs are EM V(x,y) :- R(x,y)#V = d By Linearity, E[V] = xn 2 = d 19 Random Graph: Add edges independently at random
Example I: Random Graphs are EM V(x,y) :- R(x,y)#V = d Random Graph: Add edges independently at random By Linearity, E[V] = xn 2 = d 20 This is MaxEnt…write:
Example II: an intersection model Read: Each element is either in R 1, R 2, or all three V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 21 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1
Example II: an intersection model V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 Read: Each element is either in R 1, R 2, or all three 22 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1
Example II: an intersection model V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 Read: Each element is either in R 1, R 2, or all three 23 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1
Example II: an intersection model V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 Read: Each element is either in R 1, R 2, or all three 24 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1
Example II: an intersection model V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 Read: Each element is either in R 1, R 2, or all three 25 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1
Example II: an intersection model V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 Read: Each element is either in R 1, R 2, or all three 26 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1
Results in the paper Normal Form for statistical programs Syntactic classes that we can solve analytically – “Project-Semijoin” queries (previous slide) A general technique, conditioning: – Start with tuple independent prior, and condition – Introduces inclusion constraints Extensions to handle histograms 27
Conclusion Showed a principled, general model for database statistics based on MaxEnt Analytically solved syntactic classes of statistics Applications: Query Feedback and the Cloud 28