General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization Christopher Re and Dan Suciu University of Washington 1.
Information Theory For Data Management
Fast Algorithms For Hierarchical Range Histogram Constructions
Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
Quantum Spectrum Testing Ryan O’Donnell John Wright (CMU)
The General Linear Model. The Simple Linear Model Linear Regression.
1 Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski Natural Language Lab Simon Fraser university Homotopy-based Semi- Supervised Hidden Markov.
Visual Recognition Tutorial
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Inference in Probabilistic Ontologies with Attributive Concept Descriptions and Nominals Rodrigo Bellizia Polastro and Fabio Gagliardi Cozman.
1 An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem Matthew Streeter & Stephen Smith Carnegie Mellon University NESCAI, April
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
An Information-Theoretic Approach to Normal Forms for Relational and XML Data Marcelo Arenas Leonid Libkin University of Toronto.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
1 Adaptive error estimation of the Trefftz method for solving the Cauchy problem Presenter: C.-T. Chen Co-author: K.-H. Chen, J.-F. Lee & J.-T. Chen BEM/MRM.
Maximum-Likelihood estimation Consider as usual a random sample x = x 1, …, x n from a distribution with p.d.f. f (x;  ) (and c.d.f. F(x;  ) ) The maximum.
Parametric Query Generation Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
July 3, Department of Computer and Information Science (IDA) Linköpings universitet, Sweden Minimal sufficient statistic.
Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin The Chinese.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Maximum likelihood (ML)
A brief maximum entropy tutorial. Overview Statistical modeling addresses the problem of modeling the behavior of a random process In constructing this.
Slide Systems of Linear Equations A system of linear equations consists two or more linear equations.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
by B. Zadrozny and C. Elkan
PROBABILITY REVIEW PART 2 PROBABILITY FOR TEXT ANALYTICS Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.
Graphical models for part of speech tagging
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
National Accounts and SAM Estimation Using Cross-Entropy Methods Sherman Robinson.
Scientific Writing Abstract Writing. Why ? Most important part of the paper Number of Readers ! Make people read your work. Sell your work. Make your.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
Solving Compound Inequalities. Domain: A-REI Reasoning with Equations & Inequalities Cluster: 1. Understand solving equations as a process of reasoning.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
September 28, 2000 Improved Simultaneous Data Reconciliation, Bias Detection and Identification Using Mixed Integer Optimization Methods Presented by:
Maximum Entropy … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information, is.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Network Science K. Borner A.Vespignani S. Wasserman.
The Message Passing Communication Model David Woodruff IBM Almaden.
MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Lecture 9: Query Complexity Tuesday, January 30, 2001.
Chapter 3: Maximum-Likelihood Parameter Estimation
A Course on Probabilistic Databases
Linear Systems November 28, 2016.
Approximate Lineage for Probabilistic Databases
Data Integration with Dependent Sources
Lecture 16: Probabilistic Databases
2.3 Represent Relations & Functions p. 33
The Improved Iterative Scaling Algorithm: A gentle Introduction
Presentation transcript:

General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison 3 University of Washington

Study Cardinality Estimation 1. Model: Information that optimizer knows 2. Prediction: use the model to estimate cardinality of future queries Contribution: A principled, declarative approach to cardinality estimation based on Entropy Maximization. “We estimate that distinct # of Employees is 10” 2 Propose a declarative language with statistical assertions

Motivating Applications 3 1. Incorporate query feedback records - 3. Data generation and description 2. Optimizers for new domains (DB Kit 2.0) Cloud Computing, Information Extraction Underutilized: No general purpose mechanism

Outline Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions 4

Statistical Assertions An assertion is a CQ Views + sharp (#) statement: V 1 (x) :- R(x,-) “The number of values in the output of V1 is 20” #V 1 = 20 V 2 (y) :- R(-,y),S(y) “The number of values in the output V 2 is 50” #V 2 = 50 A program is a set of assertions V(x) :- R(x,y), …. #V=

Model as a Probabilistic Database Intuitively, # is “Expected Value” V 1 (x) :- R(x,-) A model is a probabilistic database s.t. the expected number of tuples in V 1 is 20. Ok, but which pdb? #V 1 = 20 V(x) :- R(x,y), …. #V= “The number of values in the output of V1 is 20”

Desiderata for our solution Two Desiderata for the distribution (D1): Should agree with provided statistics (D2): Should assume nothing else Approach: maximize entropy subject to D1 Challenge: Compute params of MaxEnt Distribution Technical Desideratum: want params analytically V(x) :- R(x,y), …. #V=

Outline Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions 8

Notation for Probabilistic Databases Consider a domain D of size n. Fix a schema R=R 1, R 2,… Let Inst(n) = all instances over R on D An element I of Inst (n) is called a world 9

Notation for Probabilistic Databases Consider a domain D of size n. Fix a schema R=R 1, R 2,… Let Inst(n) = all instances over R on D An element I of Inst (n) is called a world Essentially, any discrete probability distribution on relations A probabilistic database is a pair ( Inst (n),p) 10

The semantics of # V 1 (x) :- R(x,-) # means “expected value” #V 1 = 20 Achieving (D1): Stats must agree NB: In truth, we let n tend to infinity, and settle for asymptotically equal. 11 “The number of values in the output of V1 is 20”

Multiple Views Given V 1, V 2, … with #V i = d i for i=1,…,t If p satisfies these equations, we’ve achieved: (D1): Should agree with provided statistics If p satisfies these equations, we’ve achieved: (D1): Should agree with provided statistics Many such distributions exist. How do we pick one? Achieving (D1): Stats must agree 12

Selecting the best one Maximize Entropy subject to constraints: Achieving (D2) : No ad-hoc assumptions 13

Selecting the best one Maximize Entropy subject to constraints: Achieving (D2) : No ad-hoc assumptions Z is normalizing constant and  i is positive parameter for i=1,..,t NB: p is only a function of the stats, and so we have achieved (D2) NB: p is only a function of the stats, and so we have achieved (D2) One can show that p has following form: 14

Benefits of MaxEnt Every (consistent) statistical program induces a well-defined distribution – Every query has a well-defined cardinality estimate Statistics as a whole, not as individual stats. Can add new statistics to our heart’s content Technical Challenge:  i analytically 15 A statistical program

Outline Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions 16

Two quick Examples I: A material random Graph – Even simple EM solutions have interesting theory II: Intersection Models – Generating function, and – Different, analytic technique 17

Example I: Random Graphs are EM V(x,y) :- R(x,y)#V = d 18 Random Graph: Add edges independently at random

Example I: Random Graphs are EM V(x,y) :- R(x,y)#V = d By Linearity, E[V] = xn 2 = d 19 Random Graph: Add edges independently at random

Example I: Random Graphs are EM V(x,y) :- R(x,y)#V = d Random Graph: Add edges independently at random By Linearity, E[V] = xn 2 = d 20 This is MaxEnt…write:

Example II: an intersection model Read: Each element is either in R 1, R 2, or all three V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 21 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1

Example II: an intersection model V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 Read: Each element is either in R 1, R 2, or all three 22 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1

Example II: an intersection model V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 Read: Each element is either in R 1, R 2, or all three 23 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1

Example II: an intersection model V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 Read: Each element is either in R 1, R 2, or all three 24 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1

Example II: an intersection model V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 Read: Each element is either in R 1, R 2, or all three 25 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1

Example II: an intersection model V(x) :- R 1 (x), R 2 (x) #R 1 = d 1, #R 2 = d 2, #V = d 3 Read: Each element is either in R 1, R 2, or all three 26 e.g., term with x 1 k is an instance where k distinct values in R 1 e.g., term with x 1 k is an instance where k distinct values in R 1

Results in the paper Normal Form for statistical programs Syntactic classes that we can solve analytically – “Project-Semijoin” queries (previous slide) A general technique, conditioning: – Start with tuple independent prior, and condition – Introduces inclusion constraints Extensions to handle histograms 27

Conclusion Showed a principled, general model for database statistics based on MaxEnt Analytically solved syntactic classes of statistics Applications: Query Feedback and the Cloud 28