Selectivity Estimation using Probabilistic Models Author: Lise Getoor, Ben Taskar, Daphne Koller Presenter: Qidan Cheng.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
Dynamic Bayesian Networks (DBNs)
Bayesian Networks VISA Hyoungjune Yi. BN – Intro. Introduced by Pearl (1986 ) Resembles human reasoning Causal relationship Decision support system/ Expert.
Modelling Relational Statistics With Bayes Nets School of Computing Science Simon Fraser University Vancouver, Canada Tianxiang Gao Yuke Zhu.
Review: Bayesian learning and inference
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Bayesian networks Chapter 14 Section 1 – 2.
Bayesian Belief Networks
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Goal: Reconstruct Cellular Networks Biocarta. Conditions Genes.
Representing Uncertainty CSE 473. © Daniel S. Weld 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one.
Thanks to Nir Friedman, HU
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Bayesian networks More commonly called graphical models A way to depict conditional independence relationships between random variables A compact specification.
Quiz 4: Mean: 7.0/8.0 (= 88%) Median: 7.5/8.0 (= 94%)
Bayes’ Nets  A Bayes’ net is an efficient encoding of a probabilistic model of a domain  Questions we can ask:  Inference: given a fixed BN, what is.
Relational Probability Models Brian Milch MIT 9.66 November 27, 2007.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
第十讲 概率图模型导论 Chapter 10 Introduction to Probabilistic Graphical Models
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Collective Classification A brief overview and possible connections to -acts classification Vitor R. Carvalho Text Learning Group Meetings, Carnegie.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
Introduction to Bayesian Networks
Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
UIUC CS 497: Section EA Lecture #8 Reasoning in Artificial Intelligence Professor: Eyal Amir Spring Semester 2004 (Based on slides by Lise Getoor (UMD))
Announcements Project 4: Ghostbusters Homework 7
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.
Bayesian Networks CSE 473. © D. Weld and D. Fox 2 Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential.
CPSC 322, Lecture 33Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 33 Nov, 30, 2015 Slide source: from David Page (MIT) (which were.
Review: Bayesian inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.
Learning Statistical Models From Relational Data Lise Getoor University of Maryland, College Park Includes work done by: Nir Friedman, Hebrew U. Daphne.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
Introduction on Graphic Models
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
Maximum Expected Utility
Bayesian networks Chapter 14 Section 1 – 2.
Read R&N Ch Next lecture: Read R&N
General Graphical Model Learning Schema
A Quick Romp Through Probabilistic Relational Models
Learning Bayesian Network Models from Data
Probabilistic Data Management
Read R&N Ch Next lecture: Read R&N
CAP 5636 – Advanced Artificial Intelligence
Class #19 – Tuesday, November 3
CS 188: Artificial Intelligence
Class #16 – Tuesday, October 26
LECTURE 07: BAYESIAN ESTIMATION
Discriminative Probabilistic Models for Relational Data
Bayesian networks Chapter 14 Section 1 – 2.
Probabilistic Reasoning
Read R&N Ch Next lecture: Read R&N
Statistical Relational AI
Chapter 14 February 26, 2004.
Presentation transcript:

Selectivity Estimation using Probabilistic Models Author: Lise Getoor, Ben Taskar, Daphne Koller Presenter: Qidan Cheng

Outline  Introduction  Estimation for single tables  SRM: Statistical Relational Models  Selectivity Estimation using SRMs  Learning SRM

Introduction  Accurate estimates of the result size of queries are crucial to several query processing components of DBMS.  Cost-based query optimizers: choose the optimal query execution plan.  Query profilers: predicting resource consumption and distribution of query results.  Answer counting queries

Introduction  How to estimate the size of a selection query over multiple attributes for a single table? The result size is determined by the joint frequency distribution of the values of these attributes. Size(Q) = |R|  P D (a 1, …,a k ) Query:select * from R where R.A1 = a1 and … and R.Ak = ak

Introduction  But … exponential in # of attributes v n, representing all combination of attribute values is infeasible.  Attribute value independence assumption: joint distribution is product of single attribute distributions  Problem: overestimate or underestimate the query size. → Bayesian Network

Estimation for single table  Example: Given a simple relation R with three attributes: Education (high-school, college,degree)3 values Income (low, medium, high) 3 values Home-Owner (false, true) 2 values Joint distribution need 18 numbers.  Observation: Some of the correlations between attributes might be indirect ones, mediated by other attributes. → Conditional Independence

Estimation for single table  P(H=h|E=e,I=i)=P(H=h|I=i) Home-owner is conditionally independent of Education given Income. → Compact form of the joint distribution P(H,E,I) = P(E)P(I|E)P(H|I,E)=P(E)P(I|E)P(H|I) Home-owner Income Education

Estimation for single table Figure (b) can encode precisely the same joint distribution as in Figure (a)

Bayesian Networks X-Ray Lung Infiltrates Sputum Smear TuberculosisPneumonia Nodes = random variables Edges = direct probabilistic influence Network structure encodes independence assumptions: X-Ray conditionally independent of Pneumonia given Infiltrates

Bayesian Networks X-Ray Lung Infiltrates Sputum Smear TuberculosisPneumonia  Associated with each node X i there is a conditional probability distribution P(X i |Pa i :  ) — distribution over X i for each assignment to parents 0.8 t 0.2 p t p tp t p T P P(I |P, T )

BN Semantics Compact & natural representation: nodes have  k parents  2 k n vs. 2 n params conditional independencies in BN structure + local probability models full joint distribution over domain = X I S TP

BNs for Query Estimation  Query:select * from R where R.A 1 = a 1 and … and R.A k = a k  P(a 1,a 2, … a n )=  P(a i |parents(a i )) Use Bayesian inference algorithm to compute P D (a 1, …,a k )  Algorithm complexity depends on BN connectivity; efficient in practice Size(Q) = |R|  P D (a 1, …,a k )

Join Selectivity Estimation Person Purchase Uniform Join Assumption Size(Purchase Person) = | Purchase | Assuming referential integrity Naïve Approach

Join Selectivity Estimation Example query Q: “ person.income=high and purchase.type=luxury ” p = P (person.income=high) q = P (purchase.type=luxury) Size Q = |Purchase|*p*q Problems: Joining Two Tables

Correlated Attributes Person Purchase Income = high Income = low Type = luxury Type = necessity The attributes of the two different tables are often correlated

Skewed Join Person Purchase Income = high Income = low Type = luxury Type = necessity The probability that two tuples join with each other can also be correlated with various attributes.

Join Indicator S R Query: select * from R, S where R.F = S.K and R.A = a and S.B = b P(J F ) = prob. randomly chosen tuple from R joins with a randomly chosen tuple from S size(Q) = | R | | S | P(J F, a, b)

Statistical Relational Models  Model distribution of attributes across multiple tables  Allow attribute values to depend on attributes in the same table (like a BN)  Allow attribute values to depend on attributes in other tables along a foreign key join  Can model the join probability of two tuples using join indicator variable

Statistical Relational Model  A SRM for a relational database is a pair(S,θ),which specifies a local probabilistic model for each of the following variables  A variable R.A for each table R and each attribute A  R.*  A boolean join indicator variable R.J F  For each variable R.X  S specifies a set of parents Pa(R.X)  Θ specifies a CPD P(R.X|Pa(R.X))

Example SRM Person Income Age School Prestige J school Purchase J person Type Attended Bought-by , Type=necessity false true false true 0.999, Income = high 0.99, 0.01

Universal Foreign Key Closure  Schema: R, S, T.R.F refers to S, S.F refers to T stratification: T < S < R r s t r.F 1 = s.K s.F 2 = t.K  Schema: R, S R.F 1 refers to S, R.F 2 refers to S stratification: S < R r r.F 1 = s 1.K s1s1 s2s2 r.F 2 = s 2.K

Universal Foreign Key Closure  Minimal extension Q + to a query Q: Let Q be a keyjoin query over r 1,r 2, … r k For each r, if there is an attribute R.A with parent R.F.B where R.F points to S, then there is a unique tuple variable s representing the join r.F=s.K  Proposition: Let Q be a query and let Q + be its minimal extension. Then size Q [D]=size Q+ [D]

Answering Queries Using SRMs  Construct Query Evaluation BN for Query: select * from Person, Purchase where Person.id = Purchase.buyer-id and Person.Income = high and Purchase.Type=luxury Person Purchase J person Income Type Age Prestige J school School Compute upward closure of query attributes by including all parents as well

SRM Learning  Learn parameters & qualitative dependency structure  Extend known techniques for learning Bayesian networks from data. Database Patient Strain Contact

Structure selection  Define scoring function: log-likelihood function l ( θ,S | D)=log P(D | S, θ) Finding the model that has maximum log-likelihood given data.  Do the greedy local structure search

Parameter Estimation  The model contains a parameter Θ a|x for each value a of A and each assignment of values x to X. Θ a|x = P(R.A=a |X=x) Θ a|x = F D (R.A=a,X=x)/F D (X=x)

System Architecture Model Constructor Database offline Selectivity Estimator execution time Query Q Size(Q)

Conclusions  SRM is unique in its ability to handle select and join operators  Estimates the high-dimensional joint distribution using a set of lower- dimensional conditional distributions To do:  Incremental maintenance of the SRM as the database changes  Joins over non-key attributes

Selected Publications o“Learning Probabilistic Models of Link Structure”, L. Getoor, N. Friedman, D. Koller and B. Taskar, JMLR o“Probabilistic Models of Text and Link Structure for Hypertext Classification”, L. Getoor, E. Segal, B. Taskar and D. Koller, IJCAI WS ‘Text Learning: Beyond Classification’, o“Selectivity Estimation using Probabilistic Models”, L. Getoor, B. Taskar and D. Koller, SIGMOD-01. o“Learning Probabilistic Relational Models”, L. Getoor, N. Friedman, D. Koller, and A. Pfeffer, chapter in Relation Data Mining, eds. S. Dzeroski and N. Lavrac, osee also N. Friedman, L. Getoor, D. Koller, and A. Pfeffer, IJCAI-99. o“Learning Probabilistic Models of Relational Structure”, L. Getoor, N. Friedman, D. Koller, and B. Taskar, ICML-01. o“From Instances to Classes in Probabilistic Relational Models”, L. Getoor, D. Koller and N. Friedman, ICML Workshop on Attribute-Value and Relational Learning: Crossing the Boundaries, oNotes from AAAI Workshop on Learning Statistical Models from Relational Data, eds. L.Getoor and D. Jensen, oNotes from IJCAI Workshop on Learning Statistical Models from Relational Data, eds. L.Getoor and D. Jensen, See