Required Sample size for Bayesian network Structure learning

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 

. The sample complexity of learning Bayesian Networks Or Zuk*^, Shiri Margel* and Eytan Domany* *Dept. of Physics of Complex Systems Weizmann Inst. of.

. On the Number of Samples Needed to Learn the Correct Structure of a Bayesian Network Or Zuk, Shiri Margel and Eytan Domany Dept. of Physics of Complex.

Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Model Assessment and Selection

Chapter 4: Linear Models for Classification

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

Visual Recognition Tutorial

By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik

Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.

Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.

Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.

Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Evaluating Hypotheses

Machine Learning CMPT 726 Simon Fraser University

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Visual Recognition Tutorial

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Part I: Classification and Bayesian Learning

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.

Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Probability theory: (lecture 2 on AMLbook.com)

Statistical Decision Theory

Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.

INTRODUCTION TO Machine Learning 3rd Edition

Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”

Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.

Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CS 9633 Machine Learning Support Vector Machines

Chapter 3: Maximum-Likelihood Parameter Estimation

Learning Tree Structures

Empirical risk minimization

Ch3: Model Building through Regression

Course: Autonomous Machine Learning

Data Mining Lecture 11.

Bayesian Models in Machine Learning

Computational Learning Theory

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Computational Learning Theory

Parametric Methods Berlin Chen, 2005 References:

Empirical risk minimization

Presentation transcript:

Required Sample size for Bayesian network Structure learning Samee Ullah Khan and Kwan Wai Bong Peter

Outline Motivation Introduction Sample Complexity Summary Conclusion Sanjoy Dasgupta Russell Greiner Nir Friedman David Haussler Summary Conclusion

Motivation John Works at a Pharmaceutical Company. Optimal Sample Size of a Clinical Trial? It’s a function of Both Statistical Significance of the Difference and the Magnitude of Apparent difference between Performances. Purpose: A tool (measure) for Public and Commercial vendors to plan clinical trials. Looking For: Gain acceptance from potential users. Statistically Significance Evidence

Motivation: Solution Optimize the difference between the performances of both treatments. Let C= diff (expected cost of new treatment –expected cost of old treatment)

Motivation C=0, m= users,  is the difference in performance

Motivation C>0

Motivation C<0

Motivation: Conclusion Actual improvement in performance is known It may be extended to the uncertainty about the amount of improvement. It is also possible to shift the functions 1` or 2`to right. Where ` is standard deviation of the posterior distribution of unknown parameter .

Motivation: Model Paired Observations (X1,Y1),(X2,Y2)…….. Xi is new clinical outcome Yi is old clinical outcome Let Z be the objective function Zi=Xi-Yi (i=1,2,3……….) Assume that has normal density N(,2) Formulating our previous knowledge about  assume a prior density N(,2). Under the assumptions is a sufficient statistics for the parameter .

Introduction Efficient learning -- more accurate models with less data Compare: P(A) and P(B) vs joint P(A,B) former requires less data! Discover structural properties of the domain Identifying independencies in the domain helps to Order events that occur sequentially Sensitivity analysis and inference Predict effect of actions Involves learning causal relationship among variables

Introduction Why Struggle for Accurate Structure

Introduction Adding an Arc Increases the number of parameters to be fitted Wrong assumptions about causality and domain structure

Introduction Deleting an Arc Cannot be compensated by accurate fitting of parameters Also misses causality and domain structure

Introduction Approaches to Learning Structure Constraint based Perform tests of conditional independence Search for a network that is consistent with the observed dependencies and independencies Score based Define a score that evaluates how well the (in)dependencies in a structure match the observations Search for a structure that maximizes the score

Introduction Constraints versus Scores Constraint based Score based Intuitive, follows closely the definition of BNs Separates structure construction from the form of the independence tests Sensitive to errors in individual tests Score based Statistically motivated Can make compromises Both Consistent---with sufficient amounts of data and computation, they learn the correct structure

Dasgupta’s model Haussler’s extension of the PAC framework Situation: fixed network structure Goal: To learn the conditional probability functions accurately

Dasgupta’s model A learning algorithm A: Given: An approximation parameter  > 0 A confidence parameter 0 <  < 1 Variables drawn from a instance space X, x1, x2, …, xn An oracle which generates randomly instances of X according to some unknown distribution P that we are going to learn Some hypothesis class H

Dasgupta’s model Output: hypothesis h  H such that with probability > 1- where d(.,.) is a distance measure hopt is the concept h’  H that minimizes d(P, h’)

Dasgupta’s model: Distance measure Most intuitive: L1 norm Most popular: Kullback-Leibler divergence (relative entropy) Minimizing dKL with respect to the empirically observed distribution is equivalent to solving the maximum likelihood problem

Dasgupta’s model: Distance measure Disadvantage of dKL: unbounded So, the measure adopted in this model is relative entropy by replacing log with ln.

Dasgupta’s model The algorithm, given m samples drawn from some distribution P, finds the best fitting hypothesis by evaluating each h(,)H(,) by computing the empirical log loss E(-ln h(,)) and returning the hypothesis with the smallest value, where H(,)H, called an (,)-bounded approximation of H.

Dasgupta’s model By using Hoeffding and Chernoff bounds, the number of samples needed is bounded by Lower bound:

Rusell Greiner’s claim Many learning algorithms that determine which Bayesian network is optimal usually based on some measures such as log-likelihood, MDL, BIC. These typical measures are independent of the queries that will be posed. Learning algorithms should consider the distribution of queries as well as the underlying distribution of events, and seek the BN with the best performance over the query distribution rather than the one that appears closest to the underlying event distribution.

Russell Greiner’s model Let V: set of the N variables SQ: set of all possible legal statistical queries sq(x; y): a distribution over SQ Suppose we fixed a network B over V, and let B(x|y) be the real-value probability that B returns for this assignment. Given distribution sq(.,.) over SQ, the “score” of B is err(B)=errsq,p(B) if sq, p are clear from context

Russell Greiner’s model Observation: Any Bayesian network B* that encodes the underlying distribution p(.), will in fact produce the optimal performance; i.e. err(B*) will be optimal This means that if we have a learning algorithm that produces better approximations to p(.) as it sees more training examples, then in the limit the sq(.) distribution becomes irrelevant.

Russell Greiner’s model Given a set of labeled statistical queries Q={<xi;yi;pi>}i, let be the empirical score of the Bayesian net.

Russell Greiner’s model Compute err(B): #P-hard to compute the estimate of err(B) from general statistical queries If we know that all queries encountered sq(x;y), satisfy p(y) for some >0, then we only need complete event examples, with example queries to obtain an -close estimate, with probability at least 1-.

Nir Friedman’s model Review BN is composed of two parts. Setup DAG Parameters encoding Setup Let B* be a BN that describe the target distributions from training samples. Entropy Distance (Kullback-Leibler) Learn from Random Variables, decrease with N.

Nir Friedman’s model: Learning Criteria: Error Threshold  Confidence Threshold  N(,) sample size If the sample size is larger than N(,) then Pr(D(PLrn()||P)>)< where Lrn() represents the learning routine. If N(,) is MINIMAL the it is called sample complexity.

Nir Friedman’s model:Notations Vector Valued U={X1, X2,……Xn} X,Y,Z Variables x,y,z  values So B=<G,> G is DAG  are number of parameters xi|xi =P(xi|xi) BN is minimal

Nir Friedman’s model:Learning Given a training set wN={u1,……..un} of U find B that best matches D. The loglikelihood of B: Decomposing loglikelihood according to structure:

Nir Friedman’s model:Learning So we can derive Assume G has fixed structure, optimize  Argument is large networks not desirable

Nir Friedman’s model: PSM Penalized weighting function: MDL principle: Total description length of data AIC BIC

Nir Friedman’s model: Sample Complexity Log-likelihood and penality term Random noise Entropy distance

Nir Friedman’s model: Sample Complexity Idealized case

Nir Friedman’s model: Sample Complexity Sub-sampling strategies in learning

Nir Friedman’s model: Summary It can be shown on the sample complexity of BN using MDL Bound is loose To search for an optimal structure is NP-hard

David Haussler’s model The model is based on prediction. The learner attempts to infer an unknown target concept f chosen from a concept class F of {0, 1} valued function. For any given instance i, the learner predicts value of f(xi). After the prediction, the learner is to the correct answer. It improves on the result.

David Haussler’s model Criteria for sample bounds: Probability of f(xm+1) over (x1, f(x1)), …,(xm,f(xm)) Cumulative mistakes made over m trials The model uses VC dimension

VC General condition for uniform convergence: Definition: Shattered set. Let X be the instance space and C the concept class SX, shattered by C S’ S, c C which contains all S’ and none of S-S’ SX, C(S)  S

David Haussler’s model Information Gain At instance m, the learner has observed f(x1),…,f(xm) labels predict f(xm+1)

David Haussler’s model