Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro-Communications, Japan 1.Introduction: Learning.

Slides:



Advertisements
Similar presentations
Cooperative Transmit Power Estimation under Wireless Fading Murtaza Zafer (IBM US), Bongjun Ko (IBM US), Ivan W. Ho (Imperial College, UK) and Chatschik.
Advertisements

A Tutorial on Learning with Bayesian Networks
Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.
Sampling and Pulse Code Modulation
CSE 330: Numerical Methods
Fast Algorithms For Hierarchical Range Histogram Constructions
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
CS479/679 Pattern Recognition Dr. George Bebis
ORTHOGONAL ARRAYS APPLICATION TO PSEUDORANDOM NUMBERS GENERATION AND OPTIMIZATION PROBLEMS A.G.Chefranov †‡, T.A.Mazurova ‡, I.D.Sidorov ‡, T.S.Letia 
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
. The sample complexity of learning Bayesian Networks Or Zuk*^, Shiri Margel* and Eytan Domany* *Dept. of Physics of Complex Systems Weizmann Inst. of.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Dynamic Bayesian Networks (DBNs)
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Visual Recognition Tutorial
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
SYSTEMS Identification
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Presenting: Assaf Tzabari
Maximum Likelihood (ML), Expectation Maximization (EM)
Sampling Distributions
Continuous Random Variables and Probability Distributions
QMS 6351 Statistics and Research Methods Chapter 7 Sampling and Sampling Distributions Prof. Vera Adamchik.
Learning Bayesian Networks (From David Heckerman’s tutorial)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Introduction to Statistical Inferences
16-1 Copyright  2010 McGraw-Hill Australia Pty Ltd PowerPoint slides to accompany Croucher, Introductory Mathematics and Statistics, 5e Chapter 16 The.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
1 Lesson 8: Basic Monte Carlo integration We begin the 2 nd phase of our course: Study of general mathematics of MC We begin the 2 nd phase of our course:
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Binary Stochastic Fields: Theory and Application to Modeling of Two-Phase Random Media Steve Koutsourelakis University of Innsbruck George Deodatis Columbia.
The Application of The Improved Hybrid Ant Colony Algorithm in Vehicle Routing Optimization Problem International Conference on Future Computer and Communication,
Decision Trees. MS Algorithms Decision Trees The basic idea –creating a series of splits, also called nodes, in the tree. The algorithm adds a node to.
Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
BCS547 Neural Decoding.
SYSTEMS Identification Ali Karimpour Assistant Professor Ferdowsi University of Mashhad Reference: “System Identification Theory For The User” Lennart.
Wei Sun and KC Chang George Mason University March 2008 Convergence Study of Message Passing In Arbitrary Continuous Bayesian.
1 An Arc-Path Model for OSPF Weight Setting Problem Dr.Jeffery Kennington Anusha Madhavan.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
1 1 Slide © 2004 Thomson/South-Western Simulation n Simulation is one of the most frequently employed management science techniques. n It is typically.
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
Optimization of Nonlinear Singularly Perturbed Systems with Hypersphere Control Restriction A.I. Kalinin and J.O. Grudo Belarusian State University, Minsk,
Copyright © Cengage Learning. All rights reserved. 9 Inferences Based on Two Samples.
Chance Constrained Robust Energy Efficiency in Cognitive Radio Networks with Channel Uncertainty Yongjun Xu and Xiaohui Zhao College of Communication Engineering,
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Chapter 7: The Distribution of Sample Means
Computacion Inteligente Least-Square Methods for System Identification.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Markov Chain Monte Carlo in R
Virtual University of Pakistan
Lesson 8: Basic Monte Carlo integration
CS479/679 Pattern Recognition Dr. George Bebis
12. Principles of Parameter Estimation
Bayesian Generalized Product Partition Model
CSCI 5822 Probabilistic Models of Human and Machine Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
PSY 626: Bayesian Statistics for Psychological Science
5.1 Introduction to Curve Fitting why do we fit data to a function?
12. Principles of Parameter Estimation
Presentation transcript:

Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro-Communications, Japan 1.Introduction: Learning Bayesian networks is known to be highly sensitive to the chosen equivalent sample size (ESS) in the Bayesian Dirichlet equivalence uniform (BDeu). This sensitivity often engenders unstable or undesired results because the prior of BDeu does not represent ignorance of prior knowledge, but rather a user’s prior belief in the uniformity of the conditional distribution. This paper presents a proposal for a non-informative Dirichlet score by marginalizing the possible hypothetical structures as the user’s prior belief. Some numerical experiments demonstrate that the proposed score improves learning accuracy. The results also suggest that the proposed score might be effective especially for small samples. 2. Learning Bayesian networks According to Bayesian network structure, The marginal likelihood of g to find the MAP structure is obtained as ML score: Where and denote the hyperparameters of the Dirichlet prior distribution ( pseudo-sample corresponding to n ijk. In particular, Heckerman et al. (1995) presented a sufficient condition for satisfying the likelihood equivalence assumption in the form of the following constraint related to hyperparameters:Therein, α is the user-determined equivalent sample size (ESS) and g h is the hypothetical Bayesian network structure that reflects a user’s prior knowledge. This metric was designated as the Bayesian Dirichlet equivalence (BDe) score metric. As Buntine (1991) described is considered to be a special case of the BDe metric. Heckerman et al. (1995) called this special case “BDeu”. Recent reports have described that learning Bayesian network is highly sensitive to the chosen equivalent sample size (ESS) in the Bayesian Dirichlet equivalence uniform (BDeu) and this often engenders some unstable or undesirable results.(Steck and Jaakkola, 2002, Silander, Kontkanen and Myllymaki, 2007). 3. Problems of the previous works Theorem 1(Ueno 2010) When is sufficiently large, the log-ML is approximated asymptotically as Log-posterior The number of Parameters The difference between prior and data As the two structures become equivalent, the penalty term is minimized with the fixed ESS. Conversely, as the two structures become different, the term increases. Moreover, α determines the magnitude of the user‘s prior belief for a hypothetical structure. Thus, the mechanism of BDe makes sense to us when we have approximate prior knowledge. Moreover, Ueno (2011) showed that the prior of BDeu does not represent ignorance of prior knowledge, but rather a user’s prior belief in the uniformity of the conditional distribution. This fact is particularly surprising because it had been believed that BDeu has a non-informative prior. The results further imply that the optimal ESS becomes large/small when the empirical conditional distribution becomes uniform/ skewed. This main factor underpins the sensitivity of BDeu to ESS. To solve the sensitiveness problem, Silander, Kontkanen and Myllymaki (2007) earlier proposed a learning method to marginalize the ESS of BDeu. They averaged BDeu values with increasing ESS from 1 to 100 by 1.0. Moreover, to decrease the computation costs of the marginalization, Cano et al. (2011) proposed averaging a wide range of different chosen ESSs: ESS > 1:0, ESS >> 1:0, ESS < 1:0, ESS << 1:0. They reported that the proposed method performed robustly and efficiently. However, such approximated marginalization does not always guarantee robust results when the optimal ESS is extremely small or large. In addition, the exact marginalization of ESS is difficult because ESS is a continuous variable in domain (0;∞). 5. Learning algorithm using dynamic programming Learning Bayesian networks is known to be an NP complete problem (Chickering, 1996). Recently, however, the exact solution methods can produce results in reasonable computation time if the variables are not prohibitively numerous (e.g.(Silander and Myllymaki, 2006), Malone et al., 2011). We employ the learning algorithm of (Silander and Myllymaki, 2006) using dynamic programming. Our algorithm comprises four steps: 1. Compute the local Expected log-Bde scores for all possible pairs. 2. For each variable, find the best parent set in parent candidate set for all, 3. Find the best sink for all 2 N variables and the best ordering. 4. Find the optimal Bayesian network Only Step 1 is different from the procedure described by Silander et al., Algorithm 1 gives a pseudo-code for computing the local Expected log-BDe scores. The time complexity of the Algorithm 1 is O(N 2 2 2(N-1) exp(w)) given an elimination order of width w. The traditional methods (BDeu, AIC, BIC, and so on) run in O(N2 2(N-1) ). However, the required memory for the proposed computation is equal to that of the traditional scores. 6. Numerical Examples “◦ ” : the number of correct-structure estimates in 100 trials “SHD ”: the Structure Hamming Distance (SHD) “ME ” : the total number of missing arcs “EE ” :the total number of extra arcs “WO ” :the total number of arcs with wrong orientation, “MO” :the total number of arcs with missing orientation “EO ” : the total number of arcs with extra orientation in the case of likelihood equivalence. We conducted simulation experiments to compare Expected log-BDe with BDeu according to the procedure described by Ueno (2011). We used small network structures with binary variables in Figs. 1, 2, 3, and 4, in which the distributions are changed from skewed to uniform.Figure 1 presents a structure in which the conditional probabilities differ greatly because of the parent variable states (g1: Strongly skewed distribution). By gradually reducing the difference of the conditional probabilities from Fig.1, we generated Fig. 2 (g2: Skewed distribution), Fig. 3 (g3: Uniform distribution), and Fig. 4 (g4: Strongly uniform distribution).Procedures used for the simulation experiments are described below. 1.We generated 200, 500, and 1,000 samples from the four figures. 2.Using BDeu and Expected log-BDe bychanging (0.1, 1.0, 10), Bayesian network structures were estimated, respectively, based on 200, 500, and 1,000 samples.We searched for the exactly true structure. 3.The times the estimated structure was the true structure (when Structural Hamming Distance (SHD) is zero; Tsamardinos, Brown, and Aliferis, 2006) were counted by repeating procedure 2 for 100 iterations. We employ the variable elimination algorithm (Darwiche, 2009) for the marginalization in. Especially for small samples, the large α setting for the proposed method works effectively because the learning performances of α = 10 for n = 200 are better than those of α = 0.1, which means that the ESS of Expected log-BDe might be affected only by the sample size. The results also suggest that the optimal ESS increases as the sample size becomes small. Therefore, the ESS of the proposed method works effectively as actual pseudo-samples to augment the data, especially when the sample size is small. 4. Proposal: NIP-BDe and Expected log-BDe This paper proposes to marginalize ML over the hypothetical structures: Our method has the following steps: 1.Estimate the conditional probability parameters set given g h from data. 2.Estimate the joint probability First, we estimate the conditional probability parameters set given g h as Where represents the number of when, and denotes the number of parent variables of.Next, we calculate the estimated joint probability as shown below. In practice, however, the Expected log-Bde is difficult to calculate because the product of multiple probabilities suffers serious computational problems. To avoid this, we propose an alternative method, Expected log-BDe, as described below. The optimal ESS of BDeu becomes large/small when the empirical conditional distribution becomes uniform/skewed because its hypothetical structure assumes a uniform conditional probabilities distribution and the ESS adjusts the magnitude of the user’s belief for a hypothetical structure. However, the ESS of full non-informative prior is expected to work effectively as actual pseudo-samples to augment the data, especially when the sample size is small, regardless of the uniformity of empirical distribution. This is a unique feature of the proposed method because the previous non-informative methods exclude the ESS from the score. Definition 2. Expected log-BDe is defined as Definition 1. NIP-Bde (Non Informative Prior Bayesian Dirichlet equivalence) is defined as