1 Learning with Bayesian Networks Author: David Heckerman Presented by Yan Zhang April 24 2006.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Bayesian networks Chapter 14 Section 1 – 2. Outline Syntax Semantics Exact computation.
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
A Tutorial on Learning with Bayesian Networks
Bayesian Network and Influence Diagram A Guide to Construction And Analysis.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
PROBABILITY. Uncertainty  Let action A t = leave for airport t minutes before flight from Logan Airport  Will A t get me there on time ? Problems :
A Brief Introduction to Bayesian Inference Robert Van Dine 1.
ฟังก์ชั่นการแจกแจงความน่าจะเป็น แบบไม่ต่อเนื่อง Discrete Probability Distributions.
Author: David Heckerman Presented By: Yan Zhang Jeremy Gould –
Introduction of Probabilistic Reasoning and Bayesian Networks
Parameter Estimation using likelihood functions Tutorial #1
Software Engineering Laboratory1 Introduction of Bayesian Network 4 / 20 / 2005 CSE634 Data Mining Prof. Anita Wasilewska Hiroo Kusaba.
Learning with Bayesian Networks David Heckerman Presented by Colin Rickert.
Bayesian networks Chapter 14 Section 1 – 2.
Basic Probability. Theoretical versus Empirical Theoretical probabilities are those that can be determined purely on formal or logical grounds, independent.
Bayesian Belief Networks
Haimonti Dutta, Department Of Computer And Information Science1 David HeckerMann A Tutorial On Learning With Bayesian Networks.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Learning Bayesian Networks (From David Heckerman’s tutorial)
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
A Practical Course in Graphical Bayesian Modeling; Class 1 Eric-Jan Wagenmakers.
Probability and Probability Distributions
Made by: Maor Levy, Temple University  Probability expresses uncertainty.  Pervasive in all of Artificial Intelligence  Machine learning 
Chapter 5 Sampling Distributions
Author: David Heckerman Presented By: Yan Zhang Jeremy Gould – 2013 Chip Galusha
Soft Computing Lecture 17 Introduction to probabilistic reasoning. Bayesian nets. Markov models.
Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?
11-1 Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Probability and Statistics Chapter 11.
Theory of Probability Statistics for Business and Economics.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
Bayesian Networks for Data Mining David Heckerman Microsoft Research (Data Mining and Knowledge Discovery 1, (1997))
Probability. Statistical inference is based on a Mathematics branch called probability theory. If a procedure can result in n equally likely outcomes,
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
1 Let X represent a Binomial r.v as in (3-42). Then from (2-30) Since the binomial coefficient grows quite rapidly with n, it is difficult to compute (4-1)
LECTURE 17 THURSDAY, 22 OCTOBER STA 291 Fall
Uncertainty in Expert Systems
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Lecture notes 9 Bayesian Belief Networks.
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Review: Bayesian inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y.
Copyright © 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Review of Statistics I: Probability and Probability Distributions.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Lecture 29 Conditional Independence, Bayesian networks intro Ch 6.3, 6.3.1, 6.5, 6.5.1,
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Probability. Probability Probability is fundamental to scientific inference Probability is fundamental to scientific inference Deterministic vs. Probabilistic.
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Bayesian Belief Network AI Contents t Introduction t Bayesian Network t KDD Data.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Bayesian Estimation and Confidence Intervals
Bayes Net Learning: Bayesian Approaches
Bayes for Beginners Stephanie Azzopardi & Hrvoje Stojic
Chapter 5 Sampling Distributions
CSCI 5822 Probabilistic Models of Human and Machine Learning
Lecture 11 Sections 5.1 – 5.2 Objectives: Probability
Chapter 5 Sampling Distributions
CSCI 5822 Probabilistic Models of Human and Machine Learning
Parameter Learning 2 Structure Learning 1: The good
Bayes for Beginners Luca Chech and Jolanda Malamud
28th September 2005 Dr Bogdan L. Vrusias
Probabilistic Reasoning
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

1 Learning with Bayesian Networks Author: David Heckerman Presented by Yan Zhang April

2 Outline  Bayesian Approach Bayes Therom Bayesian vs. classical probability methods coin toss – an example  Bayesian Network Structure Inference Learning Probabilities Learning the Network Structure Two coin toss – an example  Conclusions  Exam Questions

3 Bayes Theorem where Or  p(  |D)= p(  |D)p(  )/p(D)  p(S h |D)=p(D|S h )p(S h )/p(D)

4 Bayesian vs. the Classical Approach  The Bayesian probability of an event x, represents the person’s degree of belief or confidence in that event’s occurrence based on prior and observed facts.  Classical probability refers to the true or actual probability of the event and is not concerned with observed behavior.

5 Bayesian vs. the Classical Approach  Bayesian approach restricts its prediction to the next (N+1) occurrence of an event given the observed previous (N) events.  Classical approach is to predict likelihood of any given event regardless of the number of occurrences.

6 Example  Toss a coin 100 times, denote r.v. X as the outcome of one flip p(X=head) = , p(X=tail) =1-   Before doing this experiment, we have some belief in our mind: Prior Probability p(  |  )=beta(  |a=5, b=5 ) E[  ]= a/(a+b)=0.5, Var(  )= ab/[(a+b) 2 (a+b+1)]  Experiment finished h = 65, t = 35  p(  |D,  )= ?  p(  |D,  )=p(D| ,  )p(  |  )/p(D|  )  =[k 1  h (1-  ) t ][k 2  a-1 (1-  ) b-1 ]/k 3  =beta(  |a=5+ h, b=5+ t)  E[  | D]= a/(a+b)=(5+65)/( ) = 0.64

7 Example

8 Integration To find the probability that X n+1 =heads, we must integrate over all possible values of  to find the average value of  which yields:

9 Bayesian Probabilities  Posterior Probability, p(  |D,  ): Probability of a particular value of  given that D has been observed (our final value of  ). In this case  = {D}.  Prior Probability, p(  |  ): Prior Probability of a particular value of  given no observed data (our previous “belief”)  Observed Probability or “Likelihood”, p(D| ,  ): Likelihood of sequence of coin tosses D being observed given that  is a particular value. In this case  = {  }.  p(D|  ): Raw probability of D

10 Priors  In the previous example, we used a beta prior to encode the states of a r.v. It is because there are only 2 states/outcomes of the variable X.  In general, if the observed variable X is discrete, having r possible states {1,…,r}, the likelihood function is given by  p(X=x k | ,  )=  k, where k=1,…,r and  = {  1,…,  r }, ∑  k =1  We use Dirichlet distribution as prior:  And we can derive the posterior distribution

11 Outline  Bayesian Approach Bayes Therom Bayesian vs. classical probability methods coin toss – an example  Bayesian Network Structure Inference Learning Probabilities Learning the Network Structure Two coin toss – an example  Conclusions  Exam Questions

12 Introduction to Bayesian Networks  Bayesian networks represent an advanced form of general Bayesian probability  A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest  The model has several advantages for data analysis over rule based decision trees

13 Advantages of Bayesian Techniques (1) How do Bayesian techniques compare to other learning models?  Bayesian networks can readily handle incomplete data sets.

14 Advantages of Bayesian Techniques (2)  Bayesian networks allow one to learn about causal relationships We can use observed knowledge to determine the validity of the acyclic graph that represents the Bayesian network. Observed knowledge may strengthen or weaken this argument.

15 Advantages of Bayesian Techniques (3)  Bayesian networks readily facilitate use of prior knowledge Construction of prior knowledge is relatively straightforward by constructing “causal” edges between any two factors that are believed to be correlated. Causal networks represent prior knowledge where as the weight of the directed edges can be updated in a posterior manner based on new data

16 Advantages of Bayesian Techniques (4)  Bayesian methods provide an efficient method for preventing the over fitting of data (there is no need for pre-processing). Contradictions do not need to be removed from the data. Data can be “smoothed” such that all available data can be used

17 Example Network  Consider a credit fraud network designed to determine the probability of credit fraud based on certain events  Variables include: Fraud(f): whether fraud occurred or not Gas(g): whether gas was purchased within 24 hours Jewelry(J): whether jewelry was purchased in the last 24 hours Age(a): Age of card holder Sex(s): Sex of card holder  Task of determining which variables to include is not trivial, involves decision analysis.

18 Example Network Jewelry Sex Age Fraud Gas  A set of Variables X={X 1,…, X n }  A Network Structure  Conditional Probability Table (CPT) X1X2X3Өijk yes<30mӨ511 yes<30fӨ521 X5 = yesyes30-50mӨ521 yes30-50fӨ541 yes>50mӨ551 yes>50fӨ561 no … Ө X5 = noyes<30mӨ512 … X1X1 X2X2 X3X3 X4X4 X5X5

19 Example Network Jewelry Sex Age Fraud Gas X1X1 X2X2 X3X3 X4X4 X5X5 p(a|f) = p(a) p(s|f,a) = p(s) p(g|f,a, s) = p(g|f) p(j|f,a,s,g) = p(j|f,a,s) Using the graph of expected causes, we can check for conditional independence of the following probabilities given initial sample data

20 Inference in a Bayesian Network  To determine various probabilities of interests from the model  Probabilistic inference The computation of a probability of interest given a model

21 Learning Probabilities in a Bayesian Network Jewelry Sex Age Fraud Gas  The physical joint probability distribution for X=(X 1 … X 5 ) can be encoded as following expression X1X2X3Өijk yes<30mӨ511 yes<30fӨ521 X5 = yesyes30-50mӨ521 yes30-50fӨ541 yes>50mӨ551 yes>50fӨ561 no … Ө X5 = noyes<30mӨ512 … X1X1 X2X2 X3X3 X4X4 X5X5 where  s =(  1 …  n )

22 Learning Probabilities in a Bayesian Network  As new data come, the probabilities in CPTs need to be updated  Then we can update each vector of parameters  ij independently, just as one-variable case.  Assuming each vector  ij has the prior distribution Dir(  ij |a ij1,…, a ijr i )  Posterior distribution p (  ij |D,S h )=Dir(  ij | a ij1 +N ij1, …, a ijr i +N ijr i )  Where N ijk is the number of cases in D in which X i =x i k and Pa i =pa i j

23 Learning the Network Structure  Sometimes the causal relations are not obvious, so that we are uncertain with the network structure  Theoretically, we can use bayesian approach to get the posterior distribution of the network structure  Unfortunately, the number of possible network structure increase exponentially with n – the number of nodes

24 Learning the Network Structure  Model Selection To select a “good” model (i.e. the network structure) from all possible models, and use it as if it were the correct model.  Selective Model Averaging To select a manageable number of good models from among all possible models and pretend that these models are exhaustive.  Questions How do we choose search for good models? How do we decide whether or not a model is “Good”?

25 Two Coin Toss Example  Experiment: flip two coins and observe the outcome  We have had two network structures in mind: S h 1 or S h 2  If p( S h 1 )=p( S h 2 )=0.5  After observing some data, which model is more possible for this collection of data? X1X1 X2X2 X1X1 X2X2 p(H)=p(T)=0.5 p(H|H)= 0.1 p(T|H)= 0.9 p(H|T)= 0.9 p(T|T)= 0.1 Sh1Sh1 Sh2Sh2

26 Two Coin Toss Example X1X1 X2X2 1TT 2TH 3HT 4HT 5HH 6HT 7TH 8TH 9HT 10HT

27 Outline  Bayesian Approach Bayes Therom Bayesian vs. classical probability methods coin toss – an example  Bayesian Network Structure Inference Learning Probabilities Learning the Network Structure Two coin toss – an example  Conclusions  Exam Questions

28 Conclusions  Bayesian method  Bayesian network Structure Inference Learn parameters and structure Advantages

29 Question1: What is Bayesian Probability?  A person’s degree of belief in a certain event  i.e. Your own degree of certainty that a tossed coin will land “heads”

30 Question 2: What are the advantages and disadvantages of the Bayesian and classical approaches to probability?  Bayesian Approach: +Reflects an expert’s knowledge +The belief is kept updating when new data item arrives - Arbitrary (More subjective)  Classical Probability: +Objective and unbiased - Generally not available  It takes a long time to measure the object’s physical characteristics

31 Question 3: Mention at least 3 Advantages of Bayesian analysis  Handle incomplete data sets  Learning about causal relationships  Combine domain knowledge and data  Avoid over fitting

32 The End  Any Questions?