Bayesian Networks Martin Bachler MLA - VO 06.12.2005.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Machine Learning Week 2 Lecture 1.
Overview Full Bayesian Learning MAP learning
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Data Mining Techniques Outline
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Bayes Nets Rong Jin. Hidden Markov Model  Inferring from observations (o i ) to hidden variables (q i )  This is a general framework for representing.
Bayesian Classification and Bayesian Networks
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Bayesian Belief Networks
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Thanks to Nir Friedman, HU
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Crash Course on Machine Learning
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes April 3, 2012.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
Bayesian Networks. Male brain wiring Female brain wiring.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Naive Bayes Classifier
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Classification Techniques: Bayesian Classification
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 6 Bayesian Learning
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Linear Classification with Perceptrons
Slides for “Data Mining” by I. H. Witten and E. Frank.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Exam Preparation Class
Data Mining Lecture 11.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Classification Techniques: Bayesian Classification
POINT ESTIMATOR OF PARAMETERS
The Naïve Bayes (NB) Classifier
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Bayesian Networks Martin Bachler MLA - VO

2

3 Overview „Microsoft‘s competitive advantage lies in its expertise in Bayesian networks“ (Bill Gates, quoted in LA Times, 1996)

4 Overview (Recap of) Definitions Naive Bayes –Performance/Optimality ? –How important is independence ? –Linearity ? Bayesian networks

5 Definitions Conditional probability Bayes theorem

6 Definitions Bayes theorem Likelihood Prior probability normalization term

7 Definitions Classification problem –Input space X={x 1 x x 2 x … x x n } –Output space Y = {0,1} –Target concept C:X→Y –Hypothesis space H Bayesian way of classifying an instance :

8 Definitions Theoretically OPTIMAL! For large n the estimation of is very hard! => Assumption: pairwise conditional independence between input-variables given C:

9 Overview (Recap of) Definitions Naive Bayes –Performance/Optimality ? –How important is independence ? –Linearity ? Bayesian networks

10 Naive Bayes

11 Example 1/41100 ……………… 2/3 1 P(x 2 |C)C 1/3 0 2/3 P(x 1 |C) 3/4 1/4 3/4 P(C) x2x2 x1x

12 Naive Bayes - Independence The independence assumption is very strict! For most practical problems it is blatantly wrong! (not even fulfilled in the previous example!...see later) => Is naive Bayes a rather „academic“ algorithm ?

13 Naive Bayes - Independence For which problems is naive Bayes optimal ? (Lets assume for the moment we can perfectly estimate all necessary probabilites) Guess: For problems for which the independence assumption holds Let‘s check… (empirically + theoretically)

14 Independence - Example /3 1/ /31/31/91/ /32/91/ /3 4/91/3111 P(x 2 |C)P(x 1 |C)P(x 1 |C)P(x 2 |C)P(x 1,x 2 |C)Cx2x2 x1x

15 Independence - Example

16 Independence - Example 1/2 1/41/2000 1/ /2 1/ /2 1/41/2110 1/ /2 1/41/2101 1/41/2011 1/40111 P(x 2 |C)P(x 1 |C)P(x 1 |C)P(x 2 |C)P(x 1,x 2 |C)Cx2x2 x1x

17 Independence - Example

18 Naive Bayes - Independence [1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996

19 Naive Bayes - Independence

20 Naive Bayes - Independence For which problems is naive Bayes optimal ? Guess: For problems for which the independence assumption holds Empirical answer: Not really…. Theoretical answer ?

21 Naive Bayes - optimality Example: 3 features x 1, x 2, x 3 P(c=0) = P(c=1) x1, x3 independent; x2 = x1 (totally dep.) => optimal classification: naive Bayes: [1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996

22 Naive Bayes - optimality Let p =P(1|x 1 ), q = P(1|x 3 ) optimal: naive Bayes: independence assumption holds optimal and naive classifier disagree only here

23 Naive Bayes - optimality In general: Instance x = Let Theorem 1: A naive Bayesian classifier is optimal for x, iff

24 Naive Bayes - optimality region of optimality independence assumption holds only here

25 Naive Bayes - optimality This is a criterion for local optimality ( instance) What about global optimality ? Theorem 2: The naive Bayesian classifier is globally optimal for a dataset Ѕ iff

26 Naive Bayes - optimality What is the reason for this ? –Difference between classification and probability (distribution) estimation –I.e. for classification the perfect estimation of probabilities is not important as long as for each instance the maximum estimate corresponds to the maximum true probability. Problem with this result: Verification of global optimality (optimality for all instances) ?

27 Naive Bayes - optimality For which problems is naive Bayes optimal ? Guess: For problems for which the independence assumption holds Empirical answer: Not really…. Theoretical answer no 1: For all problems for for which Theorem 2 holds.

28 Naive Bayes - linearity other question: how does naive Bayes‘ hypothesis depend on the input variables ? Consider simple case of binary variables only… It can be shown (e.g.[2]) that in binary domains naive Bayes is LINEAR in the input variables!! [2]: Duda, Hart: Pattern classification and Scene Analysis, Wiley, 1973

29 Naive Bayes - linearity Proof…

30 Naive Bayes – linearity - examples naive Bayes Perceptron

31 Naive Bayes – linearity - examples

32 Naive Bayes - linearity For boolean domains naive Bayes‘ hypothesis is a linear hyperplane! => It can only be globally optimal for linearly separable problems!! BUT: It is not optimal for all linearly separable problems! (e.g. not for certain m-out-of-n concepts)

33 Naive Bayes - optimality For which problems is naive Bayes optimal ? Guess: For problems for which the independence assumption holds Empirical answer: Not really…. Theoretical answer no 1: For all problems for for which Theorem 2 holds. Theoretical answer no 2: For a (large) subset of the set of linearly separable problems.

34 Naive Bayes - optimality class of concepts for which perceptron is optimal class of concepts for which naive Bayes is optimal

35 Overview (Recap of) Definitions Naive Bayes –Performance/Optimality ? –How important is independence ? –Linearity ? Bayesian networks

36 Bayesian networks The problem-class for which naive Bayes is optimal is quite small…. Idea: Relax the independence-assumption to obtain a more general classifier I.e. model cond. dependencies between variables Different techniques (e.g. hidden variables,…) Most established: Bayesian networks

37 Bayesian networks Bayesian network: –tool for representing statistical dependencies between a set of random variables –acyclic directed graph –one vertex for each variable –for each pair of stat. dependent variables there is an edge in the graph between the corresponding vertices –not connected variables(vertices) are independent! –each vertex has a table of local probability distributions

38 Bayesian networks Each variable is dependent only on its parents in the network! y x1x1 x3x3 x2x2 x4x4 x5x5 „parents“ of x 4 (Pa 4 )

39 Bayesian networks Bayesian network – based classifier: y x1x1 x3x3 x2x2 x4x4 x5x5

40 Bayesian networks In the case of boolean attributes this is again linear, but not on the input-variables: Linear on product-features:

41 Bayesian networks The difficulty here is to estimate the correct network-structure (and probability-parameters) from training data! For general Bayesian networks this problem is NP-hard! There exist numerous heuristics for learning Bayesian networks from data!

42 References [1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996 [2] Duda, Hart: Pattern classification and Scene Analysis, Wiley, 1973