Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.

Slides:



Advertisements
Similar presentations
A Tutorial on Learning with Bayesian Networks
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Probabilistic Reasoning Bayesian Belief Networks Constructing Bayesian Networks Representing Conditional Distributions Summary.
Artificial Intelligence Universitatea Politehnica Bucuresti Adina Magda Florea
Bayesian Networks. Introduction A problem domain is modeled by a list of variables X 1, …, X n Knowledge about the problem domain is represented by a.
Bayesian Networks VISA Hyoungjune Yi. BN – Intro. Introduced by Pearl (1986 ) Resembles human reasoning Causal relationship Decision support system/ Expert.
Data Mining Classification: Naïve Bayes Classifier
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
Review: Bayesian learning and inference
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Bayesian Belief Networks
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Thanks to Nir Friedman, HU
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
Artificial Intelligence CS 165A Tuesday, November 27, 2007  Probabilistic Reasoning (Ch 14)
A Brief Introduction to Graphical Models
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
Naive Bayes Classifier
Bayesian Networks Martin Bachler MLA - VO
1 CS 391L: Machine Learning: Bayesian Learning: Beyond Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Bayesian Statistics and Belief Networks. Overview Book: Ch 13,14 Refresher on Probability Bayesian classifiers Belief Networks / Bayesian Networks.
Introduction to Bayesian Networks
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
CS 416 Artificial Intelligence Lecture 14 Uncertainty Chapters 13 and 14 Lecture 14 Uncertainty Chapters 13 and 14.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Review: Bayesian inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y.
Lecture 2: Statistical learning primer for biologists
Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University.
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Conditional Probability, Bayes’ Theorem, and Belief Networks CISC 2315 Discrete Structures Spring2010 Professor William G. Tanner, Jr.
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
CS 2750: Machine Learning Review
CS 2750: Machine Learning Directed Graphical Models
Qian Liu CSE spring University of Pennsylvania
Ch3: Model Building through Regression
Computer Science Department
Read R&N Ch Next lecture: Read R&N
Conditional Probability, Bayes’ Theorem, and Belief Networks
Learning Bayesian Network Models from Data
Still More Uncertainty
Read R&N Ch Next lecture: Read R&N
Data Mining Classification: Alternative Techniques
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Bayesian Statistics and Belief Networks
Hankz Hankui Zhuo Bayesian Networks Hankz Hankui Zhuo
Multivariate Methods Berlin Chen
Machine Learning: Lecture 6
Multivariate Methods Berlin Chen, 2005 References:
Machine Learning: UNIT-3 CHAPTER-1
Probabilistic Reasoning
Read R&N Ch Next lecture: Read R&N
Presentation transcript:

Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro

Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Part I – Naive Bayes Part II - Introduction to Bayesian Networks

Aprendizagem Computacional Gladys Castillo, U.A. 3 The Supervised Classification Problem classifier A classifier is a function f :  X   C that assigns a class label c   C ={c 1,…, c m } to objects described by a set of attributes X ={X 1, X 2, …, X n }   X Supervised Learning Algorithm hChChChC hChChChC C ( N +1 ) Inputs: Attribute values of x ( N +1) Output: class of x (N+1) Classification Phase: Classification Phase: the class attached to x (N+1) is c (N+1) = h C (x (N+1) )   C Given: a dataset D with N labeled examples of Build: a classifier, a hypothesis h C :  X   C that can correctly predict the class labels of new objects Learning Phase: … attribute X 2 attribute X 1 x x x x x x x x x x o o o o o o o o o o o o o give credit don´t give credit

Aprendizagem Computacional Gladys Castillo, U.A. 4 Statistical Classifiers Treat the attributes X= {X 1, X 2, …, X n } and the class C as random variables A random variable is defined by the probability density function Give the probability P(c j | x) that x belongs to a particular class rather than a simple classification Probability density function of a random variable and few observations  instead of having the map X  C, we have X  P(C| X) The class c* attached to an example is the class with bigger P(c j | x)

Aprendizagem Computacional Gladys Castillo, U.A. 5 Bayesian Classifiers Bayesian Bayesian because the class c* attached to an example x is determined by the Bayes’ Theorem P(X,C)= P(X|C).P(C) P(X,C)= P(C|X).P(X) Bayes theorem is the main tool in Bayesian inference We can combine the prior distribution and the likelihood of the observed data in order to derive the posterior distribution.

Aprendizagem Computacional Gladys Castillo, U.A. 6 Bayes Theorem Example Given: A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis? likelihood prior posterior  prior x likelihood Before observing the data, our prior beliefs can be expressed in a prior probability distribution that represents the knowledge we have about the unknown features. After observing the data our revised beliefs are captured by a posterior distribution over the unknown features.

Aprendizagem Computacional Gladys Castillo, U.A. 7 Bayesian Classifier Bayes Theorem Maximum a posteriori classification How to determine P(c j | x) for each class c j ? P(x) can be ignored because is just a normalizing constant that ensures the posterior adds up to 1; it can be computed by summing up the numerator over all possible values of C, i.e., P(x) =  j P(c j ) P(x |c j ) Classify x to the class which has bigger posterior probability Bayesian Classification Rule

Aprendizagem Computacional Gladys Castillo, U.A. 8  “Naïve”independence assumption:  “Naïve” because of its very naïve independence assumption: Naïve Bayes (NB) Classifier all the attributes are conditionally independent given the class Duda and Hart (1973); Langley (1992) P(x | c j ) can be decomposed into a product of n terms, one term for each attribute  “Bayes”  “Bayes” because the class c* attached to an example x is determined by the Bayes’ Theorem when the attribute space is high dimensional direct estimation is hard unless we introduce some assumptions NB Classification Rule

Aprendizagem Computacional Gladys Castillo, U.A. 9 X i continuous The attribute is discretized and then treats as a discrete attribute A Normal distribution is usually assumed 2. Estimate P(X i =x k |c j ) for each value x k of the attribute X i and for each class c j X i discrete 1. Estimate P(c j ) for each class c j Naïve Bayes (NB) Learning Phase (Statistical Parameter Estimation) Given a training dataset D of N labeled examples (assuming complete data) onde N j - the number of examples of the class c j N ijk - number of examples of the class c j having the value x k for the attribute X i Two options The mean  ij e the standard deviation  ij are estimated from D

Aprendizagem Computacional Gladys Castillo, U.A Estimate P(X i =x k |c j ) for a value of the attribute X i and for each class c j For real attributes a Normal distribution is usually assumed  X i | c j ~ N(μ ij, σ 2 ij ) - t he mean  ij e the standard deviation  ij are estimated from D Continuous Attributes Normal or Gaussian Distribution For a variable X ~ N(74, 36), the probability of observing the value 66 is given by: f(x) = g(66; 74, 6) = The density probability function f(x) is symmetrical around its mean

Aprendizagem Computacional Gladys Castillo, U.A Estimate P(c j ) for each class c j 2. Estimate P(X i =x k |c j ) for each value of X 1 and each class c j X 1 discrete X 2 continuos Naïve Bayes Probability Estimates Training dataset D Class X1X1 X2X2 + a b a b b 4.5 Two classes: + (positive) e - (negative) Two attributes: X 1 – discrete which takes values a e b X 2 – continuos Binary Classification Problem  2+ = 1.73,  2+ = 1.10  2- = 4.45,  2 2+ = 0.07 Example from John & Langley (1995)

Aprendizagem Computacional Gladys Castillo, U.A. 12  NB is one of more simple and effective classifiers independence assumption  NB has a very strong unrealistic independence assumption : Naïve Bayes Performance all the attributes are conditionally independent given the value of class  HIGH BIAS  in practice: independence assumption is violated  HIGH BIAS  it can lead to poor classification, NB is efficient  However, NB is efficient due to its high variance management  LOW VARIANCE  less parameters  LOW VARIANCE

Aprendizagem Computacional Gladys Castillo, U.A. 13 the modeling error  reducing the bias resulting from the modeling error  by relaxing the attribute independence assumption  one natural extension: Bayesian Network Classifiers Improving Naïve Bayes the parameter estimates  reducing the bias of the parameter estimates  by improving the probability estimates computed from data Relevant works:  Web and Pazzani (1998) - “Adjusted probability naive Bayesian induction” in LNCS v 1502  J. Gama (2001, 2003) - “Iterative Bayes”, in Theoretical Computer Science, v. 292  Friedman, Geiger and Goldszmidt (1998) “Bayesian Network Classifiers” in Machine Learning, 29  Pazzani (1995) - “Searching for attribute dependencies in Bayesian Network Classifiers” in Proc. of the 5th Workshop of Artificial Intelligence and Statistics  Keogh and Pazzani (1999) - “Learning augmented Bayesian classifiers…”, in Theoretical Computer Science, v. 292 Relevant works:

Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Part II - Introduction to Bayesian Networks

Aprendizagem Computacional Gladys Castillo, U.A. 15 Reasoning under Uncertainty Probabilistic Approach A problem domain is modeled by a set of random variables X 1, X 2..., X n Knowledge about the problem domain is represented by a joint probability distribution P(X 1, X 2,..., X n ) Example: ALARM network (Pearl 1988) Story: In LA, burglary and earthquake are not uncommon. They both can cause alarm. In case of alarm, two neighbors John and Mary may call Problem: Estimate the probability of a burglary based who has or has not called Variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls (M) Knowledge required by the probabilistic approach in order to solve this problem: P (B, E, A, J, M)

Aprendizagem Computacional Gladys Castillo, U.A. 16 To specify P(X 1, X 2,..., X n ) we need at least 2 n − 1 numbers probability  exponential model size. Knowledge acquisition difficult (complex, unnatural) Exponential storage and inference Inference with Joint Probability Distribution by exploiting conditional independence we can obtain an appropriate factorization of the joint probability distribution  the number of parameters could be substantially reduced A good factorization is provided with Bayesian Networks Solution:

Aprendizagem Computacional Gladys Castillo, U.A. 17 Bayesian Networks the structure S - a Directed Acyclic Graph (DAG) nodes – random variables arcs - direct dependence between variables Bayesian Networks (BNs) graphically represent the joint probability distribution of a set X of random variables in a problem domain A BN=( S,  S ) consist in two components: the set of parameters  S = set of conditional probability distributions (CPDs) discrete variables:  S - multinomial (CPDs are CPTs) continuos variables:  S - Normal ou Gaussiana Qualitative part Quantitative part each node is independent of its non descendants given its parents in S Wet grass BN from Hugin Pearl (1988, 2000); Jensen (1996); Lauritzen (1996) Probability theory Graph theory Markov condition :  The joint distribution is factorized

Aprendizagem Computacional Gladys Castillo, U.A. 18 the structure – a Directed Acyclic Graph the parameters – a set of conditional probability tables (CPTs)  pa(B) = { }, pa(E) = { }, pa(A) = {B, E}, pa(J) = {A}, pa(M) = {A} X = { B - Burglary, E- Earthquake, A - Alarm, J - JohnCalls, M – MaryCalls } The Bayesian Network has two components: Example: Alarm Network (Pearl) Bayesian Network P(B) P(E) P(A | B, E) P(J | A) P( M | A} Model size reduced from 31 to =10

Aprendizagem Computacional Gladys Castillo, U.A. 19 P(B, E, A, J, M) = P(B) P(E) P(A|B, E) P(J|A) P(M|A) X = {B (Burglary), (E) Earthquake, (A) Alarm, (J) JohnCalls, (M) MaryCalls} Example: Alarm Network (Pearl) Factored Joint Probability Distribution

Aprendizagem Computacional Gladys Castillo, U.A. 20 Bayesian Networks: Summary Efficient: Local models Independence (d-separation) Effective: Algorithms take advantage of structure to Compute posterior probabilities Compute most probable instantiation Decision making But there is more: statistical induction  LEARNING adapted from © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Bayesian Networks: an efficient and effective representation of the joint probability distribution of a set of random variables