Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.

Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro

Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Part I – Naive Bayes Part II - Introduction to Bayesian Networks

Aprendizagem Computacional Gladys Castillo, U.A. 3 The Supervised Classification Problem classifier A classifier is a function f :  X   C that assigns a class label c   C ={c 1,…, c m } to objects described by a set of attributes X ={X 1, X 2, …, X n }   X Supervised Learning Algorithm hChChChC hChChChC C ( N +1 ) Inputs: Attribute values of x ( N +1) Output: class of x (N+1) Classification Phase: Classification Phase: the class attached to x (N+1) is c (N+1) = h C (x (N+1) )   C Given: a dataset D with N labeled examples of Build: a classifier, a hypothesis h C :  X   C that can correctly predict the class labels of new objects Learning Phase: … attribute X 2 attribute X 1 x x x x x x x x x x o o o o o o o o o o o o o give credit don´t give credit

Aprendizagem Computacional Gladys Castillo, U.A. 4 Statistical Classifiers Treat the attributes X= {X 1, X 2, …, X n } and the class C as random variables A random variable is defined by the probability density function Give the probability P(c j | x) that x belongs to a particular class rather than a simple classification Probability density function of a random variable and few observations  instead of having the map X  C, we have X  P(C| X) The class c* attached to an example is the class with bigger P(c j | x)

Aprendizagem Computacional Gladys Castillo, U.A. 5 Bayesian Classifiers Bayesian Bayesian because the class c* attached to an example x is determined by the Bayes’ Theorem P(X,C)= P(X|C).P(C) P(X,C)= P(C|X).P(X) Bayes theorem is the main tool in Bayesian inference We can combine the prior distribution and the likelihood of the observed data in order to derive the posterior distribution.

Aprendizagem Computacional Gladys Castillo, U.A. 6 Bayes Theorem Example Given: A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis? likelihood prior posterior  prior x likelihood Before observing the data, our prior beliefs can be expressed in a prior probability distribution that represents the knowledge we have about the unknown features. After observing the data our revised beliefs are captured by a posterior distribution over the unknown features.

Aprendizagem Computacional Gladys Castillo, U.A. 7 Bayesian Classifier Bayes Theorem Maximum a posteriori classification How to determine P(c j | x) for each class c j ? P(x) can be ignored because is just a normalizing constant that ensures the posterior adds up to 1; it can be computed by summing up the numerator over all possible values of C, i.e., P(x) =  j P(c j ) P(x |c j ) Classify x to the class which has bigger posterior probability Bayesian Classification Rule

Aprendizagem Computacional Gladys Castillo, U.A. 8  “Naïve”independence assumption:  “Naïve” because of its very naïve independence assumption: Naïve Bayes (NB) Classifier all the attributes are conditionally independent given the class Duda and Hart (1973); Langley (1992) P(x | c j ) can be decomposed into a product of n terms, one term for each attribute  “Bayes”  “Bayes” because the class c* attached to an example x is determined by the Bayes’ Theorem when the attribute space is high dimensional direct estimation is hard unless we introduce some assumptions NB Classification Rule

Aprendizagem Computacional Gladys Castillo, U.A. 9 X i continuous The attribute is discretized and then treats as a discrete attribute A Normal distribution is usually assumed 2. Estimate P(X i =x k |c j ) for each value x k of the attribute X i and for each class c j X i discrete 1. Estimate P(c j ) for each class c j Naïve Bayes (NB) Learning Phase (Statistical Parameter Estimation) Given a training dataset D of N labeled examples (assuming complete data) onde N j - the number of examples of the class c j N ijk - number of examples of the class c j having the value x k for the attribute X i Two options The mean  ij e the standard deviation  ij are estimated from D

Aprendizagem Computacional Gladys Castillo, U.A. 10 2. Estimate P(X i =x k |c j ) for a value of the attribute X i and for each class c j For real attributes a Normal distribution is usually assumed  X i | c j ~ N(μ ij, σ 2 ij ) - t he mean  ij e the standard deviation  ij are estimated from D Continuous Attributes Normal or Gaussian Distribution For a variable X ~ N(74, 36), the probability of observing the value 66 is given by: f(x) = g(66; 74, 6) = 0.0273 The density probability function f(x) is symmetrical around its mean

Aprendizagem Computacional Gladys Castillo, U.A. 11 1. Estimate P(c j ) for each class c j 2. Estimate P(X i =x k |c j ) for each value of X 1 and each class c j X 1 discrete X 2 continuos Naïve Bayes Probability Estimates Training dataset D Class X1X1 X2X2 + a 1.0 + b 1.2 + a 3.0 - b 4.4 - b 4.5 Two classes: + (positive) e - (negative) Two attributes: X 1 – discrete which takes values a e b X 2 – continuos Binary Classification Problem  2+ = 1.73,  2+ = 1.10  2- = 4.45,  2 2+ = 0.07 Example from John & Langley (1995)

Aprendizagem Computacional Gladys Castillo, U.A. 12  NB is one of more simple and effective classifiers independence assumption  NB has a very strong unrealistic independence assumption : Naïve Bayes Performance all the attributes are conditionally independent given the value of class  HIGH BIAS  in practice: independence assumption is violated  HIGH BIAS  it can lead to poor classification, NB is efficient  However, NB is efficient due to its high variance management  LOW VARIANCE  less parameters  LOW VARIANCE

Aprendizagem Computacional Gladys Castillo, U.A. 13 the modeling error  reducing the bias resulting from the modeling error  by relaxing the attribute independence assumption  one natural extension: Bayesian Network Classifiers Improving Naïve Bayes the parameter estimates  reducing the bias of the parameter estimates  by improving the probability estimates computed from data Relevant works:  Web and Pazzani (1998) - “Adjusted probability naive Bayesian induction” in LNCS v 1502  J. Gama (2001, 2003) - “Iterative Bayes”, in Theoretical Computer Science, v. 292  Friedman, Geiger and Goldszmidt (1998) “Bayesian Network Classifiers” in Machine Learning, 29  Pazzani (1995) - “Searching for attribute dependencies in Bayesian Network Classifiers” in Proc. of the 5th Workshop of Artificial Intelligence and Statistics  Keogh and Pazzani (1999) - “Learning augmented Bayesian classifiers…”, in Theoretical Computer Science, v. 292 Relevant works:

Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Part II - Introduction to Bayesian Networks

Aprendizagem Computacional Gladys Castillo, U.A. 15 Reasoning under Uncertainty Probabilistic Approach A problem domain is modeled by a set of random variables X 1, X 2..., X n Knowledge about the problem domain is represented by a joint probability distribution P(X 1, X 2,..., X n ) Example: ALARM network (Pearl 1988) Story: In LA, burglary and earthquake are not uncommon. They both can cause alarm. In case of alarm, two neighbors John and Mary may call Problem: Estimate the probability of a burglary based who has or has not called Variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls (M) Knowledge required by the probabilistic approach in order to solve this problem: P (B, E, A, J, M)

Aprendizagem Computacional Gladys Castillo, U.A. 16 To specify P(X 1, X 2,..., X n ) we need at least 2 n − 1 numbers probability  exponential model size. Knowledge acquisition difficult (complex, unnatural) Exponential storage and inference Inference with Joint Probability Distribution by exploiting conditional independence we can obtain an appropriate factorization of the joint probability distribution  the number of parameters could be substantially reduced A good factorization is provided with Bayesian Networks Solution:

Aprendizagem Computacional Gladys Castillo, U.A. 17 Bayesian Networks the structure S - a Directed Acyclic Graph (DAG) nodes – random variables arcs - direct dependence between variables Bayesian Networks (BNs) graphically represent the joint probability distribution of a set X of random variables in a problem domain A BN=( S,  S ) consist in two components: the set of parameters  S = set of conditional probability distributions (CPDs) discrete variables:  S - multinomial (CPDs are CPTs) continuos variables:  S - Normal ou Gaussiana Qualitative part Quantitative part each node is independent of its non descendants given its parents in S Wet grass BN from Hugin Pearl (1988, 2000); Jensen (1996); Lauritzen (1996) Probability theory Graph theory Markov condition :  The joint distribution is factorized

Aprendizagem Computacional Gladys Castillo, U.A. 18 the structure – a Directed Acyclic Graph the parameters – a set of conditional probability tables (CPTs)  pa(B) = { }, pa(E) = { }, pa(A) = {B, E}, pa(J) = {A}, pa(M) = {A} X = { B - Burglary, E- Earthquake, A - Alarm, J - JohnCalls, M – MaryCalls } The Bayesian Network has two components: Example: Alarm Network (Pearl) Bayesian Network P(B) P(E) P(A | B, E) P(J | A) P( M | A} Model size reduced from 31 to 1+1+4+2+2=10

Aprendizagem Computacional Gladys Castillo, U.A. 19 P(B, E, A, J, M) = P(B) P(E) P(A|B, E) P(J|A) P(M|A) X = {B (Burglary), (E) Earthquake, (A) Alarm, (J) JohnCalls, (M) MaryCalls} Example: Alarm Network (Pearl) Factored Joint Probability Distribution

Aprendizagem Computacional Gladys Castillo, U.A. 20 Bayesian Networks: Summary Efficient: Local models Independence (d-separation) Effective: Algorithms take advantage of structure to Compute posterior probabilities Compute most probable instantiation Decision making But there is more: statistical induction  LEARNING adapted from © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Bayesian Networks: an efficient and effective representation of the joint probability distribution of a set of random variables

Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.

Similar presentations

Presentation on theme: "Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.

Similar presentations

Presentation on theme: "Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro."— Presentation transcript:

Similar presentations

About project

Feedback