Bayesian Decision Theory Making Decisions Under uncertainty 1.

Slides:



Advertisements
Similar presentations
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for 1 Lecture Notes for E Alpaydın 2010.
Advertisements

Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Pattern Recognition and Machine Learning
Chapter 4 Probability and Probability Distributions
Yazd University, Electrical and Computer Engineering Department Course Title: Machine Learning By: Mohammad Ali Zare Chahooki Bayesian Decision Theory.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Bayesian Decision Theory
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
What is Statistical Modeling
Chapter 4 Discrete Random Variables and Probability Distributions
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
1 Business 90: Business Statistics Professor David Mease Sec 03, T R 7:30-8:45AM BBC 204 Lecture 17 = Finish Chapter “Some Important Discrete Probability.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Introduction to Probability and Statistics
Machine Learning CMPT 726 Simon Fraser University
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Thanks to Nir Friedman, HU
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Principles of Pattern Recognition
Modeling and Simulation CS 313
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
11-1 Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Probability and Statistics Chapter 11.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
Using Probability and Discrete Probability Distributions
Geo597 Geostatistics Ch9 Random Function Models.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
Introduction to Probability
Association Rule Mining
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Probability Definition : The probability of a given event is an expression of likelihood of occurrence of an event.A probability isa number which ranges.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI
Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Machine Learning 5. Parametric Methods.
ELEC 303 – Random Signals Lecture 17 – Hypothesis testing 2 Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 2, 2009.
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Lecture 2. Bayesian Decision Theory
Lecture 1.31 Criteria for optimal reception of radio signals.
Special Topics In Scientific Computing
Quick Review Probability Theory
A Research Oriented Study Report By :- Akash Saxena
Data Mining Lecture 11.
CHAPTER 3: Bayesian Decision Theory
INTRODUCTION TO Machine Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
INTRODUCTION TO Machine Learning 3rd Edition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Machine Learning
Machine Learning” Dr. Alper Özpınar.
Mathematical Foundations of BME
LECTURE 23: INFORMATION THEORY REVIEW
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
INTRODUCTION TO Machine Learning
Foundations 2.
Presentation transcript:

Bayesian Decision Theory Making Decisions Under uncertainty 1

Overview Basics of Probability and the Bayes Rule Bayesian Classification Losses and Risks Discriminant Function Utility Theory Association Rule Learning 2

Basics of Probability X is a random variable P( X = x ) or simply P(x) is the probability of X being x. 3

Basics of Probability X is a random variable P( X = x ) or simply P(x) is the probability of X being x. Conditional Probability: P(X|Y) is the probability of the occurrence of event X given that Y occurred: P( X | Y ) = P( X, Y ) / P(Y) 4

Basics of Probability X is a random variable P( X = x ) or simply P(x) is the probability of X being x. Conditional Probability: P(X|Y) is the probability of the occurrence of event X given that Y occurred: P( X | Y ) = P( X, Y ) / P(Y) Joint Probability of X and Y: P( Y, X ) = P( X, Y ) = P( X | Y )P(Y) = P( Y | X )P(X) 5

Basics of Probability X is a random variable P( X = x ) or simply P(x) is the probability of X being x. Conditional Probability: P(X|Y) is the probability of the occurrence of event X given that Y occurred: P( X | Y ) = P( X, Y ) / P(Y) Joint Probability of X and Y: P( Y, X ) = P( X, Y ) = P( X | Y )P(Y) = P( Y | X )P(X) Bayes Rule: P( X | Y ) = P( Y | X )P(X) / P(Y) 6

Basics of Probability X is a random variable P( X = x ) or simply P(x) is the probability of X being x. Conditional Probability: P(X|Y) is the probability of the occurrence of event X given that Y occurred: P( X | Y ) = P( X, Y ) / P(Y) Joint Probability of X and Y: P( Y, X ) = P( X, Y ) = P( X | Y )P(Y) = P( Y | X )P(X) Bayes Rule: P( X | Y ) = P( Y | X )P(X) / P(Y) Bernoulli Distribution: A trial is performed whose outcome is either a “success” or a “failure” (1 or 0). E.g. tossing a coin. P( X = 1) = p and P(X = 0) = 1 − p 7

Basics of Probability X is a random variable P( X = x ) or simply P(x) is the probability of X being x. Conditional Probability: P(X|Y) is the probability of the occurrence of event X given that Y occurred: P( X | Y ) = P( X, Y ) / P(Y) Joint Probability of X and Y: P( Y, X ) = P( X, Y ) = P( X | Y )P(Y) = P( Y | X )P(X) Bayes Rule: P( X | Y ) = P( Y | X )P(X) / P(Y) Bernoulli Distribution: A trial is performed whose outcome is either a “success” or a “failure” (1 or 0). E.g. tossing a coin. P( X = 1) = p and P(X = 0) = 1 − p Unobservable Variables: The extra pieces of knowledge that we do not have access to. In the coin tossing: only observable variable : the outcome of the toss. Unobservable variable: composition of the coin, its initial position, the force and the direction of tossing, where and how it is caught,... 8

Basics of Probability X is a random variable P( X = x ) or simply P(x) is the probability of X being x. Conditional Probability: P(X|Y) is the probability of the occurrence of event X given that Y occurred: P( X | Y ) = P( X, Y ) / P(Y) Joint Probability of X and Y: P( Y, X ) = P( X, Y ) = P( X | Y )P(Y) = P( Y | X )P(X) Bayes Rule: P( X | Y ) = P( Y | X )P(X) / P(Y) Bernoulli Distribution: A trial is performed whose outcome is either a “success” or a “failure” (1 or 0). E.g. tossing a coin. P( X = 1) = p and P(X = 0) = 1 − p Unobservable Variables: The extra pieces of knowledge that we do not have access to. In the coin tossing: only observable variable : the outcome of the toss. Unobservable variable: composition of the coin, its initial position, the force and the direction of tossing, where and how it is caught,... Estimating P(X) from a given sample: Coin tossing example: Sample: the outcomes of the past N tosses p h = #{tosses with outcome heads} / N 9

Bayesian Classification If X is the vector of observable variables: X = [X 1, X 2, X 3, …] T And C is the random variable denoting the class label. Then the probability of belonging to class C = c will be: P( C = c | X ) 10

Bayesian Classification If X is the vector of observable variables: X = [X 1, X 2, X 3, …] T And C is the random variable denoting the class label. Then the probability of belonging to class C = c will be: P( C = c | X ) E.g. Bank Loan Eligibility: high-risk customers: C = 1 low-risk customers: C = 0 Observable variables: X 1 : customer’s yearly income X 2 : customer’s savings X = [X 1, X 2 ] T C = 1 if P( C =1 |x 1, x 2 ) > P( C = 0 |x 1, x 2 ) Choose C = 0 otherwise 11

Bayesian Classification For each specific data: x is the vector of observable variables: x = [x 1, x 2, x 3, …] T Need to calculate: P ( C | x ) 12

Bayesian Classification For each specific data: x is the vector of observable variables: x = [x 1, x 2, x 3, …] T Need to calculate: P ( C | x ) The knowledge about the classification before observing the data P( C = 0 ) + P( c = 1 ) = 1 13

Bayesian Classification For each specific data: x is the vector of observable variables: x = [x 1, x 2, x 3, …] T Need to calculate: P ( C | x ) The conditional probability that an event belonging to C has the associated observation value x. E.g. P( x 1, x 2 | C = 1) is the probability that a high-risk customer has X 1 = x 1 and X 2 = x 2. The knowledge about the classification before observing the data P( C = 0 ) + P( c = 1 ) = 1 14

Bayesian Classification For each specific data: x is the vector of observable variables: x = [x 1, x 2, x 3, …] T Need to calculate: P ( C | x ) The conditional probability that an event belonging to C has the associated observation value x. E.g. P( x 1, x 2 | C = 1) is the probability that a high-risk customer has X 1 = x 1 and X 2 = x 2. The knowledge about the classification before observing the data P( C = 0 ) + P( c = 1 ) = 1 The Evidence : probability that the event/data has been observed. It is the marginal probability that an observation x is seen, regardless of whether it is a positive or negative example. P(x) = ∑ P( x, C ) = P( x |C = 1 )P( C = 1 ) + P(x | C = 0 )P( C = 0 ) 15

Bayesian Classification For each specific data: x is the vector of observable variables: x = [x 1, x 2, x 3, …] T Need to calculate: P ( C | x ) The conditional probability that an event belonging to C has the associated observation value x. E.g. P( x 1, x 2 | C = 1) is the probability that a high-risk customer has X 1 = x 1 and X 2 = x 2. The knowledge about the classification before observing the data P( C = 0 ) + P( c = 1 ) = 1 The Evidence : probability that the event/data has been observed. It is the marginal probability that an observation x is seen, regardless of whether it is a positive or negative example. P(x) = ∑ P( x, C ) = P( x |C = 1 )P( C = 1 ) + P(x | C = 0 )P( C = 0 ) The combination of prior belief with the likelihood provided by observations and weighted by the evidence P( C = 0 | x ) + P( C = 1 | x ) = 1 16

Bayesian Classification Generalization: K mutually exclusive and exhaustive classes C i : i = 1,..., K Prior probabilities: P(C i ) ≥ 0 and ∑ P(C i ) = 1 p( x | C i ) is the probability of seeing x as the input when it is known to belong to class C i Posterior Probability P( C i | x ) = p( x | C i )P(C i ) / p(x) Bayes’ classifier chooses the class with the highest posterior probability P( C i | x ): choose C i : if P( C i | x ) = max k P( C k |x ) 17 i = 1 K

Losses and Risks The cases where decisions are not equally good or costly. E.g. The loss for a high-risk applicant erroneously accepted may be different from the potential gain for an erroneously rejected low-risk applicant. A negative diagnosis of a serious medical condition is very costly (high loss) 18

Losses and Risks The cases where decisions are not equally good or costly. E.g. The loss for a high-risk applicant erroneously accepted may be different from the potential gain for an erroneously rejected low-risk applicant. A negative diagnosis of a serious medical condition is very costly (high loss) Definitions: Action (α i ) : the decision to assign the input to class C i Loss (λ ik ) : the loss incurred for taking action α i when the input actually belongs to C k Expected Risk (R) : The weighted sum of posterior probabilities (where: weight = loss): R( α i | x ) = ∑ λ ik P( C k |x ) The classification rule: “Choose the action with minimum risk” choose α i if : i = argmin k R( α k | x ) 19 i = 1 K

Calculating R for 0/1 Loss Case Special case of the 0/1 loss : all correct decisions have no loss and all errors are equally costly. 0 if i = k λ ik = 1 if i ≠ k R( α i | x ) = ∑ λ ik P( C k |x ) = ∑ P( C k |x ) = 1 − P(C i |x) To minimize risk : choose the most probable case => choose the class with the highest posterior probability 20 k = 1 k ≠ i K

Reject In some applications, wrong decisions (misclassifications) may have very high cost. If the automatic system has low certainty of its decision: requires a more complex (e.g. manual) decision. E.g. if we are using an optical digit recognizer to read postal codes on envelopes, wrongly recognizing the code causes the envelope to be sent to a wrong destination. Define an additional action (α K+1 ) called Reject or Doubt. 21

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 22 Reject Risk of reject Risk of action i

Discriminant Functions A border between classes g i (x) for i = 1,..., K =>choose C i if g i (x) = max k g k (x) Relation to the risk function: g i (x) = −R(α i |x) When using the 0/1 loss function: g i (x) = P(C i |x) This divides the feature space into K Decision Regions: R 1,...,R K where: R i = {x|g i (x) = max k g k (x)}. The regions are separated by Decision Boundaries. 23

Dichotomizer & Polychotomizer Two classes ( k = 2 ) -> One discriminant function g(x) = g 1 (x) − g 2 (x) For K ≥ 3 : Polychotomizer 24

Utility Theory Concerned with making rational decisions. In the context of classification: decisions correspond to choosing one of the classes. If probability of state k given evidence x: P (S k |x): Utility Function U ik : measures how good it is to take action α i when the state is S k. Expected utility: Maximizing the expected utility is equivalent to minimizing expected risk. 25

26 Association Rule Learning Discovering relations between items in databases E.g. Basket Analysis : understand the purchase behavior of customers find the dependency between two items X and Y Association rule: X  YX: Antecedent Y: Consequent Support (X  Y): Confidence (X  Y):

27 Understanding Support & Confidence 1.Maximizing Confidence : to be able to say that the rule holds with enough confidence, this value should be close to 1 and significantly larger than P(Y), the overall probability of people buying Y. 2.Maximizing Support: because even if there is a dependency with a strong confidence value, if the number of such customers is small, the rule is worthless. Support shows the statistical significance of the rule, whereas confidence shows the strength of the rule. The minimum support and confidence values are set by the company, and all rules with higher support and confidence are searched for in the database.

28 Other Concepts Lift (X  Y): P( X, Y ) / P(X)P(Y) = P( Y | X ) / P(Y) lift > 1X makes Y more likely lift = 1X and Y are independent lift < 1Y makes X more likely Generalization to n variables E.g. { X, Y, Z } three-item set, rule: X, Z → Y = P( Y | X, Z ) Goal: Find all such rules having high enough support and confidence by doing a small number of passes over the database.

29 Apriori Algorithm 1.Finding frequent itemsets (those which have enough support) 2.Converting them to rules with enough confidence, by splitting the items into two (antecedent items and the consequent items)

30 Apriori Algorithm 1.Finding frequent itemsets (those which have enough support) 2.Converting them to rules with enough confidence, by splitting the items into two (antecedent items and the consequent items) [1]: Adding another item can never increase support. If a two-item set is known not to be frequent, all its supersets can be pruned and need not be checked. Finding the frequent one-item sets and at each step, inductively, from frequent k-item sets, we generate candidate k+1-item sets and then do a pass over the data to check if they have enough support. Algorithm stores the frequent itemsets in a hash table for easy access. Note that the number of candidate itemsets will decrease very rapidly as k increases. If the largest itemset has n items, we need a total of n + 1 passes over the data.

31 Apriori Algorithm 1.Finding frequent itemsets (those which have enough support) 2.Converting them to rules with enough confidence, by splitting the items into two (antecedent items and the consequent items) [2]: Once we find the frequent k-item sets, we need to convert them to rules by splitting the k items into two (antecedent and consequent). Just like we do for generating the itemsets, we start by putting a single consequent and k − 1 items in the antecedent. Then, for all possible single consequents, we check if the rule has enough confidence and remove it if it does not. For the same itemset, there may be multiple rules with different subsets as antecedent and consequent. Then, inductively, we check whether we can move another item from the antecedent to the consequent.