Foundations 2
Modeling data as random variable Data usually come from process not completely known Example: coin toss Result (heads or tails) is deterministic Given sufficient knowledge, we could use Newton’s laws to calculate the result of each toss Alternative: accept doubt about result of toss Treat result as random variable X subject to P(X=x) Use P(X=x) to make rational decision about result of next toss Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2
Statistical Analysis of Coin-Toss Data Let heads = 1; tails = 0 Obeys Bernoulli statistics P (x) = poX (1 ‒ po)(1 ‒ X) po = probability of heads Given a sample of N tosses an unbiased estimator of po = number heads/number tosses Prediction of next toss: Heads if po > ½, Tails otherwise Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3
Discriminant functions Consider a 2D dichotomizer for bank loans: Attributes = income and savings Class = high risk P(C = 1|x1,x2) = probability of high risk given values of attributes P(C = 1|x1,x2) is a “discriminant” function for classification If probability is normalized, we can use the rule Choose C = 1 if P(C = 1|x1,x2) > 0.5 Otherwise choose C = 0 Normalization is unnecessary. We can use the rule Choose C = 1 if P(C = 1|x1,x2) > P(C = 0|x1,x2) 4 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Bayes’ rule allows for prior knowledge about class posterior class likelihood prior evidence “most of our clients are high risk” p(C) = probability of class C independent of attributes p(x|C) = probability of attributes x given class C p(C|x) = probability of class C given attributes x p(x) is normalization factor independent of class Assign client to class with higher posterior maximum a posteriori (MAP) approach Prior is what we know about credit risk before we observe a clients attributes; might be per-capita bankrupties Class likelihood, p(x|C), probability of observing x conditioned on the event being in class C given client is high-risk (C = 1) how likely is X = {x1, x2} deduced by data on a set of known high-risk clients Evidence, p(x), is essentially a normalization; also called “marginal probability” that x is seen regardless of class Posterior, P(C|x), probability that client belongs to class C conditioned on attributes being X When normalized by evidence, posteriors add up to 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 5
A discriminant function for 2-class problem can be defined as the ratio of class likelihoods g(x) = p(x|C1)/p(x|C2) What is the rule for using this discriminant function in classification? If g(x)>1, choose C1 Otherwise choose C2
A discriminant function for 2-class problem is defined as the Bayesian odds ratio g(x) = log(P(C1|x)/P(C2|x)) What is the rule for using this discriminant in classification? When does classification using this discriminant function become independent of attributes? g(x) = log(p(x|C1)/p(x|C2))+log((p(C1)/p(C2)) chose C1 if g(x) > 0 else chose C2 Priors could dominate
Normalized Bayesian dichotomizer posterior likelihood prior evidence Normalized Bayesian dichotomizer Normalized priors Prior is what we know about credit risk before we observe a clients attributes; might be per-capita bankrupties Class likelihood, p(x|C), probability of observing x conditioned on the event being in class C given client is high-risk (C = 1) how likely is X = {x1, x2} deduced by data on a set of known high-risk clients Evidence, p(x), is essentially a normalization; also called “marginal probability” that x is seen regardless of class Posterior, P(C|x), probability that client belongs to class C conditioned on attributes being X When normalized by evidence, posteriors add up to 1 Normalized posteriors Assign client to class with higher posterior Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 8
Normalized Bayesian classifier K>2 classes MAP approach Priors, likelihoods, posteriors, and margins are class specific Evidence is sum of margins over classes Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 9
Minimizing risk of decisions Action αi is assigning x to Ci of K classes λik is loss that occurs if we take αi when x belongs to Ck Expected risk (Duda and Hart, 1973) 10 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
For minimum risk, choose the most probable class Special case: correct decisions no loss and errors have equal cost: “0/1 loss function” Normalized posteriors For minimum risk, choose the most probable class Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11
Duda and Hart model applied to 2-class problem R(a1|x) = l11 P(C1|x) + l12 P(C2|x) R(a2|x) = l21 P(C1|x) + l22 P(C2|x) l11 = l22 = 0 No cost for correct decisions l12 = 10, and l21 = 1 Cost incorrect assignment to C1 is ten times greater than cost of incorrect assignment to C2 Posteriors are normalized State the classification rule in terms of P(C1|x) First step? R(a1|x) = 10 P(C2|x) R(a2|x) = P(C1|x) Choose C1 if R(a1|x) < R(a2|x), which is true if 10 P(C2|x) < P(C1|x), which can be written as 10(1 – P(C1|x)) < P(C1|x) if posteriors are normalized This inequality can be solved to get a rule that just involves P(C1|X) Choose C1 if P(C1|x) > 10/11 Consequence of erroneously assigning instance to C1 is so great that we choose C1 only when we are virtually certain it is correct.
Substitute cost matrix into model R(a1|x) = 10 P(C2|x) R(a2|x) = P(C1|x) What’s rule for choosing C1?
Rule for assigning client to C1 Choose C1 if R(a1|x) < R(a2|x), which becomes 10 P(C2|x) < P(C1|x) in this risk model How do get a rule that just involves P(C1|x)?
Use normalization to eliminate P(C2|x) 10(1 – P(C1|x)) < P(C1|x) Solve for P(C1|X) Choose C1 if P(C1|x) > 10/11 What does this mean?
l12 = 10, and l21 = 1 Consequence of erroneously assigning client to C1 is so great that we choose C1 only when we are virtually certain it is correct.
Rejection option: risk of not assigning a class Assume normalized posteriors risk not assigning risk of choosing Ci = risk of not assign (i.e. risk of rejection) 1-l = risk assigning If the risk of making an assignment is low (large l), then the threshold on max posterior for choosing that class is lower. We can make class assignments even when posteriors are not very discriminating (i.e. even when the classifier is weak). We use such a classifier when the risk associated with no decision is high. If the risk of making an assignment is high (low l), then we want a classifier that is very discriminating (i.e. generates a posterior for the correct class that are near unity) 17 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Relationship between the accuracy of classification and the risk of not assigning With the risk model Choose Ci if P(Ci|x) is the highest posterior and P(Ci|x) > 1 – l, the risk of making an assignment If the classifier is accurate, the risk of making an assignment can be low, which is achieved by making l closer to unity. 18 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Question: If more accurate classifiers are more expensive, propose a 3-level classification cascade to minimize cost? If the classifiers in the cascade adopt the 0/1 loss function with rejection, how should l change in the components of the cascade? The more accurate (more expensive) down-steam classifiers can have lower risk of assigning, which is 1 - l in the 0/1 risk model with rejection; therefore l1<l2<l3.
The more accurate (more expensive) are down-steam so that they are used only when necessary Down-steam classifiers can have lower risk of assigning, which is 1 - l in the 0/1 risk model with rejection; therefore l1<l2<l3.
Examples of discriminant functions Boundaries are defined by gk(x) = constant hyper-surfaces in attribute space that separate classes Example: P(C1|x) = 0.5 is a boundary for a Bayesian dichotomizer with normalized posteriors
Bayes’ classifier based on neighbors posterior likelihood prior evidence Prior is what we know about credit risk before we observe a clients attributes; might be per-capita bankrupties Class likelihood, p(x|C), probability of observing x conditioned on the event being in class C given client is high-risk (C = 1) how likely is X = {x1, x2} deduced by data on a set of known high-risk clients Evidence, p(x), is essentially a normalization; also called “marginal probability” that x is seen regardless of class Posterior, P(C|x), probability that client belongs to class C conditioned on attributes being X When normalized by evidence, posteriors add up to 1 Assign client to class with highest posterior How can we use neighbors in attribute space to estimate posteriors? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 22
Bayes’ classifier based on neighbors Consider data set with N examples, ni belonging to class i Given new example x, draw a hyper-sphere in attribute space, centered on x containing precisely K training examples, irrespective of their class. Suppose this sphere has volume V and contains ki examples from class i. Then class likelihood p(x|Ci) = ki /(ni V) Evidence p(x) = K/(N V) is the unconditional probability of drawing x irrespective of its class Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 23
Bayes’ classifier based on neighbors Class priors p(Ci) = ni/N (i.e. how well is class i represented in the training data) When we use Bayes’ rule to calculate posteriors, the explicit dependence on V cancels out and p(Ci|x) = ki/K Assign x to the class with highest posterior (i.e. the class with the highest representation among the K training examples in the hyper-sphere centered on x K=1 (nearest neighbor rule) assign x to the class of the nearest neighbor in the training data. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 24
Pseudo-code for KNN Given K, how do we find calculate p(Ci|x) = ki/K?
Find the “distance” between x and all other members of the data set If attributes are real numbers use “Euclidian” distance If attributes are binary try Hamming distance Hamming: how many bits are different? x = 0 1 0 0 1 0 1 y = 1 1 0 1 1 0 0 Hamming distance between x and y = 3 How do I use the distance measurements?
Sort distances by increasing size Among the K smallest count the number in class i
Bayes’ classifier based on neighbors In a 2-attribute data set, we can visualize the classification by applying KNN to each point in plane. As K increases expect fewer islands and smoother boundaries Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 28
Dichotomizer confusion matrix Error rate = # of errors / # of instances = (FN+FP) / N Recall = # of found positives / # of positives = TP / (TP+FN) = sensitivity = hit rate Precision = # true positives found/ # positives found = TP / (TP+FP) Specificity = TN / (TN+FP) False alarm rate = FP / (FP+TN) = 1 - Specificity Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29
Definition of ROC curve Let q be the threshold of normalized P(C|x) for assignment of example x to positive class C If q is almost 1, then rarely make assignment to C but we expect few false positives As q decreases, we make more assignments to C and expect more false positives 0 < q < 1 parametrically defines a true-positive verses false-positive curve called the “receiver operating characteristic” (ROC) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 30
Pseudo-code for ROC construction Rank all examples by value of P(C|x) (some diagnostic of positive-class) Flag each example by positive or negative (positive = correct class assignment) (negative = incorrect class assignment) For each example calculate (TP rate, FP rate) TP rate = fraction of positives with equal or greater score FP rate = fraction of negatives with equal or greater score Plot points and calculate area under curve
data ROC curves Filled circle positive Open circle negative Good classifier Poor classifier
ROC-related Curve Other combinations of confusion-matrix variables can be use in q-parameter curve definitions Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 33