ETHEM ALPAYDIN © The MIT Press, Lecture Slides for 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
2
Probability and Inference Result of tossing a coin is {Heads,Tails} Random var X {1,0} Bernoulli distribution P (X = 1) = p o P (X = 0) = (1 ‒ p o ) Sample: X = {x t } N t =1 Estimation: p o = # {Heads}/#{Tosses} = ∑ t x t / N Prediction of next toss: Heads if p o > ½, Tails otherwise 3 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
4 Classification Credit scoring: Inputs are income(x 1 ) and savings (x 2 ). Output is low-risk vs high-risk Input: x = [x 1, x 2 ] T, Output: C {0, 1} Prediction: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Bayes’ Rule posterior Likelihood : conditional probability prior Evidence : marginal probability 5Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Bayes’ Rule: K>2 Classes 6Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Losses and Risks Action α i : the decision to assign the input to class C i Loss of α i when the input actually belongs to C k : λ ik Expected risk (Duda and Hart, 1973) for taking action α i 7 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Losses and Risks: 0/1 Loss In the special case of the zero-one loss case: then For minimum risk, choose the most probable class. 8 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
9 Losses and Risks: Reject Define an additional action of reject, : Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Discriminant Functions Set Bayes’ classifier – Use 0-1 loss function: Ignoring the term, p ( x ) : K decision regions R 1,...,R K 10 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
K=2 Classes Examples of K=2 : g(x) = g 1 (x) – g 2 (x) Log odds: 11 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Utility Theory Certain features may be costly to observe. To assess the value of information that additional features may provide. Prob of state S k given evidence x : P (S k |x) Utility function of action α i when state is S k : U ik To measure how good it is to take action α i when the state is S k Expected utility For example, in classification, decisions correspond to choosing one of the classes, and maximizing the expected utility is equivalent to minimizing expected risk. 12 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Association Rules Association rule: X Y People who buy/click/visit/enjoy X are also likely to buy/click/visit/enjoy Y. A rule implies association, not necessarily causation ( 因果關係 ). 13 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Association measures Association rule: X Y Support (X Y): the statistical significance of the rule Confidence (X Y): the conditional probability Lift (X Y) (the interest of the association rule) If X and Y are independent, then we expect lift to be close to 1. If the lift is more than 1, that X makes Y more likely, and if the lift is less than 1, having X make Y less likely. 14 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Apriori algorithm (Agrawal et al., 1996) For example, {X, Y, Z } is a 3-item set, and we may look for a rule, such as X, Y Z. All such rules have high enough support and confidence. Since a sales database is generally very large, it needs an efficient algorithm ( Apriori algorithm ) to find these rules by doing a small number of passes over the database. Apriori algorithm Finding frequent ( 時常發生的 ) itemsets which have enough support. If {X,Y,Z} is frequent, then {X,Y}, {X,Z}, and {Y,Z} should be frequent. If {X,Y} is not frequent, none of its supersets can be frequent. Converting them to rules with enough confidence Once we find the frequent k -item sets, we convert them to rules: X, Y Z,... and X Y, Z,... For all possible single consequents ( 後項 ), we check if the rule has enough confidence and remove it if it does not. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15
Exercise In a two-class, two-action problem, if the loss function is,, and, write the optimal decision rule. Show that as we move an item from the antecedent to the consequent, confidence can never increase: confidence(ABC D) confidence(AB CD) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 16
17
Bayesian Networks Bayesian Networks: also known as graphical models, probabilistic networks Nodes are hypotheses (random variables) and the probability corresponds to our belief in the truth of the hypothesis. Arcs are direct influences between hypotheses. The structure is represented as a directed acyclic graph (DAG). The parameters are the conditional probabilities in the arcs. (Pearl, 1988, 2000; Jensen, 1996; Lauritzen, 1996) 18 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Causes and Bayes’ Rule Diagnostic inference: Knowing that the grass is wet, what is the probability that rain is the cause? causal diagnostic 19Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Conditional Independence X and Y are independent if P(X,Y)=P(X)P(Y) X and Y are conditionally independent given Z if P(X,Y|Z)=P(X|Z)P(Y|Z) or P(X|Y,Z)=P(X|Z) Three canonical ( 標準的 ) cases for conditional independence: Head-to-tail connection Tail-to-tail connection Head-to-head connection 20 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Case 1: Head-to-Tail Connection P(X,Y,Z)=P(X)P(Y|X)P(Z|Y) X and Z are independent given Y 21 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) P(W|C) = P(W|R)P(R|C)+P (W|~R)P(~R|C) = 0.9 X X 0.2 = 0.76 P(C|W) = P(W|C)P(C)/P(W) = 0.76 X 0.4/0.47 = 0.65 P(R) = P(R|C)P(C)+P(R|~C)P(~C) = 0.8 X X 0.6 = 0.38 P(W) = P(W|R)P(R)+P(W|~R)P(~R) = 0.9 X X 0.62 = 0.47
Case 2: Tail-to-Tail Connection P(X,Y,Z)=P(X)P(Y|X)P(Z|X) Y and Z are independent given X 22 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) P(C|R) = P(R|C)P(C)/P(R)= P(R|C)P(C)/(P(R|C)P(C)+ P(R|~C)P(~C)) = 0.8 X 0.5/(0.8 X X 0.5) = 0.89 P(R|S) = P(R|C)P(C|S)+P(R|~C)P(~C|S) = 0.22 (Pages ) 0.22 =P(R|S) < P(R) = 0.45
Case 3: Head-to-Head Connection P(X,Y,Z)=P(X)P(Y)P(Z|X,Y) X and Y are independent 23 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Causal Inference Causal inference: If the sprinkler is on, what is the probability that the grass is wet? P (W |S) = P (W |R, S) P (R |S) + P (W |~R,S) P (~R |S) = P (W |R,S) P (R) + P (W |~R, S) P (~R) = 0.95x x0.6 = 0.92 P(R) and P(S) are independent. 24 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Diagnostic Inference Diagnostic inference: If the grass is wet, what is the probability that the sprinkler is on? P (S |W) = 0.35 > 0.2 ( =P (S)) where P (W) = 0.52 P (S |R,W) = 0.21 < 0.35 Explaining away: Knowing that it has rained decreases the probability that the sprinkler is on. 25 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Diagnostic Inference 26 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Exercise P417 (3) Calculate P(R|W), P(R|W,S), and P(R|W,~S). Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27
Bayesian Networks: Causes Causal inference: P (W |C) = P (W |R,S,C) P (R,S |C) + P (W |~R,S,C) P (~R,S |C) + P(W |R,~S,C) P (R,~S |C) + P (W |~R,~S,C) P (~R,~S|C) = 0.76 use the fact that P (R, S |C) = P (R |C) P (S |C) Diagnostic: P (C |W ) = ? (Exercise) 28 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Causal Inference 29 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Bayesian Networks Belief propagation (Pearl, 1988) use for inference when the network is a tree Junction trees (Lauritzen and Spiegelhalter, 1988) convert a given directed acyclic graph to a tree 30 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Bayesian Networks: Classification diagnostic P (C | x ) Bayes’ rule inverts the arc: 31 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)