Download presentation
Presentation is loading. Please wait.
Published byAmy Price Modified over 8 years ago
1
CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016
2
Plan for today and next two classes Probability review Density estimation Naïve Bayes and Bayesian Belief Networks
3
Procedural View Training Stage: –Raw Data x (Feature Extraction) –Training Data { (x,y) } f (Learning) Testing Stage –Raw Data x (Feature Extraction) –Test Data x f(x) (Apply function, Evaluate error) (C) Dhruv Batra3
4
Statistical Estimation View Probabilities to rescue: –x and y are random variables –D = (x 1,y 1 ), (x 2,y 2 ), …, (x N,y N ) ~ P(X,Y) IID: Independent Identically Distributed –Both training & testing data sampled IID from P(X,Y) –Learn on training set –Have some hope of generalizing to test set (C) Dhruv Batra4
5
Probability A is non-deterministic event –Can think of A as a boolean-valued variable Examples –A = your next patient has cancer –A = Rafael Nadal wins US Open 2016 (C) Dhruv Batra5
6
Interpreting Probabilities What does P(A) mean? Frequentist View –limit N ∞ #(A is true)/N –limiting frequency of a repeating non-deterministic event Bayesian View –P(A) is your “belief” about A Market Design View –P(A) tells you how much you would bet (C) Dhruv Batra6
7
Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false has probability 0. P(true) = 1 P(false) = 0 The probability of disjunction is: 7 A B Slide credit: Ray Mooney
8
Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra8Image Credit: Andrew Moore
9
Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra9Image Credit: Andrew Moore
10
Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra10Image Credit: Andrew Moore
11
Interpreting the Axioms 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) (C) Dhruv Batra11Image Credit: Andrew Moore
12
Joint Distribution The joint probability distribution for a set of random variables, X 1,…,X n gives the probability of every combination of values (an n- dimensional array with v n values if all variables are discrete with v values, all v n values must sum to 1): P(X 1,…,X n ) The probability of all possible conjunctions (assignments of values to some subset of variables) can be calculated by summing the appropriate subset of values from the joint distribution. Therefore, all conditional probabilities can also be calculated. 12 circlesquare red0.200.02 blue0.020.01 circlesquare red0.050.30 blue0.20 positive negative Slide credit: Ray Mooney
13
Marginal Distributions (C) Dhruv BatraSlide Credit: Erik Suddherth13 Sum rule y z
14
Conditional Probabilities P(A | B) = In worlds where B is true, fraction where A is true Example –H: “Have a headache” –F: “Coming down with Flu” (C) Dhruv Batra14
15
Conditional Probabilities P(Y=y | X=x) What do you believe about Y=y, if I tell you X=x? P(Rafael Nadal wins US Open 2016)? What if I tell you: –He has won the US Open twice –Novak Djokovic is ranked 1; just won Australian Open (C) Dhruv Batra15
16
Conditional Distributions (C) Dhruv BatraSlide Credit: Erik Sudderth16 Product rule
17
Conditional Probabilities 17 Figures from Bishop
18
Chain rule Generalized product rule: Example: 18 Equations from Wikipedia
19
Independence A and B are independent iff: Therefore, if A and B are independent: 19 These two constraints are logically equivalent Slide credit: Ray Mooney
20
Marginal: P satisfies (X Y) if and only if –P(X=x,Y=y) = P(X=x) P(Y=y), x Val(X), y Val(Y) Conditional: P satisfies (X Y | Z) if and only if –P(X,Y|Z) = P(X|Z) P(Y|Z), x Val(X), y Val(Y), z Val(Z) Independence (C) Dhruv Batra20
21
Independent Random Variables (C) Dhruv BatraSlide Credit: Erik Sudderth21
22
Other Concepts Expectation: Variance: Covariance: 22 Equations from Bishop
23
Central Limit Theorem Let {X 1,..., X n } be a random sample of size n — that is, a sequence of independent and identically distributed random variables drawn from distributions of expected values given by µ and finite variances given by σ 2. Suppose we are interested in the sample averagerandom sampleindependent and identically distributedexpected valuesvariancessample average of these random variables. By the law of large numbers, the sample averages converge in probability and almost surely to the expected value µ as n → ∞ […] For large enough n, the distribution of S n is close to the normal distribution with mean µ and variance σ 2 /n. The usefulness of the theorem is that the distribution of √n(S n − µ) approaches normality regardless of the shape of the distribution of the individual X i ’s.law of large numbersconverge in probabilityalmost surely 23 Text from Wikipedia
24
Entropy (C) Dhruv Batra24Slide Credit: Sam Roweis
25
KL-Divergence / Relative Entropy (C) Dhruv Batra25Slide Credit: Sam Roweis
26
26 Bayes Theorem Simple proof from definition of conditional probability: QED: (Def. cond. prob.) Adapted from Ray Mooney
27
27 Probabilistic Classification Let Y be the random variable for the class which takes values {y 1,y 2,…y m }. Let X be the random variable describing an instance consisting of a vector of values for n features, let x k be a possible value for X and x ij a possible value for X i. For classification, we need to compute P(Y=y i | X=x k ) for i=1…m However, given no other assumptions, this requires a table giving the probability of each category for each possible instance in the instance space, which is impossible to accurately estimate from a reasonably-sized training set. –Assuming Y and all X i are binary, we need 2 n entries to specify P(Y=pos | X=x k ) for each of the 2 n possible x k ’s since P(Y=neg | X=x k ) = 1 – P(Y=pos | X=x k ) –Compared to 2 n+1 – 1 entries for the joint distribution P(Y,X 1,X 2 …X n ) Slide credit: Ray Mooney
28
28 Bayesian Categorization Determine category of x k by determining for each y i P(X=x k ) can be determined since categories are complete and disjoint. Adapted from Ray Mooney posterior priorlikelihood
29
29 Bayesian Categorization (cont.) Need to know: –Priors: P(Y=y i ) –Conditionals (likelihood): P(X=x k | Y=y i ) P(Y=y i ) are easily estimated from data. –If n i of the examples in D are in y i then P(Y=y i ) = n i / |D| Too many possible instances (e.g. 2 n for binary features) to estimate all P(X=x k | Y=y i ). Need to make some sort of independence assumptions about the features to make learning tractable (more details later). Adapted from Ray Mooney
30
Likelihood / Prior / Posterior A hypothesis is denoted as h; it is one member of the hypothesis space H A set of training examples is denoted as D, a collection of (x, y) pairs for training Pr(h) – the prior probability of the hypothesis – without observing any training data, what’s the probability that h is the target function we want? 30 Slide content from Rebecca Hwa
31
Likelihood / Prior / Posterior Pr(D) – the prior probability of the observed data – chance of getting the particular set of training examples D Pr(h|D) – the posterior probability of h – what is the probability that h is the target given that we’ve observed D? Pr(D|h) –the probability of getting D if h were true (a.k.a. likelihood of the data) Pr(h|D) = Pr(D|h)Pr(h)/Pr(D) 31 Slide content from Rebecca Hwa
32
MAP vs MLE Estimation Maximum-a-posteriori (MAP) estimation: –h MAP = argmax h Pr(h|D) = argmax h Pr(D|h)Pr(h)/Pr(D) = argmax h Pr(D|h)Pr(h) Maximum likelihood estimation (MLE): –h ML = argmax Pr(D|h) 32 Slide content from Rebecca Hwa
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.