Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 20 of AIMA KAIST CS570 Lecture note

Similar presentations


Presentation on theme: "Chapter 20 of AIMA KAIST CS570 Lecture note"— Presentation transcript:

1 Chapter 20 of AIMA KAIST CS570 Lecture note
Statistical Learning Chapter 20 of AIMA KAIST CS570 Lecture note Based on AIMA slides, Jahwan Kim’s slides and Duda, Hart & Stork’s slides

2 We view LEARNING as a form of uncertain reasoning from observation
Statistical Learning We view LEARNING as a form of uncertain reasoning from observation 2

3 Outline Bayesian Learning Parameter Learning
Bayesian inference MAP and ML Naïve Bayesian method Parameter Learning Examples Regression and LMS Learning Probability Distribution Parametric method Non-parametric method 3

4 Bayesian Learning 1 View learning as Bayesian updating of a probability distribution over the hypothesis space H is the hypothesis variable, values h1,…,hn be possible hypotheses. Let d=(d1,…dn) be the observed data vectors. Often (always) iid assumption is made. Let X denote the prediction. In Bayesian Learning, Compute the probability of each hypothesis given the data. Predict based on that basis. Predictions are made by using all hypotheses. Learning in Bayesian setting is reduced to probabilistic inference. 4

5 Bayesian Learning 2 The probability that the prediction is X, when the data d is observed is P(X|d) = åi P(X|d, hi) P(hi|d) = åi P(X|hi) P(hi|d) Prediction is weighted average over the predictions of individual hypothesis. Hypotheses are intermediaries between the data and the predictions. Requires computing P(hi|d) for all i. This is usually intractable. 5

6 Bayesian Learning Basics Terms
P(hi) is called the (hypothesis) prior. We can embed knowledge by means of prior. It also controls the complexity of the model. P(hi|d) is called posterior (or a posteriori) probability. Using Bayes’ rule, P(hi|d)/ P(d|hi)P(hi) P(d|hi) is called the likelihood of the data. Under iid assumption, P(d|hi)=Õj P(dj|hi). Let hMAP be the hypothesis for which the posterior probability P(hi|d) is maximal. It is called the maximum a posteriori (or MAP) hypothesis. 6

7 Candy Example Two flavors of candy, cherry and lime, wrapped in the same opaque wrapper. (cannot see inside) Sold in very large bags, of which there are known to be five kinds: h1: 100% cherry h2: 75% cherry + 25% lime h3: 50% -50 % h4: 25 % cherry -75 % lime h5: 100% lime Priors known: P(h1),…,P(h5) are 0.1, 0.2, 0.4, 0.2, 0.1 Suppose from a bag of candy, we took N pieces of candy and all of them were lime (data dN). What kind of bag is it ? What flavor will the next candy be ? 7

8 Candy Example Posterior probability of hypotheses
P(h1|dN) / P(dN|h1)P(h1)=0, P(h2|dN) / P(dN|h2)P(h2)= 0.2(.25)N, P(h3|dN) / P(dN|h3)P(h3)=0.4(.5)N, P(h4|dN) / P(dN|h4)P(h4)=0.2(.75)N, P(h5|dN) / P(dN|h5)P(h5)=P(h5)=0.1. Normalize them by requiring them to sum up to 1. 8

9 Candy Example Prediction Probability
9

10 Maximum a posteriori (MAP) Learning
Since calculating the exact probability is often impractical, we use approximation by MAP hypothesis. That is, P(X|d)¼P(X|hMAP). Make prediction with most probable hypothesis Summing over that hypotheses space is often intractable instead of large summation (integration), an optimization problem can be solved. For deterministic hypothesis, P(d|hi) is 1 if consistent, 0 otherwise  MAP = simplest consistent hypothesis (cf. science) The true hypothesis eventually dominates the Bayesian prediction 10

11 MAP approximation MDL Principle
Since P(hi|d)/ P(d|hi)P(hi), instead of maximizing P(hi|d), we may maximize P(d|hi)P(hi). Equivalently, we may minimize –log P(d|hi)P(hi)=-log P(d|hi)-log P(hi). We can interpret this as choosing hi to minimize the number of bits that is required to encode the hypothesis hi and the data d under that hypothesis. The principle of minimizing code length (under some pre-determined coding scheme) is called the minimum description length (or MDL) principle. MDL is used in wide range of practical machine learning applications. 11

12 Maximum Likelihood Approximation
Assume furthermore that P(hi)’s are all equal, i.e., assume the uniform prior. reasonable when there is no reason to prefer one hypothesis over another a priori. For Large data set, prior becomes irrelevant to obtain MAP hypothesis, it suffices to maximize P(d|hi), the likelihood. the maximum likelihood hypothesis hML. MAP and uniform prior , ML ML is the standard statistical learning method Simply get the best fit to the data 12

13 Naïve Bayes Method Attributes (components of observed data) are assumed to be independent in Naïve Bayes Method. Works well for about 2/3 of real-world problems, despite naivety of such assumption. Goal: Predict the class C, given the observed data Xi=xi. By the independent assumption, P(C|x1,…xn) / P(C) Õi P(xi|C) We choose the most likely class. Merits of NB Scales well: No search is required. Robust against noisy data. Gives probabilistic predictions. 13

14 Learning Curve on the Restaurant Problem
14

15 Learning with Data : Parameter Learning
Introduce parametric probability model with parameter q. Then the hypotheses are hq, i.e., hypotheses are parameterized. In the simplest case, q is a single scalar. In more complex cases, q consists of many components. Using the data d, predict the parameter q. 15

16 ML Parameter Learning Examples : discrete case
A bag of candy whose lime-cherry proportions are completely unknown. In this case we have hypotheses parameterized by the probability q of cherry. P(d|hq)=Õj P(dj|hq)=qcherry(1-q)lime Find hq Maximize P(d|hq) Two wrappers, green and red, are selected according to some unknown conditional distribution, depending on the flavor. It has three parameters: q=P(F=cherry), q1=P(W=red|F=cherry), q2=P(W=red|F=lime). P(d|hQ)= qcherry(1-q)lime q1red,cherry(1-q1)green,cherry q2red,lime(1-q2)green,lime Find hQ Maximize P(d|hQ) 16

17 ML Parameter Learning Example : continuous case Single Variable Gaussian
Gaussian pdf on a single variable: Suppose x1,…,xN are observed. Then the log likelihood is We want to find m and s that will maximize this. Find where gradient is zero. 17

18 This verifies ML agrees with our common sense.
ML Parameter Learning Example : continuous case Single Variable Gaussian Solving this, we find This verifies ML agrees with our common sense. 18

19 ML Parameter Learning Example : continuous case Linear Regression
Y has a Gaussian distribution whose mean is depend on X and standard deviation is fixed Maximize = minimizing This quantity is sum of squared errors. Thus in this case, ML , Least Mean-Square (LMS) 19

20 Bayesian Parameter Learning
ML approximation’s deficiency with small data e.g. ML of one cherry observation = 100% cherry Bayesian parameter learning Place a hypothesis prior over the possible values of parameters Update this distribution as data arrive 20

21 Bayesian Learning of Parameter q
The density becomes more peaked as the number of samples increase Despite different prior dsitribution, posterior density is virtually identical with large set of data 21

22 Bayesian Parameter Learning Example Beta Distribution : Candy example revisited.
q is the value of a random variable Q in Bayesian view. P(Q) is a continuous distribution. Uniform density is one candidate. Another possibility is to use beta distributions. Beta distribution has two hyperparameters a and b, and is given by (a normalizing constant) ba,b(q)=aqa-1(1-q)b-1. mean : a/(a+b). Larger a suggest q is closer to 1 than to 0 More peaked when a+b is large, suggesting greater certainty about the value of Q. 22

23 Beta Distribution ba,b(q)=aqa-1(1-q)b-1 23

24 Baysian Parameter Learning Example Property of Beta Distribution
if Q has a prior ba,b, then the posterior distribution for Q is also a beta distribution. P(q|d=cherry) = a P(d=cherry|q)P(q) = a’ q ba,b(q) = a’ q¢qa-1(1-q)b-1 = a’ qa(1-q)b-1 = ba+1,b Beta distribution is called the conjugate prior for the family of distributions for a Boolean variable. a and b as virtual count Uniform prior ba,b  seen a-1 cherry and b-1 lime 24

25 Density Estimation All Parametric densities are unimodal (have a single local maximum), whereas many practical problems involve multi-modal densities Nonparametric procedures can be used with arbitrary distributions and without the assumption that the forms of the underlying densities are known There are two types of nonparametric methods: Estimating P(x | j ) Bypass probability and go directly to a-posteriori probability estimation 25

26 Density Estimation : Basic idea
Probability that a vector x will fall in region R is: P is a smoothed (or averaged) version of the density function p(x) if we have a sample of size n; therefore, the probability that k points fall in R is then: and the expected value for k is: E(k) = nP (3) 26

27 where x’ is a point within R and V the volume enclosed by R.
ML estimation of P =  is reached for Therefore, the ratio k/n is a good estimate for the probability P and hence for the density function p. p(x) is continuous and that the region R is so small that p does not vary significantly within it, we can write: where x’ is a point within R and V the volume enclosed by R. Combining equation (1) , (3) and (4) yields: 27

28 Parzen Windows Parzen-window approach to estimate densities assume that the region Rn is a d-dimensional hypercube ((x-xi)/hn) is equal to unity if xi falls within the hypercube of volume Vn centered at x and equal to zero otherwise. 28

29 The number of samples in this hypercube is:
By substituting kn in equation (7), we obtain the following estimate: Pn(x) estimates p(x) as an average of functions of x and the samples (xi) (i = 1,… ,n). These functions  can be general! 29

30 Illustration of Parzen Window
The behavior of the Parzen-window method Case where p(x) N(0,1) Let (u) = (1/(2) exp(-u2/2) and hn = h1/n (n>1) (h1: known parameter) Thus: is an average of normal densities centered at the samples xi. 30

31 Numerical results For n = 1 and h1= For n = 10 and h = 0.1, the contributions of the individual samples are clearly observable ! 31

32 32

33 33

34 Analogous results are also obtained in two dimensions as illustrated
34

35 Case where p(x) = 1. U(a,b) + 2
Case where p(x) = 1.U(a,b) + 2.T(c,d) (unknown density) (mixture of a uniform and a triangle density) 35

36 36

37 Summary Full Bayesian learning gives best possible predictions but is intractable MAP Learning balances complexity with accuracy on training data ML approximation assumes uniform prior, OK for large data sets Parameter estimation is often used 37


Download ppt "Chapter 20 of AIMA KAIST CS570 Lecture note"

Similar presentations


Ads by Google