CS 416 Artificial Intelligence Lecture 24 Statistical Learning Chapter 20
AI: Creating rational agents The pursuit of autonomous, rational, agents It’s all about search Varying amounts of model information tree searching (informed/uninformed) simulated annealing value/policy iteration Searching for an explanation of observations Used to develop a model
Searching for explanation of observations If I can explain observations… can I predict the future? Can I explain why ten coin tosses are 6 H and 4 T? Can I predict the 11th coin toss
Running example: Candy Surprise Candy Comes in two flavors cherry (yum) lime (yuk) All candy is wrapped in same opaque wrapper Candy is packaged in large bags containing five different allocations of cherry and lime
Statistics Given a bag of candy, what distribution of flavors will it have? Let H be the random variable corresponding to your hypothesis H1 = all cherry, H2 = all lime, H3 = 50/50 cherry/lime As you open pieces of candy, let each observation of data: D1, D2, D3, … be either cherry or lime D1 = cherry, D2 = cherry, D3 = lime, … Predict the flavor of the next piece of candy If the data caused you to believe H1 was correct, you’d pick cherry
Bayesian Learning Use available data to calculate the probability of each hypothesis and make a prediction Because each hypothesis has an independent likelihood, we use all their relative likelihoods when making a prediction Probabilistic inference using Bayes’ rule: P(hi | d) = aP(d | hi) P(hi) The probability of of hypothesis hi being active given you observed sequence d equals the probability of seeing data sequence d generated by hypothesis hi multiplied by the likelihood of hypothesis i being active hypothesis prior likelihood
Prediction of an unknown quantity X The likelihood of X happening given d has already happened is a function of how much each hypothesis predicts X can happen given d has happened Even though a hypothesis has a high prediction that X will happen, this prediction will be discounted if the hypothesis itself is unlikely to be true given the observation of d
Details of Bayes’ rule All observations within d are independent identically distributed The probability of a hypothesis explaining a series of observations, d is the product of explaining each component
Example Prior distribution across hypotheses Prediction h1 = 100% cherry = 0.1 h2 = 75/25 cherry/lime = 0.2 h3 = 50/50 cherry/lime = 0.5 h4 = 25/75 cherry/lime = 0.2 h5 = 100% lime = 0.1 Prediction P(d|h3) = (0.5)10
Example Probabilities for each hypothesis starts at prior value <.1, .2, .4, .2, .1> Probability of h3 hypothesis as 10 lime candies are observed P(d|h3)*P(h3) = (0.5)10*(0.4)
Prediction of 11th candy If we’ve observed 10 lime candies, is 11th lime? Build weighted sum of each hypothesis’s prediction Weighted sum can become expensive to compute Instead use most probable hypothesis and ignore others MAP: maximum a posteriori from hypothesis from observations
Overfitting Remember overfitting from NN discussion? The number of hypotheses influences predictions Too many hypotheses can lead to overfitting
Overfitting Example Say we’ve observed 3 cherry and 7 lime Consider our 5 hypotheses from before prediction is a weighted average of the 5 Consider having 11 hypotheses, one for each permutation The 3/7 hypothesis will be 1 and all others will be 0
Learning with Data First talk about parameter learning Let’s create a hypothesis for candies that says the probability a cherry is drawn is q, hq If we unwrap N candies and c are cherry, what is q? The (log) likelihood is:
Learning with Data We want to find q that maximizes log-likelihood differentiate L with respect to q and set to 0 This solution process may not be easily computed and iterative and numerical methods may be used