ITCS 3153 Artificial Intelligence Lecture 24 Statistical Learning Chapter 20 Lecture 24 Statistical Learning Chapter 20
AI: Creating rational agents The pursuit of autonomous, rational, agents It’s all about searchIt’s all about search –Varying amounts of model information tree searching (informed/uninformed) simulated annealing value/policy iteration –Searching for an explanation of observations Used to develop a model The pursuit of autonomous, rational, agents It’s all about searchIt’s all about search –Varying amounts of model information tree searching (informed/uninformed) simulated annealing value/policy iteration –Searching for an explanation of observations Used to develop a model
Searching for explanation of observations If I can explain observations… can I predict the future? can I predict the future? Can I explain why ten coin tosses are 6 H and 4 T?Can I explain why ten coin tosses are 6 H and 4 T? –Can I predict the 11 th coin toss If I can explain observations… can I predict the future? can I predict the future? Can I explain why ten coin tosses are 6 H and 4 T?Can I explain why ten coin tosses are 6 H and 4 T? –Can I predict the 11 th coin toss
Running example: Candy Surprise Candy Comes in two flavorsComes in two flavors –cherry (yum) –lime (yuk) All candy is wrapped in same opaque wrapperAll candy is wrapped in same opaque wrapper Candy is packaged in large bags containing five different allocations of cherry and limeCandy is packaged in large bags containing five different allocations of cherry and lime Surprise Candy Comes in two flavorsComes in two flavors –cherry (yum) –lime (yuk) All candy is wrapped in same opaque wrapperAll candy is wrapped in same opaque wrapper Candy is packaged in large bags containing five different allocations of cherry and limeCandy is packaged in large bags containing five different allocations of cherry and lime
Statistics Given a bag of candy, what distribution of flavors will it have? Let H be the random variable corresponding to your hypothesisLet H be the random variable corresponding to your hypothesis –H 1 = all cherry, H 2 = all lime, H 3 = 50/50 cherry/lime As you open pieces of candy, let each observation of data: D 1, D 2, D 3, … be either cherry or limeAs you open pieces of candy, let each observation of data: D 1, D 2, D 3, … be either cherry or lime –D 1 = cherry, D 2 = cherry, D 3 = lime, … Predict the flavor of the next piece of candyPredict the flavor of the next piece of candy –If the data caused you to believe H 1 was correct, you’d pick cherry Given a bag of candy, what distribution of flavors will it have? Let H be the random variable corresponding to your hypothesisLet H be the random variable corresponding to your hypothesis –H 1 = all cherry, H 2 = all lime, H 3 = 50/50 cherry/lime As you open pieces of candy, let each observation of data: D 1, D 2, D 3, … be either cherry or limeAs you open pieces of candy, let each observation of data: D 1, D 2, D 3, … be either cherry or lime –D 1 = cherry, D 2 = cherry, D 3 = lime, … Predict the flavor of the next piece of candyPredict the flavor of the next piece of candy –If the data caused you to believe H 1 was correct, you’d pick cherry
Bayesian Learning Use available data to calculate the probability of each hypothesis and make a prediction Because each hypothesis has an independent likelihood, we use all their relative likelihoods when making a predictionBecause each hypothesis has an independent likelihood, we use all their relative likelihoods when making a prediction Probabilistic inference using Bayes’ rule:Probabilistic inference using Bayes’ rule: –P(h i | d) = P(d | h i ) P(h i ) –The probability of of hypothesis h i being active given you observed sequence d equals the probability of seeing data sequence d generated by hypothesis h i multiplied by the likelihood of hypothesis i being active Use available data to calculate the probability of each hypothesis and make a prediction Because each hypothesis has an independent likelihood, we use all their relative likelihoods when making a predictionBecause each hypothesis has an independent likelihood, we use all their relative likelihoods when making a prediction Probabilistic inference using Bayes’ rule:Probabilistic inference using Bayes’ rule: –P(h i | d) = P(d | h i ) P(h i ) –The probability of of hypothesis h i being active given you observed sequence d equals the probability of seeing data sequence d generated by hypothesis h i multiplied by the likelihood of hypothesis i being active hypothesis prior likelihood
Prediction of an unknown quantity X The likelihood of X happening given d has already happened is a function of how much each hypothesis predicts X can happen given d has happenedThe likelihood of X happening given d has already happened is a function of how much each hypothesis predicts X can happen given d has happened –Even though a hypothesis has a high prediction that X will happen, this prediction will be discounted if the hypothesis itself is unlikely to be true given the observation of d The likelihood of X happening given d has already happened is a function of how much each hypothesis predicts X can happen given d has happenedThe likelihood of X happening given d has already happened is a function of how much each hypothesis predicts X can happen given d has happened –Even though a hypothesis has a high prediction that X will happen, this prediction will be discounted if the hypothesis itself is unlikely to be true given the observation of d
Details of Bayes’ rule All observations within d areAll observations within d are –independent –identically distributed The probability of a hypothesis explaining a series of observations, dThe probability of a hypothesis explaining a series of observations, d –is the product of explaining each component All observations within d areAll observations within d are –independent –identically distributed The probability of a hypothesis explaining a series of observations, dThe probability of a hypothesis explaining a series of observations, d –is the product of explaining each component
Example Prior distribution across hypotheses –h1 = 100% cherry = 0.1 –h2 = 75/25 cherry/lime = 0.2 –h3 = 50/50 cherry/lime = 0.5 –h4 = 25/75 cherry/lime = 0.2 –h5 = 100% lime = 0.1 Prediction P(d|h 3 ) = (0.5) 10P(d|h 3 ) = (0.5) 10 Prior distribution across hypotheses –h1 = 100% cherry = 0.1 –h2 = 75/25 cherry/lime = 0.2 –h3 = 50/50 cherry/lime = 0.5 –h4 = 25/75 cherry/lime = 0.2 –h5 = 100% lime = 0.1 Prediction P(d|h 3 ) = (0.5) 10P(d|h 3 ) = (0.5) 10
Example Probabilities for each hypothesis starts at prior value Probabilities for each hypothesis starts at prior value Probability of h 3 hypothesis as 10 lime candies are observed P(d|h 3 )*P(h 3 ) = (0.5) 10 *(0.4)P(d|h 3 )*P(h 3 ) = (0.5) 10 *(0.4) Probabilities for each hypothesis starts at prior value Probabilities for each hypothesis starts at prior value Probability of h 3 hypothesis as 10 lime candies are observed P(d|h 3 )*P(h 3 ) = (0.5) 10 *(0.4)P(d|h 3 )*P(h 3 ) = (0.5) 10 *(0.4)
Prediction of 11 th candy If we’ve observed 10 lime candies, is 11 th lime? Build weighted sum of each hypothesis’s predictionBuild weighted sum of each hypothesis’s prediction Weighted sum can become expensive to computeWeighted sum can become expensive to compute –Instead use most probable hypothesis and ignore others –MAP: maximum a posteriori If we’ve observed 10 lime candies, is 11 th lime? Build weighted sum of each hypothesis’s predictionBuild weighted sum of each hypothesis’s prediction Weighted sum can become expensive to computeWeighted sum can become expensive to compute –Instead use most probable hypothesis and ignore others –MAP: maximum a posteriori from hypothesis from observations
Overfitting Remember overfitting from NN discussion? The number of hypotheses influences predictions Too many hypotheses can lead to overfittingToo many hypotheses can lead to overfitting Remember overfitting from NN discussion? The number of hypotheses influences predictions Too many hypotheses can lead to overfittingToo many hypotheses can lead to overfitting
Overfitting Example Say we’ve observed 3 cherry and 7 lime Consider our 5 hypotheses from beforeConsider our 5 hypotheses from before –prediction is a weighted average of the 5 Consider having 11 hypotheses, one for each permutationConsider having 11 hypotheses, one for each permutation –The 3/7 hypothesis will be 1 and all others will be 0 Say we’ve observed 3 cherry and 7 lime Consider our 5 hypotheses from beforeConsider our 5 hypotheses from before –prediction is a weighted average of the 5 Consider having 11 hypotheses, one for each permutationConsider having 11 hypotheses, one for each permutation –The 3/7 hypothesis will be 1 and all others will be 0
Learning with Data First talk about parameter learning Let’s create a hypothesis for candies that says the probability a cherry is drawn is h Let’s create a hypothesis for candies that says the probability a cherry is drawn is h –If we unwrap N candies and c are cherry, what is –The (log) likelihood is: First talk about parameter learning Let’s create a hypothesis for candies that says the probability a cherry is drawn is h Let’s create a hypothesis for candies that says the probability a cherry is drawn is h –If we unwrap N candies and c are cherry, what is –The (log) likelihood is:
Learning with Data We want to find that maximizes log-likelihood differentiate L with respect to and set to 0differentiate L with respect to and set to 0 This solution process may not be easily computed and iterative and numerical methods may be usedThis solution process may not be easily computed and iterative and numerical methods may be used We want to find that maximizes log-likelihood differentiate L with respect to and set to 0differentiate L with respect to and set to 0 This solution process may not be easily computed and iterative and numerical methods may be usedThis solution process may not be easily computed and iterative and numerical methods may be used