ITCS 3153 Artificial Intelligence Lecture 24 Statistical Learning Chapter 20 Lecture 24 Statistical Learning Chapter 20.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Knowledge Representation and Reasoning University "Politehnica" of Bucharest Department of Computer Science Fall 2010 Adina Magda Florea
Rutgers CS440, Fall 2003 Review session. Rutgers CS440, Fall 2003 Topics Final will cover the following topics (after midterm): 1.Uncertainty & introduction.
Sampling Distributions (§ )
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
Intro to Bayesian Learning Exercise Solutions Ata Kaban The University of Birmingham 2005.
Segmentation and Fitting Using Probabilistic Methods
Parameter Estimation using likelihood functions Tutorial #1
Statistical Learning Methods Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 20 (20.1, 20.2, 20.3, 20.4) Fall 2005.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Chapter 20 of AIMA KAIST CS570 Lecture note
Visual Recognition Tutorial
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Maximum likelihood (ML) and likelihood ratio (LR) test
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Statistical Learning: Bayesian and ML COMP155 Sections May 2, 2007.
Data Mining Techniques Outline
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Review P(h i | d) – probability that the hypothesis is true, given the data (effect  cause) Used by MAP: select the hypothesis that is most likely given.
Maximum likelihood (ML) and likelihood ratio (LR) test
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Chapter 7: Variation in repeated samples – Sampling distributions
Bayesian Learning and Learning Bayesian Networks.
Learning Bayesian Networks
Computer vision: models, learning and inference
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CS 416 Artificial Intelligence Lecture 23 Making Complex Decisions Chapter 17 Lecture 23 Making Complex Decisions Chapter 17.
Let’s flip a coin. Making Data-Based Decisions We’re going to flip a coin 10 times. What results do you think we will get?
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Bayesian inference review Objective –estimate unknown parameter  based on observations y. Result is given by probability distribution. Bayesian inference.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Naive Bayes Classifier
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Statistical Learning (From data to distributions).
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
1 1 Slide Simulation Professor Ahmadi. 2 2 Slide Simulation Chapter Outline n Computer Simulation n Simulation Modeling n Random Variables and Pseudo-Random.
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Machine Learning 5. Parametric Methods.
ELEC 303 – Random Signals Lecture 17 – Hypothesis testing 2 Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 2, 2009.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1 Statistical Data Analysis: Lecture 5 1Probability, Bayes’ theorem 2Random variables and.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Statistical Learning Methods
Applied statistics Usman Roshan.
Lecture 1.31 Criteria for optimal reception of radio signals.
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability Theory and Parameter Estimation I
CS 416 Artificial Intelligence
Data Mining Lecture 11.
Review of Probability and Estimators Arun Das, Jason Rebello
CSCI 5822 Probabilistic Models of Human and Machine Learning
Making Data-Based Decisions
Modelling data and curve fitting
Discrete Event Simulation - 4
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
pairing data values (before-after, method1 vs
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

ITCS 3153 Artificial Intelligence Lecture 24 Statistical Learning Chapter 20 Lecture 24 Statistical Learning Chapter 20

AI: Creating rational agents The pursuit of autonomous, rational, agents It’s all about searchIt’s all about search –Varying amounts of model information  tree searching (informed/uninformed)  simulated annealing  value/policy iteration –Searching for an explanation of observations  Used to develop a model The pursuit of autonomous, rational, agents It’s all about searchIt’s all about search –Varying amounts of model information  tree searching (informed/uninformed)  simulated annealing  value/policy iteration –Searching for an explanation of observations  Used to develop a model

Searching for explanation of observations If I can explain observations… can I predict the future? can I predict the future? Can I explain why ten coin tosses are 6 H and 4 T?Can I explain why ten coin tosses are 6 H and 4 T? –Can I predict the 11 th coin toss If I can explain observations… can I predict the future? can I predict the future? Can I explain why ten coin tosses are 6 H and 4 T?Can I explain why ten coin tosses are 6 H and 4 T? –Can I predict the 11 th coin toss

Running example: Candy Surprise Candy Comes in two flavorsComes in two flavors –cherry (yum) –lime (yuk) All candy is wrapped in same opaque wrapperAll candy is wrapped in same opaque wrapper Candy is packaged in large bags containing five different allocations of cherry and limeCandy is packaged in large bags containing five different allocations of cherry and lime Surprise Candy Comes in two flavorsComes in two flavors –cherry (yum) –lime (yuk) All candy is wrapped in same opaque wrapperAll candy is wrapped in same opaque wrapper Candy is packaged in large bags containing five different allocations of cherry and limeCandy is packaged in large bags containing five different allocations of cherry and lime

Statistics Given a bag of candy, what distribution of flavors will it have? Let H be the random variable corresponding to your hypothesisLet H be the random variable corresponding to your hypothesis –H 1 = all cherry, H 2 = all lime, H 3 = 50/50 cherry/lime As you open pieces of candy, let each observation of data: D 1, D 2, D 3, … be either cherry or limeAs you open pieces of candy, let each observation of data: D 1, D 2, D 3, … be either cherry or lime –D 1 = cherry, D 2 = cherry, D 3 = lime, … Predict the flavor of the next piece of candyPredict the flavor of the next piece of candy –If the data caused you to believe H 1 was correct, you’d pick cherry Given a bag of candy, what distribution of flavors will it have? Let H be the random variable corresponding to your hypothesisLet H be the random variable corresponding to your hypothesis –H 1 = all cherry, H 2 = all lime, H 3 = 50/50 cherry/lime As you open pieces of candy, let each observation of data: D 1, D 2, D 3, … be either cherry or limeAs you open pieces of candy, let each observation of data: D 1, D 2, D 3, … be either cherry or lime –D 1 = cherry, D 2 = cherry, D 3 = lime, … Predict the flavor of the next piece of candyPredict the flavor of the next piece of candy –If the data caused you to believe H 1 was correct, you’d pick cherry

Bayesian Learning Use available data to calculate the probability of each hypothesis and make a prediction Because each hypothesis has an independent likelihood, we use all their relative likelihoods when making a predictionBecause each hypothesis has an independent likelihood, we use all their relative likelihoods when making a prediction Probabilistic inference using Bayes’ rule:Probabilistic inference using Bayes’ rule: –P(h i | d) =  P(d | h i ) P(h i ) –The probability of of hypothesis h i being active given you observed sequence d equals the probability of seeing data sequence d generated by hypothesis h i multiplied by the likelihood of hypothesis i being active Use available data to calculate the probability of each hypothesis and make a prediction Because each hypothesis has an independent likelihood, we use all their relative likelihoods when making a predictionBecause each hypothesis has an independent likelihood, we use all their relative likelihoods when making a prediction Probabilistic inference using Bayes’ rule:Probabilistic inference using Bayes’ rule: –P(h i | d) =  P(d | h i ) P(h i ) –The probability of of hypothesis h i being active given you observed sequence d equals the probability of seeing data sequence d generated by hypothesis h i multiplied by the likelihood of hypothesis i being active hypothesis prior likelihood

Prediction of an unknown quantity X The likelihood of X happening given d has already happened is a function of how much each hypothesis predicts X can happen given d has happenedThe likelihood of X happening given d has already happened is a function of how much each hypothesis predicts X can happen given d has happened –Even though a hypothesis has a high prediction that X will happen, this prediction will be discounted if the hypothesis itself is unlikely to be true given the observation of d The likelihood of X happening given d has already happened is a function of how much each hypothesis predicts X can happen given d has happenedThe likelihood of X happening given d has already happened is a function of how much each hypothesis predicts X can happen given d has happened –Even though a hypothesis has a high prediction that X will happen, this prediction will be discounted if the hypothesis itself is unlikely to be true given the observation of d

Details of Bayes’ rule All observations within d areAll observations within d are –independent –identically distributed The probability of a hypothesis explaining a series of observations, dThe probability of a hypothesis explaining a series of observations, d –is the product of explaining each component All observations within d areAll observations within d are –independent –identically distributed The probability of a hypothesis explaining a series of observations, dThe probability of a hypothesis explaining a series of observations, d –is the product of explaining each component

Example Prior distribution across hypotheses –h1 = 100% cherry = 0.1 –h2 = 75/25 cherry/lime = 0.2 –h3 = 50/50 cherry/lime = 0.5 –h4 = 25/75 cherry/lime = 0.2 –h5 = 100% lime = 0.1 Prediction P(d|h 3 ) = (0.5) 10P(d|h 3 ) = (0.5) 10 Prior distribution across hypotheses –h1 = 100% cherry = 0.1 –h2 = 75/25 cherry/lime = 0.2 –h3 = 50/50 cherry/lime = 0.5 –h4 = 25/75 cherry/lime = 0.2 –h5 = 100% lime = 0.1 Prediction P(d|h 3 ) = (0.5) 10P(d|h 3 ) = (0.5) 10

Example Probabilities for each hypothesis starts at prior value Probabilities for each hypothesis starts at prior value Probability of h 3 hypothesis as 10 lime candies are observed P(d|h 3 )*P(h 3 ) = (0.5) 10 *(0.4)P(d|h 3 )*P(h 3 ) = (0.5) 10 *(0.4) Probabilities for each hypothesis starts at prior value Probabilities for each hypothesis starts at prior value Probability of h 3 hypothesis as 10 lime candies are observed P(d|h 3 )*P(h 3 ) = (0.5) 10 *(0.4)P(d|h 3 )*P(h 3 ) = (0.5) 10 *(0.4)

Prediction of 11 th candy If we’ve observed 10 lime candies, is 11 th lime? Build weighted sum of each hypothesis’s predictionBuild weighted sum of each hypothesis’s prediction Weighted sum can become expensive to computeWeighted sum can become expensive to compute –Instead use most probable hypothesis and ignore others –MAP: maximum a posteriori If we’ve observed 10 lime candies, is 11 th lime? Build weighted sum of each hypothesis’s predictionBuild weighted sum of each hypothesis’s prediction Weighted sum can become expensive to computeWeighted sum can become expensive to compute –Instead use most probable hypothesis and ignore others –MAP: maximum a posteriori from hypothesis from observations

Overfitting Remember overfitting from NN discussion? The number of hypotheses influences predictions Too many hypotheses can lead to overfittingToo many hypotheses can lead to overfitting Remember overfitting from NN discussion? The number of hypotheses influences predictions Too many hypotheses can lead to overfittingToo many hypotheses can lead to overfitting

Overfitting Example Say we’ve observed 3 cherry and 7 lime Consider our 5 hypotheses from beforeConsider our 5 hypotheses from before –prediction is a weighted average of the 5 Consider having 11 hypotheses, one for each permutationConsider having 11 hypotheses, one for each permutation –The 3/7 hypothesis will be 1 and all others will be 0 Say we’ve observed 3 cherry and 7 lime Consider our 5 hypotheses from beforeConsider our 5 hypotheses from before –prediction is a weighted average of the 5 Consider having 11 hypotheses, one for each permutationConsider having 11 hypotheses, one for each permutation –The 3/7 hypothesis will be 1 and all others will be 0

Learning with Data First talk about parameter learning Let’s create a hypothesis for candies that says the probability a cherry is drawn is  h Let’s create a hypothesis for candies that says the probability a cherry is drawn is  h  –If we unwrap N candies and c are cherry, what is  –The (log) likelihood is: First talk about parameter learning Let’s create a hypothesis for candies that says the probability a cherry is drawn is  h Let’s create a hypothesis for candies that says the probability a cherry is drawn is  h  –If we unwrap N candies and c are cherry, what is  –The (log) likelihood is:

Learning with Data We want to find  that maximizes log-likelihood differentiate L with respect to  and set to 0differentiate L with respect to  and set to 0 This solution process may not be easily computed and iterative and numerical methods may be usedThis solution process may not be easily computed and iterative and numerical methods may be used We want to find  that maximizes log-likelihood differentiate L with respect to  and set to 0differentiate L with respect to  and set to 0 This solution process may not be easily computed and iterative and numerical methods may be usedThis solution process may not be easily computed and iterative and numerical methods may be used