CS 416 Artificial Intelligence

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
ITCS 3153 Artificial Intelligence Lecture 24 Statistical Learning Chapter 20 Lecture 24 Statistical Learning Chapter 20.
Rutgers CS440, Fall 2003 Review session. Rutgers CS440, Fall 2003 Topics Final will cover the following topics (after midterm): 1.Uncertainty & introduction.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Intro to Bayesian Learning Exercise Solutions Ata Kaban The University of Birmingham 2005.
Statistical Learning Methods Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 20 (20.1, 20.2, 20.3, 20.4) Fall 2005.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Chapter 20 of AIMA KAIST CS570 Lecture note
Maximum likelihood (ML) and likelihood ratio (LR) test
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Statistical Learning: Bayesian and ML COMP155 Sections May 2, 2007.
Data Mining Techniques Outline
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Review P(h i | d) – probability that the hypothesis is true, given the data (effect  cause) Used by MAP: select the hypothesis that is most likely given.
Maximum likelihood (ML) and likelihood ratio (LR) test
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Bayesian Learning and Learning Bayesian Networks.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Learning Bayesian Networks
Computer vision: models, learning and inference
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CS 416 Artificial Intelligence Lecture 23 Making Complex Decisions Chapter 17 Lecture 23 Making Complex Decisions Chapter 17.
Let’s flip a coin. Making Data-Based Decisions We’re going to flip a coin 10 times. What results do you think we will get?
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Naive Bayes Classifier
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Statistical Learning (From data to distributions).
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
INTRODUCTION TO Machine Learning 3rd Edition
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Machine Learning 5. Parametric Methods.
CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Statistical Learning Methods
Applied statistics Usman Roshan.
Essential Probability & Statistics
Lecture 1.31 Criteria for optimal reception of radio signals.
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability Theory and Parameter Estimation I
Naive Bayes Classifier
Chapter 7: Sampling Distributions
Data Mining Lecture 11.
Review of Probability and Estimators Arun Das, Jason Rebello
Tutorial #3 by Ma’ayan Fishelson
CSCI 5822 Probabilistic Models of Human and Machine Learning
Making Data-Based Decisions
More about Posterior Distributions
Modelling data and curve fitting
Discrete Event Simulation - 4
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Statistical NLP: Lecture 4
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: Lecture 6
Sampling Distributions (§ )
Machine Learning: UNIT-3 CHAPTER-1
Naive Bayes Classifier
pairing data values (before-after, method1 vs
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

CS 416 Artificial Intelligence Lecture 24 Statistical Learning Chapter 20

AI: Creating rational agents The pursuit of autonomous, rational, agents It’s all about search Varying amounts of model information tree searching (informed/uninformed) simulated annealing value/policy iteration Searching for an explanation of observations Used to develop a model

Searching for explanation of observations If I can explain observations… can I predict the future? Can I explain why ten coin tosses are 6 H and 4 T? Can I predict the 11th coin toss

Running example: Candy Surprise Candy Comes in two flavors cherry (yum) lime (yuk) All candy is wrapped in same opaque wrapper Candy is packaged in large bags containing five different allocations of cherry and lime

Statistics Given a bag of candy, what distribution of flavors will it have? Let H be the random variable corresponding to your hypothesis H1 = all cherry, H2 = all lime, H3 = 50/50 cherry/lime As you open pieces of candy, let each observation of data: D1, D2, D3, … be either cherry or lime D1 = cherry, D2 = cherry, D3 = lime, … Predict the flavor of the next piece of candy If the data caused you to believe H1 was correct, you’d pick cherry

Bayesian Learning Use available data to calculate the probability of each hypothesis and make a prediction Because each hypothesis has an independent likelihood, we use all their relative likelihoods when making a prediction Probabilistic inference using Bayes’ rule: P(hi | d) = aP(d | hi) P(hi) The probability of of hypothesis hi being active given you observed sequence d equals the probability of seeing data sequence d generated by hypothesis hi multiplied by the likelihood of hypothesis i being active hypothesis prior likelihood

Prediction of an unknown quantity X The likelihood of X happening given d has already happened is a function of how much each hypothesis predicts X can happen given d has happened Even though a hypothesis has a high prediction that X will happen, this prediction will be discounted if the hypothesis itself is unlikely to be true given the observation of d

Details of Bayes’ rule All observations within d are independent identically distributed The probability of a hypothesis explaining a series of observations, d is the product of explaining each component

Example Prior distribution across hypotheses Prediction h1 = 100% cherry = 0.1 h2 = 75/25 cherry/lime = 0.2 h3 = 50/50 cherry/lime = 0.5 h4 = 25/75 cherry/lime = 0.2 h5 = 100% lime = 0.1 Prediction P(d|h3) = (0.5)10

Example Probabilities for each hypothesis starts at prior value <.1, .2, .4, .2, .1> Probability of h3 hypothesis as 10 lime candies are observed P(d|h3)*P(h3) = (0.5)10*(0.4)

Prediction of 11th candy If we’ve observed 10 lime candies, is 11th lime? Build weighted sum of each hypothesis’s prediction Weighted sum can become expensive to compute Instead use most probable hypothesis and ignore others MAP: maximum a posteriori from hypothesis from observations

Overfitting Remember overfitting from NN discussion? The number of hypotheses influences predictions Too many hypotheses can lead to overfitting

Overfitting Example Say we’ve observed 3 cherry and 7 lime Consider our 5 hypotheses from before prediction is a weighted average of the 5 Consider having 11 hypotheses, one for each permutation The 3/7 hypothesis will be 1 and all others will be 0

Learning with Data First talk about parameter learning Let’s create a hypothesis for candies that says the probability a cherry is drawn is q, hq If we unwrap N candies and c are cherry, what is q? The (log) likelihood is:

Learning with Data We want to find q that maximizes log-likelihood differentiate L with respect to q and set to 0 This solution process may not be easily computed and iterative and numerical methods may be used