Announcements CS Ice Cream Social 9/5 3:30-4:30, ECCR 265 includes poster session, student group presentations.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
Regression Eric Feigelson Lecture and R tutorial Arcetri Observatory April 2014.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
Pattern Recognition and Machine Learning
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
Kevin Murphy UBC CS & Stats 9 February 2005
Model Assessment, Selection and Averaging
What is Statistical Modeling
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Introduction  Bayesian methods are becoming very important in the cognitive sciences  Bayesian statistics is a framework for doing inference, in a principled.
Statistical Learning: Bayesian and ML COMP155 Sections May 2, 2007.
Simple Linear Regression
Presenting: Assaf Tzabari
Machine Learning CMPT 726 Simon Fraser University
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Bayesian Learning Rong Jin.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Part I: Classification and Bayesian Learning
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
Statistical inference: confidence intervals and hypothesis testing.
PATTERN RECOGNITION AND MACHINE LEARNING
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Machine Learning CSE 681 CH2 - Supervised Learning.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Statistical Learning (From data to distributions).
Bayesian Methods I: Parameter Estimation “A statistician is a person who draws a mathematically precise line from an unwarranted assumption to a foregone.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
Quantifying Knowledge Fouad Chedid Department of Computer Science Notre Dame University Lebanon.
Kolmogorov Complexity and Universal Distribution Presented by Min Zhou Nov. 18, 2002.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Randomized Algorithms for Bayesian Hierarchical Clustering
CS Bayesian Learning1 Bayesian Learning A powerful and growing approach in machine learning We use it in our own decision making all the time – You.
Chapter 6 Bayesian Learning
INTRODUCTION TO Machine Learning 3rd Edition
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
G. Cowan Computing and Statistical Data Analysis / Stat 9 1 Computing and Statistical Data Analysis Stat 9: Parameter Estimation, Limits London Postgraduate.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Basic Bayes: model fitting, model selection, model averaging Josh Tenenbaum MIT.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Bayes Net Learning: Bayesian Approaches
Data Mining Lecture 11.
Bias and Variance of the Estimator
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Announcements CS Ice Cream Social 9/5 3:30-4:30, ECCR 265 includes poster session, student group presentations

Concept Learning Examples  Word meanings  Edible foods  Abstract structures (e.g., irony) glorch not glorch not glorch

Supervised Approach To Concept Learning Both positive and negative examples provided Typical models (both in ML and Cog Sci) circa 2000 required both positive and negative examples

Contrast With Human Learning Abiliites Learning from positive examples only Learning from a small number of examples  E.g., word meanings  E.g., learning appropriate social behavior  E.g., instruction on some skill What would it mean to learn from a small number of positive examples? + + +

Tenenbaum (1999) Two dimensional continuous feature space Concepts defined by axis-parallel rectangles e.g., feature dimensions  cholesterol level  insulin level e.g., concept healthy

Learning Problem Given a set of given a set of n examples, X = {x 1, x 2, x 3, …, x n }, which are instances of the concept… Will some unknown example Y also be an instance of the concept? Problem of generalization

Hypothesis (Model) Space H: all rectangles on the plane, parameterized by (l 1, l 2, s 1, s 2 ) h: one particular hypothesis Note: |H| = ∞ Consider all hypotheses in parallel In contrast to non-Bayesian approach of maintaining only the best hypothesis at any point in time.

Prediction Via Model Averaging Will some unknown input y be in the concept given examples X = {x 1, x 2, x 3, …, x n }? Q: y is a positive example of the concept (T,F)  P(Q | X) = ⌠ h p(Q & h | X) dh  P(Q & h | X) = p(Q | h, X) p(h | X)  P(Q | h, X) = P(Q | h) = 1 if y is in h  p(h | X) ~ P(X | h) p(h) priorlikelihood Chain rule Marginalization Conditional independence and deterministic concepts Bayes rule

Priors and Likelihood Functions Priors, p(h)  Location invariant  Uninformative prior (prior depends only on area of rectangle)  Expected size prior Likelihood function, p(X|h)  X = set of n examples  Size principle x

Expected size prior

Generalization Gradients MIN: smallest hypothesis consistent with data weak Bayes: instead of using size principle, assumes examples are produced by process independent of the true class Dark line = 50% prob.

Experimental Design Subjects shown n dots on screen that are “randomly chosen examples from some rectangle of healthy levels”  n drawn from {2, 3, 4, 6, 10, 50} Dots varied in horizontal and vertical range  r drawn from {.25,.5, 1, 2, 4, 8} units in a 24 unit window Task  draw the ‘true’ rectangle around the dots

Experimental Results

Number Game Experimenter picks integer arithmetic concept C  E.g., prime number  E.g., number between 10 and 20  E.g., multiple of 5 Experimenter presents positive examples drawn at random from C, say, in range [1, 100] Participant asked whether some new test case belongs in C

Empirical Predictive Distributions

Hypothesis Space  Even numbers  Odd numbers  Squares  Multiples of n  Ends in n  Powers of n  All numbers  Intervals [n, m] for n>0, m<101  Powers of 2, plus 37  Powers of 2, except for 32

Observation = 16 Likelihood function  Size principle Prior  Intuition

Observation = Likelihood function  Size principle Prior  Intuition

Posterior Distribution After Observing 16

Model Vs. Human Data MODEL HUMAN DATA

Summary of Tenenbaum (1999) Method  Pick prior distribution (includes hypothesis space)  Pick likelihood function (size principle)  leads to predictions for generalization as a function of r (range) and n (number of examples) Claims people generalize optimally given assumptions about priors and likelihood Bayesian approach provides best description of how people generalize on rectangle task. Explains how people can learn from a small number of examples, and only positive examples.

Important Ideas in Bayesian Models Generative models  Likelihood function Consideration of multiple models in parallel  Potentially infinite model space Inference  prediction via model averaging  role of priors diminishes with amount of evidence Learning  trade off between model simplicity and fit to data  Bayesian Occam’s Razor

Ockham's Razor If two hypotheses are equally consistent with the data, prefer the simpler one. Simplicity can accommodate fewer observations smoother fewer parameters restricts predictions more (“sharper” predictions) Examples 1 st vs. 4 th order polynomial small rectangle vs. large rectangle in Tenenbaum model medieval philosopher and monk tool for cutting (metaphorical)

Motivating Ockham's Razor Aesthetic considerations A theory with mathematical beauty is more likely to be right (or believed) than an ugly one, given that both fit the same data. Past empirical success of the principle Coherent inference, as embodied by Bayesian reasoning, automatically incorporates Ockham's razor Two theories H 1 and H 2 PRIORS LIKELIHOODS

Ockham's Razor with Priors Jeffreys (1939) probabililty text more complex hypotheses should have lower priors Requires a numerical rule for assessing complexity e.g., number of free parameters e.g., Vapnik-Chervonenkis (VC) dimension

Subjective vs. Objective Priors subjective or informative prior specific, definite information about a random variable objective or uninformative prior vague, general information Philosophical arguments for certain priors as uninformative Maximum entropy / least committment e.g., interval [a b]: uniform e.g., interval [0, ∞) with mean 1/λ: exponential distribution e.g., mean μ and std deviation σ: Gaussian  Independence of measurement scale e.g., Jeffrey’s prior 1/(θ(1-θ)) for θ in [0,1] expresses same belief whether we talk about θ or logθ

Ockham’s Razor Via Likelihoods Coin flipping example H 1 : coin has two heads H 2 : coin has a head and a tail Consider 5 flips producing HHHHH H 1 could produce only this sequence H 2 could produce HHHHH, but also HHHHT, HHHTH,... TTTTT P(HHHHH | H 1 ) = 1, P(HHHHH | H 2 ) = 1/32 H 2 pays the price of having a lower likelihood via the fact it can accommodate a greater range of observations H 1 is more readily rejected by observations

Simple and Complex Hypotheses H2H2 H1H1

Bayes Factor BIC is approximation to Bayes factor A.k.a. likelihood ratio

Hypothesis Classes Varying In Complexity E.g., 1 st, 2 nd, and 3 d order polynomials Hypothesis class is parameterized by w v

Rissanen (1976) Minimum Description Length Prefer models that can communicate the data in the smallest number of bits. The preferred hypothesis H for explaining data D minimizes: (1) length of the description of the hypothesis (2) length of the description of the data with the help of the chosen theory L: length

MDL & Bayes L: some measure of length (complexity) MDL: prefer hypothesis that min. L(H) + L(D|H) Bayes rule implies MDL principle P(H|D) = P(D|H)P(H) / P(D) –log P(H|D) = –log P(D|H) – log P(H) + log P(D) = L(D|H) + L(H) + const

Relativity Example Explain deviation in Mercury's orbit at perihelion with respect to prevailing theory E: Einstein's theoryα = true deviation F: fudged Newtonian theorya = observed deviation

Relativity Example (Continued) Subjective Ockham's razor result depends on one's belief about P(α|F) Objective Ockham's razor for Mercury example, RHS is Applies to generic situation