David Kauchak CS159 Spring 2019

Slides:



Advertisements
Similar presentations
Basics of Statistical Estimation
Advertisements

Probabilistic models Haixu Tang School of Informatics.
Clustering Beyond K-means
The Estimation Problem How would we select parameters in the limiting case where we had ALL the data? k → l  l’ k→ l’ Intuitively, the actual frequencies.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Parameter Estimation using likelihood functions Tutorial #1
Today Today: Chapter 9 Assignment: 9.2, 9.4, 9.42 (Geo(p)=“geometric distribution”), 9-R9(a,b) Recommended Questions: 9.1, 9.8, 9.20, 9.23, 9.25.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
Bayesian Learning, Cont’d. Administrivia Various homework bugs: Due: Oct 12 (Tues) not 9 (Sat) Problem 3 should read: (duh) (some) info on naive Bayes.
Computer vision: models, learning and inference
Thanks to Nir Friedman, HU
Bayesian Wrap-Up (probably). Administrivia Office hours tomorrow on schedule Woo hoo! Office hours today deferred... [sigh] 4:30-5:15.
Language Modeling Approaches for Information Retrieval Rong Jin.
Bayesian Learning Part 3+/- σ. Administrivia Final project/proposal Hand-out/brief discussion today Proposal due: Mar 27 Midterm exam: Thurs, Mar 22 (Thurs.
PROBABILITY David Kauchak CS451 – Fall Admin Midterm Grading Assignment 6 No office hours tomorrow from 10-11am (though I’ll be around most of the.
More probability CS311 David Kauchak Spring 2013 Some material borrowed from: Sara Owsley Sood and others.
Crash Course on Machine Learning
1 Advanced Smoothing, Evaluation of Language Models.
Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.
EM and expected complete log-likelihood Mixture of Experts
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
PROBABILITY David Kauchak CS159 – Spring Admin  Posted some links in Monday’s lecture for regular expressions  Logging in remotely  ssh to vpn.cs.pomona.edu.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
PRIORS David Kauchak CS159 Fall Admin Assignment 7 due Friday at 5pm Project proposals due Thursday at 11:59pm Grading update.
Quadratic Classifiers (QC) J.-S. Roger Jang ( 張智星 ) CS Dept., National Taiwan Univ Scientific Computing.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Week 41 How to find estimators? There are two main methods for finding estimators: 1) Method of moments. 2) The method of Maximum likelihood. Sometimes.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Logistic Regression William Cohen.
Machine Learning 5. Parametric Methods.
Maximum Likelihood Estimate Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
N-Grams Chapter 4 Part 2.
Matt Gormley Lecture 3 September 7, 2016
Usman Roshan CS 675 Machine Learning
Maximum Likelihood Estimate
Probability David Kauchak CS158 – Fall 2013.
Probability Theory and Parameter Estimation I
Data Science Algorithms: The Basic Methods
ECE 5424: Introduction to Machine Learning
Bayes Net Learning: Bayesian Approaches
CS 416 Artificial Intelligence
Review of Probability and Estimators Arun Das, Jason Rebello
CSCI 5822 Probabilistic Models of Human and Machine Learning
More about Posterior Distributions
Introduction to EM algorithm
Logistic Regression.
CS246: Latent Dirichlet Analysis
Word Alignment David Kauchak CS159 – Fall 2019 Philipp Koehn
CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis
Topic Models in Text Processing
CS590I: Information Retrieval
Amortized Analysis and Heaps Intro
Machine Learning basics
Mathematical Foundations of BME Reza Shadmehr
Uniform Probability Distribution
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

David Kauchak CS159 Spring 2019 priors David Kauchak CS159 Spring 2019

Admin Assignment 7 due Friday at 5pm Project proposals presentations at the beginning of class on Wednesday

Maximum likelihood estimation Intuitive Sets the probabilities so as to maximize the probability of the training data Problems? Overfitting! Amount of data particularly problematic for rare events Is our training data representative

Basic steps for probabilistic modeling Probabilistic models Step 1: pick a model Step 2: figure out how to estimate the probabilities for the model Step 3 (optional): deal with overfitting Which model do we use, i.e. how do we calculate p(feature, label)? How do train the model, i.e. how to we we estimate the probabilities for the model? How do we deal with overfitting?

Priors Coin1 data: 3 Heads and 1 Tail Coin2 data: 30 Heads and 10 tails Coin3 data: 2 Tails Coin4 data: 497 Heads and 503 tails If someone asked you what the probability of heads was for each of these coins, what would you say?

Training revisited From a probability standpoint, MLE training is selecting the Θ that maximizes: i.e. We pick the most likely model parameters given the data

Estimating revisited We can incorporate a prior belief in what the probabilities might be! To do this, we need to break down our probability (Hint: Bayes rule)

Estimating revisited What are each of these probabilities?

Priors likelihood of the data under the model probability of different parameters, call the prior probability of seeing the data (regardless of model)

Priors Does p(data) matter for the argmax? no, it doesn’t very as theta changes. It’s a constant Does p(data) matter for the argmax?

Priors What does MLE assume for a prior on the model parameters? likelihood of the data under the model probability of different parameters, call the prior What does MLE assume for a prior on the model parameters?

Priors Assumes a uniform prior, i.e. all Θ are equally likely! likelihood of the data under the model probability of different parameters, call the prior Assumes a uniform prior, i.e. all Θ are equally likely! Relies solely on the likelihood

A better approach We can use any distribution we’d like This allows us to impart addition bias into the model

Another view on the prior Remember, the max is the same if we take the log: We can use any distribution we’d like This allows us to impart addition bias into the model

What about smoothing? training data Sometimes this is also called smoothing because it is seen as smoothing or interpolating between the MLE and some other distribution for each label, pretend like we’ve seen each feature/word occur inλadditional examples

Prior for NB Uniform prior Dirichlet prior λ= 0 increasing - as lambda gets larger we bias more and more towards λ= 0 increasing