The Estimation Problem How would we select parameters in the limiting case where we had ALL the data? k → l  l’ k→ l’ Intuitively, the actual frequencies.

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Advertisements

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Maximum Likelihood Estimation Navneet Goyal BITS, Pilani.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
Intro to Bayesian Learning Exercise Solutions Ata Kaban The University of Birmingham 2005.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Visual Recognition Tutorial
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Section 6.1 Let X 1, X 2, …, X n be a random sample from a distribution described by p.m.f./p.d.f. f(x ;  ) where the value of  is unknown; then  is.
Lecture 5: Learning models using EM
Maximum likelihood (ML) and likelihood ratio (LR) test
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Hidden Markov Models.
Visual Recognition Tutorial
Computer vision: models, learning and inference
Copyright © Cengage Learning. All rights reserved. 6 Point Estimation.
EM Algorithm Likelihood, Mixture Models and Clustering.
Crash Course on Machine Learning
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Lecture 14-1 (Wooldridge Ch 17) Linear probability, Probit, and
Maximum likelihood estimation of relative transcript abundances Advanced bioinformatics 2012.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Introduction and Motivation Approaches for DE: Known model → parametric approach: p(x;θ) (Gaussian, Laplace,…) Unknown model → nonparametric approach Assumes.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Bayesian statistics Probabilities for everything.
Maximum Likelihood Estimation Methods of Economic Investigation Lecture 17.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
PRIORS David Kauchak CS159 Fall Admin Assignment 7 due Friday at 5pm Project proposals due Thursday at 11:59pm Grading update.
INTRODUCTION TO Machine Learning 3rd Edition
Week 41 How to find estimators? There are two main methods for finding estimators: 1) Method of moments. 2) The method of Maximum likelihood. Sometimes.
CSE 517 Natural Language Processing Winter 2015
1 Probability and Statistical Inference (9th Edition) Chapter 5 (Part 2/2) Distributions of Functions of Random Variables November 25, 2015.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Review of statistical modeling and probability theory Alan Moses ML4bio.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/22/11.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
Conditional Expectation
MathematicalMarketing Slide 3c.1 Mathematical Tools Chapter 3: Part c – Parameter Estimation We will be discussing  Nonlinear Parameter Estimation  Maximum.
Copyright © Cengage Learning. All rights reserved.
Ch3: Model Building through Regression
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Estimation Maximum Likelihood Estimates Industrial Engineering
Hidden Markov Models Part 2: Algorithms
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
More about Posterior Distributions
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Estimation Maximum Likelihood Estimates Industrial Engineering
CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis
Learning From Observed Data
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
David Kauchak CS159 Spring 2019
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Presentation transcript:

The Estimation Problem How would we select parameters in the limiting case where we had ALL the data? k → l  l’ k→ l’ Intuitively, the actual frequencies of all the transitions would best describe the parameters we seek The probability ( a ) of transitioning from state k to state l Counts of k to l transitions Counts of k to l transitions summed over all possible states l

The Estimation Problem What about when we only have a sample? Consider: We can therefore imagine values for the parameters, and treat the probability of the observed data as a function of  X = “ S --+++” P(X|  ) = P( “ S --+++”|  ) P(X|  ) = a s → - a - → - a - → + a + → + a + → + Before we collected the data, the probability of this sequence is a function of , our set of unknown parameters: However, our data is fixed. We have already collected it. The parameters are also fixed, but unknown.

The Estimation Problem The Likelihood Function Caution! The likelihood function does not define a probability distribution or density and it does not encompass an area of 1 L(  |X) = P( “ S --+++”|  ) L(  |X) = a s → - a - → - a - → + a + → + a + → + When we treat the probability of the observed data as a function of the parameters, we call this the likelihood function A few things to notice: The probability of any particular sample we get is generally going to be pretty low regardless of the true values of  The likelihood here still tells us some valuable information! We know, for instance that a -→+ is not zero, etc.

Maximum Likelihood Estimation Maximum Likelihood Estimation seeks the solution that “best” explains the observed dataset ML = argmax P(X|)  Translation: “select as our maximum likelihood parameters those parameters that resulted in a maximization of the probability of the observation given those parameters”. i.e. we seek to maximize P(X|) over all possible This is sometimes called the maximum likelihood criterion = argmax log P(X|)  Or

Maximum Likelihood Estimation Log likelihood is often very handy as we often would otherwise need to deal with a long product of terms… This often comes about because there are multiple outcomes that need to be considered i=1 k ML = log P(x i |)  i=1 k = log P(x i |) 

The Estimation Problem Writing the log likelihood function with frequencies… Not sure I need or want this slide………. I don’t think I really want to go into introducing the positivity of relative entropy, etc…. nk nk ii nini log P(x i |)

Maximum Likelihood Estimation Sometimes proving some parameter choice maximizes the likelihood function is the “tricky bit” Let’s skip the gory details, and try to motivate this intuitively… In general case, this is often done by finding the zeros of the derivative of the likelihood function, or by some other trick such as forcing the function into some particular form and relying on an inequality to prove it must be maximum

The Estimation Problem Maybe it’s enough to convince ourselves that… P(k→l| All the data ) k → l  l’ k→ l’ will approach….. as the amount of sample data increases to the limit where we finally have all the data…. Let’s see how this plays out with a simple simulation…

Maximum Likelihood Estimation Typical plot of single sample of 10 nucleotides MLE is prone to overfitting the data in the case where the sample is small The underlying distribution this was sampled from was uniform ( p A = 0.25, p C = 0.25, p G = 0.25, p T = 0.25)

Typical plot of 10 samples of 10 nucleotides The underlying distribution this was sampled from was uniform ( p A = 0.25, p C = 0.25, p G = 0.25, p T = 0.25) Maximum Likelihood Estimation

Typical plot of 100 samples of 10 nucleotides Maximum Likelihood Estimation The underlying distribution this was sampled from was uniform ( p A = 0.25, p C = 0.25, p G = 0.25, p T = 0.25)

Typical plot of 1000 samples of 10 nucleotides Maximum Likelihood Estimation The underlying distribution this was sampled from was uniform ( p A = 0.25, p C = 0.25, p G = 0.25, p T = 0.25)