Review of statistical modeling and probability theory Alan Moses ML4bio.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Copula Regression By Rahul A. Parsa Drake University &
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Integration of sensory modalities
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
A Short Introduction to Curve Fitting and Regression by Brad Morantz
Visual Recognition Tutorial
Classification and risk prediction
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum likelihood (ML)
Machine Learning CMPT 726 Simon Fraser University
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Pattern Recognition Topic 2: Bayes Rule Expectant mother:
Maximum likelihood (ML)
Modern Navigation Thomas Herring
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
INTRODUCTION TO Machine Learning 3rd Edition
Lecture 2: Statistical learning primer for biologists
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Machine Learning 5. Parametric Methods.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
ESTIMATION METHODS We know how to calculate confidence intervals for estimates of  and  2 Now, we need procedures to calculate  and  2, themselves.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Computacion Inteligente Least-Square Methods for System Identification.
Presentation : “ Maximum Likelihood Estimation” Presented By : Jesu Kiran Spurgen Date :
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Data Modeling Patrice Koehl Department of Biological Sciences
Applied statistics Usman Roshan.
Probability Theory and Parameter Estimation I
CH 5: Multivariate Methods
Probability & Statistics Probability Theory Mathematical Probability Models Event Relationships Distributions of Random Variables Continuous Random.
EC 331 The Theory of and applications of Maximum Likelihood Method
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Integration of sensory modalities
Pattern Recognition and Machine Learning
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME Reza Shadmehr
Applied Statistics and Probability for Engineers
Presentation transcript:

Review of statistical modeling and probability theory Alan Moses ML4bio

What is modeling? Describe some observations in a simple, more compact way X = (X 1,X 2 )

What is modeling? Describe some observations in a simple, more compact way Model: a = - G m r2r2 Instead of all the observations, we only need to remember a constant ‘G’ and measure some parameters ‘m’ and ‘r’.

What is statistical modeling? Deals also with the ‘uncertainty’ in observations Expectation Deviation or Variance Also use the term ‘probabilistic’ modeling Mathematics is more complicated

What kind of questions will we answer in this course? What’s the best linear model to explain some data?

What kind of questions will we answer in this course? Are there multiple groups? What are they?

What kind of questions will we answer in this course? Given new data, which group do we assign it to?

3 major areas of machine learning Regression Clustering Classification (that have proven useful in biology)

Molecular Biology example Expression Level Expectation Variance disease X = (L,D)

Molecular Biology example Expression Level Expectation Variance E1 V1 E2 V2 “clustering” Expression Level disease Class 2 is “enriched” for disease

Molecular Biology example Expression Level Expectation Variance E1 V1 E2 V2 “clustering” “regression” Genotype AAAaaa Expression Level disease Class 2 is “enriched” for disease

Molecular Biology example Expression Level Expectation Variance E1 V1 E2 V2 “clustering” “regression” Genotype AAAaaa Expression Level “classification” Genotype AAAaaaExpression Level disease disease? Aa Class 2 is “enriched” for disease

Probability theory Probability theory quantifies uncertainty using ‘distributions’ Distributions are the ‘models’ and they depend on constants and parameters E.g., in one dimension, the Gaussian or Normal distribution depends on two constants e and π and two parameters that have to be measured, μ and σ P(X|μ,σ) = e 1 √2πσ 2 (X–μ) 2 2σ 2 – ‘X’ are the possible datapoints that could come from the distribution. In statistics jargon ‘X’ is called a random variable

Probability theory Probability theory quantifies uncertainty using ‘distributions’ Choosing the distribution or ‘model’s the first step in a statistical model E.g., data: mRNA expression levels, counts of sequencing reads, presence or absence of protein domains or ‘A’ ‘C’ ‘G’ and ‘T’ s We will use different distributions to describe these different types of data.

Typical data and distributions Data is categorical (yes or no, A,C,G,T) Data is a fraction (e.g., 13 out of 5212) Data is a continuous number (e.g., -6.73) Data is a ‘natural’ number (0,1,2,3,4…) It’s also possible to do regression, clustering and classification without specifying a distribution

Molecular Biology example “classification” Genotype AAAaaaExpression Level disease? Aa In this example, we might try to combine a Bernoulli for the disease data, Poisson for the genotype and Gaussian for the expression level We also might try to classify without specifying distributions

Molecular Biology example Gene 2 Expression Level genomics era means we will almost never have the expression level for just one gene or the genotype at just one locus Each gene’s expression level can be considered another ‘dimension’ Gene 1 Expression Level for two genes, if each point is data for one person, we can make a graph of this type of data for 1000s of genes…. Gene 2 Expression Level Gene 1 Expression Level Gene 3 Gene 4 Gene 5 …

Molecular Biology example Gene 2 Expression Level genomics era means we will almost never have the expression level for just one gene or the genotype at just one locus We’ll usually make 2-D plots, but anything we say about 2-D can usually be generalized to n-dimensions Gene 1 Expression Level Each “observation”, X, contains expression level for Gene 1 and Gene 2 X = (1.3, 4.6) Represent this as a vector: X = (X 1, X 2 ) e.g., Or generally

Molecular Biology example Gene 2 Expression Level genomics era means we will almost never have the expression level for just one gene or the genotype at just one locus We’ll usually make 2-D plots, but anything we say about 2-D can usually be generalized to n-dimensions Gene 1 Expression Level Each “observation”, X, contains expression level for Gene 1 and Gene 2 X = (1.3, 4.6) Represent this as a vector: X = (X 1, X 2 ) e.g., Or generally This gives a geometric interpretation to multivariate statistics

Probability theory Probability theory quantifies uncertainty using ‘distributions’ Distributions are the ‘models’ and they depend on constants and parameters E.g., in two dimensions, the Gaussian or Normal distribution depends on two constants e and π and 5 parameters that have to be measured, μ and Σ P(X|μ,σ) = e 1 2π (X–μ) T Σ -1 (X–μ) 2 – ‘X’ are the possible datapoints that could come from the distribution. In statistics jargon ‘X’ is called a random variable √|Σ| 1 What does the mean mean in 2 dimensions? What does the standard deviation mean?

Bivariate Gaussian

Molecular Biology example Gene 2 Expression Level genomics era means we will almost never have the expression level for just one gene or the genotype at just one locus We’ll usually make 2-D plots, but anything we say about 2-D can usually be generalized to n-dimensions Gene 1 Expression Level Each “observation”, X, contains expression level for Gene 1 and Gene 2 Represent this as a vector: X = (X 1, X 2 ) The mean is also a vector: µ = (µ 1, µ 2 ) µ The variance is a matrix: Σ σ 11 σ 12 σ 21 σ 22 =

Σ = Σ = Σ = Σ = µ “correlated data” “axis-aligned, diagonal covariance”“full covariance” “spherical covariance” Σ = σ 2 I

Probability theory Probability theory quantifies uncertainty using ‘distributions’ Distributions are the ‘models’ and they depend on constants and parameters Once we chose a distribution, the next step is to chose the parameters This is called “estimation” or “inference” P(X|μ,σ) = e 1 √2πσ 2 (X–μ) 2 2σ 2 –

Choose the parameters so the model ‘fits the data’ There are many ways to measure how well a model fits that data Different “Objective functions” will produce different “estimators” (E.g., MSE, ML, MAP) Estimation Expression Level Expectation Variance We want to make a statistical model. 1.Choose a model (or probability distribution) 2.Estimate its parameters P(X|μ,σ) = e 1 √2πσ 2 (X–μ) 2 2σ 2 – How do we know which parameters fit the data?

Laws of probability If X 1 … X N are a series of random variables (think datapoints) P(X 1, X 2 ) is the “joint probability” and is equal to P(X 1 ) P(X 2 ) if X 1 and X 2 are independent. P(X 1 | X 2 ), is the “conditional probability” of event X 1 given X 2 Conditional probabilities are related by Bayes’ theorem: P(X 1 | X 2 ) = P(X 2 |X 1 ) P(X 1 ) P(X 2 ) (True for all distributions)

Likelihood and MLEs Likelihood is the probability of the data (say X) given certain parameters (say θ) Maximum likelihood estimation says: choose θ, so that the data is most probable. In practice there are many ways to maximize the likelihood. L = P(X|θ) L  = 0 θθ

Example of ML estimation X i P(X i |μ=6.5, σ=1.5) Data: = P(X i |μ=6.5, σ=1.5) = 6.39 x Π i=1 i=5 L = P(X|θ) = P(X 1 … X N | μ, σ) Mean, μ L

Example of ML estimation Mean, μ Log(L) In practice, we almost always use the log likelihood, which becomes a very large negative number when there is a lot of data

Mean, μ Standard deviation, σ Log(L) Example of ML estimation

ML Estimation In general, the likelihood is a function of multiple variables, so the derivatives with respect to all of these should be zero at a maximum In the example of the Gaussian, we have two parameters, so that In general, finding MLEs means solving a set of coupled equations, which usually have to be solved numerically for complex models. L  = 0 μμ L  σσ and

MLEs for the Gaussian The Gaussian is the symmetric continuous distribution that has as its “centre” a parameter given by what we consider the “average” (the expectation). The MLE for the for variance of the Gaussian is like the squared error from the mean, but is actually a biased (but still consistent!?) estimator μ ML = X V ML = (X - μ ML ) 2 Σ X N 1 Σ X N 1

Other estimators Instead of likelihood, L = P(X|θ) we can choose parameters to maximize posterior probability: Or sum of squared errors: Or a penalized likelihood: L* = P(X|θ) In each case, estimation involves a mathematical optimization problem that usually has to be solved on computer How do we choose? P(θ|X) (X – μ MSE ) 2 Σ X e – θ2θ2 x