CSE 5290: Algorithms for Bioinformatics Fall 2011

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Part 2: Unsupervised Learning
Susceptible, Infected, Recovered: the SIR Model of an Epidemic
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Gaussian Mixture.
Mixture Models and the EM Algorithm
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Clustering Beyond K-means
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Visual Recognition Tutorial
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
Clustering.
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Visual Recognition Tutorial
Computer vision: models, learning and inference
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
EM and expected complete log-likelihood Mixture of Experts
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Lecture 2: Statistical learning primer for biologists
Flat clustering approaches
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
SIR Epidemics 박상훈.
MathematicalMarketing Slide 3c.1 Mathematical Tools Chapter 3: Part c – Parameter Estimation We will be discussing  Nonlinear Parameter Estimation  Maximum.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Data Modeling Patrice Koehl Department of Biological Sciences
12. Principles of Parameter Estimation
Probability Theory and Parameter Estimation I
LECTURE 11: Advanced Discriminant Analysis
CS 2750: Machine Learning Density Estimation
CH 5: Multivariate Methods
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
Latent Variables, Mixture Models and EM
Hidden Markov Models Part 2: Algorithms
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
Multivariate Methods Berlin Chen
Susceptible, Infected, Recovered: the SIR Model of an Epidemic
Multivariate Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
12. Principles of Parameter Estimation
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

CSE 5290: Algorithms for Bioinformatics Fall 2011 Suprakash Datta datta@cse.yorku.ca Office: CSEB 3043 Phone: 416-736-2100 ext 77875 Course page: http://www.cse.yorku.ca/course/5290 11/11/2018 CSE 5290, Fall 2011

Next Clustering revisited: Expectation Maximization, and Gaussian mixture model fitting Some of the following slides are based on slides by Christopher M. Bishop, Microsoft Research, http://research.microsoft.com/~cmbishop 11/11/2018 CSE 5290, Fall 2011

Old Faithful 11/11/2018 CSE 5290, Fall 2011

Time between eruptions (minutes) Old Faithful Data Set Time between eruptions (minutes) Duration of eruption (minutes) 11/11/2018 CSE 5290, Fall 2011

K-means Algorithm Goal: represent a data set in terms of K clusters each of which is summarized by a prototype Initialize prototypes, then iterate between two phases: E-step: assign each data point to nearest prototype M-step: update prototypes to be the cluster means Simplest version is based on Euclidean distance re-scale Old Faithful data 11/11/2018 CSE 5290, Fall 2011

11/11/2018 CSE 5290, Fall 2011

11/11/2018 CSE 5290, Fall 2011

11/11/2018 CSE 5290, Fall 2011

11/11/2018 CSE 5290, Fall 2011

11/11/2018 CSE 5290, Fall 2011

11/11/2018 CSE 5290, Fall 2011

11/11/2018 CSE 5290, Fall 2011

11/11/2018 CSE 5290, Fall 2011

11/11/2018 CSE 5290, Fall 2011

Responsibilities Responsibilities assign data points to clusters such that Example: 5 data points and 3 clusters 11/11/2018 CSE 5290, Fall 2011

K-means Cost Function data prototypes responsibilities 11/11/2018 CSE 5290, Fall 2011

Minimizing the Cost Function E-step: minimize w.r.t. assigns each data point to nearest prototype M-step: minimize w.r.t gives each prototype set to the mean of points in that cluster Convergence guaranteed since there is a finite number of possible settings for the responsibilities 11/11/2018 CSE 5290, Fall 2011

Evolution of J 11/11/2018 CSE 5290, Fall 2011

Limitations of K-means Hard assignments of data points to clusters – small shift of a data point can flip it to a different cluster Not clear how to choose the value of K Solution: replace ‘hard’ clustering of K-means with ‘soft’ probabilistic assignments Represents the probability distribution of the data as a Gaussian mixture model 11/11/2018 CSE 5290, Fall 2011

The Gaussian Distribution Multivariate Gaussian Define precision to be the inverse of the covariance In 1-dimension mean covariance 11/11/2018 CSE 5290, Fall 2011

Likelihood Function Data set Assume observed data points generated independently Viewed as a function of the parameters, this is known as the likelihood function 11/11/2018 CSE 5290, Fall 2011

Maximum Likelihood Set the parameters by maximizing the likelihood function Equivalently maximize the log likelihood 11/11/2018 CSE 5290, Fall 2011

Maximum Likelihood Solution Maximizing w.r.t. the mean gives the sample mean Maximizing w.r.t covariance gives the sample covariance 11/11/2018 CSE 5290, Fall 2011

Bias of Maximum Likelihood Consider the expectations of the maximum likelihood estimates under the Gaussian distribution The maximum likelihood solution systematically under-estimates the covariance This is an example of over-fitting 11/11/2018 CSE 5290, Fall 2011

Intuitive Explanation of Over-fitting 11/11/2018 CSE 5290, Fall 2011

Unbiased Variance Estimate Clearly we can remove the bias by using since this gives Arises naturally in a Bayesian treatment For an infinite data set the two expressions are equal 11/11/2018 CSE 5290, Fall 2011

Gaussian Mixtures Linear super-position of Gaussians Normalization and positivity require Can interpret the mixing coefficients as prior probabilities 11/11/2018 CSE 5290, Fall 2011

Example: Mixture of 3 Gaussians 11/11/2018 CSE 5290, Fall 2011

Contours of Probability Distribution 11/11/2018 CSE 5290, Fall 2011

Sampling from the Gaussian To generate a data point: first pick one of the components with probability then draw a sample from that component Repeat these two steps for each new data point 11/11/2018 CSE 5290, Fall 2011

Synthetic Data Set 11/11/2018 CSE 5290, Fall 2011

Fitting the Gaussian Mixture We wish to invert this process – given the data set, find the corresponding parameters: mixing coefficients means covariances If we knew which component generated each data point, the maximum likelihood solution would involve fitting each component to the corresponding cluster Problem: the data set is unlabelled We shall refer to the labels as latent (= hidden) variables 11/11/2018 CSE 5290, Fall 2011

Synthetic Data Set Without Labels 11/11/2018 CSE 5290, Fall 2011

Posterior Probabilities We can think of the mixing coefficients as prior probabilities for the components For a given value of we can evaluate the corresponding posterior probabilities, called responsibilities These are given from Bayes’ theorem by 11/11/2018 CSE 5290, Fall 2011

Posterior Probabilities (colour coded) 11/11/2018 CSE 5290, Fall 2011

Posterior Probability Map 11/11/2018 CSE 5290, Fall 2011

Maximum Likelihood for the GMM The log likelihood function takes the form Note: sum over components appears inside the log There is no closed form solution for maximum likelihood 11/11/2018 CSE 5290, Fall 2011

Over-fitting in GMM Singularities in likelihood function when a component ‘collapses’ onto a data point: then consider Likelihood function gets larger as we add more components (and hence parameters) to the model not clear how to choose the number K of components 11/11/2018 CSE 5290, Fall 2011

Problems and Solutions How to maximize the log likelihood solved by expectation-maximization (EM) algorithm How to avoid singularities in the likelihood function solved by a Bayesian treatment How to choose number K of components also solved by a Bayesian treatment Will not cover Will not cover 11/11/2018 CSE 5290, Fall 2011

EM Algorithm – Informal Derivation Let us proceed by simply differentiating the log likelihood Setting derivative with respect to equal to zero gives giving which is simply the weighted mean of the data 11/11/2018 CSE 5290, Fall 2011

EM Algorithm – Informal Derivation II Similarly for the covariances For mixing coefficients use a Lagrange multiplier to give 11/11/2018 CSE 5290, Fall 2011

EM Algorithm – Informal Derivation III The solutions are not closed form since they are coupled Suggests an iterative scheme for solving them: Make initial guesses for the parameters Alternate between the following two stages: E-step: evaluate responsibilities M-step: update parameters using ML results 11/11/2018 CSE 5290, Fall 2011

11/11/2018 CSE 5290, Fall 2011

11/11/2018 CSE 5290, Fall 2011

11/11/2018 CSE 5290, Fall 2011

11/11/2018 CSE 5290, Fall 2011

11/11/2018 CSE 5290, Fall 2011

11/11/2018 CSE 5290, Fall 2011

EM – Latent Variable Viewpoint Binary latent variables describing which component generated each data point If we knew the values for the LV, we would maximize the complete-data log likelihood which gives a trivial closed-form solution (fit each component to the corresponding set of data points) We don’t know the values of the LV However, for given parameter values we can compute the expected values of the LV 11/11/2018 CSE 5290, Fall 2011

Expected Value of Latent Variable From Bayes’ theorem 11/11/2018 CSE 5290, Fall 2011

Complete and Incomplete Data 11/11/2018 CSE 5290, Fall 2011

Next A non-bioinformatics topic: Epidemiology Some of the following slides are based on slides by Dr Bill Hackborn: http://www.augustana.ualberta.ca/~hackw/mat332/. 11/11/2018 CSE 5290, Fall 2011

The problem Build a quantitative model for epidemics Useful for computer viruses as well Tradeoff between accuracy and tractability 11/11/2018 CSE 5290, Fall 2011

What is a Mathematical Model? a mathematical description of a scenario or situation from the real-world focuses on specific quantitative features of the scenario, ignores others a simplification, abstraction, “cartoon” involves hypotheses that can be tested against real data and refined if desired one purpose is improved understanding of real-world scenario e.g. celestial motion, chemical kinetics 11/11/2018 CSE 5290, Fall 2011

Susceptible, Infected, Recovered: the SIR Model of an Epidemic 11/11/2018 CSE 5290, Fall 2011

The SIR Epidemic Model First studied, Kermack & McKendrick, 1927. Consider a disease spread by contact with infected individuals. Individuals recover from the disease and gain further immunity from it. S = fraction of susceptibles in a population I = fraction of infecteds in a population R = fraction of recovereds in a population S + I + R = 1 11/11/2018 CSE 5290, Fall 2011

The SIR Epidemic Model II Differential equations (involving the variables S, I, and R and their rates of change with respect to time t) are An equivalent compartment diagram is 11/11/2018 CSE 5290, Fall 2011

Parameters of the Model r = the infection rate a = the removal rate The basic reproduction number is obtained from these parameters: NR = r /a This number represents the average number of infections caused by one infective in a totally susceptible population. As such, an epidemic can occur only if NR > 1. 11/11/2018 CSE 5290, Fall 2011

What does it imply? Typical behaviour 11/11/2018 CSE 5290, Fall 2011

Epidemic size Final epidemic size representing the number of nodes that became infectious during the whole epidemic, plotted as a function of rSE in the absence of tracing. Continuous line corresponds to random networks and dashed line to SF networks. Kiss I Z et al. J. R. Soc. Interface 2006;3:55-62

Threshold phenomena Prediction of the threshold is critical Control mechanisms? 11/11/2018 CSE 5290, Fall 2011

What does it not model? Heterogeneity Carriers Effect of immunizations Deaths 11/11/2018 CSE 5290, Fall 2011

Vaccination and Herd Immunity If only a fraction S0 of the population is susceptible, the reproduction number is NRS0, and an epidemic can occur only if this number exceeds 1. Suppose a fraction V of the population is vaccinated against the disease. In this case, S0=1-V and no epidemic can occur if V > 1 – 1/NR The basic reproduction number NR can vary from 3 to 5 for smallpox, 16 to 18 for measles, and over 100 for malaria [Keeling, 2001]. 11/11/2018 CSE 5290, Fall 2011

Case Study: Boarding School Flu 11/11/2018 CSE 5290, Fall 2011

Boarding School Flu (Cont’d) time is measured in days, r = 1.66, a = 0.44, and NR = 3.8. 11/11/2018 CSE 5290, Fall 2011

Flu at Hypothetical Hospital In this case, new susceptibles are arriving and those of all classes are leaving. 11/11/2018 CSE 5290, Fall 2011

Flu at Hypothetical Hospital II Parameters r and a are as before. New parameters b = l = 1/14, representing an average turnover time of 14 days. The disease becomes endemic. 11/11/2018 CSE 5290, Fall 2011

Case Study: Bombay Plague, 1905-6 The R in SIR often means removed (due to death, quarantine, etc.), not recovered. 11/11/2018 CSE 5290, Fall 2011

Enhancing the SIR Model Can consider additional populations of disease vectors (e.g. fleas, rats). Can consider an exposed (but not yet infected) class, the SEIR model. SIRS, SIS, and double (gendered) models are sometimes used for STDs. Can consider biased mixing, age differences, multiple types of transmission, geographic spread, etc. 11/11/2018 CSE 5290, Fall 2011

Why Study Epidemic Models? To supplement statistical extrapolation. To learn more about the qualitative dynamics of a disease. To test hypotheses about, for example, prevention strategies, disease transmission, significant characteristics, etc. 11/11/2018 CSE 5290, Fall 2011

References J. D. Murray, Mathematical Biology, Springer-Verlag, 1989. O. Diekmann & A. P. Heesterbeek, Mathematical Epidemiology of Infectious Diseases, Wiley, 2000. Matt Keeling, The Mathematics of Diseases, http://plus.maths.org, 2004. Allyn Jackson, Modeling the Aids Epidemic, Notices of the American Mathematical Society, 36:981-983, 1989. 11/11/2018 CSE 5290, Fall 2011