3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata,

Slides:



Advertisements
Similar presentations
Statistical Techniques I EXST7005 Start here Measures of Dispersion.
Advertisements

Brief introduction on Logistic Regression
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
BPS - 5th Ed. Chapter 241 One-Way Analysis of Variance: Comparing Several Means.
Pattern Recognition and Machine Learning
Physics 114: Lecture 7 Uncertainties in Measurement Dale E. Gary NJIT Physics Department.
Random variable Distribution. 200 trials where I flipped the coin 50 times and counted heads no_of_heads in a trial.
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Visual Recognition Tutorial
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau.
Clustered or Multilevel Data
Visual Recognition Tutorial
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Chapter 5: Descriptive Research Describe patterns of behavior, thoughts, and emotions among a group of individuals. Provide information about characteristics.
 Deviation is a measure of difference for interval and ratio variables between the observed value and the mean.  The sign of deviation (positive or.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.
Hypothesis Testing:.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 6 Sampling Distributions.
Sampling and Confidence Interval
Estimation of Statistical Parameters
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
CHAPTER 18: Inference about a Population Mean
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Measures of Variability In addition to knowing where the center of the distribution is, it is often helpful to know the degree to which individual values.
VI. Evaluate Model Fit Basic questions that modelers must address are: How well does the model fit the data? Do changes to a model, such as reparameterization,
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 12 Inference About A Population.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Statistical Methods II&III: Confidence Intervals ChE 477 (UO Lab) Lecture 5 Larry Baxter, William Hecker, & Ron Terry Brigham Young University.
An Efficient Sequential Design for Sensitivity Experiments Yubin Tian School of Science, Beijing Institute of Technology.
Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
BPS - 3rd Ed. Chapter 161 Inference about a Population Mean.
6.1 Inference for a Single Proportion  Statistical confidence  Confidence intervals  How confidence intervals behave.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
Ch 8 Estimating with Confidence 8.1: Confidence Intervals.
Univariate Gaussian Case (Cont.)
Review of statistical modeling and probability theory Alan Moses ML4bio.
+ Unit 6: Comparing Two Populations or Groups Section 10.2 Comparing Two Means.
Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
PCB 3043L - General Ecology Data Analysis Organizing an ecological study What is the aim of the study? What is the main question being asked? What are.
Introduction We consider the data of ~1800 phenotype measurements Each mouse has a given probability distribution of descending from one of 8 possible.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Estimating standard error using bootstrap
Bayesian Semi-Parametric Multiple Shrinkage
Data Analysis.
Parameter Estimation 主講人:虞台文.
PCB 3043L - General Ecology Data Analysis.
Classification of unlabeled data:
Mathematical Foundations of BME Reza Shadmehr
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
LECTURE 09: BAYESIAN LEARNING
Learning From Observed Data
CHAPTER 18: Inference about a Population Mean
Presentation transcript:

3 rd Place Winning Project, 2009 USPROC Author: Kinjal Basu Sujayam Saha Sponsor Professor: S. Ghosh A.K. Ghosh Indian Statistical Institute, Kolkata, India

The problem is of localization of a bi-allelic gene controlling a quantitative trait. The problem is of localization of a bi-allelic gene controlling a quantitative trait. The (unknown) distribution of trait data depends on genotype, i.e. we have a mixture of 3 distributions each corresponding to a genotype. The (unknown) distribution of trait data depends on genotype, i.e. we have a mixture of 3 distributions each corresponding to a genotype. Statistics- An Integral Part of Genetic Research Our quest is to estimate p, the frequency of allele A, from a mixture distribution with mixing proportions p 2, 2pq and q 2, due to genotypes AA, Aa and aa. Our quest is to estimate p, the frequency of allele A, from a mixture distribution with mixing proportions p 2, 2pq and q 2, due to genotypes AA, Aa and aa.

Cluster analysis gives us estimates which are used both on their own or as initial guesses for other methods. For sake of algebraic simplicity, to begin with we assume the data to follow a mixture Gaussian Model. We test two methods-based on EM and CEM, respectively. We next investigate two categories of departure from normality: a. Asymmetric Distributions b. Heavy-tailed Distributions

Using 3-Means algorithm we find the three clusters. Now we need to decide which cluster corresponds to which genotype. We connect the bigger of the extreme clusters to AA and the smaller one to aa. If n 1, n 2 and n 3 be the cluster sizes corresponding to AA, Aa and aa genotypes respectively, then the MLE of p is given by p = (2n 1 + n 2 ) /2(n 1 + n 2 + n 3 )

A mixture of N(3,1) N(0,1) and N(-3,1) with p=0.45 A mixture of N(3,1) N(0,1) and N(-3,1) with p=0.45 To analyze the data assuming an underlying mixture Gaussian distribution, we make use of EM and CEM algorithms using the posterior expectations of indicator variables given the data in E-step and the standard results for Gaussian Model in M-step (here mean and variance is interpreted as weighted mean and variance with the indicator variables as weights).

 As the separation between the means increases the MSE decreases.  EM gives better results than 3-means. CEM is unsatisfactory.  As p approaches 1 the performance of all the methods detoriate. This is probably because the cluster corresponding to q 2 vanishes at a quadratic rate.

In multi-dimensioned data, treating each variable separately means information on interdependencies between the variables is not used at all. Thus, a vector-valued estimation algorithm is called for. We choose multivariate normal to model the data and use a multivariate analog of the theory in Slide 5 to estimate p. Overall, EM was better than the other two methods. EM and CEM gave comparable MSE mostly, but their superiority over 3-means was not evident in some cases, especially for p=0.6.

Here we transform the original asymmetric data into a symmetric data by using an appropriate value of λ. y original  (y λ – 1 )/ λ if λ ≠ 0 ln(y) if λ = 0 Criterion for choice of λ: Maximizing between group to within group variance ratio. Log Normal Dist. to Normal Dist. by λ = 0 Chi Squares Dist. to Normal Dist. by λ = 0.5

Using a regular grid of points for λ we see that almost always (more than 95% time) the correct λ or a nearby value is chosen by the algorithm. The performance under different values remain similar under the variations, however there is a drop of performance due to the added variation for the choice of λ.

Many heavy-tailed distributions such as Cauchy and T-2 do not have finite first two moments. In these cases we cannot use the sample mean and variance to estimate the location and scale parameters of the population Many heavy-tailed distributions such as Cauchy and T-2 do not have finite first two moments. In these cases we cannot use the sample mean and variance to estimate the location and scale parameters of the population Instead we use sample median and quartile deviation to estimate the location and scale parameters. Instead we use sample median and quartile deviation to estimate the location and scale parameters. Use of quantiles instead of moments also help increase the robustness of the algorithms towards outliers in the data. So this algorithm can also be used when robustness is required even though the distribution is not suspected to be heavy-tailed. Use of quantiles instead of moments also help increase the robustness of the algorithms towards outliers in the data. So this algorithm can also be used when robustness is required even though the distribution is not suspected to be heavy-tailed.

The Outlier and the single element of the cluster Using p=0.5, the classification should have been as 250, 500 and 250 The 3 clusters have comparable no of elements and actual classification has been done The three clusters are of size 984, 15 and 1 The three clusters are of size 299, 421 and 280 Thus, 3-Medoids gives much better results in the presence of outliers.

The robust algorithms protect us from outliers messing with the estimates too much but at a cost of loss of efficiency of the EM algorithm

 Data was collected from an ongoing clinical survey at Madras Diabetes Research Foundation, Chennai, India on Type 2 Diabetes from roughly 500 patients on 9 different fields.  Preliminary analysis revealed some perfect linear dependencies which helped us reduce dimensionality of the multivariate estimates.  We have run the data through both the univariate algorithms, each variable separately, and also the multivariate routine using 6 fields.

i) Results from multivariate Analysis: 3-medoids: EM: CEM: The consistency of the results shows that multivariate normal is a good fit for the data. ii) Result from univariate analysis

 We see that in phenotypes FBS-INS, IR, CHO, TRI and HDL, the estimate of p is almost consistent except for the EM and CEM Algorithms. The reason must be that the distribution does not follow a Gaussian Model or the data contained extreme outliers.  In LDL, robust EM and CEM give consistent values, but the initial cluster analysis does not, implying that though 3- medoids was not entirely accurate, that initial estimate yielded a consistent solution.  In BMI and FBS, we have consistent solution for EM and CEM algorithm but its sensitivity decreases during robustification. This implies that the underlying model is most likely Gaussian.  If some phenotypes return same p, and we have prior biological knowledge that their controlling genes may be same, it is probably true that the same gene controls those specific phenotypes. This work will immensely help in identifying those phenotypes.

Using the simulated result we propose the following method as the most optimum method for calculating the allele frequency : We first execute the 3-medoids algorithm to estimate the location and scale parameters of the 3 clusters and also a crude estimate of p. Using EM algorithm, starting with the crude estimates for a grid of λ values we choose the one with the maximum between to within variance ratio. We graphically check if the data contains outliers. If yes, we use the robust EM or else we follow the usual EM to get the final Estimate of p, the allele frequency. Madras Diabetes Research Foundation, Chennai, India