Elizabeth Garrett Giovanni Parmigiani

Slides:



Advertisements
Similar presentations
Dummy Dependent variable Models
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
4.1: Linearizing Data.
Uncertainty in fall time surrogate Prediction variance vs. data sensitivity – Non-uniform noise – Example Uncertainty in fall time data Bootstrapping.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals.
Slides 13b: Time-Series Models; Measuring Forecast Error
Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.
Relationship of two variables
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
What does it mean? The variance of the error term is not constant
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
12.1 Heteroskedasticity: Remedies Normality Assumption.
Regression. Population Covariance and Correlation.
Analysis of Residuals Data = Fit + Residual. Residual means left over Vertical distance of Y i from the regression hyper-plane An error of “prediction”
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Lecture 6 Re-expressing Data: It’s Easier Than You Think.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Describing Samples Based on Chapter 3 of Gotelli & Ellison (2004) and Chapter 4 of D. Heath (1995). An Introduction to Experimental Design and Statistics.
1 Heteroskedasticity. 2 The Nature of Heteroskedasticity  Heteroskedasticity is a systematic pattern in the errors where the variances of the errors.
Tutorial I: Missing Value Analysis
Linear Regression Linear Regression. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Purpose Understand Linear Regression. Use R functions.
Maths Study Centre CB Open 11am – 5pm Semester Weekdays
Chapter 4 More on Two-Variable Data. Four Corners Play a game of four corners, selecting the corner each time by rolling a die Collect the data in a table.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Section 11.1 Day 3.
Chapter 15 Multiple Regression Model Building
The simple linear regression model and parameter estimation
Advanced Quantitative Techniques
Inference for Regression
QM222 Class 9 Section A1 Coefficient statistics
Oliver Schulte Machine Learning 726
Regression Analysis AGEC 784.
FW364 Ecological Problem Solving Class 6: Population Growth
Let’s Get It Straight! Re-expressing Data Curvilinear Regression
CHAPTER 3 Describing Relationships
Survival curves We know how to compute survival curves if everyone reaches the endpoint so there is no “censored” data. Survival at t = S(t) = number still.
B&A ; and REGRESSION - ANCOVA B&A ; and
Model Inference and Averaging
Re-expressing the Data: Get It Straight!
Descriptive Statistics (Part 2)
10701 / Machine Learning.
Estimating with PROBE II
Chapter 10 Re-Expressing data: Get it Straight
Significance analysis of microarrays (SAM)
Re-expressing the Data: Get It Straight!
Re-expressing the Data: Get It Straight!
Standard Deviation Calculate the mean Given a Data Set 12, 8, 7, 14, 4
Statistical Methods For Engineers
CHAPTER 29: Multiple Regression*
Scatter Plots of Data with Various Correlation Coefficients
No notecard for this quiz!!
Relationship between two continuous variables: correlations and linear regression both continuous. Correlation – larger values of one variable correspond.
Molecular Subtypes of Non-muscle Invasive Bladder Cancer
Regression Statistics
EVENT PROJECTION Minzhao Liu, 2018
Major Topics first semester by chapter
Major Topics first semester by chapter
Lecture 6 Re-expressing Data: It’s Easier Than You Think
The Examination of Residuals
Linear Regression and Correlation
Residuals (resids).
Exercise 1 Use Transform  Compute variable to calculate weight lost by each person Calculate the overall mean weight lost Calculate the means and standard.
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Elizabeth Garrett Giovanni Parmigiani Normalization in the Presence of Differential Expression in a Large Subset of Genes Elizabeth Garrett Giovanni Parmigiani

Motivation (again) Class discovery: Find breast cancer subtypes within 81 samples of previously unclassified breast cancer tumor samples Gene selection: Find small subset of genes which allows us to cluster tumor samples Gene clustering: Look for genes which are differentially expressed and genes that behave similarly.

log gene expression median versus log gene expression in sample i Raw data: log gene expression median versus log gene expression in sample i

Problem with raw data “V” pattern in many of the slides Curvature Non-constant variance

“V” Patterns Debate: We thought…..Oops, something went wrong in the lab. We should either correct the V’s so that we see only one line remove the genes that are causing the V They (i.e. “experts”) thought…..It’s REAL differential expression! Assuming it is real, how do we normalize to straighten and stabilize variance?

Crude Initial Approach Fit a regression to each plot and identify points with large negative (positive) residuals. Remove the genes with negative (positive) residuals (and high abundance?) and normalize using the remaining points. Problem: Points near origin get truncated in odd way and there is no obvious way to decide how to include exclude near origin.

High abundance = 3 or greater

A “better” (and not hard to implement) approach class 0 1. Assume 2 classes of genes class 1 2. Take subset of samples where V is obvious (we picked four samples) 3. Fit a latent variable model using MCMC to predict which genes are in class 1 and which in class 0.

Latent Variable Model Allow different slopes and intercepts for the two classes of genes: Details:

Results Goal is to estimate gene classes, cg ’s are nuisance parameters Based on chain, we estimate g = P(cg = 1) at each iteration, each gene is assigned to class 0 or class 1 by averaging class assignments over iterations, we get posterior probability of class membership To do normalization, we restrict attention to genes with g < 0.95

Posterior Probabilities of Class Membership

Normalization Use loess normalization where class 0 genes are the reference: rsg = residuals = ysg - loess Sample 43

Before and after loess normalization (R function “loess’ with weights = 1 - c_g) Before After

Variance Stabilization Take residuals from previous loess fit. Fit loess to squared residuals versus median Square-root of fitted value approximates standard deviation. Rescale so that overall slide variability is not lost by dividing by average slide variance.

Final Step Calculate normalized data: Slide median Residual from first loess gene median Variance stabilizer from second loess