Elizabeth Garrett Giovanni Parmigiani

Slides:

Advertisements

Similar presentations

Dummy Dependent variable Models

Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.

4.1: Linearizing Data.

Uncertainty in fall time surrogate Prediction variance vs. data sensitivity – Non-uniform noise – Example Uncertainty in fall time data Bootstrapping.

© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.

Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals.

Slides 13b: Time-Series Models; Measuring Forecast Error

Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.

Relationship of two variables

Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.

1 Least squares procedure Inference for least squares lines Simple Linear Regression.

What does it mean? The variance of the error term is not constant

6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.

12.1 Heteroskedasticity: Remedies Normality Assumption.

Regression. Population Covariance and Correlation.

Analysis of Residuals Data = Fit + Residual. Residual means left over Vertical distance of Y i from the regression hyper-plane An error of “prediction”

Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.

Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.

Lecture 6 Re-expressing Data: It’s Easier Than You Think.

Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.

Describing Samples Based on Chapter 3 of Gotelli & Ellison (2004) and Chapter 4 of D. Heath (1995). An Introduction to Experimental Design and Statistics.

1 Heteroskedasticity. 2 The Nature of Heteroskedasticity  Heteroskedasticity is a systematic pattern in the errors where the variances of the errors.

Tutorial I: Missing Value Analysis

Linear Regression Linear Regression. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Purpose Understand Linear Regression. Use R functions.

Maths Study Centre CB Open 11am – 5pm Semester Weekdays

Chapter 4 More on Two-Variable Data. Four Corners Play a game of four corners, selecting the corner each time by rolling a die Collect the data in a table.

Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Section 11.1 Day 3.

Chapter 15 Multiple Regression Model Building

The simple linear regression model and parameter estimation

Advanced Quantitative Techniques

Inference for Regression

QM222 Class 9 Section A1 Coefficient statistics

Oliver Schulte Machine Learning 726

Regression Analysis AGEC 784.

FW364 Ecological Problem Solving Class 6: Population Growth

Let’s Get It Straight! Re-expressing Data Curvilinear Regression

CHAPTER 3 Describing Relationships

Survival curves We know how to compute survival curves if everyone reaches the endpoint so there is no “censored” data. Survival at t = S(t) = number still.

B&A ; and REGRESSION - ANCOVA B&A ; and

Model Inference and Averaging

Re-expressing the Data: Get It Straight!

Descriptive Statistics (Part 2)

10701 / Machine Learning.

Estimating with PROBE II

Chapter 10 Re-Expressing data: Get it Straight

Significance analysis of microarrays (SAM)

Re-expressing the Data: Get It Straight!

Re-expressing the Data: Get It Straight!

Standard Deviation Calculate the mean Given a Data Set 12, 8, 7, 14, 4

Statistical Methods For Engineers

CHAPTER 29: Multiple Regression*

Scatter Plots of Data with Various Correlation Coefficients

No notecard for this quiz!!

Relationship between two continuous variables: correlations and linear regression both continuous. Correlation – larger values of one variable correspond.

Molecular Subtypes of Non-muscle Invasive Bladder Cancer

Regression Statistics

EVENT PROJECTION Minzhao Liu, 2018

Major Topics first semester by chapter

Major Topics first semester by chapter

Lecture 6 Re-expressing Data: It’s Easier Than You Think

The Examination of Residuals

Linear Regression and Correlation

Residuals (resids).

Exercise 1 Use Transform  Compute variable to calculate weight lost by each person Calculate the overall mean weight lost Calculate the means and standard.

MGS 3100 Business Analysis Regression Feb 18, 2016

Presentation transcript:

Elizabeth Garrett Giovanni Parmigiani Normalization in the Presence of Differential Expression in a Large Subset of Genes Elizabeth Garrett Giovanni Parmigiani

Motivation (again) Class discovery: Find breast cancer subtypes within 81 samples of previously unclassified breast cancer tumor samples Gene selection: Find small subset of genes which allows us to cluster tumor samples Gene clustering: Look for genes which are differentially expressed and genes that behave similarly.

log gene expression median versus log gene expression in sample i Raw data: log gene expression median versus log gene expression in sample i

Problem with raw data “V” pattern in many of the slides Curvature Non-constant variance

“V” Patterns Debate: We thought…..Oops, something went wrong in the lab. We should either correct the V’s so that we see only one line remove the genes that are causing the V They (i.e. “experts”) thought…..It’s REAL differential expression! Assuming it is real, how do we normalize to straighten and stabilize variance?

Crude Initial Approach Fit a regression to each plot and identify points with large negative (positive) residuals. Remove the genes with negative (positive) residuals (and high abundance?) and normalize using the remaining points. Problem: Points near origin get truncated in odd way and there is no obvious way to decide how to include exclude near origin.

High abundance = 3 or greater

A “better” (and not hard to implement) approach class 0 1. Assume 2 classes of genes class 1 2. Take subset of samples where V is obvious (we picked four samples) 3. Fit a latent variable model using MCMC to predict which genes are in class 1 and which in class 0.

Latent Variable Model Allow different slopes and intercepts for the two classes of genes: Details:

Results Goal is to estimate gene classes, cg ’s are nuisance parameters Based on chain, we estimate g = P(cg = 1) at each iteration, each gene is assigned to class 0 or class 1 by averaging class assignments over iterations, we get posterior probability of class membership To do normalization, we restrict attention to genes with g < 0.95

Posterior Probabilities of Class Membership

Normalization Use loess normalization where class 0 genes are the reference: rsg = residuals = ysg - loess Sample 43

Before and after loess normalization (R function “loess’ with weights = 1 - c_g) Before After

Variance Stabilization Take residuals from previous loess fit. Fit loess to squared residuals versus median Square-root of fitted value approximates standard deviation. Rescale so that overall slide variability is not lost by dividing by average slide variance.

Final Step Calculate normalized data: Slide median Residual from first loess gene median Variance stabilizer from second loess