Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the 11-25 PM.

Slides:



Advertisements
Similar presentations
NASC Normalisation and Analysis of the Affymetrix Data David J Craigon.
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
The Maximum Likelihood Method
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
P. J. Munson, National Institutes of Health, Nov. 2001Page 1 A "Consistency" Test for Determining the Significance of Gene Expression Changes on Replicate.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Microarray Normalization
A Short Introduction to Curve Fitting and Regression by Brad Morantz
Zhongxue Chen, Monnie McGee, Qingzhong Liu and Richard Scheuermann
Analysis and Interpretation of Microarray Data Michael F. Miles, M.D., Ph.D. Depts. of Pharmacology/Toxicology and Neurology and the Center for Study of.
Statistical Methods in Microarray Data Analysis Mark Reimers, Genomics and Bioinformatics, Karolinska Institute.
LINEAR REGRESSION: Evaluating Regression Models. Overview Standard Error of the Estimate Goodness of Fit Coefficient of Determination Regression Coefficients.
Getting the numbers comparable
Probe Level Analysis of AffymetrixTM Data
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Preprocessing Methods for Two-Color Microarray Data
Low-Level Analysis and QC Regional Biases Mark Reimers, NCI.
Differentially expressed genes
Summarizing and comparing GeneChip  data Terry Speed, UC Berkeley & WEHI, Melbourne Affymetrix Users Meeting, Friday June 7, 2002 Redwood City, CA.
Identification of spatial biases in Affymetrix oligonucleotide microarrays Jose Manuel Arteaga-Salas, Graham J. G. Upton, William B. Langdon and Andrew.
1 Preprocessing for Affymetrix GeneChip Data 1/18/2011 Copyright © 2011 Dan Nettleton.
SNP chips Advanced Microarray Analysis Mark Reimers, Dept Biostatistics, VCU, Fall 2008.
Chapter 11 Multiple Regression.
1 Models and methods for summarizing GeneChip probe set data.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Statistical Treatment of Data Significant Figures : number of digits know with certainty + the first in doubt. Rounding off: use the same number of significant.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Summaries of Affymetrix GeneChip probe level data By Rafael A. Irizarry PH 296 Project, Fall 2003 Group: Kelly Moore, Amanda Shieh, Xin Zhao.
Microarray Preprocessing
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
Inference for regression - Simple linear regression
DATA TRANSFORMATION and NORMALIZATION Lecture Topic 4.
R. Kass/W03P416/Lecture 7 1 Lecture 7 Some Advanced Topics using Propagation of Errors and Least Squares Fitting Error on the mean (review from Lecture.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Estimating Signal with Next Generation Affymetrix Software Earl Hubbell, Ph.D. Principal Statistician, Applied Research.
VI. Evaluate Model Fit Basic questions that modelers must address are: How well does the model fit the data? Do changes to a model, such as reparameterization,
Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Microarray Data Pre-Processing
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 10.
Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry.
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Gene expression  Introduction to gene expression arrays Microarray Data pre-processing  Introduction to RNA-seq Deep sequencing applications RNA-seq.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data Rafael A. Irizarry Department of Biostatistics, JHU (joint.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Forecasting. Model with indicator variables The choice of a forecasting technique depends on the components identified in the time series. The techniques.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
Estimating standard error using bootstrap
Introduction to Affymetrix GeneChip data
CDNA-Project cDNA project Julia Brettschneider (UCB Statistics)
What we’ll cover today Transformations Inferential statistics
Maximum likelihood estimation
Statistics in MSmcDESPOT
Normalization Methods for Two-Color Microarray Data
Getting the numbers comparable
Estimation Error and Portfolio Optimization
Diagnostics and Remedial Measures
Lecture 3 From Images to Data
Pre-processing AFFY data
Presentation transcript:

Lecture Topic 5 Pre-processing AFFY data

Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM and MM intensities –Critical for later analysis. Avoiding GIGO –VERY recent, but has made significant progress

Difficulties Large variability Few measurements (11-25) at most MM is very complex, it is signal plus background Signal has to be SCALED Probe-level effects

Different Methods MAS 4 Affymetrix 1996 MAS 5 Affymetrix 2002 Model Based Expression Index (MBEI) Li and Wong 2001 Robust Multichip Analysis (RMA) 2002 GC-RMA 2004

MAS 4 A- probe pairs selected

Avg Diff Calculated using differences between MM and PM of every probe pair and averaging over the probe pair –Excluded OUTLIER pairs if PM-MM > 3 SD –Was NOT a robust average –NOT log-transformed –COULD be negative (about 1/3 of the times)

MAS 5 Signal=TukeyBiweight{log 2 (PM j -IM j ) Discussed this earlier. Requires calculating IM Adjusted PM-MM are log transformed and robust for outlying observations using Tukey Biweight.

Robust Multichip Analysis ONLY uses PM and ignores MM SACRIFICES Accuracy but major gains in PRECISION Basic Steps: –1. Calculate chip background (*BG) and subtract from PM –2. Carry out intensity dependent normalization for PM-*BG Lowess Quantile Normalization (Discussed before) –Normalized PM-*BG are log transformed –Robust multichip analysis of all probes in the set and using Tukey median polishing procedure. Signal is antilog of result.

RMA- Step 1: Background Correction Irrizary et al(2003) Looks at finding the conditional expectation of the TRUE signal given the observed signal (which is assumed to be the true signal plus noise) E(s i | s i +b i ) Here, s i assumed to follow Exponential distribution with parameter . B i assumed to follow N(  e,  2 e ) Estimate  e and  e as the mean and standard deviation of empty spots

RMA- BG Corrected Value

RMA-Normalization Use the background corrected intensities B(PM) to carry out normalization –Lowess (for Spatial effects) –Quantile Normalization (to allow comparability amongst replicate slides) –Normalized B(PM) are log transformed

RMA summarization Use MEDIAN POLISH to fit a linear model Given a MATRIX of data: –Data= overall effects+row effects + column effects + residual Find row and column effects by subtracting the medians of row and column successively till all the medians are less than some epsilon Gives estimated row, column and overall effect when done

Median Polish of RMA For each probe set we have a matrix (probes in rows and arrays in columns) We assume: Signal=probe affinity effect + logscale for expression + error Also assume the sum of probe affinities is 0 Use MEDIAN polish to estimate the expression level in each array

GC-RMA the Basic Idea of Background Uses MM and PM in a more statistical framework. –PM = O PM + N PM + S1 –MM = O MM + N MM +  S O: represents optical noise, N represents NSB noise and S is a quantity proportional to RNA expression (the quantity of interest). The parameter 0 <  < 1 accounts for the fact that for some probe-pairs the MM detects signal.

Distributional Assumptions Assume O follows a log-normal distribution log ( N PM ) and log ( N MM ) follow a bivariate-normal distribution with means of µ PM and µ MM the variance var [ log ( N PM ]= var [ log ( N MM )]=  2 and correlation  constant across probes. µ PM h (  PM ) and µ MM h (  MM ), with h a smooth (almost linear) function and the  defined next Because we do not expect NSB to be affected by optics we assume O and N are independent The parameters µ PM, µ MM, , and  2 can be estimated from the large amount of data. A background adjustment procedure can then be formalized as the statistical problem of predicting S given that we observed PM and MM and assuming we know h, ,  2 and 

GC-RMA Naef and Magnesco (2003) defined where k = 1,…,25 indicates the position along the probe, j indicates the base letter, b k represents the base at position k, I bk = j is an indicator function that is 1 when the k-th base is of type j and 0 otherwise, µ j ; k represents the contribution to affinity of base j in position k.

Assumptions and Notations needed for applying GC-RMA  is 0. (Although we know  > 0). 2. O is an array-dependent constant. Notations: Let m: minimum value allowed for S, (generally=0) and h   are plug-in estimators

MLE estimates Under the above described assumptions, the maximum likelihood estimate (MLE) of S=