Presentation is loading. Please wait.

Presentation is loading. Please wait.

DATA TRANSFORMATION and NORMALIZATION

Similar presentations


Presentation on theme: "DATA TRANSFORMATION and NORMALIZATION"— Presentation transcript:

1 DATA TRANSFORMATION and NORMALIZATION
Lecture Topic 4

2 DATA PRE-PROCESSING TRANSFORMATION NORMALIZATION SCALING

3 DATA TRANSFORMATION Difference between raw fluorescence is a meaningless number Data is transformed: Ratio allows immediate visualization of number Log

4 Why Log 2? Difference in expression intensity exist on a multiplicative scale, log transformation brings them into the additive scale, where a linear model may apply. Ex. 4 fold repression=0.25 (Log2=-2) Ex. 4 fold induction=4 ( Log2=2) Ex. 16 fold induction=16 (Log2= 4) Ex. 16 fold repression= (Log2=-4) Evens out highly skewed distributions Makes variation of intensities…independent of absolute magnitude

5 Log Transformation: Makes the distribution less skewed

6 Example 2

7 Non-parametric Regression: the Loess Method
LOWESS= LOESS is an Acronym for LOcally reWEighted ScatterPlot Smoothing (Cleveland). For i=1 to n, the ith measurement yi of the response y and the corresponding measurement xi of the vector x of p predictors are related by Yi=g(xi) + eI where g is the regression function and ei is a random error. Idea: g(x) can be locally approximated by a parametric function. Obtained by fitting a regression surface to the data points within a chosen neighborhood of the point x.

8 LOESS contd… In the LOESS (LOWESS) method, weighted least squares is used to fit linear or quadratic functions of the predictors at the centers of neighborhoods. The radius of each neighborhood is chosen so that the neighborhood contains a specified percentage of the data points. The fraction of the data, called the smoothing parameter, in each local neighborhood controls the smoothness of the estimated surface. Data points in a given local neighborhood are weighted by a smooth decreasing function of their distance from the center of the neighborhood.

9 Distance metrics used Finding distance between the ith and hth points 2 predictors: Distance between: (Xi1, Xi2) and (Xh1,Xh2): Generally Eucledean Distance is used, and weights are defined by a tri-cube function: Choice of q is between 0 and 1, often between .4 to .6. Large q: smoother but maybe too smooth Small q: too rough

10

11 Comments on LOESS ·       fitting is done at each point at which the regression surface is to be estimated ·       faster computational procedure is to perform such local fitting at a selected sample of points and then to blend local polynomials to obtain a regression surface ·       can use the LOESS procedure to perform statistical inference provided the error distribution are i.i.d. normal random variables with mean 0. ·       using the iterative reweighting, LOESS can also provide statistical inference when the error distribution is symmetric but not necessarily normal. ·       by doing iterative reweighting, you can use the LOESS procedure to perform robust fitting in the presence of outliers in the data.

12 Data “Normalization” To biologists, data normalization means “eliminating systematic noise” from the data Noise is systematic variation – experimental variation, human error, variation of scanner technology, etc Variation in which we are NOT interested We are interested in measuring true biologic variation of genes across experiments, throughout time, etc. Plays an important role in earlier stages of microarray data analysis. Subsequent analysis are highly dependent on normalization. NORMALIZATION: Adjusts from any bias which arises from microarray technology rather than biological

13 Normalization:Age old Statistical Idea
Stands for removing bias as a result of experimental artifacts from the data. Stems back to Fisher’s idea (1923) setting up of ANOVA. There is a thrust to use ANOVA for normalization, but for the most part it is still a stage-wise approach instead of a model taking out all sources of variation at once. We will need to look at: Spatial correction Background correction Dye-effect correction Within replicate rescaling Across replicate rescaling Within slide normalization Paired slide normalization for dye swap Multiple slide normalization

14 M vs A plots Used to look at agreement of variables intended to measure the same response. Consider y1 and y2 measure the same variable (two reps of the same variable) M or Minus = (y1-y2) A or Average = (y1+y2)/2 Often done in the log scale (M=log(y1/y2) A= log((y1*y2)/2) If we plot M on the y-axis and A on the X axis, we expect to see a flat line, if the two variables do indeed measure the same thing.

15 Code ma.data=read.csv("MA.csv",header=TRUE) head(ma.data) y1=ma.data$slide1 y2=ma.data$slide2 M=y1-y2 A=(y1+y2)/2 plot(y1,y2) plot(A,M) lw1 <- loess(M ~ A,data=ma.data,span=0.10) plot(M ~ A, data=ma.data) j <- order(A) lines(A[j],lw1$fitted[j],col="red",lwd=3) #idea of normalization fit=lw1$fitted newM=M-fit lw2 <- loess(newM ~ A,data=ma.data,span=0.10) plot(newM ~ A, data=ma.data) lines(A[j],lw2$fitted[j],col=“green",lwd=3) #plotting all on the same plot lines(A[j],lw2$fitted[j],col="green",lwd=3)

16 Array 1: pre and post norm

17 Comments: Print-tip normalization is generally a good proxy for spatial effects Instead of LOESS one can use SPLINE to estimate the trend to subtract from the raw data.

18 BACKGROUND CORRECTION
Idea: Signal = True Signal + Background So, an attractive idea seems like we should subtract BACKGROUND from the signal to get to the “TRUE” signal. The problem is that, the actual BACKGROUND in a spot cannot be measured and what is measured are really a “estimate” for background of places NEAR the spot. Criticism: the assumption in these models is that the background is additive OFTEN WE SEE HIGN CORRELATION BETWEEN FOREGROUND AND BACKGROUND. GENERAL CONSENSUS THESE DAYS: NOT TO SUBTRACT LOCAL BACKGROUND, BUT POSSIBLY SUBTRACT A GLOBAL BACKGROUND (FROM EMPTY SPOTS OR BUFFERS).

19 Background Correction: more thoughts
McClure and Wit (2004) suggest calculating the mean or median of the empty spots and estimate, signal as: Signal = max(observed signal – center(empty spots), 0) This allows never to have the problem of negative “corrected signals”.

20 Background Correction: Probabilistic Idea
Irrizary et al(2003) Looks at finding the conditional expectation of the TRUE signal given the observed signal (which is assumed to be the true signal plus noise) E(si | si+bi) Here, si assumed to follow Exponential distribution with parameter q. Bi assumed to follow N(me, s2e) Estimate me and se as the mean and standard deviation of empty spots

21 Irrizary Approach contd…
This allows the formula to be approximated by the following, where F, f are the CDF and pdf of the standard normal distribution:

22 Normalization Approaches
GLOBAL Normalization (G): Global (ARRAY) Mean or Median. NOT USED VERY OFTEN ANYMORE Intensity dependent linear Normalization (L): by least square estimation AGAIN NOT USED AS MUCH Intensity dependent non-linear Normalization (N): Lowess curve (Robust scatter plot smoother) Under ideal experimental conditions: M=0 for the selected genes used for normalization THE MOST COMMONLY USED IDEA THESE DAYS.

23 Normalization: Historical Approaches
Gobal normalization Sum method: Norm coef.(kj) = Where Imi = intensity of gene i on array Array m, m=1,2 Bm= background intensity on Array m, m=1,2 n = number of genes on the array problem: validity of the assumption; stronger signals dominate the summation. Median (robust with respect to outliers) Normalization coefficient (kj) =

24 Normalization continued
Housekeeping gene normalization Housekeeping genes are a set of genes whose expression levels are not affected by the treatment. The normalization coefficient is the ratio of mC/mT, where mC and mT are the means of the selected housekeeping genes for control and treatment respectively. Problem: housekeeping genes change their expression level sometimes. The assumption doesn’t hold. Trimmed mean normalization(adjusted global method) trim off 5% highest and lowest extreme values, then globally normalize data. The normalization coefficient is: where are the trimmed means for the ith treatment and control respectively.

25 Ideal Control Spots that should be on an array
As we saw in the previous slide, there can be many special probes spotted onto an array during its manufacture, collectively called control probes. These include Blanks: places where water or nothing is spotted. Buffer: where the buffer solution without DNA is spotted. Negative: here there are DNA probes, but they shouldn’t be complementary to any target cDNA. Calibration: probes corresponding to DNA put in the hyb mix which should have equal signals in the two channels. Ratio: probes corresponding to DNA put in the hyb mix which should have known ratios between the two channels (e.g. 3:1,1:3, 10:1, 1:10).

26 Normalization Within and Across Conditions
The Normalization WITHIN conditions is more common Idea we want all the arrays that represent the SAME condition to be comparable. Take out the array effect, in other words. Many models for this: Factorial model (Kerr et al, Wolfinger et al) Location Scale Model (Yang et al) Scaling (Affymetrix) Consider the data to be: xijk: ith spot, jth color, kth array

27 Quantile Normalization
Idea Ideally “replicate” microarrays should be similar In real life they are often NOT identically distributed Quantile normalization FORCES the same distribution on all the arrays for the same condition

28

29 Mathematical details: Quantile Normalization
{x} represent the matrix of all p spot intensities and the n replicate arrays. Here, xik is the spot intensity of the ith spot (i=1,…p, k=1,…n). Let x(k) = vector of the smallest spot intensities across the arrays be the mean/median of x(j) The vector represents the compromise distribution. {r} be the matrix of row ranks associated with matrix {x} Then, the following are the quantile normalized value

30 Numerical Example Let us consider a situation where we have 5 spots on an array and two replicates for an array (numbers in brackets represents the ranks) Spot Array 1 16(5) 0(1) 9(3) 11(4) 7(2) Array2 13(4) 3(1) 5(2) 14(5) 8(3) Order the arrays: Array Average these: Replace the ranks by these: Normalized arrays are: Array1: Array2:

31 R code #need to install files from Bioconductor #For R version 3.6 onwards we need to do the following: if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install() BiocManager::install(c("GenomicFeatures", "AnnotationDbi")) BiocManager::install("affy") BiocManager::install("affyPLM") library(affyPLM) #load package library(preprocessCore) #create a matrix using the same example mat1=matrix(c(16,0,9,11,7,13,3,5,14,8),ncol=2) normalize.quantiles(mat1)

32 Conclusion No unique normalization method for the same data. It depends on what kind of experiment you have and what the data look like. No absolute criteria for normalization. Basically, the normalized log ratio should be centered around 0. Nowadays the focus IS on using Nonparametric Regression methods to remove trend or spatial artifacts from the data Quantile normalization (though not liked by BIOLOGISTS) is catching on as well.


Download ppt "DATA TRANSFORMATION and NORMALIZATION"

Similar presentations


Ads by Google