The microarray data analysis Ana Deckmann Carla Judice Jorge Lepikson Jorge Mondego Leandra Scarpari Marcelo Falsarella Carazzolle Michelle Servais Tais Herig
Summary - Statistics background - Introduction to microarray - Pre-processing microarray data - Statistics analysis - Applications on the LGE - Gene Chip
- measurement = truth + error - error = bias + variance Error model Normalization Experimental replicate (techniques and biological) and statistics Bias describe a systematic tendency of the measurement. Ex: dyes Cy3 and Cy5 don´t have the same efficient Variance is often normally distributed, ex : instrumentation imperfection and biological variation Statistics background
- Standard deviation Mean : Standard deviation : mean(x) Gaussian function
Assume data with one outlier: x = (8, 85, 7, 9, 5, 4, 13, 6, 8) –The mean of all x’s, i.e. (x 1 +x x K )/K, is affected by the outlier: mean(x) = (7.5) –The median of all x’s, i.e. the middle value of (x 1 +x x K ), is not (if < 50% values are outliers): x ordered = (4,5,6,7,8,8,9,13,85) median(x) = 8.0 Use the median instead of the mean if you expect artifacts. (If there are a lot of measurements and the errors are symmetrically distributed the median will give the same result as the mean without outliers.) - Mean vs median :
- Quantiles Mean the fraction (or percent) of points below the given value. That is, the 0.3 (or 30%) quantile is the point at which 30% percent of the data fall below and 70% fall above that value. Q p =30% x=(0,10,40,25,15,50,70,60) x=(0,10,15,25,40,50,60,70) ordered values Quantil(x ; 30%) = (0,10,15) 1º quartil = 10 3º quartil = 60 Median = (25+40)/2 = 32.5
Introduction to microarray -Three different microarray technologies : - Spotted cDNA microarrays (500 to 2500 bp) - Spotted oligonucleotide microarrays (30 to 70 bp) - Affymetrix chips (25 bp) - Can be used to : - Differential gene expression studies, gene co-regulation studies, gene function identification studies. time-course studies, dose-response studies, clinical diagnosis, …
Two color architecture
Probes: 30-meros, 90% até 550 bases downstream extremidade 3’ Targets: 10ug cRNA biotinilado Codelink architecture (one color)
higher frequency, more energy lower frequency, less energy excitation red laser green laser emission overlay images Scanning
A B C H G F D E a b c d e f g h i j k Scarpari, Leandra – 2006 – Tese Doutorado Ludwig flags : (0) Int <= Back (1) Irregular spots (3) Spot ok (4) Saturated Ludwig scanner
Codelink flags : (L) near background (C) contaminated (S) saturated (M) masked (G) good Codelink scanner
A B C H G F D E 1234 LGE defined flags : (0) – Spot ok (1) – Spot Saturado (2) – Int/Back <= 1.05 (3) – Area <= 110 or 50 (9x9 or 11x11) Defined intensity : -Int Cy3 = Area Cy3 * (median(Int Cy3)- median(Bkgd(Cy3)) -Int Cy5 = Area Cy5 * (median(Int Cy5)- median(Bkgd(Cy5)) LGE scanner
Cy3= ; Cy5= r=0.67 (fold=-1.49) (Target median - Bkgd median) * Area = integrated intensity pixels out pixels in > pixels out pixels in - * =
Cy3= ; Cy5= 15488r=0.069 fold=-14.5 flag=0 Cy3= ; Cy5= r=fold=1.40 flag=0 Cy3= ; Cy5= r=1.65 flag=0 Cy3= 6400; Cy5= NA (sinal:ruído<=1) flag=2 Cy3= ; Cy5= r=0.15 fold=-6.7 flag=1
Pre-processing microarray data -Bioconductor repository ( -Log intensities R=G Log 2 R=Log 2 G Most genes have low gene expression levels. What happens here?
up-regulated genes down-regulated genes non-differentially expressed genes are now along the horizontal line: M = 0 log 2 R - log 2 G = 0 R = G Transformed data {(M,A) i }: M = log 2 (R) - log 2 (G) (minus) A = ½·[log 2 (R) + log 2 (G)] (add) M vs A plot
log 2 R = red channel signal log 2 G = green channel signal Density plot
1 16 Print-tip box plot
Normalization within slides Expectation: Most genes are non-differentially expressed, i.e. most of the data points should be around M=0.
Median normalization : which sets the median of log intensity ratios to zero Median value = 0 Lowess normalization : global lowess normalization
Print-tip normalization : print-tip group lowess normalization X* ij =(X ij -median(GRID j ))/sd(GRID j ) Scaled print-tip : scaled print-tip group lowess normalization
Normalization across slides -QUANTILE QQPlot Mean between 8 slides
-LOWESS (applied in one color microarray) Transformed data {(M,A) i }: M = log 2 (Int 1 ) - log 2 (Int 2 ) ; A= ½·[log 2 (Int 1 ) + log 2 (Int 2 )]
Statistics analysis - T statistics test The T statistics down-weight the importance of the average if the deviation is large and vice versa; T = mean(x) / SE(x) where SE(x)=std.dev(x)/N (standard error of the mean) The blue gene has the lower T-value than red gene.
Top table and volcanoplot Fold change = ratio; if ratio >=1 or -1/ratio; if ratio < 1
Cluster data analysis
Missing values Bioinformatics (2001) vol 17, n. 6, Gene expression microarray experiments can generate data sets with multiple missing expression. Accurate estimation of missing values is an important for efficient data analysis.
Applications on the LGE -Codelink (Ana Deckmann) - There is one package in the bioconductor for the codelink - Pipeline used : Read codelink file Normalize between slides : method LOWESS BMC Bioinfomatics 2005, 6:309 Background corrected Bad spot excluded Flags : C,S,M,X and I Clustering and data analyses Replicate validation At least the flags : - GG x GG - GG x LL - LL x GG Statistical analyses Fold change >= 2 P-value <= 0.05
LOWESS
-Ludwig (Leandra Scarpari) - Reformat file from ScanArray (Ludwig) to ScanAlyze to be compatible with bioconductor package (aroma and limma) - Pipeline used : Background corrected Reformat file Read ScanAlyze file Normalize across slides : method quantile Clustering and data analyses Results were compatible with Ludwig analyses Bad spot excluded Flags : 0, 1, 2 and 4 Normalize within arrays : method lowess Nucleic Acids Research, 2002, Vol 30, No 4 Replicate validation At least flag=3 in 2 internal replicates for each array Statistical analyses Fold change >= 2 P-value <= 0.05
LOWESS
QUANTILE
- LGE (two color) - Reformat file from Scanner LGE to ScanAlyze to be compatible with bioconductor package (aroma and limma) - Pipeline used : Background corrected Reformat file Read ScanAlyze file Normalize within arrays : method lowess Normalize across slides : method quantile Data analyses Bad spot excluded Flag: 2 (Ratio Int/Back < XX) Replicate validation At least flag 3 in 2 internal replicates for each array Statistical analyses Fold change >= 2 P-value <= 0.05
LOWESS + QUANTILE
- LGE (one color) - Reformat file from Scanner LGE to ScanAlyze to be compatible with bioconductor package (aroma and limma) - Pipeline used : Background corrected Reformat file Read ScanAlyze file Normalize within arrays : method median Normalize across slides : method quantile Clustering and data analyses Bad spot excluded Flag: 2 (Ratio Int/Back < XX) Replicate validation At least flag 3 in 2 internal replicates for each array Statistical analyses Fold change >= 2 P-value <= 0.05
MEDIAN + QUANTILE
Mais expressos em Op0d Corte/backgroundAmostrap.valueFold changeIdentidadeOrganismo 0,05G1.i106,93E-075,66gnl|Amel_1.1|Contig6992 2e-13Apis mellifera F1.j102,59E-064,05desconhecidoApis mellifera D1.i107,70E-053,08no hits (baixa qualidade) 0,01B1.a20, ,21Dunce 2e-39Drosophila melanogaster Mais expressos em Op5d Corte/backgroundAmostrap.valueFold changeIdentidadeOrganismo 0,05H4.b20, ,00gnl|Amel_1.1|Contig4902 2e-55Apis mellifera B3.i30, ,35gnl|Amel_1.1|Contig896 1e-09Apis mellifera H2.d20, ,16gnl|Amel_1.1|Contig e-16Apis mellifera 0,01H4.h30, ,80Groucho 1.6e-14Anopheles gambiae
Gene Chip
Fim
Comparison of normalization methods for Codelink Bioarray data Differences between pair of arrays in the technical replicates : (1)Array 1 vs array 4 (2)Array4 vs array 5 BMC Bioinfomatics 2005, 6:309
- Within slide normalization BeforeAfter Print-tip normalization No norm Print tip Scaled print tip Nucleic Acids Research, 2002, vol 30, No 4