The microarray data analysis Ana Deckmann Carla Judice Jorge Lepikson Jorge Mondego Leandra Scarpari Marcelo Falsarella Carazzolle Michelle Servais Tais.

Slides:

Advertisements

Similar presentations

Limma: Linear Models for Microarray Data R user group 21 June 2005 Judith Boer.

Advertisements

M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.

Microarray Quality Assessment Issues in High-Throughput Data Analysis BIOS Spring 2010 Dr Mark Reimers.

Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.

LimmaGUI A Point-and-Click Interface for cDNA Microarray Analysis James Wettenhall and Gordon Smyth Division of Genetics and Bioinformatics Walter and.

Microarray Normalization

Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.

Mathematical Statistics, Centre for Mathematical Sciences

Microarray technology and analysis of gene expression data Hillevi Lindroos.

Image Quantitation in Microarray Analysis More tomorrow...

Normalization of Microarray Data - how to do it! Henrik Bengtsson Terry Speed

TIGR Spotfinder: a tool for microarray image processing

Statistical Methods in Microarray Data Analysis Mark Reimers, Genomics and Bioinformatics, Karolinska Institute.

Getting the numbers comparable

Normalization for cDNA Microarray Data Yee Hwa Yang, Sandrine Dudoit, Percy Luu and Terry Speed. SPIE BIOS 2001, San Jose, CA January 22, 2001.

DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.

Preprocessing Methods for Two-Color Microarray Data

Microarray Data Preprocessing and Clustering Analysis

Normalization Class web site: Statistics for Microarrays.

Low-Level Analysis and QC Regional Biases Mark Reimers, NCI.

CDNA Microarray Design and Pre-processing By H. Bjørn Nielsen.

Gene Expression Data Analyses (2)

Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.

GCB/CIS 535 Microarray Topics John Tobias November 8th, 2004.

Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.

Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.

ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”

Corrections and Normalization in microarrays data analysis

Scanning and image analysis Scanning -Dyes -Confocal scanner -CCD scanner Image File Formats Image analysis -Locating the spots -Segmentation -Evaluating.

SPH 247 Statistical Analysis of Laboratory Data. Two-Color Arrays Two-color arrays are designed to account for variability in slides and spots by using.

Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.

1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.

(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.

Image Quantitation in Microarray Analysis More tomorrow...

Scanning and Image Processing -by Steve Clough. GSI Lumonics cDNA microarrays use two dyes with well separated emission spectra such as Cy3 and Cy5 to.

The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.

CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.

Affymetrix vs. glass slide based arrays

Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.

The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.

DATA TRANSFORMATION and NORMALIZATION Lecture Topic 4.

CDNA Microarrays MB206.

Panu Somervuo, March 19, cDNA microarrays.

1 Two Color Microarrays EPP 245/298 Statistical Analysis of Laboratory Data.

Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.

WORKSHOP SPOTTED 2-channel ARRAYS DATA PROCESSING AND QUALITY CONTROL Eugenia Migliavacca and Mauro Delorenzi, ISREC, December 11, 2003.

Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.

Agenda Introduction to microarrays

Microarray - Leukemia vs. normal GeneChip System.

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.

A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.

1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.

Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.

Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.

Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.

Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden Plate Effects in cDNA Microarray Data.

Microarray hybridization Usually comparative – Ratio between two samples Examples – Tumor vs. normal tissue – Drug treatment vs. no treatment – Embryo.

(1) Normalization of cDNA microarray data Methods, Vol. 31, no. 4, December 2003 Gordon K. Smyth and Terry Speed.

Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University Plate Effects in cDNA Microarray Data.

Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004 Jo Hardin Pomona College

Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Other uses of DNA microarrays

Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.

Lecture 2 – Pre-processing and Normalization José Luis Mosquera Computational Lab on Microarrays Data Analysis Special Topics in Computer Science Institute.

Normalization Methods for Two-Color Microarray Data

Getting the numbers comparable

Optimal gene expression analysis by microarrays

Normalization for cDNA Microarray Data

Presentation transcript:

The microarray data analysis Ana Deckmann Carla Judice Jorge Lepikson Jorge Mondego Leandra Scarpari Marcelo Falsarella Carazzolle Michelle Servais Tais Herig

Summary - Statistics background - Introduction to microarray - Pre-processing microarray data - Statistics analysis - Applications on the LGE - Gene Chip

- measurement = truth + error - error = bias + variance Error model Normalization Experimental replicate (techniques and biological) and statistics Bias describe a systematic tendency of the measurement. Ex: dyes Cy3 and Cy5 don´t have the same efficient Variance is often normally distributed, ex : instrumentation imperfection and biological variation Statistics background

- Standard deviation Mean : Standard deviation : mean(x)  Gaussian function

Assume data with one outlier: x = (8, 85, 7, 9, 5, 4, 13, 6, 8) –The mean of all x’s, i.e. (x 1 +x x K )/K, is affected by the outlier: mean(x) = (7.5) –The median of all x’s, i.e. the middle value of (x 1 +x x K ), is not (if < 50% values are outliers): x ordered = (4,5,6,7,8,8,9,13,85) median(x) = 8.0 Use the median instead of the mean if you expect artifacts. (If there are a lot of measurements and the errors are symmetrically distributed the median will give the same result as the mean without outliers.) - Mean vs median :

- Quantiles Mean the fraction (or percent) of points below the given value. That is, the 0.3 (or 30%) quantile is the point at which 30% percent of the data fall below and 70% fall above that value. Q p =30% x=(0,10,40,25,15,50,70,60) x=(0,10,15,25,40,50,60,70) ordered values Quantil(x ; 30%) = (0,10,15) 1º quartil = 10 3º quartil = 60 Median = (25+40)/2 = 32.5

Introduction to microarray -Three different microarray technologies : - Spotted cDNA microarrays (500 to 2500 bp) - Spotted oligonucleotide microarrays (30 to 70 bp) - Affymetrix chips (25 bp) - Can be used to : - Differential gene expression studies, gene co-regulation studies, gene function identification studies. time-course studies, dose-response studies, clinical diagnosis, …

Two color architecture

Probes: 30-meros, 90% até 550 bases downstream extremidade 3’ Targets: 10ug cRNA biotinilado Codelink architecture (one color)

 higher frequency, more energy  lower frequency, less energy excitation red laser green laser emission overlay images Scanning

A B C H G F D E a b c d e f g h i j k Scarpari, Leandra – 2006 – Tese Doutorado Ludwig flags : (0) Int <= Back (1) Irregular spots (3) Spot ok (4) Saturated Ludwig scanner

Codelink flags : (L) near background (C) contaminated (S) saturated (M) masked (G) good Codelink scanner

A B C H G F D E 1234 LGE defined flags : (0) – Spot ok (1) – Spot Saturado (2) – Int/Back <= 1.05 (3) – Area <= 110 or 50 (9x9 or 11x11) Defined intensity : -Int Cy3 = Area Cy3 * (median(Int Cy3)- median(Bkgd(Cy3)) -Int Cy5 = Area Cy5 * (median(Int Cy5)- median(Bkgd(Cy5)) LGE scanner

Cy3= ; Cy5= r=0.67 (fold=-1.49) (Target median - Bkgd median) * Area = integrated intensity pixels out pixels in > pixels out pixels in - * =

Cy3= ; Cy5= 15488r=0.069 fold=-14.5 flag=0 Cy3= ; Cy5= r=fold=1.40 flag=0 Cy3= ; Cy5= r=1.65 flag=0 Cy3= 6400; Cy5= NA (sinal:ruído<=1) flag=2 Cy3= ; Cy5= r=0.15 fold=-6.7 flag=1

Pre-processing microarray data -Bioconductor repository ( -Log intensities R=G Log 2 R=Log 2 G Most genes have low gene expression levels. What happens here?

up-regulated genes down-regulated genes non-differentially expressed genes are now along the horizontal line: M = 0  log 2 R - log 2 G = 0  R = G Transformed data {(M,A) i }: M = log 2 (R) - log 2 (G) (minus) A = ½·[log 2 (R) + log 2 (G)] (add) M vs A plot

log 2 R = red channel signal log 2 G = green channel signal Density plot

1 16 Print-tip box plot

Normalization within slides Expectation: Most genes are non-differentially expressed, i.e. most of the data points should be around M=0.

Median normalization : which sets the median of log intensity ratios to zero Median value = 0 Lowess normalization : global lowess normalization

Print-tip normalization : print-tip group lowess normalization X* ij =(X ij -median(GRID j ))/sd(GRID j ) Scaled print-tip : scaled print-tip group lowess normalization

Normalization across slides -QUANTILE QQPlot Mean between 8 slides

-LOWESS (applied in one color microarray) Transformed data {(M,A) i }: M = log 2 (Int 1 ) - log 2 (Int 2 ) ; A= ½·[log 2 (Int 1 ) + log 2 (Int 2 )]

Statistics analysis - T statistics test The T statistics down-weight the importance of the average if the deviation is large and vice versa; T = mean(x) / SE(x) where SE(x)=std.dev(x)/N (standard error of the mean) The blue gene has the lower T-value than red gene.

Top table and volcanoplot Fold change = ratio; if ratio >=1 or -1/ratio; if ratio < 1

Cluster data analysis

Missing values Bioinformatics (2001) vol 17, n. 6, Gene expression microarray experiments can generate data sets with multiple missing expression. Accurate estimation of missing values is an important for efficient data analysis.

Applications on the LGE -Codelink (Ana Deckmann) - There is one package in the bioconductor for the codelink - Pipeline used : Read codelink file Normalize between slides : method LOWESS BMC Bioinfomatics 2005, 6:309 Background corrected Bad spot excluded Flags : C,S,M,X and I Clustering and data analyses Replicate validation At least the flags : - GG x GG - GG x LL - LL x GG Statistical analyses Fold change >= 2 P-value <= 0.05

LOWESS

-Ludwig (Leandra Scarpari) - Reformat file from ScanArray (Ludwig) to ScanAlyze to be compatible with bioconductor package (aroma and limma) - Pipeline used : Background corrected Reformat file Read ScanAlyze file Normalize across slides : method quantile Clustering and data analyses Results were compatible with Ludwig analyses Bad spot excluded Flags : 0, 1, 2 and 4 Normalize within arrays : method lowess Nucleic Acids Research, 2002, Vol 30, No 4 Replicate validation At least flag=3 in 2 internal replicates for each array Statistical analyses Fold change >= 2 P-value <= 0.05

LOWESS

QUANTILE

- LGE (two color) - Reformat file from Scanner LGE to ScanAlyze to be compatible with bioconductor package (aroma and limma) - Pipeline used : Background corrected Reformat file Read ScanAlyze file Normalize within arrays : method lowess Normalize across slides : method quantile Data analyses Bad spot excluded Flag: 2 (Ratio Int/Back < XX) Replicate validation At least flag 3 in 2 internal replicates for each array Statistical analyses Fold change >= 2 P-value <= 0.05

LOWESS + QUANTILE

- LGE (one color) - Reformat file from Scanner LGE to ScanAlyze to be compatible with bioconductor package (aroma and limma) - Pipeline used : Background corrected Reformat file Read ScanAlyze file Normalize within arrays : method median Normalize across slides : method quantile Clustering and data analyses Bad spot excluded Flag: 2 (Ratio Int/Back < XX) Replicate validation At least flag 3 in 2 internal replicates for each array Statistical analyses Fold change >= 2 P-value <= 0.05

MEDIAN + QUANTILE

Mais expressos em Op0d Corte/backgroundAmostrap.valueFold changeIdentidadeOrganismo 0,05G1.i106,93E-075,66gnl|Amel_1.1|Contig6992 2e-13Apis mellifera F1.j102,59E-064,05desconhecidoApis mellifera D1.i107,70E-053,08no hits (baixa qualidade) 0,01B1.a20, ,21Dunce 2e-39Drosophila melanogaster Mais expressos em Op5d Corte/backgroundAmostrap.valueFold changeIdentidadeOrganismo 0,05H4.b20, ,00gnl|Amel_1.1|Contig4902 2e-55Apis mellifera B3.i30, ,35gnl|Amel_1.1|Contig896 1e-09Apis mellifera H2.d20, ,16gnl|Amel_1.1|Contig e-16Apis mellifera 0,01H4.h30, ,80Groucho 1.6e-14Anopheles gambiae

Gene Chip

Fim

Comparison of normalization methods for Codelink Bioarray data Differences between pair of arrays in the technical replicates : (1)Array 1 vs array 4 (2)Array4 vs array 5 BMC Bioinfomatics 2005, 6:309

- Within slide normalization BeforeAfter Print-tip normalization No norm Print tip Scaled print tip Nucleic Acids Research, 2002, vol 30, No 4