Introduction to microarray technology and analysis

Slides:



Advertisements
Similar presentations
MicroArray Image Analysis Robin Liechti
Advertisements

M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.
MicroArray Image Analysis
MicroArray Image Analysis Robin Liechti
Microarray Normalization
Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Statistics for Microarrays
Image Quantitation in Microarray Analysis More tomorrow...
Microarray Data Analysis Stuart M. Brown NYU School of Medicine.
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Getting the numbers comparable
Normalization for cDNA Microarray Data Yee Hwa Yang, Sandrine Dudoit, Percy Luu and Terry Speed. SPIE BIOS 2001, San Jose, CA January 22, 2001.
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Preprocessing Methods for Two-Color Microarray Data
Microarray Data Preprocessing and Clustering Analysis
Normalization Class web site: Statistics for Microarrays.
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
Sample preparation 1. Design experiment Question? Replicates? Test? 2. Perform experiment 4. Label RNA Amplification? Direct or indirect? Label? wild.
Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.
Microarray Technology Types Normalization Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at.
Image Analysis Class web site: Statistics for Microarrays.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Introduce to Microarray
Scanning and image analysis Scanning -Dyes -Confocal scanner -CCD scanner Image File Formats Image analysis -Locating the spots -Segmentation -Evaluating.
Analysis of microarray data
Microarray Preprocessing
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
with an emphasis on DNA microarrays
Image Quantitation in Microarray Analysis More tomorrow...
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Affymetrix vs. glass slide based arrays
The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
CDNA Microarrays MB206.
Data Type 1: Microarrays
Panu Somervuo, March 19, cDNA microarrays.
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Analysis of Microarray Data Analysis of images Preprocessing of gene expression data Normalization of data –Subtraction of Background Noise –Global/local.
Agenda Introduction to microarrays
Microarray - Leukemia vs. normal GeneChip System.
ARK-Genomics: Centre for Comparative and Functional Genomics in Farm Animals Richard Talbot Roslin Institute and R(D)SVS University of Edinburgh Microarrays.
Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.
Statistical Principles of Experimental Design Chris Holmes Thanks to Dov Stekel.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Microarray Technology. Introduction Introduction –Microarrays are extremely powerful ways to analyze gene expression. –Using a microarray, it is possible.
Microarray hybridization Usually comparative – Ratio between two samples Examples – Tumor vs. normal tissue – Drug treatment vs. no treatment – Embryo.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
(1) Normalization of cDNA microarray data Methods, Vol. 31, no. 4, December 2003 Gordon K. Smyth and Terry Speed.
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University Plate Effects in cDNA Microarray Data.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Functional Genomics Carol Bult, Ph.D. Course coordinator The Jackson Laboratory Winter/Spring 2011 Keith Hutchison, Ph.D. Course co-coordinator.
Microarray: An Introduction
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
Microarray - Leukemia vs. normal GeneChip System.
Normalization Methods for Two-Color Microarray Data
The Basics of Microarray Image Processing
Getting the numbers comparable
Normalization for cDNA Microarray Data
Presentation transcript:

Introduction to microarray technology and analysis Carol Bult Associate Professor The Jackson Laboratory carol.bult@jax.org 1

Measuring Gene Expression Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently harder.

Central Assumption of Gene Expression Microarrays The level of a given mRNA is positively correlated with the expression of the associated protein. Higher mRNA levels mean higher protein expression, lower mRNA means lower protein expression Other factors: Protein degradation, mRNA degradation, polyadenylation, codon preference, translation rates, alternative splicing, translation lag…

Principal Uses of Microarrays Genome-scale gene expression analysis Differential gene expression between two (or more) sample types Responses to environmental factors Disease processes (e.g. cancer) Effects of drugs Identification of genes associated with clinical outcomes (e.g. survival)

Microarray example: Biomarker identification - lung cancer Samples Genes Gene expression patterns segregate the four major morphological lung tumor subtypes Patterns of gene expression promise to refine traditional morphologic classification of lung cancer. Can we distinguish subsets of genes that would allow molecular-level classification of tumor subtypes? Garber, Troyanskaya et al. Diversity of gene expression in adenocarcinoma of the lung. PNAS 2001, 98(24):13784-9. 5

Data partitioning clinically important: Patient survival for lung cancer subgroups 60 Cum. Survival Time (months) .2 .4 .6 .8 1 10 20 30 40 50 Cum. Survival (Group 3) Cum. Survival (Group 2) Cum. Survival (Group 1) p = 0.002 for Gr. 1 vs. Gr. 3 Can we identify individual genes that can predict patient survival for adenocarcinoma lung cancer? Garber, Troyanskaya et al. Diversity of gene expression in adenocarcinoma of the lung. PNAS 2001, 98(24):13784-9. 6

Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment Image analysis Normalization http://statwww.epfl.ch/davison/teaching/Microarrays/ETHZ/ Estimation Testing Clustering Discrimination Biological verification and interpretation

Technology basics Microarrays are composed of short, specific DNA sequences attached to a glass or silicon slide at high density A microarray works by exploiting the ability of an mRNA molecule to bind specifically to, or hybridize, the DNA template from which it originated RNA or DNA from the sample of interest is fluorescently-labeled so that relative or absolute abundances can be quantitatively measured (Chris Workman)

Two color vs single color Bakel and Holstege. 2007. http://www.cell-press.com/misc/page?page=ETBR

Other applications of microarray technology (besides measuring gene expression) DNA copy number analysis SNP analysis chIP-chip (interaction data) Competitive growth assays …

Major technologies cDNA probes (> 200 nt), usually produced by PCR, attached to either nylon or glass supports Oligonucleotides (25-80 nt) attached to glass support Oligonucleotides (25-30 nt) synthesized in situ on silica wafers (Affymetrix) Probes attached to tagged beads

cDNA Microarray Design Probe selection Non-redundant set of probes Includes genes of interest to project Corresponds to physically available clones Chip layout Grouping of probes by function Correspondence between wells in microtiter plates and spots on the chip

Building the chip Ngai Lab arrayer , UC Berkeley Print-tip head

http://transcriptome.ens.fr/sgdb/presentation/principle.php

Example dual channel cDNA array results (Chris Workman)

Affymetrix GeneChips Probes are oligos synthesized in situ using a photolithographic approach There are at least 5 oligos per cDNA, plus an equal number of negative controls The apparatus requires a fluidics station for hybridization and a special scanner Only a single fluorochrome is used per hybridization

http://genome.ucsc.edu/cgi-bin/hgTracks

Affy There may be 5,000-100,000 probe sets per chip A probe set = 11-20 PM, MM pairs

http://www.weizmann.ac.il/home/ligivol/pictures/system.jpg

Interpreting Affymetrix Output Perfect Match/Mismatch Strategy Each probe designed to be perfectly complementary to a target sequence, a partner probe is generated that is identical except for a single base mismatch in its center. These probe pairs, called the Perfect Match probe (PM) and the Mismatch probe (MM), allow the quantitation and subtraction of signals caused by non-specific cross-hybridization. The difference in hybridization signals between the partners serve as indicators of specific target abundance Moustafa Ghanem 20

Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment Image analysis Normalization http://statwww.epfl.ch/davison/teaching/Microarrays/ETHZ/ Estimation Testing Clustering Discrimination Biological verification and interpretation

Experimental Design Bakel and Holstege. 2007. http://www.cell-press.com/misc/page?page=ETBR

- Donald Rumsfeld, former Secretary of Defense Microarray Analysis: Controlling for the Known Knowns and Unknown Unknowns - Donald Rumsfeld, former Secretary of Defense

http://www.bioconductor.org/workshops/2003/NGFN03/experimental-design.pdf

Selected references http://discover.nci.nih.gov/microarrayAnalysis/Experimental.Design.jsp Best advice? Consult a statistician before you start!

Statistical Power The probability that a test will reject a null hypothesis if it is false Type I and Type II errors Type 1 – fail to accept the null hypothesis We say there is a difference in gene expression between gene A and gene B when there really isn’t Type 2- fail to reject the null hypothesis We say there is no difference in gene expression between gene A and gene B when there actually is!

Power in Perspective Sample size Effect size Alpha level Power Number of units Effect size Signal to noise Alpha level Significance level Power Likelihood of detecting a treatment effect if it is there What are the 4 main components that determine what conclusions are drawn from a study?

Check out this pithy description of Statistical Power and Hypothesis Testing http://www.socialresearchmethods.net/kb/power.php

MicroArray Image Analysis Based on slides from Robin Liechti (robin.liechti@ie-bpv.unil.ch)

Microarray analysis Array construction, hybridisation, scanning Quantitation of fluorescence signals Data visualisation Meta-analysis (clustering) More visualisation

Technical pseudo-colour image sample (labelled) probe (on chip) [image from Jeremy Buhler]

Experimental design Track what’s on the chip which spot corresponds to which gene Duplicate experimental spots reproducibility Controls DNAs spotted on glass positive probe (induced or repressed) negative probe (bacterial genes on human chip) oligos on glass or synthesised on chip (Affymetrix) point mutants (hybridisation plus/minus)

Images from scanner Resolution standard 10m [currently, max 5m] 100m spot on chip = 10 pixels in diameter Image format TIFF (tagged image file format) 16 bit (65’536 levels of grey) 1cm x 1cm image at 16 bit = 2Mb (uncompressed) other formats exist e.g.. SCN (used at Stanford University) Separate image for each fluorescent sample channel 1, channel 2, etc.

Images in analysis software The two 16-bit images (cy3, cy5) are compressed into 8-bit images Goal : display fluorescence intensities for both wavelengths using a 24-bit RGB overlay image RGB image : Blue values (B) are set to 0 Red values (R) are used for cy5 intensities Green values (G) are used for cy3 intensities Qualitative representation of results

Images : examples Pseudo-color overlay cy3 cy5 Spot color Signal strength Gene expression yellow Control = perturbed unchanged red Control < perturbed induced green Control > perturbed repressed

Processing of images Addressing or gridding Segmentation Assigning coordinates to each of the spots Segmentation Classification of pixels either as foreground or as background Intensity extraction (for each spot) Foreground fluorescence intensity pairs (R, G) Background intensities Quality measures

File or archive your e-mail on your own computer

Addressing (I) ScanAlyze Parameters to address the spots positions Separation between rows and columns of grids Individual translation of grids Separation between rows and columns of spots within each grid Small individual translation of spots Overall position of the array in the image The basic structure of the images is known (determined by the arrayer)

Addressing (II) The measurement process depends on the addressing procedure Addressing efficiency can be enhanced by allowing user intervention (slow!) Most software systems now provide for both manual and automatic gridding procedures

Segmentation (I) Classification of pixels as foreground or background -> fluorescence intensities are calculated for each spot as measure of transcript abundance Production of a spot mask : set of foreground pixels for each spot

Segmentation (II) Segmentation methods : Fixed circle segmentation Adaptive circle segmentation Adaptive shape segmentation Histogram segmentation Fixed circle ScanAlyze, GenePix, QuantArray Adaptive circle GenePix, Dapple Adaptive shape Spot, region growing and watershed Histogram method ImaGene, QuantArraym DeArray and adaptive thresholding

Fixed circle segmentation Fits a circle with a constant diameter to all spots in the image Easy to implement The spots need to be of the same shape and size Bad example !

Adaptive circle segmentation Dapple finds spots by detecting edges of spots (second derivative) The circle diameter is estimated separately for each spot Problematic if spot exhibits oval shapes

Adaptive shape segmentation Specification of starting points or seeds Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region.

Histogram segmentation Uses a target mask chosen to be larger than any other spot Foreground and background intensity are determined from the histogram of pixel values for pixels within the masked area Example : QuantArray Background : mean between 5th and 20th percentile Foreground : mean between 80th and 95th percentile Unstable when a large target mask is set to compensate for variation in spot size Bkgd Foreground

Information extraction

Spot intensity The total amount of hybridization for a spot is proportional to the total fluorescence at the spot Spot intensity = sum of pixel intensities within the spot mask Since later calculations are based on ratios between cy5 and cy3, we compute the average* pixel value over the spot mask *alternative : use ratios of medians instead of means

Background intensity Motivation : spot’s measured intensity includes a contribution of non-specific hybridization and other chemicals on the glass Fluorescence from regions not occupied by DNA should by different from regions occupied by DNA -> could be interesting to use local negative controls (spotted DNA that should not hybridize) Different background methods : Local background, morphological opening, constant background, no adjustment

Local background Focusing on small regions surrounding the spot mask. Median of pixel values in this region Most software package implement such an approach ScanAlyze ImaGene Spot, GenePix By not considering the pixels immediately surrounding the spots, the background estimate is less sensitive to the performance of the segmentation procedure

Morphological opening (spot) Applied to the original images R and G Use a square structuring element with side length at least twice as large as the spot separation distance Remove all the spots and generate an image that is an estimate of the background for the entire slide For individual spots, the background is estimated by sampling this background image at the nominal center of the spot Lower background estimate and less variable

Constant background Global method which subtracts a constant background for all spots Some findings suggests that the binding of fluorescent dyes to ‘negative control spots’ is lower than the binding to the glass slide -> More meaningful to estimate background based on a set of negative control spots If no negative control spots : approximation of the average background = third percentile of all the spot foreground values

No adjustment Do not consider the background

Quality measures (-> Flag) How good are foreground and background measurements ? Variability measures in pixel values within each spot mask Spot size Circularity measure Relative signal to background intensity b-value : fraction of background intensities less than the median foreground intensity p-score : extend to which the position of a spot deviates from a rigid rectangular grid Based on these measurements, one can flag a spot

Summary Spot, GenePix ScanAlyze M = log2 R/G A = log2 √(R•G) The choice of background correction method has a larger impact on the log-intensity ratios than the segmentation method used The morphological opening method provides a better estimate of background than other methods Low within- and between-slide variability of the log2 R/G Background adjustment has a larger impact on low intensity spots

Selected references Yang, Y. H., Buckley, M. J., Dudoit, S. and Speed, T. P. (2001), ‘Comparisons of methods for image analysis on cDNA microarray data’. Technical report #584, Department of Statistics, University of California, Berkeley. http://www.stat.berkeley.edu/users/terry/zarray/Html/papersindex.html Yang, Y. H., Buckley, M. J. and Speed, T. P. (2001), ‘Analysis of cDNA microarray images’. Briefings in bioinformatics, 2 (4), 341-349. Excellent review in concise format!

Download the limma package and work through the Swirl zebrafish example. http://pbil.univ-lyon1.fr/library/limma/doc/usersguide.html

Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment Image analysis Normalization http://statwww.epfl.ch/davison/teaching/Microarrays/ETHZ/ Estimation Testing Clustering Discrimination Biological verification and interpretation

63

Normalization - two problems How do we detect biases? Which genes should we use for estimating biases among chips/channels? How do we remove the biases? http://www.tau.ac.il/lifesci/bioinfo/teaching/2005-2006/Normalization-Diff-Jan06.ppt

Why normalize? Microarray data have significant systematic variation both within arrays and between arrays that is not true biological variation Accurate comparison of genes’ relative expression within and across conditions requires normalization of effects Sources of variation: Spatial location on the array Dye biases which vary with spot intensity Plate origin Printing/spotting quality Experimenter http://statwww.epfl.ch/davison/teaching/Microarrays/ETHZ/

Why is normalization important? Experiment: Comparison of gene expression response in mouse heart and kidney in response to drug Most biological effects are swamped by systematic effects! Source: http://www.partek.com

Other Sources of Systematic Bias Individual Factors Print (20% - 30%) Experimenter (20% - 30%) Organism (3% - 10%) Date (5%) Software (2%) Number of tips (3%) Interactions Print - Experimenter (40%) Print - Date (40%) Experimenter - Date (40%) (based on ~4,600 experiments in Stanford Microarray Database analyzed by ANOVA) (slide from Catherine Ball)

Clearly visible plate effects KO #8 Probes: ~6,000 cDNAs, including 200 related to lipid metabolism. Arranged in a 4x4 array of 19x21 sub-arrays.

Spatial Biases Solution: spatial background estimation/subtraction (Gavin Sherlock) Solution: spatial background estimation/subtraction

Spatial plots: background from two slides

Highlighting extreme log ratios Top (black) and bottom (green) 5% of log ratios

Pin group (sub-array) effects Lowess lines through points from pin groups Boxplots of log ratios by pin group

Boxplots and highlighting pin group effects Log-ratios Print-tip groups Clear example of spatial bias

Time of printing effects spot number Green channel intensities (log2G). Printing over 4.5 days. The previous slide depicts a slide from this print run.

Normalization in a nutshell Goal is to measure the ratios of gene expression levels, (ratio)i = Ri/Gi Where Ri/Gi are, respectively, the measured intensities for the ith spot In a self hybridzation, we would expect all ratios to be equal to one: Ri/Gi = 1 for all i. But they probably won’t be… Why? noise (systematics bias) signal (true differences) Normalization brings appropriate ratios closer to 1

76

The Starting Point: The Ratio (2-color arrays) (Gavin Sherlock)

Log ratios treat up- and down-regulated genes equally (Gavin Sherlock) (two-color arrays) log2(1) = 0 log2(2) = 1 log2(1/2) = -1

A note about Affymetrix (1-color) pre-processing Typical Affymetrix probe intensity distribution Log transform After log-transform

Normalization methods

Which Genes to use for bias detection? All genes on the chip Assumption: Most of the genes are equally expressed in the compared samples, the proportion of the differential genes is low (<20%). Limits: Not appropriate when comparing highly heterogeneous samples (different tissues) Not appropriate for analysis of ‘dedicated chips’ (apoptosis chips, inflammation chips etc) http://www.tau.ac.il/lifesci/bioinfo/teaching/2005-2006/Normalization-Diff-Jan06.ppt

Which Genes to use for bias detection? Housekeeping genes Assumption: based on prior knowledge a set of genes can be regarded as equally expressed in the compared samples Affy novel chips: ‘normalization set’ of 100 genes NHGRI’s cDNA microarrays: 70 "house-keeping" genes set Limits: The validity of the assumption is questionable Housekeeping genes are usually expressed at high levels, not informative for the low intensities range http://www.tau.ac.il/lifesci/bioinfo/teaching/2005-2006/Normalization-Diff-Jan06.ppt

Which Genes to use for bias detection? Spiked-in controls from other organism, over a range of concentrations Limits: low number of controls- less robust Can’t detect biases due to differences in RNA extraction protocols “Invariant set” Trying to identify genes that are expressed at similar levels in the compared samples without relying on any prior knowledge: Rank the genes in each chip according to their expression level Find genes with small change in ranks http://www.tau.ac.il/lifesci/bioinfo/teaching/2005-2006/Normalization-Diff-Jan06.ppt

1. Global normalization (Scaling) A single normalization factor (k) is computed for balancing chips\channels: Xinorm = k*Xi or log2 R/G  log2 R/G – c (2-color) Multiplying intensities by this factor equalizes the mean (median) intensity among compared chips Assumption: Total RNA (mass) used is same for both samples. So, averaged across thousands of genes, total hybridization should be the same for both samples.

Global Normalization (1-color, e.g. Affymetrix) Before After http://www.tau.ac.il/lifesci/bioinfo/teaching/2005-2006/Normalization-Diff-Jan06.ppt Xinorm = k*Xi

Global Normalization (2-color) Un-normalized Normalized Frequency (Gavin Sherlock) Log-ratios log2 R/G  log2 R/G – c where c = log2 (∑Ri/ ∑Gi)

2. Intensity-dependent normalization (Yang, Speed) (Lowess – local linear fit) Compensate for intensity-dependent biases http://www.tau.ac.il/lifesci/bioinfo/teaching/2005-2006/Normalization-Diff-Jan06.ppt

Detect Intensity-dependent Biases: M vs A plots (also called R-I plot) X axis: A – average intensity A = 0.5*log(Cy3*Cy5) Y axis: M – log ratio M = log(Cy3/Cy5) http://www.tau.ac.il/lifesci/bioinfo/teaching/2005-2006/Normalization-Diff-Jan06.ppt

Intensity-dependent bias High intensities M>0: Cy3>Cy5 M = log(Cy3/Cy5) Low intensities M<0: Cy3<Cy5 * Global normalization cannot remove intensity-dependent biases http://www.tau.ac.il/lifesci/bioinfo/teaching/2005-2006/Normalization-Diff-Jan06.ppt A

We expect the M vs A plot to look like: M = log(Cy3/Cy5) http://www.tau.ac.il/lifesci/bioinfo/teaching/2005-2006/Normalization-Diff-Jan06.ppt A

LOWESS (Locally Weighted Scatterplot Smoothing) Local linear regression model Tri-cube weight function Least Squares Estimated values of log2(Cy5/Cy3) as function of log10(Cy3*Cy5)

A note about Affymetrix (1-color) pre-processing within-chip cross-chip sequence specific background correction within-probe set aggregation of intensity values (Johannes Freudenberg) Two “standard” methods MAS 5.0 (now GCOS/GDAS) by Affymetrix (compares PM and MM probes) RMA by Speed group (UC Berkeley) (ignores MM probes)

Normalization – Thoughts There are many different ways to normalize data Global median, LOWESS, LOESS, etc By print tip, spatial, etc Choose one wisely BUT: don’t expect it to fix bad data! Won’t make up for lack of replicates Won’t make up for horrible slides

For next time.. Read Quackenbush paper on normalization Look up the paper on Robust Multichip Averaging (RMA) out of Terry Speed’s lab What is meant by least squares? Visit the Gene Expression Omnibus (GEO) resource at NCBI and explore what is there If you aren’t familiar with the statistical computing environment, R, look it up on the web Look up MeV (multi-experiment viewer) on the web.

File or archive your e-mail on your own computer

Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment Image analysis Normalization http://statwww.epfl.ch/davison/teaching/Microarrays/ETHZ/ Estimation Testing Clustering Discrimination Biological verification and interpretation

Analysis

Microarray experiment Microarray Data Flow Microarray experiment Unsupervised Analysis – clustering Image Analysis Database Data Selection & Missing value estimation Supervised Analysis Normalization & Centering Networks & Data Integration Data Matrix Decomposition techniques 98

Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment Image analysis Normalization http://statwww.epfl.ch/davison/teaching/Microarrays/ETHZ/ Estimation Testing Clustering Discrimination Biological verification and interpretation

Interpretation

Microarray data on the Web Several initiatives to create “unified” databases EBI: ArrayExpress NCBI: Gene Expression Omnibus

Normalization - tools Normalization is typically provided in microarray vendor’s software/core facilities but you should always understand the data you’re working with How has your data been processed? Are there any lingering effects? Bioconductor (both Affymetrix and cDNA): Packages in R language dChip (Affymetrix): Quantile, Invariant set MAANOVA Microarray ANOVA analysis