Presentation is loading. Please wait.

Presentation is loading. Please wait.

:: Microarray analysis ::

Similar presentations


Presentation on theme: ":: Microarray analysis ::"— Presentation transcript:

1 :: Microarray analysis ::
Data pre-processing Normalization Molecular diagnosis Statistical classification Florian Markowetz

2 From experiment to data

3 Raw data are not mRNA concentrations
tissue contamination RNA degradation amplification efficiency reverse transcription efficiency Hybridization efficiency and specificity clone identification and mapping PCR yield, contamination spotting efficiency DNA support binding other array manufacturing related issues image segmentation signal quantification “background” correction

4 Quality control: Noise and reliable signal
Probe level Array level Gene level Arrays n Probe level: quality of the expression measurement of one spot on one particular array Array level: quality of the expression measurement on one particular glass slide Gene level: quality of the expression measurement of one probe across all arrays

5 Probe-level quality control
Individual spots printed on the slide Sources: faulty printing, uneven distribution, contamination with debris, magnitude of signal relative to noise, poorly measured spots; Visual inspection: hairs, dust, scratches, air bubbles, dark regions, regions with haze Spot quality: Brightness: foreground/background ratio Uniformity: variation in pixel intensities and ratios of intensities within a spot Morphology: area, perimeter, circularity. Spot Size: number of foreground pixels Action: set measurements to NA (missing values) local normalization procedures which account for regional idiosyncrasies. use weights for measurements to indicate reliability in later analysis.

6 Spot identification NA
Individual spots are recognized, size and shape might be adjusted per spot (automatically fine adjustments by hand). Additional manual flagging of bad (X) or non-present (NA) spots NA X poor spot quality good spot quality Different Spot identification methods: Fixed circles, circles with variable size, arbitrary spot shape (morphological opening)

7 Spot identification The signal of the spots is quantified.
Histogram of pixel intensities of a single spot „Donuts“ Mean / Median / Mode / 75% quantile

8 Local background GenePix QuantArray ScanAlyse

9 Array level quality control
Problems: array fabrication defect problem with RNA extraction failed labeling reaction poor hybridization conditions faulty scanner Quality measures: Percentage of spots with no signal (~30% excluded spots) Range of intensities (Av. Foreground)/(Av. Background) > 3 in both channels Distribution of spot signal area Amount of adjustment needed: signals have to substantially changed to make slides comparable.

10 Gene-level quality control
Poor hybridization in the reference channel may introduce bias on the fold-change Some probes will not hybridize well to the target RNA Printing problems: such that all spots of a given inventory well have poor quality. Gene g A well may be of bad quality – contamination Genes with a consistently low signal in the reference channel are suspicious

11 Gene expression data sample1 sample2 sample3 sample4 sample5 …
mRNA Samples sample1 sample2 sample3 sample4 sample5 … Gene gene-expression level or ratio for gene i in mRNA sample j 3 Log2(red intensity / green intensity) M = Function (PM, MM) of MAS, dchip or RMA average: log2(red intensity), log2(green intensity) A = Function (PM, MM) of MAS, dchip or RMA

12 Scatterplot Data Data (log scale)
Message: look at your data on log-scale!

13 MA Plot A = 1/2 log2(RG) M = log2(R/G)

14 Median centering One of the simplest strategies is to bring all „centers“ of the array data to the same level. Assumption: the majority of genes are un-changed between conditions. Median is more robust to outliers than the mean. Divide all expression measurements of each array by the Median. Log Signal, centered at 0

15 Problem of median-centering
Median-Centering is a global Method. It does not adjust for local effects, intensity dependent effects, print-tip effects, etc. Log Green Log Red Scatterplot of log-Signals after Median-centering A = (Log Green + Log Red) / 2 M = Log Red - Log Green M-A Plot of the same data

16 Lowess normalization Use the estimate to bend the banana straight
A = (Log Green + Log Red) / 2 M = Log Red - Log Green Local estimate Use the estimate to bend the banana straight

17 Summary I Raw data are not mRNA concentrations
We need to check data quality on different levels Probe level Array level (all probes on one array) Gene level (one gene on many arrays) Always log your data Normalize your data to avoid systematic (non-biological) effects Lowess normalization straightens banana

18 From data to knowledge Ok, now we made sure that our data is of high quality and systematic, non-biological effects are removed. The result is a gene expression matrix Gene mRNA Samples sample1 sample2 sample3 sample4 sample5 … Is that already a result? No! It’s just data, not knowledge. We need to use this data to answer a scientific question.

19 Supervised analysis = learning from examples, classification
We have already seen groups of healthy and sick people. Now let’s diagnose the next person walking into the hospital. We know that these genes have function X (and these others don’t). Let’s find more genes with function X. We know many gene-pairs that are functionally related (and many more that are not). Let’s extend the number of known related gene pairs. Known structure in the data needs to be generalized to new data.

20 Un-supervised analysis
= clustering Are there groups of genes that behave similarly in all conditions? Disease X is very heterogeneous. Can we identify more specific sub-classes for more targeted treatment? No structure is known. We first need to find it. Exploratory analysis.

21 Supervised analysis Calvin, I still don’t know the difference between cats and dogs … Oh, now I get it!! Don’t worry! I’ll show you once more: Class 1: cats Class 2: dogs

22 Un-supervised analysis
Calvin, I still don’t know the difference between cats and dogs … I don’t know it either. Let’s try to figure it out together …

23 Supervised analysis: setup
Training set Data: microarrays Labels: for each one we know if it falls into our class of interest or not (binary classification) New data (test data) Data for which we don’t have labels. Eg. Genes without known function Goal: Generalization ability Build a classifier from the training data that is good at predicting the right class for the new data.

24 One microarray, one dot Think of a space with #genes dimensions (yes, it’s hard for more than 3). Each microarray corresponds to a point in this space. If gene expression is similar under some conditions, the points will be close to each other. If gene expression overall is very different, the points will be far away. Expression of gene 2 Expression of gene 1

25 Which line separates best?
D

26 No sharp knive, but a … FAT PLANE

27 Support Vector Machines
Maximal margin separating hyperplane Datapoints closest to separating hyperplane = support vectors

28 How well did we do? Same classifier (= line)
Training error: how well do we do on the data we trained the classifier on? But how well will we do in the future, on new data? Test error: How well does the classifier generalize? Same classifier (= line) New data from same classes The classifier will usually perform worse than before: Test error > training error

29 Train classifier and test it
Cross-validation Training error Train classifier and test it Test error Train Test K-fold Cross-validation Step 1. Train Train Test Here for K=3 Step 2. Train Test Train Step 3. Test Train Train

30 Summary II Supervised and un-supervised learning
… are needed everywhere in biology and medicine Microarrays = points in high-dimensional spaces Classifiers = lines (hyperplanes) in these spaces Support Vector Machines use maximal margin hyperplanes as classifiers Classifier performance: Test error > training error Cross-validation is the right way to evaluate classifier performance

31 Experimental Cycle Biological question (hypothesis-driven or explorative) To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: He may be able to say what the experiment died of. Ronald Fisher Experimental design Failed Microarray experiment Quality Measurement Image analysis Pre-processing Normalization Pass Analysis Estimation Testing Clustering Discrimination Biological verification and interpretation

32 Books Terry Speed, „Statistical Analysis of Gene Expression Microarray Data”. Chapman & Hall/CRC David W. Mount, „Bioinformatics“, Cold Spring Harbor Giovanni Parmigani et al, „The Analysis of Gene Expression Data“, Springer Gentleman, Carey, Huber, “Bioinformatics and Computational Biology Solutions Using R and Bioconductor”, Springer Pierre Baldi & G. Wesley Hatfield, „DNA Microarrays and Gene Expression”, Cambridge

33 And how do I analyze my own data?
Open source Free Easy installation Helpful community High quality standards Regularly maintained and updated Tons of documentation Every package comes with example vignettes to walk you through standard tasks.

34 Acknowlegdements http://compdiag.molgen.mpg.de/ngfn/
I ‘borrowed’ slides from: Tim Beissbarth, Achim Tresch, Wolfgang Huber, Ulrich Mansmann, Terry Speed, Jean Yang, Benedikt Brors, Anja von Heydebreck, Rainer König More info on microarray analysis, lectures, tutorials:


Download ppt ":: Microarray analysis ::"

Similar presentations


Ads by Google