Presentation is loading. Please wait.

Presentation is loading. Please wait.

Microarray data analysis

Similar presentations


Presentation on theme: "Microarray data analysis"— Presentation transcript:

1 Microarray data analysis
September 24, 2003 Microarray data analysis

2 Copyright notice Many of the images in this powerpoint presentation
are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN ). Copyright © 2003 by John Wiley & Sons, Inc. These images and materials may not be used without permission from the publisher. We welcome instructors to use these powerpoints for educational purposes, but please acknowledge the source. The book has a homepage at Including hyperlinks to the book chapters.

3 Announcements: schedule
Today : microarray data analysis (Chapter 7) At 3:00 Tom Downey (Partek, Inc.) Friday: computer lab (gene expression) 2:00 exam is DUE! Monday: Protein analysis (Chapter 8) Wednesday: Protein structure (Chapter 9)

4 Microarray data analysis
• begin with a data matrix (gene expression values versus samples) Page 190

5 Microarray data analysis
• begin with a data matrix (gene expression values versus samples) Typically, there are many genes (>> 10,000) and few samples (~ 10) Page 190

6 Microarray data analysis
• begin with a data matrix (gene expression values versus samples) Preprocessing Inferential statistics Descriptive statistics Page 190

7 Microarray data analysis: preprocessing
Observed differences in gene expression could be due to transcriptional changes, or they could be caused by artifacts such as: different labeling efficiencies of Cy3, Cy5 uneven spotting of DNA onto an array surface variations in RNA purity or quantity variations in washing efficiency variations in scanning efficiency Page 191

8 Microarray data analysis: preprocessing
The main goal of data preprocessing is to remove the systematic bias in the data as completely as possible, while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription. A basic assumption of most normalization procedures is that the average gene expression level does not change in an experiment. Page 191

9 Data analysis: global normalization
Global normalization is used to correct two or more data sets. In one common scenario, samples are labeled with Cy3 (green dye) or Cy5 (red dye) and hybridized to DNA elements on a microrarray. After washing, probes are excited with a laser and detected with a scanning confocal microscope. Page 192

10 Data analysis: global normalization
Global normalization is used to correct two or more data sets Example: total fluorescence in Cy3 channel = 4 million units Cy 5 channel = 2 million units Then the uncorrected ratio for a gene could show 2,000 units versus 1,000 units. This would artifactually appear to show 2-fold regulation. Page 192

11 Data analysis: global normalization
Global normalization procedure Step 1: subtract background intensity values (use a blank region of the array) Step 2: globally normalize so that the average ratio = 1 (apply this to 1-channel or 2-channel data sets) Page 192

12 Microarray data preprocessing
Some researchers use housekeeping genes for global normalization Visit the Human Gene Expression (HuGE) Index: Page 192

13 Scatter plots Useful to represent gene expression values from
two microarray experiments (e.g. control, experimental) Each dot corresponds to a gene expression value Most dots fall along a line Outliers represent up-regulated or down-regulated genes Page 193

14 Scatter plot analysis of microarray data
Page 193

15 Differential Gene Expression in Different Tissue and Cell Types
Fibroblast Brain Astrocyte Astrocyte

16 Expression level (sample 2)
up high expression level down Expression level (sample 2) low Expression level (sample 1) Page 193

17 Log-log transformation Page 195

18 Scatter plots Typically, data are plotted on log-log coordinates
Visually, this spreads out the data and offers symmetry raw ratio log2 ratio time behavior value value t=0 basal t=1h no change t=2h 2-fold up t=3h 2-fold down Page 194, 197

19 expression level low high up Log ratio down Mean log intensity
Page 196

20 SNOMAD converts array data to scatter plots http://snomad.org
2-fold Linear-linear plot Log-log plot EXP EXP 2-fold 2-fold 2-fold CON CON EXP > CON 2-fold Log10 (Ratio ) 2-fold EXP < CON Mean ( Log10 ( Intensity ) ) Page

21 SNOMAD corrects local variance artifacts
robust local regression fit residual EXP > CON 2-fold Log10 ( Ratio ) Corrected Log10 ( Ratio ) [residuals] 2-fold EXP < CON Mean ( Log10 ( Intensity ) ) Mean ( Log10 ( Intensity ) ) Page

22 SNOMAD describes regulated genes in Z-scores
Locally estimated standard deviation of positive ratios Local Log10 ( Ratio ) Z-Score Corrected Log10 ( Ratio ) Z= 5 Z= 1 2-fold Z= -1 2-fold Mean ( Log10 ( Intensity ) ) Locally estimated standard deviation of negative ratios Z= 5 Z= 2 Mean ( Log10 ( Intensity ) ) Z= 1 2-fold Corrected Log10 ( Ratio ) 2-fold Z= -1 Z= -5 Z= -2 Z= -5 Mean ( Log10 ( Intensity ) )

23 Inferential statistics
Inferential statistics are used to make inferences about a population from a sample. Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: “There is no difference in signal intensity for the gene expression measurements in normal and diseased samples.” The alternative hypothesis is that there is a difference. We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level a to p < 0.05. Page 199

24 Inferential statistics
A t-test is a commonly used test statistic to assess the difference in mean values between two groups. t = = Questions Is the sample size (n) adequate? Are the data normally distributed? Is the variance of the data known? Is the variance the same in the two groups? Is it appropriate to set the significance level to p < 0.05? x1 – x2 difference between mean values s variability (noise) Page 199

25 Inferential statistics
Paradigm Parametric test Nonparametric Compare two unpaired groups Unpaired t-test Mann-Whitney test paired groups Paired t-test Wilcoxon test Compare 3 or ANOVA more groups Page

26 Inferential statistics
Is it appropriate to set the significance level to p < 0.05? If you hypothesize that a specific gene is up-regulated, you can set the probability value to 0.05. You might measure the expression of 10,000 genes and hope that any of them are up- or down-regulated. But you can expect to see 5% (500 genes) regulated at the p < 0.05 level by chance alone. To account for the thousands of repeated measurements you are making, some researchers apply a Bonferroni correction. The level for statistical significance is divided by the number of measurements, e.g. the criterion becomes: p < (0.05)/10,000 or p < 5 x 10-6 Page 199

27 Significance analysis of microarrays (SAM)
SAM -- an Excel plug-in (URL: page 202) -- modified t-test -- adjustable false discovery rate Page 200

28 Page 202

29 up- regulated observed expected down- regulated Page 202

30 Descriptive statistics
Microarray data are highly dimensional: there are many thousands of measurements made from a small number of samples. Descriptive (exploratory) statistics help you to find meaningful patterns in the data. A first step is to arrange the data in a matrix. Next, use a distance metric to define the relatedness of the different data points. Two commonly used distance metrics are: -- Euclidean distance -- Pearson coefficient of correlation 203

31 Data matrix (20 genes and 3 time points from Chu et al.) Page 205

32 3D plot (using S-PLUS software)
Page 205

33 Descriptive statistics: clustering
Clustering algorithms offer useful visual descriptions of microarray data. Genes may be clustered, or samples, or both. We will next describe hierarchical clustering. This may be agglomerative (building up the branches of a tree, beginning with the two most closely related objects) or divisive (building the tree by finding the most dissimilar objects first). In each case, we end up with a tree having branches and nodes. Page 204

34 Agglomerative clustering
1 2 3 4 a a,b b c d e Page 206

35 Agglomerative clustering
1 2 3 4 a a,b b c d d,e e Page 206

36 Agglomerative clustering
1 2 3 4 a a,b b c c,d,e d d,e e Page 206

37 Agglomerative clustering
1 2 3 4 a a,b b a,b,c,d,e c c,d,e d d,e e …tree is constructed Page 206

38 Divisive clustering a,b,c,d,e 4 3 2 1 Page 206

39 Divisive clustering a,b,c,d,e c,d,e 4 3 2 1 Page 206

40 Divisive clustering a,b,c,d,e c,d,e d,e 4 3 2 1 Page 206

41 Divisive clustering a,b a,b,c,d,e c,d,e d,e 4 3 2 1 Page 206

42 Divisive clustering a a,b b a,b,c,d,e c c,d,e d d,e e
4 3 2 1 …tree is constructed Page 206

43 agglomerative a a,b b a,b,c,d,e c c,d,e d d,e e divisive Page 206 1 2
1 2 3 4 a a,b b a,b,c,d,e c c,d,e d d,e e 4 3 2 1 divisive Page 206

44 Page 205

45 Page 207

46 sometimes give conflicting results, as shown here
1 12 Agglomerative and divisive clustering sometimes give conflicting results, as shown here 1 12 Page 207

47 Cluster and TreeView Page 208

48 Cluster and TreeView clustering K means SOM PCA Page 208

49 Cluster and TreeView Page 208

50 Cluster and TreeView Page 208

51 Page 208

52 Page 208

53 Page 208

54 Two-way clustering of genes (y-axis) and cell lines (x-axis)
(Alizadeh et al., 2000) Page 209

55 To download GeneCluster:
Self-organizing maps (SOM) To download GeneCluster:

56 Self-organizing maps (SOM)
One chooses a geometry of 'nodes'-for example, a 3x2 grid Page 210

57 Self-organizing maps (SOM)
The nodes are mapped into k-dimensional space, initially at random and then successively adjusted. Page 210

58 Self-organizing maps (SOM)
Page 211

59 Unlike k-means clustering, which is unstructured, SOMs allow one to impose
partial structure on the clusters. The principle of SOMs is as follows. One chooses an initial geometry of “nodes” such as a 3 x 2 rectangular grid (indicated by solid lines in the figure connecting the nodes). Hypothetical trajectories of nodes as they migrate to fit data during successive iterations of SOM algorithm are shown. Data points are represented by black dots, six nodes of SOM by large circles, and trajectories by arrows.

60 Self-organizing maps (SOM)
Neighboring nodes tend to define 'related' clusters. An SOM based on a rectangular grid thus is analogous to an entomologist's specimen drawer in which adjacent compartments hold similar insects.

61 Two pre-processing steps essential to apply SOMs
1. Variation Filtering: Data were passed through a variation filter to eliminate those genes showing no significant change in expression across the k samples. This step is needed to prevent nodes from being attracted to large sets of invariant genes. 2. Normalization: The expression level of each gene was normalized across experiments. This focuses attention on the 'shape' of expression patterns rather than absolute levels of expression.

62 Principal components analysis (PCA),
an exploratory technique that reduces data dimensionality, distinguishes lead-exposed from control cell lines P4 N2 Legend Principal component axis #2 (10%) Lead (P) Sodium (N) P1 C2 N3 Control (C) C3 P2 P3 N4 C4 C1 Principal component axis #1 (87%) PC#3: 1%

63 Principal components analysis (PCA)
An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D For a matrix of m genes x n samples, create a new covariance matrix of size n x n Thus transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs). Page 211

64 Principal components analysis (PCA): objectives
• to reduce dimensionality • to determine the linear combination of variables • to choose the most useful variables (features) • to visualize multidimensional data • to identify groups of objects (e.g. genes/samples) • to identify outliers Page 211

65 Page 212

66 Page 212

67 Page 212

68 Page 212

69 Page 212

70 Page 212

71 Use of PCA to demonstrate increased levels of gene
expression from Down syndrome (trisomy 21) brain Chr 21


Download ppt "Microarray data analysis"

Similar presentations


Ads by Google