Download presentation
Presentation is loading. Please wait.
1
Introduction to Microarry Data Analysis - II BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University
2
Review of Microarray Elements of Statistics and Gene Discovery in Expression Data Elements of Machine Learning and Clustering of Gene Expression Profiles
3
How does two-channel microarray work? Printing process introduces errors and larger variance Comparative hybridization experiment
4
How does microarray work? Fabrication expense and frequency of error increases with the length of probe, therefore 25 oligonucleotide probes are employed. Problem: cross hybridization Solution: introduce mismatched probe with one position (central) different with the matched probe. The difference gives a more accurate reading.
5
How do we use microarray? Inference Clustering
6
Normalization Which normalization algorithm to use Inter-slide normalization Not just for Affymetrix arrays
7
Review of Microarray Elements of Statistics and Gene Discovery in Expression Data Elements of Machine Learning and Clustering of Gene Expression Profiles
8
Hypothesis Testing Two set of samples sampled from two distributions (N=2)
9
Hypothesis Testing Two set of samples sampled from two distributions (N=2) Hypothesis 1 and 2 are the means of the two distributions. Null hypothesis Alternative hypothesis
10
Student’s t-test
11
p-value can be computed from t-value and number of freedom (related to number of samples) to give a bound on the probability for type-I error (claiming insignificant difference to be significant) assuming normal distributions.
12
Student’s t-test Dependent (paired) t-test
13
Permutation (t-)test T-test relies on the parametric distribution assumption (normal distribution). Permutation tests do not depend on such an assumption. Examples include the permutation t-test and Wilcoxon rank-sum test. Perform regular t-test to obtain t-value t 0. The randomly permute the N 1 +N 2 samples and designate the first N 1 as group 1 with the rest being group 2. Perform t-test again and record the t-value t. For all possible permutations, count how many t- values are larger than t 0 and write down the number K 0.
14
Multiple Classes (N>2) F-test The null hypothesis is that the distribution of gene expression is the same for all classes. The alternative hypothesis is that at least one of the classes has a distribution that is different from the other classes. Which class is different cannot be determined in F-test (ANOVA). It can only be identified post hoc.
15
Example GEO Dataset Subgroup Effect
16
Gene Discovery and Multiple T-tests Controlling False Positives p-value cutoff = 0.05 (probability for false positive - type-I error) 22,000 probesets False discovery 22,000X0.05=1,100 Focus on the 1,100 genes in the second speciman. False discovery 1,100X0.05 = 55
17
Gene Discovery and Multiple T-tests Controlling False Positives State the set of genes explicitly before the experiments Problem: not always feasible, defeat the purpose of large scale screening, could miss important discovery Statistical tests to control the false positives
18
Gene Discovery and Multiple T-tests Controlling False Positives Statistical tests to control the false positives Controlling for no false positives (very stringent, e.g. Bonferroni methods) Controlling the number of false positives ( Controlling the proportion of false positives Note that in the screening stage, false positive is better than false negative as the later means missing of possibly important discovery.
19
Gene Discovery and Multiple T-tests Controlling False Positives Statistical tests to control the false positives Controlling for no false positives (very stringent) Bonferroni methods and multivariate permutation methods Bonferroni inequality Area of union < Sum of areas
20
Gene Discovery and Multiple T-tests Bonferroni methods Bonferroni adjustment If E i is the event for false positive discovery of gene I, conservative speaking, it is almost guaranteed to have false positive for K > 19. So change the p-value cutoff line from p 0 to p 0 /K. This is called Bonferroni adjustment. If K=20, p 0 =0.05, we call a gene i is significantly differentially expressed if pi<0.0025.
21
Gene Discovery and Multiple T-tests Bonferroni methods Bonferroni adjustment Too conservative. Excessive stringency leads to increased false negative (type II error). Has problem with metaanalysis. Variations: sequential Bonferroni test (Holm-Bonferroni test) Sort the K p-values from small to large to get p 1 p 2 … p K. So change the p-value cutoff line for the ith p-value to be p 0 /(K-i+1) (ie, p 1 p 0 /K, p 2 p 0 /(K-1), …, p K p 0. If p j p 0 /(K-j+1) for all j i but p i+1 >p 0 /(K-i+1+1), reject all the alternative hypothesis from i+1 to K, but keep the hypothesis from 1 to i.
22
Gene Discovery and Multiple T-tests Controlling False Positives Statistical tests to control the false positives Controlling the number of false positives Simple approach – choose a cutoff for p- values that are lower than the usual 0.05 but higher than that from Bonferroni adjustment More sophisticated way: a version of multivariate permutation.
23
Gene Discovery and Multiple T-tests Controlling False Positives Statistical tests to control the false positives Controlling the proportion of false positives Let be the portion (percentage) of false positive in the total discovered genes. False positive Total positive p D is the choice. There are other ways for estimating false positives. Details can be found in Tusher et. al. PNAS 98:5116-5121.
24
Review of Microarray Elements of Statistics and Gene Discovery in Expression Data Elements of Machine Learning and Clustering of Gene Expression Profiles
25
Review of Microarray and Gene Discovery Clustering and Classification Preprocessing Distance measures Popular algorithms (not necessarily the best ones) More sophisticated ones Evaluation Data mining
27
-Clustering or classification? -Is training data available? -What domain specific knowledge can be applied? -What preprocessing of data is needed? -Log / data scale and numerical stability -Filtering / denoising -Nonlinear kernel -Feature selection (do I need to use all the data?) -Is the dimensionality of the data too high?
28
How do we process microarray data (clustering)? - Feature selection – genes, transformations of expression levels. - Genes discovered in the class comparison (t- test). Risk: missing genes. - Iterative approach : select genes under different p-value cutoff, then select the one with good performance using cross-validation. - Principal components (pro and con). - Discriminant analysis (e.g., LDA).
29
Distance Measure (Metric?) -What do you mean by “similar”? -Euclidean -Uncentered correlation -Pearson correlation
30
Distance Metric -Euclidean 102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300 3189.0001321.3002164.400868.600185.300266.4002527.800 160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000 5410.9003162.1004100.9004603.2006066.2005505.8005702.700 d E (Lip1, Ap1s1) = 12883
31
Distance Metric -Pearson Correlation r = 1r = -1 Ranges from 1 to -1.
32
Distance Metric -Pearson Correlation 102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300 3189.0001321.3002164.400868.600185.300266.4002527.800 160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000 5410.9003162.1004100.9004603.2006066.2005505.8005702.700 d P (Lip1, Ap1s1) = 0.904
33
Distance Metric -Uncentered Correlation 102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300 3189.0001321.3002164.400868.600185.300266.4002527.800 160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000 5410.9003162.1004100.9004603.2006066.2005505.8005702.700 d u (Lip1, Ap1s1) = 0.835 About 33.4 o
34
Distance Metric -Difference between Pearson correlation and uncentered correlation 102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300 3189.0001321.3002164.400868.600185.300266.4002527.800 160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000 5410.9003162.1004100.9004603.2006066.2005505.8005702.700 Pearson correlation Baseline expression possible Uncentered correlation All are considered signals
35
Distance Metric -Difference between Euclidean and correlation
36
Distance Metric -Missing: negative correlation may also mean “close” in signal pathway (1-|PCC|, 1-PCC^2)
37
Review of Microarray and Gene Discovery Clustering and Classification Preprocessing Distance measures Popular algorithms (not necessarily the best ones) More sophisticated ones Evaluation Data mining
38
How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering
39
How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Single linkage: The linking distance is the minimum distance between two clusters.
40
How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Complete linkage: The linking distance is the maximum distance between two clusters.
41
How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Average linkage/UPGMA: The linking distance is the average of all pair-wise distances between members of the two clusters. Since all genes and samples carry equal weight, the linkage is an Unweighted Pair Group Method with Arithmetic Means (UPGMA).
42
How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Single linkage – Prone to chaining and sensitive to noise Complete linkage – Tends to produce compact clusters Average linkage – Sensitive to distance metric
43
-Unsupervised Learning – Hierarchical Clustering
44
Dendrograms Distance – the height each horizontal line represents the distance between the two groups it merges. Order – Opensource R uses the convention that the tighter clusters are on the left. Others proposed to use expression values, loci on chromosomes, and other ranking criteria.
45
-Unsupervised Learning - K-means -Vector quantization -K-D trees -Need to try different K, sensitive to initialization
46
-Unsupervised Learning - K-means [cidx, ctrs] = kmeans(yeastvalueshighexp, 4, 'dist', 'corr', 'rep',20); K Metric
47
-Unsupervised Learning - K-means -Number of class K needs to be specified -Does not always converge -Sensitive to initialization
48
-Issues -Lack of consistency or representative features (5.3 TP53 + 0.8 PTEN doesn’t make sense) -Data structure is missing -Not robust to outliers and noise D’Haeseleer 2005 Nat. Biotechnol 23(12):1499-501
49
-Model-based clustering methods (Han) http://www.cs.umd.edu/~bhhan/research2.html Pan et al. Genome Biology 2002 3:research0009.1 doi:10.1186/gb-2002-3-2-research0009
50
-Structure-based clustering methods
51
-Supervised Learning -Support vector machines (SVM) and Kernels -Only (binary) classifier, no data model
52
-Accuracy vs. generality -Overfitting -Model selection Model complexity Prediction error Training sample Testing sample (reproduced from Hastie et.al.)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.