Download presentation
Presentation is loading. Please wait.
1
Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated. Goal B: Divide conditions to groups with similar gene expression profiles. Example: divide drugs according to their effect on gene expression. Unsupervised Analysis Clustering Methods
2
K-means: The Algorithm Given a set of numeric points in d dimensional space, and integer k Algorithm generates k (or fewer) clusters as follows: 1. Assign all points to a cluster at random 2. Compute centroid for each cluster 3. Reassign each point to nearest centroid 4. If centroids changed go back to stage 2
3
K-means: Example, k = 3 Step 1: Make random assignments and compute centroids (big dots) Step 2: Assign points to nearest centroids Step 3: Re-compute centroids (in this example, solution is now stable)
4
Fuzzy K means The clusters produced by the k-means procedure are sometimes called "hard" or "crisp" clusters, since any feature vector x either is or is not a member of a particular cluster. This is in contrast to "soft" or "fuzzy" clusters, in which a feature vector x can have a degree of membership in each cluster. The fuzzy-k-means procedure allows each feature vector x to have a degree of membership in Cluster i:
5
Fuzzy K means Algorithm Make initial guesses for the means m 1, m 2,..., m k Until there are no changes in any mean: Use the estimated means to find the degree of membership u(j,i) of x j in Cluster i; for example, if dist(j,i) = exp(- || x j - m i || 2 ), one might use u(j,i) = dist(j,i) / j dist(j,i) For i from 1 to k Replace m i with the fuzzy mean of all of the examples for Cluster i end_for end_until
6
Time course experiment
7
K-means: Sample Application Gene clustering. Given a series of microarray experiments measuring the expression of a set of genes at regular time intervals in a common cell line. Normalization allows comparisons across microarrays. Produce clusters of genes which vary in similar ways over time. Hypothesis: genes which vary in the same way may be co-regulated and/or participate in the same pathway. Sample Array. Rows are genes and columns are time points. A cluster of co-regulated genes.
8
Iteration = 3 Start with random position of K centroids. Iteratre until centroids are stable Assign points to centroids Move centroids to center of assign points Centroid Methods - K-means
9
Application of K-means to tome course experiments
10
Agglomerative Hierarchical Clustering Results depend on distance update method Single linkage: elongated clusters Complete linkage: sphere-like clusters Greedy iterative process Not robust against noise No inherent measure to choose the clusters
11
Gene Expression Data Cluster genes and conditions 2 independent clustering: Genes represented as vectors of expression in all conditions Conditions are represented as vectors of expression of all genes
12
1. Identify tissue classes (tumor/normal) First clustering - Experiments
13
2. Find Differentiating And Correlated Genes Second Clustering - Genes Ribosomal proteins Cytochrome C HLA2 metabolism
14
Two-way Clustering
15
Coupled Two-way Clustering (CTWC) Motivation: Only a small subset of genes play a role in a particular biological process; the other genes introduce noise, which may mask the signal of the important players. Only a subset of the samples exhibit the expression patterns of interest. New Goal: Use subsets of genes to study subsets of samples (and vice versa) A non-trivial task – exponential number of subsets. CTWC is a heuristic to solve this problem.
16
CTWC of Colon Cancer Data (A) (B)
17
Using only the tumor tissues to cluster Genes, reveals correlation between two Gene clusters; Cell growth and epthelial COLON CANCER - ASSOCIATED WITHEPITHELIAL CELLS CTWC of Colon Cancer - Genes
18
Multiple Testing Problem Simultaneously test m null hypotheses, one for each gene j H j : no association between expression measure of gene j and the response Because microarray experiments simultaneously monitor expression levels of thousands of genes, there is a large multiplicity issue Increased chance of false positives
19
Hypothesis Truth Vs. Decision # not rejected# rejectedtotal s # true HUV Type I error m0m0 # non-true HT Type II error Sm1m1 totalsm - RRm Truth Decision
20
Strong Vs. Weak Control All probabilities are conditional on which hypotheses are true Strong control refers to control of the Type I error rate under any combination of true and false nulls Weak control refers to control of the Type I error rate only under the complete null hypothesis (i.e. all nulls true) In general, weak control without other safeguards is unsatisfactory
21
Adjusted p-values (p*) Test level (e.g. 0.05) does not need to be determined in advance Some procedures most easily described in terms of their adjusted p-values Usually easily estimated using resampling Procedures can be readily compared based on the corresponding adjusted p-values
22
A Little Notation For hypothesis H j, j = 1, …, m observed test statistic: t j observed unadjusted p-value: p j Ordering of observed (absolute) t j : {r j } such that |t r 1 | |t r 2 | … |t r G | Ordering of observed p j : {r j } such that |p r 1 | |p r 2 | … |p r G | Denote corresponding RVs by upper case letters (T, P)
23
Control of the type I errors Bonferroni single-step adjusted p-values p j * = min (mp j, 1) Sidak single-step (SS) adjusted p-values p j * = 1 – (1 – p j ) m Sidak free step-down (SD) adjusted p-values p j * = 1 – (1 – p (j) ) (m – j + 1)
24
Control of the type I errors Holm (1979) step-down adjusted p-values p r j * = max k = 1…j {min ((m-k+1)p r k, 1)} Intuitive explanation: once H (1) rejected by Bonferroni, there are only m-1 remaining hyps that might still be true (then another Bonferroni, etc.) Hochberg (1988) step-up adjusted p-values (Simes inequality) p r j * = min k = j…m {min ((m-k+1)p r k, 1) }
25
Control of the type I errors Westfall & Young (1993) step-down minP adjusted p- values p r j * = max k = 1…j { p(max l { r k… r m} P l p r k H 0 C )} Westfall & Young (1993) step-down maxT adjusted p- values p r j * = max k = 1…j { p(max l { r k… r m} |T l | ≥ |t r k | H 0 C )}
26
Westfall & Young (1993) Adjusted p-values Step-down procedures: successively smaller adjustments at each step Take into account the joint distribution of the test statistics Less conservative than Bonferroni, Sidak, Holm, or Hochberg adjusted p-values Can be estimated by resampling but computer- intensive (especially for minP)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.