Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.

Similar presentations


Presentation on theme: " Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated."— Presentation transcript:

1  Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.  Goal B: Divide conditions to groups with similar gene expression profiles. Example: divide drugs according to their effect on gene expression. Unsupervised Analysis Clustering Methods

2 K-means: The Algorithm  Given a set of numeric points in d dimensional space, and integer k  Algorithm generates k (or fewer) clusters as follows: 1. Assign all points to a cluster at random 2. Compute centroid for each cluster 3. Reassign each point to nearest centroid 4. If centroids changed go back to stage 2

3 K-means: Example, k = 3 Step 1: Make random assignments and compute centroids (big dots) Step 2: Assign points to nearest centroids Step 3: Re-compute centroids (in this example, solution is now stable)

4 Fuzzy K means  The clusters produced by the k-means procedure are sometimes called "hard" or "crisp" clusters, since any feature vector x either is or is not a member of a particular cluster. This is in contrast to "soft" or "fuzzy" clusters, in which a feature vector x can have a degree of membership in each cluster.  The fuzzy-k-means procedure allows each feature vector x to have a degree of membership in Cluster i:

5 Fuzzy K means Algorithm  Make initial guesses for the means m 1, m 2,..., m k  Until there are no changes in any mean: Use the estimated means to find the degree of membership u(j,i) of x j in Cluster i; for example, if dist(j,i) = exp(- || x j - m i || 2 ), one might use u(j,i) = dist(j,i) /  j dist(j,i) For i from 1 to k Replace m i with the fuzzy mean of all of the examples for Cluster i end_for  end_until

6 Time course experiment

7 K-means: Sample Application  Gene clustering. Given a series of microarray experiments measuring the expression of a set of genes at regular time intervals in a common cell line. Normalization allows comparisons across microarrays. Produce clusters of genes which vary in similar ways over time. Hypothesis: genes which vary in the same way may be co-regulated and/or participate in the same pathway. Sample Array. Rows are genes and columns are time points. A cluster of co-regulated genes.

8 Iteration = 3 Start with random position of K centroids. Iteratre until centroids are stable Assign points to centroids Move centroids to center of assign points Centroid Methods - K-means

9 Application of K-means to tome course experiments

10 Agglomerative Hierarchical Clustering  Results depend on distance update method Single linkage: elongated clusters Complete linkage: sphere-like clusters  Greedy iterative process  Not robust against noise  No inherent measure to choose the clusters

11 Gene Expression Data  Cluster genes and conditions  2 independent clustering: Genes represented as vectors of expression in all conditions Conditions are represented as vectors of expression of all genes

12 1. Identify tissue classes (tumor/normal) First clustering - Experiments

13 2. Find Differentiating And Correlated Genes Second Clustering - Genes Ribosomal proteins Cytochrome C HLA2 metabolism

14 Two-way Clustering

15 Coupled Two-way Clustering (CTWC)  Motivation: Only a small subset of genes play a role in a particular biological process; the other genes introduce noise, which may mask the signal of the important players. Only a subset of the samples exhibit the expression patterns of interest.  New Goal: Use subsets of genes to study subsets of samples (and vice versa)  A non-trivial task – exponential number of subsets.  CTWC is a heuristic to solve this problem.

16 CTWC of Colon Cancer Data (A) (B)

17 Using only the tumor tissues to cluster Genes, reveals correlation between two Gene clusters; Cell growth and epthelial COLON CANCER - ASSOCIATED WITHEPITHELIAL CELLS CTWC of Colon Cancer - Genes

18 Multiple Testing Problem  Simultaneously test m null hypotheses, one for each gene j H j : no association between expression measure of gene j and the response  Because microarray experiments simultaneously monitor expression levels of thousands of genes, there is a large multiplicity issue  Increased chance of false positives

19 Hypothesis Truth Vs. Decision # not rejected# rejectedtotal s # true HUV Type I error m0m0 # non-true HT Type II error Sm1m1 totalsm - RRm Truth Decision

20 Strong Vs. Weak Control  All probabilities are conditional on which hypotheses are true  Strong control refers to control of the Type I error rate under any combination of true and false nulls  Weak control refers to control of the Type I error rate only under the complete null hypothesis (i.e. all nulls true)  In general, weak control without other safeguards is unsatisfactory

21 Adjusted p-values (p*)  Test level (e.g. 0.05) does not need to be determined in advance  Some procedures most easily described in terms of their adjusted p-values  Usually easily estimated using resampling  Procedures can be readily compared based on the corresponding adjusted p-values

22 A Little Notation  For hypothesis H j, j = 1, …, m observed test statistic: t j observed unadjusted p-value: p j  Ordering of observed (absolute) t j : {r j } such that |t r 1 |  |t r 2 |  …  |t r G |  Ordering of observed p j : {r j } such that |p r 1 |  |p r 2 |  …  |p r G |  Denote corresponding RVs by upper case letters (T, P)

23 Control of the type I errors  Bonferroni single-step adjusted p-values p j * = min (mp j, 1)  Sidak single-step (SS) adjusted p-values p j * = 1 – (1 – p j ) m  Sidak free step-down (SD) adjusted p-values p j * = 1 – (1 – p (j) ) (m – j + 1)

24 Control of the type I errors  Holm (1979) step-down adjusted p-values p r j * = max k = 1…j {min ((m-k+1)p r k, 1)} Intuitive explanation: once H (1) rejected by Bonferroni, there are only m-1 remaining hyps that might still be true (then another Bonferroni, etc.)  Hochberg (1988) step-up adjusted p-values (Simes inequality) p r j * = min k = j…m {min ((m-k+1)p r k, 1) }

25 Control of the type I errors  Westfall & Young (1993) step-down minP adjusted p- values p r j * = max k = 1…j { p(max l  { r k… r m} P l  p r k  H 0 C )}  Westfall & Young (1993) step-down maxT adjusted p- values p r j * = max k = 1…j { p(max l  { r k… r m} |T l | ≥ |t r k |  H 0 C )}

26 Westfall & Young (1993) Adjusted p-values  Step-down procedures: successively smaller adjustments at each step  Take into account the joint distribution of the test statistics  Less conservative than Bonferroni, Sidak, Holm, or Hochberg adjusted p-values  Can be estimated by resampling but computer- intensive (especially for minP)


Download ppt " Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated."

Similar presentations


Ads by Google