Download presentation
Presentation is loading. Please wait.
Published byAmari Overland Modified over 9 years ago
1
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign
2
Gene Expression Data (Microarray) p genes on n samples Genes mRNA samples Gene expression level of gene i in mRNA sample j Log (treated-exp-value /controlled-exp-value ) sample1sample2sample3sample4sample5 … 1 0.46 0.30 0.80 1.51 0.90... 2-0.10 0.49 0.24 0.06 0.46... 3 0.15 0.74 0.04 0.10 0.20... 4-0.45-1.03-0.79-0.56-0.32... 5-0.06 1.06 1.35 1.09-1.09...
3
Some possible applications Sample from specific organ to show which genes are expressed Compare samples from healthy and sick host to find gene-disease connection Discover co-regulated genes Discover promoters
4
Major Analysis Techniques Single gene analysis Compare the expression levels of the same gene under different conditions Main techniques: Significance test (e.g., t-test) Gene group analysis Find genes that are expressed similarly across many different conditions Main techniques: Clustering (many possibilities) Gene network analysis Analyze gene regulation relationship at a large scale Main techniques: Bayesian networks
5
Clustering Methods Similarity-based ( need a similarity function ) Construct a partition Agglomerative, bottom up Searching for an optimal partition Typically “hard” clustering Model-based (latent models, probabilistic or algebraic) First compute the model Clusters are obtained easily after having a model Typically “soft” clustering
6
Similarity-based Clustering Define a similarity function to measure similarity between two objects Common criteria: Find a partition to Maximize intra-cluster similarity Minimize inter-cluster similarity Two ways to construct the partition Hierarchical (e.g.,Agglomerative Hierarchical Clustering) Search by starting at a random partition (e.g., K-means)
7
Method 1 (Similarity-based): Agglomerative Hierarchical Clustering
8
Agglomerative Hierachical Clustering Given a similarity function to measure similarity between two objects Gradually group similar objects together in a bottom-up fashion Stop when some stopping criterion is met Variations: different ways to compute group similarity based on individual object similarity
9
Similarity Measure: Pearson CC The most popular correlation coefficient is Pearson correlation coefficient (1892) correlation between X={X 1, X 2, …, X n } and Y={Y 1, Y 2, …, Y n } : where (Adapted from a Slide by Shin-Mu Tseng) s XY s XY is the similarity between X & Y Better measures focus on a subset of values…
10
Similarity-induced Structure
11
How to Compute Group Similarity? Given two groups g1 and g2, Single-link algorithm: s(g1,g2)= similarity of the closest pair complete-link algorithm: s(g1,g2)= similarity of the farthest pair average-link algorithm: s(g1,g2)= average of similarity of all pairs Three Popular Methods:
12
Three Methods Illustrated Single-link algorithm ? g1 g2 complete-link algorithm …… average-link algorithm
13
Comparison of the Three Methods Single-link “Loose” clusters Individual decision, sensitive to outliers Complete-link “Tight” clusters Individual decision, sensitive to outliers Average-link “In between” Group decision, insensitive to outliers Which one is the best? Depends on what you need!
14
Method 2 (similarity-based): K-Means
15
K-Means Clustering Given a similarity function Start with k randomly selected data points Assume they are the centroids of k clusters Assign every data point to a cluster whose centroid is the closest to the data point Recompute the centroid for each cluster Repeat this process until the similarity-based objective function converges
16
Method 3 (model-based): Mixture Models
17
Mixture Model for Clustering P(X|Cluster 1 ) P(X|Cluster 2 ) P(X|Cluster 3 ) P(X)= 1 P(X|Cluster 1 )+ 2 P(X|Cluster 2 )+ 3 P(X|Cluster 3 )
18
Mixture Model Estimation Likelihood function Parameters: i, i, i Using EM algorithm Similar to “soft” K-means
19
Method 4 (model-based) [If we have gtime] Singular Value Decomposition (SVD) Also called “Latent Semantic Indexing” (LSI)
20
Example of “Semantic Concepts” (Slide from C. Faloutsos’s talk)
21
Singular Value Decomposition (SVD) A [n x m] = U [n x r] r x r] (V [m x r] ) T A: n x m matrix (n documents, m terms) U: n x r matrix (n documents, r concepts) : r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix) V: m x r matrix (m terms, r concepts) (Slide from C. Faloutsos’s talk)
22
Example of SVD data inf retrieval brain lung = CS MD xx CS-concept MD-concept Term rep of concept (Slide adapted from C. Faloutsos’s talk) Strength of CS-concept Dim. Reduction A = U V T
23
More clustering methods and software Partitioning : K-Means, K-Medoids, PAM, CLARA … Hierarchical : Cluster, HAC 、 BIRCH 、 CURE 、 ROCK Density-based : CAST, DBSCAN 、 OPTICS 、 CLIQUE… Grid-based : STING 、 CLIQUE 、 WaveCluster… Model-based : SOM (self-organized map) 、 COBWEB 、 CLASSIT 、 AutoClass… Two-way Clustering Block clustering
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.