Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

K-means Clustering Given a data point v and a set of points X,
Clustering.
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Introduction to Bioinformatics
Lecture 6 Image Segmentation
Clustering II.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Region Segmentation. Find sets of pixels, such that All pixels in region i satisfy some constraint of similarity.
Introduction to Bioinformatics Algorithms Clustering.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Cluster Analysis: Basic Concepts and Algorithms
Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright  All rights reserved.
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
Introduction to Bioinformatics Algorithms Clustering.
CSE182-L17 Clustering Population Genetics: Basics.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
What is Cluster Analysis?
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Comp602 Bioinformatics Algorithms -m werner 2011
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical.
Clustering Unsupervised learning Generating “classes”
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Gene expression & Clustering (Chapter 10)
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Microarrays.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Clustering.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Unsupervised Learning
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Clustering BE203: Functional Genomics Spring 2011 Vineet Bafna and Trey Ideker Trey Ideker Acknowledgements: Jones and Pevzner, An Introduction to Bioinformatics.
Clustering.
Text Categorization Berlin Chen 2003 Reference:
Clustering.
Unsupervised Learning
Presentation transcript:

Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Outline  Microarrays  Hierarchical Clustering  K-Means Clustering  Corrupted Cliques Problem  CAST Clustering Algorithm

Motivation: Inferring Gene Functionality  Researchers want to know the functions of newly sequenced genes  Simply comparing the new gene sequences to known DNA sequences often does not give away the function of gene  For 40% of sequenced genes, functionality cannot be ascertained by only comparing to sequences of other known genes  Microarrays allow biologists to infer gene function even when sequence similarity alone there is insufficient to infer function.

Microarrays and Expression Analysis  Microarrays measure the activity (expression level) of the genes under varying conditions/time points  Expression level is estimated by measuring the amount of mRNA for that particular gene  A gene is active if it is being transcribed  More mRNA usually indicates more gene activity

Microarray Experiments  Produce cDNA from mRNA (DNA is more stable)  Attach phosphor to cDNA to see when a particular gene is expressed  Different color phosphors are available to compare many samples at once  Hybridize cDNA over the micro array  Scan the microarray with a phosphor-illuminating laser  Illumination reveals transcribed genes  Scan microarray multiple times for the different color phosphor’s

Microarray Experiments (con’t) Phosphors can be added here instead Then instead of staining, laser illumination can be used

Interpretation of Microarray Data  Green: expressed only from control  Red: expressed only from experimental cell  Yellow: equally expressed in both samples  Black: NOT expressed in either control or experimental cells

Microarray Data  Microarray data are usually transformed into an intensity matrix (below)  The intensity matrix allows biologists to make correlations between different genes (even if they are dissimilar) and to understand how gene functions might be related  Clustering comes into play Time:Time XTime YTime Z Gene 1108 Gene Gene Gene 4783 Gene 5123 Intensity (expression level) of gene at measured time

Major Analysis Techniques  Single gene analysis  Compare the expression levels of the same gene under different conditions  Main techniques: Significance test (e.g., t-test)  Gene group analysis  Find genes that are expressed similarly across many different conditions (co-expressed genes -> co-regulated genes)  Main techniques: Clustering (many possibilities)  Gene network analysis  Analyze gene regulation relationship at a large scale  Main techniques: Bayesian networks, won’t be covered

Clustering of Microarray Data Clusters

Homogeneity and Separation Principles Homogeneity: Elements within a cluster are close to each other Separation: Elements in different clusters are further apart from each other …clustering is not an easy task! Given these points a clustering algorithm might make two distinct clusters as follows

Bad Clustering This clustering violates both Homogeneity and Separation principles Close distances from points in separate clusters Far distances from points in the same cluster

Good Clustering This clustering satisfies both Homogeneity and Separation principles

Clustering Methods  Similarity-based ( need a similarity function )  Construct a partition  Agglomerative, bottom up  Typically “hard” clustering  Model-based (latent models, probabilistic or algebraic)  First compute the model  Clusters are obtained easily after having a model  Typically “soft” clustering

Similarity-based Clustering  Define a similarity function to measure similarity between two objects  Common clustering criteria: Find a partition to  Maximize intra-cluster similarity  Minimize inter-cluster similarity  Challenge: How to combine them? (many possibilities)  How to construct clusters?  Agglomerative Hierarchical Clustering (Greedy)  Need a cutoff (either by the number of clusters or by a score threshold)

Method 1 (Similarity-based): Agglomerative Hierarchical Clustering

Similarity Measures  Given two gene expression vectors, X={X 1, X 2, …, X n } and Y={Y 1, Y 2, …, Y n }, define a similarity function s(X,Y) to measure how likely the two genes are co-regulated/co-expressed  However “co-expressed” is hard to quantify  Biology knowledge would help  In practice, many choices  Euclidean distance  Pearson correlation coefficient Better measures are possible (e.g., focusing on a subset of values…

Agglomerative Hierachical Clustering  Given a similarity function to measure similarity between two objects  Gradually group similar objects together in a bottom-up fashion  Stop when some stopping criterion is met  Variations: different ways to compute group similarity based on individual object similarity

Hierarchical Clustering

Hierarchical Clustering Algorithm 1.Hierarchical Clustering (d, n) 2. Form n clusters each with one element 3. Construct a graph T by assigning one vertex to each cluster 4. while there is more than one cluster 5. Find the two closest clusters C 1 and C 2 6. Merge C 1 and C 2 into new cluster C with |C 1 | +|C 2 | elements 7. Compute distance from C to all other clusters 8. Add a new vertex C to T and connect to vertices C 1 and C 2 9. Remove rows and columns of d corresponding to C 1 and C Add a row and column to d corrsponding to the new cluster C 11. return T The algorithm takes a n x n distance matrix d of pairwise distances between points as an input.

Hierarchical Clustering Algorithm 1.Hierarchical Clustering (d, n) 2. Form n clusters each with one element 3. Construct a graph T by assigning one vertex to each cluster 4. while there is more than one cluster 5. Find the two closest clusters C 1 and C 2 6. Merge C 1 and C 2 into new cluster C with |C 1 | +|C 2 | elements 7. Compute distance from C to all other clusters 8. Add a new vertex C to T and connect to vertices C 1 and C 2 9. Remove rows and columns of d corresponding to C 1 and C Add a row and column to d corrsponding to the new cluster C 11. return T Different ways to define distances between clusters lead to different clustering methods…

How to Compute Group Similarity? Given two groups g1 and g2, Single-link algorithm: s(g1,g2)= similarity of the closest pair complete-link algorithm: s(g1,g2)= similarity of the farthest pair average-link algorithm: s(g1,g2)= average of similarity of all pairs Three Popular Methods:

Three Methods Illustrated Single-link algorithm ? g1 g2 complete-link algorithm …… average-link algorithm

Group Similarity Formulas  Given C (new cluster) and C* (any other cluster in the current results)  Given a similarity function s(x,y) (or equivalently, a distance function, d(x,y) as in the book)  Single link:  Complete link:  Average link:

Comparison of the Three Methods  Single-link  “Loose” clusters  Individual decision, sensitive to outliers  Complete-link  “Tight” clusters  Individual decision, sensitive to outliers  Average-link  “In between”  Group decision, insensitive to outliers  Which one is the best? Depends on what you need!

Method 2 (Model-based): K-Means

A Model-based View of K-Means  Each cluster is modeled by a centroid point  A partition of the data V={v 1 …v n } with k clusters is modeled by k centroid points X={x 1 …x k }; Note that a model X uniquely specifies the clustering results  Define an objective function on X and V to measure how good/bad V is as clustering results of X, f(X,V)  Clustering is reduced to the problem of finding a model X* that optimizes the value f(X*,V)  Once X* is found, the clustering results are simply computed according to our model X*  Major tasks: (1) define d(x,v); (2) define f(X,V); (3) Efficiently search for X*

Squared Error Distortion ● Given a set of n data points V={v 1 …v n } and a set of k points X={x 1 …x k } ●Define d(x i,v j )=Euclidean distance ●Define f(X,V)=d(X,V) (Squared error distortion) The distance from v i to the closest point in X

K-Means Clustering Problem: Formulation ●Input: A set, V, consisting of n points and a parameter k ●Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X This problem is NP-complete.

K-Means Clustering  Given a distance function  Start with k randomly selected data points  Assume they are the centroids of k clusters  Assign every data point to a cluster whose centroid is the closest to the data point  Recompute the centroid for each cluster  Repeat this process until the distance-based objective function converges

K-Means Clustering: Lloyd Algorithm 1.Lloyd Algorithm 2. Arbitrarily assign the k cluster centers 3. while the cluster centers keep changing 4. Assign each data point to the cluster C i (x i ) corresponding to the closest cluster representative (i.e,. the center x i ) (1 ≤ i ≤ k) 5. After the assignment of all data points, compute new cluster representatives according to the center of gravity of each cluster, that is, the new cluster representative is *This may lead to merely a locally optimal clustering.

x1x1 x2x2 x3x3

x1x1 x2x2 x3x3

x1x1 x2x2 x3x3

x1x1 x2x2 x3x3

Evaluation of Clustering Results  How many clusters?  Score threshold in hierarchical clustering (need to have some confidence in the scores)  Try out multiple numbers, and see which one works the best (best could mean fitting the data better or discovering some known clusters)  Quality of a single cluster  A tight cluster is better (but needs to quantify the “tightness”)  Prior knowledge about clusters  Quality of a set of clusters  Quality of individual clusters  How well different clusters are separated  In general, evaluation of clustering and choosing the right clustering algorithms are difficult  But the good news is that significant clusters tend to show up regardless what clustering algorithms to use

What You Should Know  Why clustering microarray data is interesting  How hierarchical agglomerative clustering works  The 3 different ways of computing the group similarity/distance  How k-means works