C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 4 Clustering Algorithms Bioinformatics Data Analysis and Tools

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Basic Gene Expression Data Analysis--Clustering
Cluster Analysis: Basic Concepts and Algorithms
Albert Gatt Corpora and Statistical Methods Lecture 13.
Cluster analysis for microarray data Anja von Heydebreck.
Machine Learning and Data Mining Clustering
Introduction to Bioinformatics
Cluster Analysis.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
The Broad Institute of MIT and Harvard Clustering.
Clustering II.
Mutual Information Mathematical Biology Seminar
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools.
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Clustering Algorithms Bioinformatics Data Analysis and Tools
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Cluster Analysis: Basic Concepts and Algorithms
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Clustering Luis Tari.
Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
What is Cluster Analysis?
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering Unsupervised learning Generating “classes”
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Functional genomics + Data mining BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
More on Microarrays Chitta Baral Arizona State University.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Microarrays.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Machine Learning Neural Networks (3). Understanding Supervised and Unsupervised Learning.
Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Hierarchical Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Data Mining and Text Mining. The Standard Data Mining process.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Clustering.
Dimension reduction : PCA and Clustering
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Presentation transcript:

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 4 Clustering Algorithms Bioinformatics Data Analysis and Tools

Biologists are estimated to produce bytes of data each year (± 35 billion CD-roms). How do we learn something from this data? First step: Find patterns/structure in the data.  Use cluster analysis The challenge

Definition: Clustering is the process of grouping several objects into a number of groups, or clusters. Goal: Objects in the same cluster are more similar to one another than they are to objects in other clusters. Cluster analysis

SpeciesFat (%)Proteins (%) Horse Donkey Mule Camel Llama Zebra Sheep Buffalo Fox Pig Rabbit Rat Deer Reindeer Whale Simple example Data taken from: W.S. Spector, Ed., Handbook of Biological Data, WB Saunders Company, London, 1956

Simple example Composition of mammalian milk Fat (%) Proteins (%) Clustering

Clustering (unsupervised learning): We have objects, but we do not know to which class (group/cluster) they belong. The task here is trying to find a ‘natural’ grouping of the objects. Classification (supervised learning): We have labeled objects (we know to which class they belong). The task here is to build a classifier based on the labeled objects; a model which is able to predict the class of a new object. This can be done with Support Vector Machines (SVM), neural networks, etc. Clustering vs. Classification

The six steps of cluster analysis: 1.Noise/outlier filtering 2.Choose distance measure 3.Choose clustering criterion 4.Choose clustering algorithm 5.Validation of the results 6.Interpretation of the results Back to cluster analysis

Ourliers are obervations that are distant from the rest of the data, due to: –human errors –measurement errors (i.e. a defective thermometer) –the fact that it’s really a special (rare) case. Outliers can negatively influence your data analysis, so it is better to filter them out. A lot of filtering methods exist, but we won’t go into that. Step 1 – Outlier filtering

The distance measure defines the dissimilarity between the objects that we are clustering. Examples. What might be a good distance measure for comparing: –Body sizes of persons? –Protein sequences? –Coordinates in n-dimensional space? Step 2 – Choose a distance measure d = | length_person_1 – length_person_2 | d = number of positions with different aa, or difference in molecular weight, or… d = Euclidean distance:

We just defined the distances between the object we want to cluster. But how to decide whether two objects should be in the same cluster? The clustering criterion defines how the objects should be clustered Possible clustering criterions: –A cost function to be minimized. This is optimization- based clustering. –A (set of) rule(s). This is hierarchical clustering. Step 3 – Choose clustering criterion

Choose an algorithm which performs the actual clustering, based on the chosen distance measure and clustering criterion. Step 4 – Choose clustering algorithm

Look at the resulting clusters! Are the object indeed grouped together as expected? Experts mostly know what to expect. For example, if a membrane protein is assigned to a cluster containing mostly DNA-binding proteins, there might be something wrong. (or not!) Step 5 – Validation of the results

The results depend totally on the choices made. Other distance measures, clustering criterions and clustering algorithms lead to different results! Important to know! For example, try hierarchical clustering with different cluster criteria:

Some popular clustering algorithms are: K-means Hierarchical clustering Self organizing maps (SOM) Fuzzy clustering Popular clustering algorithms

Steps: 1.Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids (centroid = cluster center = average of all objects in that cluster). 2.Assign each object to the group that has the closest centroid. 3.When all objects have been assigned, recalculate the positions of the K centroids. 4.Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. K-means

–Advantages: Fast Easy to interpret –But: K-means has to be told how many groups (K) to find Easily affected by outliers No measure is provided of how well a data point fits in a given cluster No guarantee to find global optimum: the quality of the outcome depends on the initialization. => You should run the algorithm several times with different initializations, and choose the best clustering (according to some quality measure). K-means

Dendrogram Venn Diagram of Clustered Data From Hierarchical clustering

Agglomerative HC: Starts with n clusters: one for each object. At every step, two clusters are merged into one. Divisive HC: Starts with one cluster containing all objects. At every step, a cluster is split into two. Different types: Single linkage Complete linkage Average linkage Average group linkage Hierarchical clustering (HC)

Initialization: each object forms one cluster Two most similar clusters are grouped together to form a new cluster Calculate the similarity between the new cluster and all remaining clusters. Hierarchical clustering Steps in agglomerative clustering:

C1C1 C2C2 C3C3 Merge which pair of clusters? From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong Hierarchical clustering The distance of two objects is obvious; it is defined by the distance measure of your choise. But what about the distance between two clusters? =>

+ + Single Linkage (= nearest neighbor) C1C1 C2C2 Distance between two clusters = minimum distance between the members of two clusters. But: Tends to generate “long chains”: From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong Hierarchical clustering

+ + Complete Linkage C1C1 C2C2 Hierarchical clustering Distance between two clusters = maximum distance between the members of two clusters. From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong Generates compact clusters (“clumps”)

+ + Average Linkage (=~ UPGMA) C1C1 C2C2 Hierarchical clustering From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong Distance between two clusters = mean distance between all pairs of objects of the two clusters. Advantage: Robust to outliers

+ + Average Group Linkage C1C1 C2C2 Hierarchical clustering From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong Distance between two clusters = distance between the means of the two clusters.

Example: Nearest neighbor Algorithm Nearest Neighbor Algorithm is an agglomerative approach (bottom-up). Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached. Hierarchical clustering

Nearest Neighbor, Level 2, k = 7 clusters. From

Nearest Neighbor, Level 3, k = 6 clusters.

Nearest Neighbor, Level 4, k = 5 clusters.

Nearest Neighbor, Level 5, k = 4 clusters.

Nearest Neighbor, Level 6, k = 3 clusters.

Nearest Neighbor, Level 7, k = 2 clusters.

Nearest Neighbor, Level 8, k = 1 cluster.

Clustering in Bioinformatics Microarray data quality checking –Does replicates cluster together? –Does similar conditions, time points, tissue types cluster together? Cluster genes  Prediction of functions of unknown genes by known ones Cluster samples  Discover clinical characteristics (e.g. survival, marker status) shared by samples. Promoter analysis of commonly regulated genes

Functional significant gene clusters Two-way clustering Gene clusters Sample clusters

Bhattacharjee et al. (2001) Human lung carcinomas mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. USA, Vol. 98,

Hard clustering: Each object is assigned to exactly one cluster. Fuzzy clustering: Each object is assigned to all clusters, but to some extent. The task of the clustering algorithm in this case is to assign a membership degree u i,j for each object j to each cluster i. Fuzzy clustering C clusters Object 1 u 1,1 = 0.6 u 3,1 = 0.1 u 2,1 = 0.3  u i,j = 1 i=0 C

Experimental observation: Genes that are expressed in a similar way (co-expressed) are under common regulatory control, and there products (proteins) share a biological function 1. So it would be interesting to know which genes are expressed in a similar way.  Clustering! 1 Eisen MB, Spellman PT, Brown PO, and D. Botstein, Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 1998, 95:14863– Fuzzy clustering of gene expression data

Many proteins have multiple roles in the cell, and act with distinct sets of cooperating proteins to fulfill each role. Their genes are therefore co- expressed with more than one groups of genes. The above means that one gene might belong to different groups/clusters.  We need ‘fuzzy’ clustering. Fuzzy clustering of gene expression data