C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools.

Slides:

Advertisements

Similar presentations

PARTITIONAL CLUSTERING

Advertisements

Unsupervised learning

Machine Learning and Data Mining Clustering

The Broad Institute of MIT and Harvard Clustering.

University at BuffaloThe State University of New York Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data Daxin Jiang Jian.

1 CLUSTERING  Basic Concepts In clustering or unsupervised learning no training data, with class labeling, are available. The goal becomes: Group the.

Mutual Information Mathematical Biology Seminar

Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Clustering Algorithms Bioinformatics Data Analysis and Tools

‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

What is Cluster Analysis?

What is Cluster Analysis?

Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz

Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.

嵌入式視覺 Pattern Recognition for Embedded Vision Template matching Statistical / Structural Pattern Recognition Neural networks.

Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

Functional genomics + Data mining BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.

Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.

More on Microarrays Chitta Baral Arizona State University.

Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

The Broad Institute of MIT and Harvard Classification / Prediction.

Machine Learning Neural Networks (3). Understanding Supervised and Unsupervised Learning.

Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.

Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1.

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:

Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.

CSSE463: Image Recognition Day 11 Lab 4 (shape) tomorrow: feel free to start in advance Lab 4 (shape) tomorrow: feel free to start in advance Test Monday.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.

An Overview of Clustering Methods Michael D. Kane, Ph.D.

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 4 Clustering Algorithms Bioinformatics Data Analysis and Tools

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)

Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

COMP 2208 Dr. Long Tran-Thanh University of Southampton K-Nearest Neighbour.

Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.

Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.

Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.

Data Mining By: Johan Johansson. Mining Techniques Association Rules Association Rules Decision Trees Decision Trees Clustering Clustering Nearest Neighbor.

1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.

 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.

CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.

CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Ananya Das Christman CS311 Fall 2016

Data Mining K-means Algorithm

CSSE463: Image Recognition Day 11

Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:

Basic machine learning background with Python scikit-learn

CSSE463: Image Recognition Day 11

Roberto Battiti, Mauro Brunato

Sequential Hierarchical Clustering

CSSE463: Image Recognition Day 11

CSSE463: Image Recognition Day 11

Supervised machine learning: creating a model

Machine Learning and Data Mining Clustering

Presentation transcript:

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools

Part A: Cluster analysis in general C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Biologists are estimated to produce bytes of data each year (± 16 billion of CD-roms). How do we learn something from this data? First step: Find patterns/structure in the data.  Use cluster analysis The challenge C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Definition: Clustering is the process of grouping several objects into a number of groups, or clusters. Goal: Objects in the same cluster are more similar to one another than they are to objects in other clusters. Cluster analysis C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

SpeciesFat (%)Proteins (%) Horse Donkey Mule Camel Llama Zebra Sheep Buffalo Fox Pig Rabbit Rat Deer Reindeer Whale Simple example Data taken from: W.S. Spector, Ed., Handbook of Biological Data, WB Saunders Company, London, 1956 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Simple example Composition of mammalian milk Fat (%) Proteins (%) Clustering C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Clustering (unsupervised learning): We have objects, but we do not know to which class (group/cluster) they belong. The task here is trying to find a ‘natural’ grouping of the objects. Classification (supervised learning): We have labeled objects (we know to which class they belong). The task here is to build a classifier based on the labeled objects; a model which is able to predict the class of a new object. This can be done with Support Vector Machines (SVM), neural networks, etc. Clustering vs. Classification C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Let’s say we have gene expression data of 5000 genes, for the healthy case and for 5 types of cancer. So for each gene expression measurement, we know to which class (healthy, cancer type 1, cancer type 2, etc.) it belongs. Now, based on this data, we can build a classifier by training or learning. If we have the gene expression data of a person, who might have cancer, we can use our classifier to predict to which class this person belongs (healthy, cancer type 1, cancer type 2, etc). Example of classification C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

The six steps of cluster analysis: 1.Noise/outlier filtering 2.Choose distance measure 3.Choose clustering criterion 4.Choose clustering algorithm 5.Validation of the results 6.Interpretation of the results Back to cluster analysis C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Noise and outliers are ‘bad measurements’ in data, caused by: –Human errors –Measurement errors (i.e. defective thermometer) Noise and outliers can negatively influence your data analysis, so it is better to filter them out. But be careful! Outliers could be correct, but just rare data. A lot of filtering methods exist, but we won’t go into that. Step 1 – Noise/outlier filtering C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

The distance measure defines the dissimilarity between the objects that we are clustering. Examples. What might be a good distance measure for comparing: –Body sizes of persons? –Protein sequences? –Coordinates in n-dimensional space? Step 2 – Choose a distance measure d = | length_person_1 – length_person_2 | d = number of positions with different aa, or difference in molecular weight, or… d = Euclidean distance: C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

We just defined the distances between the object we want to cluster. But how to decide whether two objects should be in the same cluster? The clustering criterion defines how the objects should be clustered Possible clustering criterions: –A cost function to be minimized. This is optimization- based clustering. –A (set of) rule(s). This is hierarchical clustering. Step 3 – Choose clustering criterion C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Different types of hierarchical clustering, based on the different clustering criterions: –Single linkage / Nearest neighbour –Complete linkage / Furthest neighbour –Group averaging / UPGMA (see also the lecture on machine learning) Step 3 – Choose clustering criterion C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Choose an algorithm which performs the clustering, based on the chosen distance measure and clustering criterion. Step 4 – Choose clustering algorithm C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Look at the resulting clusters! Are the object indeed grouped together as expected? Experts mostly know what to expect. For example, if a membrane protein is assigned to a cluster containing mostly DNA-binding proteins, there might be something wrong. (or not!) Step 5 – Validation of the results C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

The results depend totally on the choices made. Other distance measures, clustering criterions and clustering algorithms lead to different results! Important to know! C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E For example, try hierarchical clustering with different cluster criteria:

Part B: Cluster analysis of gene expression data with dynamic evolutionary fuzzy clustering C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Micro arrays are commonly used for measuring gene expression: Gene expression data How are micro arrays produced? Click here for an animation C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Gene expression data From the colors of the micro array, the expression levels for the different genes can be deduced. In this figure, each line represents the expression level of one gene during some time period. C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Experimental observation: Genes that are expressed in a similar way (co-expressed) are under common regulatory control, and there products (proteins) share a biological function 1. So it would be interesting to know which genes are expressed in a similar way.  Clustering! 1 Eisen MB, Spellman PT, Brown PO, and D. Botstein, Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 1998, 95:14863– The task C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Many proteins have multiple roles in the cell, and act with distinct sets of cooperating proteins to fulfill each role. Their genes are therefore co- expressed with different groups of genes. The above means that one gene might belong to different groups/clusters. The challenge  We need ‘fuzzy’ clustering. C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Hard clustering: Each object is assigned to exactly one cluster. Fuzzy clustering: Each object is assigned to all clusters, but to some extent. The task of the clustering algorithm in this case is to assign a membership degree u i,j for each object j to each cluster i. Fuzzy clustering C clusters Object 1 u 1,1 = 0.6 u 3,1 = 0.1 u 2,1 = 0.3  u i,j = 1 i=0 C C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Remember our data: Which genes are closer, based on their expression? Green and red, or green and blue? Choose distance function Time Gene expression level We want to cluster genes that show co-expression, so the shape of the lines should be similar. In this case, it’s not important if the lines are near each other. So green and blue are closer in the example. C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

(1 - Pearson correlation) is a good measure for the distance of the shapes of two lines v i and x j : Choose distance function C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

In this case we choose a cost function J to be minimized, which commonly used for fuzzy clustering. Each cluster i is represented by it’s center v i. Choose cluster criterion C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Outliers can disturb the clustering process, and the resulting clustering. So, we should deal with that. We adapt our cost function as follows: If w j is high, the contribution of object x j is low, so it will hardly influence the clustering process. How to deal with outliers? C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Use a genetic algorithm: Minimization of the cost function C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Generate data, based on these profiles: Test the algorithm C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Each scenario represents a different test case. Data is generated by producing new gene expression profiles: take a template profile, add noise to it, and add it to the database. Do this 50 times for each profile. The algorithm should find the correct number of cluster (= number of template profiles), and should assign each datum to the correct cluster. Test the algorithm C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

The algorithm finds the true number of clusters in % of the cases. The algorithm puts the genes in the correct clusters in % of the cases. I am now testing the algorithm with real data Results C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E