Introduction to Bioinformatics

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Functional Genomics and Microarray Analysis (2)
BioInformatics (3).
Basic Gene Expression Data Analysis--Clustering
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Cluster Analysis.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Tree Clustering & COBWEB. Remember: k-Means Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
Cluster Analysis (1).
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Birch: An efficient data clustering method for very large databases
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Clustering.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Cluster Analysis, an Overview Laurie Heyer. Why Cluster? Data reduction – Analyze representative data points, not the whole dataset Hypothesis generation.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Multivariate statistical methods Cluster analysis.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Unsupervised Learning
Multivariate statistical methods
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Canadian Bioinformatics Workshops
Data Clustering Michael J. Watts
John Nicholas Owen Sarah Smith
Hierarchical clustering approaches for high-throughput data
Multivariate Statistical Methods
Data Mining – Chapter 4 Cluster Analysis Part 2
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Unsupervised Learning
Presentation transcript:

Introduction to Bioinformatics Biological Networks Department of Computing Imperial College London March 4, 2010 Lecture hours 14-15 Nataša Pržulj natasha@imperial.ac.uk

Data Clustering find relationships and patterns in the data to achieve insights in underlying biology Clustering algorithms can be applied to the data to find groups of similar genes/proteins, or groups of similar samples

What is data clustering? Clustering of data is a method by which large sets of data is grouped into clusters (groups) of smaller sets of similar data. Example: There are a total of 10 balls which are of three different colours. We are interested in clustering the balls into three different groups. An intuitive solution is that balls of same colour are clustered (grouped together) by colour. Identifying similarity by colour was easy, however we want to extend this to numerical values to be able to deal with biological data, and also to cases when there are more features (not just colour).

Clustering Partition a set of elements into subsets called clusters such that elements of the same cluster are similar to each other (homogeneity property, H) Elements from different clusters are different (separation property, S)

Clustering Algorithms A clustering algorithm attempts to find natural groups of components (or data) based on some notion similarity over the features describing them. Also, the clustering algorithm finds the centroid of a group of data sets. To determine cluster membership, many algorithms evaluate the distance between a point and the cluster centroids. The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster.

Clustering Algorithms Cluster centroid : The centroid of a cluster is a point whose parameter values are the mean of the parameter values of all the points in the clusters. Distance: Generally, the distance between two points is taken as a common metric to assess the similarity among the components of a population. The commonly used distance measure is the Euclidean distance which defines the distance between two points p= ( p1, p2, ....) and q = ( q1, q2, ....) is given by :

Clustering Algorithms There are many possible distance metrics. Some theoretical (and intuitive) properties of distance metrics Distance between two items (elements) must be greater than or equal to zero, Distances cannot be negative. The distance between an item and itself must be zero Conversely if the difference between two items is zero, then the items must be identical. The distance between item A and item B must be the same as the distance between item B and item A. The distance between item A and item C must be less than or equal to the sum of the distance between items A and B and items B and C (triangle inequality).

Clustering Algorithms Example distances: Euclidean (L2) distance Manhattan (L1) distance Lm: (|x1-x2|m+|y1-y2|m)1/m L∞: max(|x1-x2|,|y1-y2|) Inner product: x1x2+y1y2 Correlation coefficient For simplicity we will concentrate on Euclidean and Manhattan distances

Clustering Algorithms Distance Measures: Minkowski Metric Suppose two objects and both have features : The Minkowski metric is defined as:

Clustering Algorithms Commonly used Minkowski metrics:

Clustering Algorithms Examples of Minkowski metrics:

Clustering Algorithms Distance/Similarity matrices: Clustering is based on distances – distance/similarity matrix: Represents the distance between objects Only need half the matrix, since it is symmetric

Clustering Algorithms Hierarchical vs Non-hierarchical: Hierarchical clustering is the most commonly used methods for identifying groups of closely related genes or tissues. Hierarchical clustering is a method that successively links genes or samples with similar profiles to form a tree structure. K-means clustering is a method for non-hierarchical (flat) clustering that requires the analyst to supply the number of clusters in advance and then allocates genes and samples to clusters appropriately.

Clustering Algorithms Hierarchical Clustering: Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process hierarchical clustering is this: Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. Compute distances (similarities) between the new cluster and each of the old clusters Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

Clustering Algorithms Hierarchical Clustering: Scan the matrix for the minimum Join items into one node Update matrix and repeat from step 1

Clustering Algorithms Hierarchical Clustering: Distance between two points – easy to compute Distance between two clusters – harder to compute: Single-Link Method / Nearest Neighbor Complete-Link / Furthest Neighbor Average of all cross-cluster pairs

Clustering Algorithms Hierarchical Clustering: Single-Link Method / Nearest Neighbor (also called the connectedness, or minimum method): distance between one cluster and another cluster is equal to the shortest distance from any member of one cluster to any member of the other cluster Complete-Link / Furthest Neighbor (also called the diameter or maximum method) the distance between one cluster and another is equal to the longest distance from any member of one cluster to any member of the other cluster Average-link clustering the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster

Clustering Algorithms Hierarchical Clustering: 2. Example: Single-Link (Minimum) Method: Resulting Tree, or Dendrogram:

Clustering Algorithms Hierarchical Clustering: Example: Complete-Link (Maximum) Method: Resulting Tree, or Dendrogram:

Clustering Algorithms Hierarchical Clustering: In a dendrogram, the length of each tree branch represents the distance between clusters it joins. Different dendrograms may arise when different Linkage methods are used.

Clustering Algorithms K-Means Clustering: Basic Ideas : use cluster centroids (means) to represent cluster. Assigning data elements to the closet cluster (centroid). Goal: Minimize intra-cluster dissimilarity.

Clustering Algorithms K-Means Clustering: Pick (usually randomly) k points as centers of k clusters. Compute distances between a non-center point v and each of the k center points find the minimum distance, say it is to center point Ci, and assign v to the cluster defined by Ci. Do this for all non-center points and obtain k non-overlapping clusters containing all the points. For each cluster, compute its new center, which is the point the with minimum sum of distances from that point to all other points in the cluster. Repeat until the algorithm converges, i.e., the same set of centers is chosen as in previous iteration. This results in non-overlapping clusters of potentially different sizes.  

Clustering Algorithms K-Means Clustering Example:  

Clustering Algorithms K-means vs. Hierarchical clustering: Computation Time: – Hierarchical clustering: O( m n2 log(n) ) – K-means clustering: O( k t m n ) t: number of iterations n: number of objects m-dimensional vectors k: number of clusters Memory Requirements: – Hierarchical clustering: O( mn + n2 ) – K-means clustering: O( mn + kn ) Other: Hierarchical Clustering: Need to select Linkage Method to perform any analysis, it is necessary to partition the dendrogram into k disjoint clusters, cutting the dendrogram at some point. A limitation is that it is not clear how to choose this k K-means: Need to select K In both cases: Need to select distance/similarity measure