Download presentation
Presentation is loading. Please wait.
Published byMartin French Modified over 9 years ago
1
Lecture 13, CS5671 Clustering Relevance to Bioinformatics –Array(s) analysis –Examples Principal Component Analysis Clustering Algorithms
2
Lecture 13, CS5672 Bioinformatic arrays Array = View of biological (sub) system(s) in terms of a vector of attributes Arrays at different resolutions –Ecosystem: Spectrum of life in a pond –Populations: Characteristics of a population of species –Organism: “FBI profile” (multiple attribute types) Single Nucleotide Polymorphism (SNP) profile (single attribute type)
3
Lecture 13, CS5673 Bioinformatic arrays Arrays at different resolutions –Cell: mRNA expression array in cancer cell –Cell subcompartment: Cytosolic protein array Multiple arrays –Trivially, replicates Increase signal/noise ratio –Under different conditions Useful for analysis of –Relationships between conditions –Relationships between attributes
4
Lecture 13, CS5674 Bioinformatic clusters Genes {g} that are turned on together in condition (e.g., disease, phase of life) x Conditions {x} that gene g is turned on in Sequence/subsequence families Structure/substructure families Hierarchical classification of molecules/systems Group of people sharing disease Group of diseases common to a people/species
5
Lecture 13, CS5675 Principal Component Analysis Not really clustering but of some use in analysis of multidimensional data Grounds for PCA –Humans restricted to viewing only 3 dimensions at a time; PCA renders a lower dimensional approximation of a high dimensional space –PCA highlights attributes showing maximum variation; helps in ignoring attributes showing minimal variation
6
Lecture 13, CS5676 Principal Component Analysis Project data from N dimensional space to M dimensional spaces such that –M << N –The first principal component is aligned with the “direction” of maximal variability –Data remaining after removal of projection along a principal component is “residual” data, with “residual” variance –Each component is derived in ranked order of capturing residual variance –All PCs are orthogonal Limitations –Different clusterings may look alike by PCA –Combination of attributes that constitute a PCA may not be biologically intuitive
7
Lecture 13, CS5677 Clustering Ultimate goal: –“Birds of a feather stick together; We be brothers thicker than blood” –Minimum similarity within a cluster should exceed maximum similarity between clusters –Maximum distance within a cluster should be less than minimum distance between clusters Clustering: Classifying data without class labels Supervised: Classes are known (“Classifying Olympic athletes who are missing their country tags”) Unsupervised: Classes not known (“Classifying forms of life on an alien planet”) Hybrid: Combination of above (“Find classes based on sample of data, followed by classifying remaining data into the newly defined classes”)
8
Lecture 13, CS5678 The notion of Distance Definition –Positivity –Symmetry (Why relative entropy is not a distance. “One way traffic not really distance!”) –Triangular inequality (“’Beam me up Scotty’ route not a distance”) Getting from a to b via some c is never less than getting from a to b directly –Self distance = 0 (“If you are trying to find yourself, just open your eyes – you are right here!”) Examples of true distances –Minkowski distances (déjà vu) –Pearson correlation coefficient
9
Lecture 13, CS5679 Representing Distances Absolute measures –Measures that can be defined without reference to other data points –Examples: Height, Weight Pairwise distances –Defined with reference to pair of data points –May or may not be derivable from absolute measures –Inverse of similarity between points –Examples: Difference in height
10
Lecture 13, CS56710 Distances in bioinformatics Absolute –(sub)Sequence Length/Molecular weight Pairwise –Between primary structures Based on probabilistic similarity scores obtained from pairwise sequence alignment Based on expectations of similarity scores Examples: Clustering based on BLAST searches –Between secondary/tertiary structures Root mean square (RMS) difference of pairs of 3D arrays Examples: Structural classes in CATH database
11
Lecture 13, CS56711 Data Types Real numbers –Interval-scaled Linear scale –Ratio-scaled Non-linear scale –Fine grained resolution Nominal –Named categories –Distance typically Boolean (“You are either a Muggle or a Wizard!”) –Coarse distance Ordinal –Named categories with implicit ranking (“Distance from janitor to CEO > distance from President to CEO, not counting the.com boom years”) –Distance depends on relative ranks (Intermediate granularity)
12
Lecture 13, CS56712 Clustering algorithms K-means (Centroid) –Algorithm 1.Pick k points 2.For each point i, find D min (i,k) and assign i to k 3.Replace the k points by the centroids (means) of each cluster 4.Calculate mean square distance from cluster centers 5.Iterate 2-4 till membership of points does not change OR 4 is minimized –Scales well –Limitations: Number of clusters predetermined Clusters always spherical in shape Adversely affected by outliers (‘coz centroids are virtual data points a la “Simone”) Mean needs to be definable Only local minimum found
13
Lecture 13, CS56713 Clustering algorithms K-means (Centroid) – EM variation –Algorithm 1.Pick k points 2.For each point i, find all D(i,k) and assign i to each k probabilistically 3.Replace the k points by the centroids (means) of each cluster 4.Calculate mean square distance from cluster centers 5.Iterate 2-4 till expectation of error function is minimized (likelihood is maximized) –Better minimum found (Why?) –Limitations: Number of clusters predetermined Clusters always spherical in shape Adversely affected by outliers (‘coz centroids are virtual data points) Mean needs to be definable
14
Lecture 13, CS56714 Clustering algorithms K-mediods (Use medians rather than means) –Algorithm 1.Pick k points 2.For each point i, find D min (i,k) and assign i to k 3.Calculate mean square distance C old from cluster centers 4.Replace one of the k points by a random point from each cluster, calculating C new every time cluster structure is changed 5.If (C new < C old ), then accept new clustering 6.Iterate till membership of points does not change OR C is minimized –Outliers handled well –Limitations: Number of clusters predetermined Clusters always spherical in shape Doesn’t scale as well as k-means
15
Lecture 13, CS56715 Clustering algorithms Hierarchical clustering (Tree building-Average Linkage clustering) –Algorithm 1.Find points i and j that are the closest and unite with a node 2.Treat i and j as a single cluster by taking their average 3.Now iterate 1-2 till all points are included in tree –Biologically relevant results –Limitations: Solution frequently suboptimal Cluster boundaries not clear Linear orderings may not correspond to similarity
16
Lecture 13, CS56716 Clustering algorithms Single linkage clustering –Algorithm 1.If D(i,j) < cut-off, then include them in cluster 2.If i in cluster x is within cut-off distance of j in cluster y, then merge x and y into single cluster –Handles non-spherical clusters well –Limitations: Not the best way to go Doesn’t scale well Essentially, assumes transitivity in similarity (“Saudi princely clan, Indian joint family, Godfather’s clan”) Clusters not compact, but scraggly
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.