Sporulation in Bacillus Dormant spore Growth Stage VI,VII. Maturation, Cell lysis Germination Stage V. Spore Coat Polar division Medial division Stage IV. Cortex Vegetative cycle Stage II. Asymmetric cell division Stage III. Engulfment After Errington, 2004
There is a hierarchy of gene expression during sporulation Sporulation gene expression is temporally regulated by a transcription factor cascade sF sG sK sE Spo0A sA There is a hierarchy of gene expression during sporulation
Which genes are controlled by which transcription factor?? sE sG sK sF Spo0A sA What if we knock-out a transcription factor gene?!
Which genes are controlled by which transcription factor?? sE sF sG sK Spo0A sA What if we knock-out a transcription factor gene?!
B. Subtilis spotted dsDNA microarray Contains ~4100 B. subtilis genes as PCR products
High speed spotting robot
Microarray hybridization
Raw microarray data is hard to interpret!
Image Analysis & Data Visualization Cy5 Cy3 Cy5 Cy3 log2 Cy3 Cy5 Underexpressed Overexpressed 8 4 2 fold
Experimental Design Spo0A Spo0A sA sA
Introduction to Clustering “An intelligent being cannot treat every object it sees as a unique entity unlike anything else in the universe. It has to put objects in categories so that it may apply its hard-won knowledge about similar objects encountered in the past, to the object at hand.” Steven Pinker, from How the Mind Works, 1997
Class prediction using supervised learning Classification by gene expression required a training set i.e. we had a priori knowledge of the system.
Clustering is an unsupervised method for data exploration microarrays Genes No training set or preconceived notions about the data labels are required. The data will reveal its natural structure to us
We start with many nodes, and end up with only one! Agglomerative Hierarchical Clustering We start with many nodes, and end up with only one!
Hierarchies are ubiquitous in biology N. Pace, SCIENCE, 1997
Clustering Terminology Clustering Dendrogram Genes Gene names “pseudogenes” Edge length is proportional to “distance” between connected genes or nodes
Clustering Reveals the "Molecular Logic" of Gene Expression Genes Experiments
Similarity Metrics In order to implement a clustering algorithm, we require some quantitative concept comparing the behaviour of two genes across some set of conditions Are they behaving similarly, or differently?
between two coordinates Euclidian Distance 2 3 Y (1,4) What is the distance between two coordinates In 2D space? (3,1) X From Pythagoras, distance = sqrt(22 + 32)
d = Dx2 + Dy2+Dz2 Euclidian Distance How about objects in 3D space? (2,4,1) X (0,0,0) d = Dx2 + Dy2+Dz2 Z
d = |X Y| = S (xi - yi)2 Euclidian Distance X = (xi, xi+1, xi+2,…,xn) It turns out that the Euclidian distance generalizes to N-dimensional space.. d = |X Y| = S N (xi - yi)2 i = 1 X = (xi, xi+1, xi+2,…,xn) Y = (yi, yi+1, yi+2,…,yn) These look an awful lot like a list in Perl, or a line of gene expression data, yes? One way to conceptualize an individual gene expression vector as therefore as a coordinate in some high-dimensional space. If we have two such vectors, then we can use the Euclidian distance to ask “How far apart are they?”
S r = (xi - ux ) (yi - uy ) Nsysx Pearson Correlation Coefficient Kellie introduced the Pearson as a true correlation Measure that varies in the range -1 to 1
S S S S S S S r = xi yi N (xi yi ) - ( ) ( ) N (xi )2 -( xi )2 N Pearson Correlation Coefficient computational form N S i = 1 N S i = 1 N S i = 1 xi yi N (xi yi ) - ( ) ( ) r = N S i = 1 N S i = 1 N S i = 1 N S i = 1 N (xi )2 -( xi )2 N (yi )2 -( yi )2 Incredibly, this form makes our lives easier if we want to implement a Pearson() subroutine in Perl!
Strategies for clustering Single linkage clustering Similarity between the clusters is defined as the similarity of the closest pair of observations between the two groups
Strategies for clustering Complete linkage clustering Similarity between the clusters is defined as the similarity of the farthest pair of observations between the two groups
Strategies for clustering Average linkage clustering Nodes are represented by the average of vectors from the two component nodes, and the average pairwise distance within the newly formed cluster is thus minimized
S Average Linkage Clustering X = (1, 4, 2,-1) Y = (3, 2,-2,-3) Once we have decided that two genes (or nodes) should join to make a new node, how do we define the contents of the new node? X = (1, 4, 2,-1) Y = (3, 2,-2,-3) Avg(X,Y) = (2, 3, 0,-2) This makes life easy: avg( avg(I,J), avg(K,L) ) = avg(I,J,K,L)
Cluster implements various flavours of clustering algorithms, Cluster and TreeView by Mike Eisen http://rana.lbl.gov/EisenSoftware.htm Cluster implements various flavours of clustering algorithms, While TreeView provides a graphical output of the files produced by Cluster