Download presentation
Presentation is loading. Please wait.
1
Sporulation in Bacillus
Dormant spore Growth Stage VI,VII. Maturation, Cell lysis Germination Stage V. Spore Coat Polar division Medial division Stage IV. Cortex Vegetative cycle Stage II. Asymmetric cell division Stage III. Engulfment After Errington, 2004
2
There is a hierarchy of gene expression during sporulation
Sporulation gene expression is temporally regulated by a transcription factor cascade sF sG sK sE Spo0A sA There is a hierarchy of gene expression during sporulation
3
Which genes are controlled by which transcription factor??
sE sG sK sF Spo0A sA What if we knock-out a transcription factor gene?!
4
Which genes are controlled by which transcription factor??
sE sF sG sK Spo0A sA What if we knock-out a transcription factor gene?!
5
B. Subtilis spotted dsDNA microarray
Contains ~4100 B. subtilis genes as PCR products
6
High speed spotting robot
7
Microarray hybridization
8
Raw microarray data is hard to interpret!
9
Image Analysis & Data Visualization
Cy5 Cy3 Cy5 Cy3 log2 Cy3 Cy5 Underexpressed Overexpressed 8 4 2 fold
10
Experimental Design Spo0A Spo0A sA sA
11
Introduction to Clustering
“An intelligent being cannot treat every object it sees as a unique entity unlike anything else in the universe. It has to put objects in categories so that it may apply its hard-won knowledge about similar objects encountered in the past, to the object at hand.” Steven Pinker, from How the Mind Works, 1997
12
Class prediction using supervised learning
Classification by gene expression required a training set i.e. we had a priori knowledge of the system.
13
Clustering is an unsupervised method for data exploration
microarrays Genes No training set or preconceived notions about the data labels are required. The data will reveal its natural structure to us
14
We start with many nodes, and end up with only one!
Agglomerative Hierarchical Clustering We start with many nodes, and end up with only one!
15
Hierarchies are ubiquitous in biology
N. Pace, SCIENCE, 1997
16
Clustering Terminology
Clustering Dendrogram Genes Gene names “pseudogenes” Edge length is proportional to “distance” between connected genes or nodes
20
Clustering Reveals the "Molecular Logic" of Gene Expression
Genes Experiments
21
Similarity Metrics In order to implement a clustering algorithm, we require some quantitative concept comparing the behaviour of two genes across some set of conditions Are they behaving similarly, or differently?
22
between two coordinates
Euclidian Distance 2 3 Y (1,4) What is the distance between two coordinates In 2D space? (3,1) X From Pythagoras, distance = sqrt( )
23
d = Dx2 + Dy2+Dz2 Euclidian Distance How about objects in 3D space?
(2,4,1) X (0,0,0) d = Dx2 + Dy2+Dz2 Z
25
d = |X Y| = S (xi - yi)2 Euclidian Distance X = (xi, xi+1, xi+2,…,xn)
It turns out that the Euclidian distance generalizes to N-dimensional space.. d = |X Y| = S N (xi - yi)2 i = 1 X = (xi, xi+1, xi+2,…,xn) Y = (yi, yi+1, yi+2,…,yn) These look an awful lot like a list in Perl, or a line of gene expression data, yes? One way to conceptualize an individual gene expression vector as therefore as a coordinate in some high-dimensional space. If we have two such vectors, then we can use the Euclidian distance to ask “How far apart are they?”
26
S r = (xi - ux ) (yi - uy ) Nsysx Pearson Correlation Coefficient
Kellie introduced the Pearson as a true correlation Measure that varies in the range -1 to 1
27
S S S S S S S r = xi yi N (xi yi ) - ( ) ( ) N (xi )2 -( xi )2 N
Pearson Correlation Coefficient computational form N S i = 1 N S i = 1 N S i = 1 xi yi N (xi yi ) - ( ) ( ) r = N S i = 1 N S i = 1 N S i = 1 N S i = 1 N (xi )2 -( xi )2 N (yi )2 -( yi )2 Incredibly, this form makes our lives easier if we want to implement a Pearson() subroutine in Perl!
28
Strategies for clustering
Single linkage clustering Similarity between the clusters is defined as the similarity of the closest pair of observations between the two groups
29
Strategies for clustering
Complete linkage clustering Similarity between the clusters is defined as the similarity of the farthest pair of observations between the two groups
30
Strategies for clustering
Average linkage clustering Nodes are represented by the average of vectors from the two component nodes, and the average pairwise distance within the newly formed cluster is thus minimized
31
S Average Linkage Clustering X = (1, 4, 2,-1) Y = (3, 2,-2,-3)
Once we have decided that two genes (or nodes) should join to make a new node, how do we define the contents of the new node? X = (1, 4, 2,-1) Y = (3, 2,-2,-3) Avg(X,Y) = (2, 3, 0,-2) This makes life easy: avg( avg(I,J), avg(K,L) ) = avg(I,J,K,L)
32
Cluster implements various flavours of clustering algorithms,
Cluster and TreeView by Mike Eisen Cluster implements various flavours of clustering algorithms, While TreeView provides a graphical output of the files produced by Cluster
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.