Download presentation
Presentation is loading. Please wait.
Published byCathleen Lloyd Modified over 9 years ago
1
A hierarchical unsupervised growing neural network for clustering gene expression patterns Javier Herrero, Alfonso Valencia & Joaquin Dopazo Seminar “Neural Networks in Bioinformatics” by Barbara Hammer Presentation by Nicolas Neubauer January, 25th, 2003
2
25.1.2002SOTA2 Topics IntroductionIntroduction –Motivation –Requirements Parent Techniques and their Problems SOTA Conclusion
3
25.1.2002SOTA3 Motivation DNA arrays create huge masses of data. Clustering may provide first orientation. Clustering: Group vectors so that similar vectors are together. Vectorizing DNA array data: –Each gene is one point in input space –Each condition (i.e., each DNA array) contributes 1 component of an input vector. in reality: –several thousands of genes, –several dozens of DNA arrays.3.7.2.5.2.6.3.5.1.5.4.5.1.5.4.5.2.6.3.5.3.7.2.5 ( )( )( )( )
4
25.1.2002SOTA4 Requirements Clustering algorithm should… …be tolerant to noise …capture high-level (inter-cluster) relations …be able to scale topology based on topology of input data users’ required level of detail Clustering is based on similarity measure –Biological sense of distance function must be validated
5
25.1.2002SOTA5 Topics Introduction Parent Techniques and their ProblemsParent Techniques and their Problems –Hierarchical Clustering –SOM SOTA Conclusion
6
25.1.2002SOTA6 Hierarchical Clustering Vectors are arranged in a binary tree Similar vectors are close in the tree hierarchy One node for each vector Quadratic runtime Result may depend on order of data
7
25.1.2002SOTA7 Hierarchical Clustering (II) Clustering algorithm should… …be tolerant to noise no - data is directly used to define position …capture high-level (inter-cluster) relations yes - tree structure gives very clear relationships between clusters …be able to scale topology based on –topology of input data yes - tree is built to fit distribution in input data –users’ required level of detail no - tree is fully built; may be reduced by later analysis, but has to be fully built before
8
25.1.2002SOTA8 Self-Organising Maps Vectors are assigned to clusters Clusters are defined by the neurons which serve as “prototypes” for that cluster Many vectors per cluster Linear runtime
9
25.1.2002SOTA9 Self-Organising Maps (II) Clustering algorithm should… …be tolerant to noise yes - data is not aligned directly but in relation to prototypes which are averages …capture high-level (inter-cluster) relations ? - paper says no, but neighbourhood of neurons? …be able to scale topology based on –topology of input data no - number of clusters is set before-hand, data may be stretched to fit the SOM’s topology “if some particular type of profile is abundant, … this type of data will populate the vast majority of clusters” –users’ required level of detail yes - choice of number of clusters influences detail
10
25.1.2002SOTA10 Topics Introduction Parent Techniques and their Problems SOTASOTA –Growing Cell Structure –Learning Algorithm –Distance Measures –Abortion Criteria Conclusion
11
25.1.2002SOTA11 SOTA Overview SOTA stands for self-organising tree algorithm SOTA combines best things from hierarchical clustering and SOMs: –Align clusters in a hierarchical structure –Use cluster prototypes created in a SOM- like way New idea: Growing Cell Structures –Topology is built up incrementally as data requires it
12
25.1.2002SOTA12 Growing Cell Structures Topology consists of –Cells, the clusters, and –Nodes, connections between the cells Cells can become nodes and get two daughter cells Result: Binary tree SOTA: Cells have codebook serving as prototypes for clusters Good splitting criteria: topology adapts to data
13
25.1.2002SOTA13 Learning Algorithm Repeat cycle Repeat epoch For each pattern, adapt cells: Find winning cell Adapt cells Until updating-finished() Split cell containing most heterogenity Until all-finished() Compare pattern P j to cell C i Cell for which d(P j, C i ) is smallest wins
14
25.1.2002SOTA14 Learning Algorithm Repeat cycle Repeat epoch For each pattern, adapt cells: Find winning cell Adapt cells Until updating-finished() Split cell containing most heterogenity Until all-finished() C i (t+1) = C i (t)+n*(P j - C i (t)) Move cell into direction of of pattern, with learning factor n depending on proximity Only three cells (at most) are updated: –winner cell, –ancestor cell and –sister cell n winner > n ancestor > n sister If sister is no longer cell, but node, only winner is adapted
15
25.1.2002SOTA15 Learning Algorithm Repeat cycle Repeat epoch For each pattern, adapt cells: Find winning cell Adapt cells Until updating-finished() Split cell containing most heterogenity Until all-finished() When is updating finished? Each pattern “belongs” to its winner cell from this epoch So, each cell has a set of patterns assigned Resource (R i ) is the average distance between the cell’s codebook and its patterns The sum of all R i s is the error t of epoch t Stop repeating epochs if |( t - t-1 )/( t-1 )| < E
16
25.1.2002SOTA16 Learning Algorithm Repeat cycle Repeat epoch For each pattern, adapt cells: Find winning cell Adapt cells Until updating-finished() Split cell containing most heterogenity Until all-finished() A new cluster is created. Most efficient: split cell where patterns are most heterogenous Measure for heterogenity: –Resource mean distance between patterns and cell –Variability maximum distance between patterns Cell turns into node Two daughter cell inherit mother’s codebook
17
25.1.2002SOTA17 Learning Algorithm Repeat cycle Repeat epoch For each pattern, adapt cells: Find winning cell Adapt cells Until updating-finished() Split cell containing most heterogenity Until all-finished() The big question: When to stop iterating cycles When each pattern has its own cell When maximum number of nodes is reached When maximum resource or variability value drops under a certain level –See later for sophisticated calculations of such a threshold
18
25.1.2002SOTA18 Learning Algorithm Repeat cycle Repeat epoch For each pattern, adapt cells: Find winning cell Adapt cells Until updating-finished() Split cell containing most heterogenity Until all-finished()
19
25.1.2002SOTA19 Learning Algorithm Repeat cycle Repeat epoch For each pattern, adapt cells: Find winning cell Adapt cells Until updating-finished() Split cell containing most heterogenity Until all-finished()
20
25.1.2002SOTA20 Distance Measures How does d(x,y) really look like? Distance function has to contain biological similarity! Euclidean distance: d(x,y) = √(x i -y i ) 2 Pearson correlation: d(x,y)=(1-r) r = (e xi -ê x )(e yi -ê y ) S ex S ey -1≤ r ≤ 1.1.5.4.5.2.6.3.5.3.7.2.5 Empirical evidence suggests that Pearson correlation better grasps similarity in case of DNA array data.
21
25.1.2002SOTA21 SOTA evaluation Clustering algorithm should… …be tolerant to noise yes - data is averaged via codebooks just as in SOM …capture high-level (inter-cluster) relations yes - hierarchical structure as in hierarch. clustering …be able to scale topology based on –topology of input data yes - tree is extended to meet the distribution of variance in the input data –users’ required level of detail yes - tree can be adjusted to desired level of detail; criteria may also be set to meet certain confidence levels...
22
25.1.2002SOTA22 Abortion Criteria What we are looking for: “an upper level of distance at which two genes can be considered to be similar at their profile expression levels” Distribution of distances has to do with non- biological characteristics of the data –Many points with few components cause a lot of high correlations
23
25.1.2002SOTA23 Abortion Criteria (II) Idea: –If we knew the distribution in the data that is random, –A confidence level could be given –Meaning that having a given distance given two unrelated genes is not more probable than . Problem: –We cannot know the random distribution: –We only know the real distribution which is partially due to random properties, partially due to “real” correlations Solution: –Approximation by shuffling
24
25.1.2002SOTA24 Abortion Criteria (III) Shuffling –For each pattern, the components are randomly shuffled –Correlation is destroyed –Number of points, ranges of values, frequency of values are conserved Claim: –Random distance distribution in this data approximates random distance distribution in real data Conclusion –If p(corr.>a)<=5% in random data, –Finding corr.>a in the real data is meaningful with 95% confidence
25
25.1.2002SOTA25 Only 5% of the random data pairs have a correlation >.178 Choosing.178 as threshold, there is a 95% confidence that genes in the same cluster are there for non-statistical, but biological reasons
26
25.1.2002SOTA26 Topics Introduction Parent Techniques and their Problems SOTA ConclusionConclusion –Additional nice properties –Summary of differences compared to parent techniques
27
25.1.2002SOTA27 Additional properties As patterns do not have to be compared to each other, runtime is approximately linear in the number of patterns –Like SOM –Hierarchical clustering uses a distance matrix relating each pattern to each other pattern The cell’s vectors approach very closely the average of the assigned data points
28
25.1.2002SOTA28 Summary of SOTA Compared to SOMs –SOTA builds up a topology that reflects higher-order relations –Level of detail can be defined very flexible Nicer topological properties Adding of new data into an existing tree would be problematic(?) Compared to standard hierarchical clustering –SOTA is more robust to noise –Has better runtime properties –Has a more flexible concept of cluster
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.