Download presentation
Presentation is loading. Please wait.
Published byGeorgiana Byrd Modified over 9 years ago
1
Computing and Mathematical Sciences Liverpool John Moores University Robust methodologies for partition clustering Paulo Lisboa Terence Etchells, Ian Jarman and Simon Chambers
2
Overview Partition clustering - critique Decomposition of the covariance matrix Landscape mapping of cluster solutions Validation for two synthetic data sets and metabolic sub-typing
3
Bioinformatics Nottingham Tenovous Primary Breast Carcinoma Series Consecutive series of 1,944 cases of primary operable invasive breast cancer (n=1,076 with all markers present) Patients presenting during 1986-98 Protein expression comprising 25 immunohistochemical markers related to tumour malignancy derived through high-throughput protein expression using TMA Abd El-Rehim et al, Int J Cancer, 116, 340-350, 2005.
4
Partition clustering – relevance to bioinformatics C-erbB-2 p53 PgR ER CK 5/6 BRCA1
5
Identify a suitable algorithm: Model-based or model-free ? Hierarchical, K-means, PAM ? Return { S a,...,S z } solutions Validate & interpret each solution K-means i. Assume #K ii. Initialise #N ? iii. Sort by optimality ? iv. Select best for #K ? v. Select #K(s) ? vi. Single cluster or ensemble ? Partition clustering –open issues
6
Scatter matrices Separation index: Decomposition of the scatter matrix SBSB SW1SW1 SW2SW2
7
Invariant separation matrix and index SBSB SW1SW1 SW2SW2 Separation index: Decomposition of the scatter matrix
8
a1a1 a2a2 a3a3 N.B. If |S T |=0 → Project onto subspace of cohort means
9
a1a1 a2a2 a3a3 ~ ~ ~ Theorem: is invariant to dimensionality reduction under Mahalanobis rotations
10
K-means clustering
11
Adaptive Resonance Theory (ART) clustering
13
Concordance measure Cluster Membership 1…M 1…O 11 …O 1M NO N1 …O NM
14
Optimality principle Reproducibility with Best Separation - max(J) Best Concordance – max(C V ) under repeated initialisations i. N initialisations ii. Sort by J iii. Select top p% iv. Calculate pairwise C V v. Retain med(C V ) vi.Plot (J, med_C V )
15
Synthetic data (6 clusters) Fig 1(a) Fig 1(b)
16
Synthetic data (6 clusters)
18
Synthetic data (10 cohorts)
20
MeanCovariance Matrix (i,j) xyz111213212223313233N C1-0.799-1.011-3.3360.3360.0440.0740.0440.3710.2100.0740.2100.58264 C2-0.441-0.569-2.3310.4280.060-0.0020.0600.1230.157-0.0020.1570.64842 C30.649-0.344-4.1540.6200.023-0.0350.0230.1370.070-0.0350.0700.44661 C41.0770.072-2.8150.366-0.0020.076-0.0020.0430.1040.0760.1040.56332 C5-0.390-0.2420.2560.5360.0130.0310.0130.348-0.1170.031-0.1170.689197 C6-1.358-0.6581.6390.309-0.060-0.055-0.0600.245-0.013-0.055-0.0130.532131 C71.2610.1250.8620.3230.0170.0270.0170.386-0.0600.027-0.0600.403163 C8-0.5933.024-0.4980.7760.0330.1750.0330.4910.0030.1750.0030.69597 C90.251-0.539-0.5300.711-0.0250.055-0.0250.352-0.0810.055-0.0810.576106 C100.374-0.2671.9730.390-0.0970.041-0.0970.343-0.0140.041-0.0140.322183 C1C2C3C4C5C6C7C8C9 C20.7805. C31.21051.4828. C41.50541.19241.0687. C52.49751.76363.06492.3119. C63.39132.82944.4763.80291.1757. C73.25162.55753.70022.73021.21512.2233. C82.97762.43413.09012.47742.0252.60822.2314. C92.03881.29692.45431.68460.71091.81761.23932.2086. C103.70873.04874.47273.59771.27171.41411.2332.54971.6952 Solution with 8 Clusters Total 24713586 Original cohorts 1582.4....64 2281.13....42 31150......61 4126.5....32 5..109431316151197 92.2364.143.106 6..25.103.3.131 7..44.13421.163 10...169148.183 8..1....9697 Total10079172133132173190971076
21
Synthetic data – mixing structure (Sammon Map)
22
Synthetic data – Visualisation in data space
23
117 388 92 383 96 192 190 97 208 212 177 23 93 28 177 183 84 164 190 29 96 1 2 3 4 5 1 2 3 4 5 6 219 177 97 160 192 113 1 2 3 4 5 6 118 7 47 144 19 170 97 150 54 21 173 59 118 133 100 132 79 173 97 1 2 3 4 5 6 172 7 190 8 78 137 169 97 132 28 185 52 45 55 63 69 161 124 129 44 176 1 2 3 4 5 6 95 7 181 8 97 9 95 89 85 129 55 18 24 161 143 24 177 127 153 176 96 48 127 1 2 3 4 5 6 60 7 42 8 181 9 66 10 59 142 112 126 42 171 95 177 38 58 98 978 1 2 238 100 738 3 1 2 98 238 738 189 97 335 3 1 2 455 4 96 97 294 101 88 238 455 49 189 94 361 Synthetic data (10 cohorts)
24
Max J SeCo Max Cv
25
Bioinformatics Nottingham Tenovous Primary Breast Carcinoma Series Consecutive series of 1,944 cases of primary operable invasive breast cancer (n=1,076 with all markers present) Patients presenting during 1986-98 Protein expression comprising 25 immunohistochemical markers related to tumour malignancy derived through high-throughput protein expression using TMA Abd El-Rehim et al, Int J Cancer, 116, 340-350, 2005.
26
Marginal distributions
27
Landscape map (SeCo)
28
Stability index (Cv)
29
A Total 123875641 B1 11840161012 142 5 211250330000 179 7 37012240202 167 6 00291450000 174 8 026098000 106 2 006009415 106 3 1000106442 108 4 000010613294 Total 17713116318310697126931076
30
Landscape map (SeCo)
31
Cluster hierarchy (1)
32
Cluster hierarchy (2)
33
Solution A
35
Solution B
36
Solution A
37
Sub-type profiling Clusters A Clusters B Luminal N Luminal New 2
38
Sub-type profiling Clusters A Clusters B HER2 Luminal A
39
Sub-type profiling Basal p53 - Basal muc1 - Basal muc1 + Basal p53 + Clusters A Clusters B
40
Consistency with consensus clustering CoRe 5 Clusters Solution 23145 Clusters in Green et al 2007 C112940366 C21138077 C3141137162 C40065170 C50056130 C61813730 NC587254119110
41
Molecular sub-typing
43
Summary Partition clustering - critique Decomposition of the covariance matrix Landscape mapping of cluster solutions Validation for two synthetic data sets and metabolic sub-typing
44
Ferrara data (n=633) erprPROLINDneuP53
45
Ferrara data (n=633)
46
SeCo methodTotal 12345 Ambrogi et al [7] 1 213130426256 2 0203013207 3 016802291 4 02077079 Total 213219688251633 Ferrara data (n=633)
47
JMU Cluster 3/5 JMU Cluster 4/5 JMU Cluster 5/5 JMU Cluster 1/5 JMU Cluster 2/5
48
Ferrara data (n=633)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.