Download presentation
Presentation is loading. Please wait.
Published bySuhendra Lesmono Modified over 5 years ago
1
Self-organizing map numeric vectors and sequence motifs
Xuhua Xia
2
Co-expressed genes Fig A subset of six co-expressed genes from yeast gene expression data (Cho et al., 1998). These genes have similar expression profile and form a tight cluster in a gene expression tree built from distances that measure differences in expression profiles. Xia Bioinfomatics and the cell. Springer Slide 2
3
Distances and scale effect
Euclidian distance: Mahalanobis distance = Euclidian distance after data standardization: Slide 3
4
Clustering approaches
Genes whose expressions changes synchronously have short distances We need to identify these genes by clustering them together Two approaches Conventional approach (single-linkage, complete-linkage, average linkage). UPGMA is an averega linkage linkage algorithm) Artificial neural network approach (SOM) Slide 4
5
UPGMA Gene1 Gene1 Gene2 Gene3 Gene4 Gene5 Gene Gene Gene Gene Gene5 D12,3 = (D1,3 + D2,3)/2 = D12,4 = (D1,4 + D2,4)/2 = D12,5 = (D1,5 + D2,5)/2 = 0.189 Gene12 Gene3 Gene4 Gene5 Gene Gene Gene Gene5 Gene2 Gene3 Gene4 Gene5 Gene3 Gene4 Gene5 Gene1 Gene2 (1,2),(3,4,5) Gene4 Gene5 Gene3 Gene1 Gene2 ((1,2),3),(4,5) Xuhua Xia
6
UPGMA Gene1 Gene2 Gene3 Gene4 Gene5 Gene Gene Gene Gene Gene5 D123,4 = (D1,4 + D2,4 + D3,4)/3 = D123,5 = (D1,5 + D2,5 +D3,5)/3 = 0.185 Gene123 Gene4 Gene5 Gene Gene Gene5 D1234,5 = (D1,5 + D2,5 +D3,5 + D4,5)/4 = 0.184 Gene4 Gene5 Gene3 Gene1 Gene2 Gene5 Gene4 Gene3 Gene1 Gene2 (((1,2),3),4),5) Slide 6 Xuhua Xia
7
Phylogenetic Relationship from UPGMA
Gene1 Gene2 Gene3 Gene4 Gene5 Gene Gene Gene Gene Gene5 Gene12 Gene3 Gene4 Gene5 Gene Gene Gene Gene5 Gene123 Gene4 Gene5 Gene Gene Gene5 Slide 7 Xuhua Xia
8
Branch Lengths D12 = 0.015 ((1,2),(3,4,5))
D12,3 = (D1,3 + D2,3)/2 = (previous slide) D12,4 = (D1,4 + D2,4)/2 = D12,5 = (D1,5 + D2,5)/2 = 0.189 D123,4 = (D1,4 + D2,4 + D3,4)/3 = (previous slide) D123,5 = (D1,5 + D2,5 +D3,5)/3 = 0.185 D1234),5 = (D1,5 + D2,5 +D3,5 + D4,5)/4 = 0.184 ((1,2),(3,4,5)) (((1,2),3),(4,5)) ((((1,2),3),4),5) 0.0075 Gene1 Gene2 Gene3 Gene4 Gene5 0.019 0.06 ((1:0.0075,2:0.0075),(3,4,5)) (((1:0.0075,2:0.0075):0.019,3:0.019),(4,5)) ((((1:0.0075,2:0.0075):0.0115,3:0.019):0.041,4:0.06):0.032,5:0.092) 0.092 The reference book (Xia 2007) gave a different example. Make sure to go through it to gain a better understanding Slide 8 Xuhua Xia
9
UPGMA Result Slide 9
10
SOM A grid of "artificial neurons" Training data
numeric vectors sequence motifs A distance or similarity index between numeric vectors between sequence motifs An algorithm to update the neurons in response to input (the learning process) Slide 10
11
Data Gene T0 T10 T20 Sum 1 93 76 87 256 2 80 81 85 246 3 89 88 262 4 69 74 96 239 5 95 277 6 65 237 7 268 8 78 255 9 97 264 10 67 55 218 11 91 90 276 12 72 215 13 79 94 251 14 250 15 66 64 63 193 Slide 11
12
Data and SOM grid Gene T0 T10 T20 Sum 1 93 76 87 256 2 80 81 85 246 3
89 88 262 4 69 74 96 239 5 95 277 6 65 237 7 268 8 78 255 9 97 264 10 67 55 218 11 91 90 276 12 72 215 13 79 94 251 14 250 15 66 64 63 193 3 by 3 SOM grid with random initial vectors 1 2 3 T0 T10 T20 2.2 11.5 33.4 41.9 27.6 0.8 6 63.8 51.2 28.5 30.2 47 28.7 51.3 9 61.6 38.2 17.9 40.5 76.1 71.2 79.8 94.6 23.2 76.2 40.9 23.9 Slide 12
13
Training We now randomly choose one gene, and suppose we happen to have chosen Gene 4 with T0, T10 and T20 equal to 69, 74 and 96, respectively (Table 6-1). The Euclidean distances (designated hereafter as d) between this gene and each of the 9 nodes (Table 6-3) show that Gene 4 is closest to node (3,1), with Gene T0 T10 T20 Sum 1 93 76 87 256 2 80 81 85 246 3 89 88 262 4 69 74 96 239 5 95 277 6 65 237 7 268 8 78 255 9 97 264 10 67 55 218 11 91 90 276 12 72 215 13 79 94 251 14 250 15 66 64 63 193 1 2 3 T0 T10 T20 2.2 11.5 33.4 41.9 27.6 0.8 6 63.8 51.2 28.5 30.2 47 28.7 51.3 9 61.6 38.2 17.9 40.5 76.1 71.2 79.8 94.6 23.2 76.2 40.9 23.9 Winning node All 9 distances: 1 2 3 111.0 109.0 78.0 77.2 98.5 86.2 37.8 76.4 79.6 Slide 13
14
Updating 1 2 3 T0 T10 T20 2.2 11.5 33.4 41.9 27.6 0.8 6 63.8 51.2 28.5 30.2 47 28.7 51.3 9 61.6 38.2 17.9 40.5 76.1 71.2 79.8 94.6 23.2 76.2 40.9 23.9 Learning rate: use 0.5 Gene 4: 69, 74 and 96 1 2 3 T0 T10 T20 2.2 11.5 33.4 41.9 27.6 0.8 6.0 63.8 51.2 38.7 41.1 59.3 28.7 51.3 9.0 61.6 38.2 17.9 54.7 75.0 83.6 77.1 89.5 41.4 76.2 40.9 23.9 Contrinue with other vectors and repeat until no more updating is possible Slide 14
15
Trained SOM 1 2 3 T0 T10 T20 80.7 77.6 73.3 74 77.9 67.3 69.2 79.5 72.2 84.4 81.9 82.1 82.5 81.2 81.4 78.5 78.2 89.6 85.4 91 88.1 82.9 91.7 79.9 79.1 89.4 Gene T0 T10 T20 Row Col d 1 93 76 87 3 2 9.69 80 81 85 4.41 89 88 6.54 4 69 74 96 13.77 5 95 6.80 6 65 17.43 7 4.90 8 78 10.17 9 97 6.12 10 67 55 23.00 11 91 90 6.30 12 72 6.19 13 79 94 4.84 14 13.63 15 66 64 63 16.54 Slide 15
16
Sequence as a matrix Table 2. A matrix representation of sequence “ACCGTTA” (a). The resulting position weight matrix (b) is obtained after adding a pseudocount of 0.01 to each cell in (a), with background frequencies being 0.3, 0.2, 0.2, and 0.3 for A, C, G, and T, respectively. (a) 1 2 3 4 5 6 7 A C G T (b) 1.695 −4.963 −4.379 2.280 Slide 16
17
Learning Table 3. Updating the node in Table 2a by the new input sequence “GCCATTA” and the resulting position weight matrix (PWM) obtained in (b) after adding a pseudocount of 0.01 to each cell in (a). (a) 1 2 3 4 5 6 7 A C G T (b) 0.723 −5.935 1.716 −5.350 2.301 1.308 Slide 17
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.