Self-organizing map numeric vectors and sequence motifs Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca
Co-expressed genes Fig. 6-1. A subset of six co-expressed genes from yeast gene expression data (Cho et al., 1998). These genes have similar expression profile and form a tight cluster in a gene expression tree built from distances that measure differences in expression profiles. Xia 2007. Bioinfomatics and the cell. Springer Slide 2
Distances and scale effect Euclidian distance: Mahalanobis distance = Euclidian distance after data standardization: Slide 3
Clustering approaches Genes whose expressions changes synchronously have short distances We need to identify these genes by clustering them together Two approaches Conventional approach (single-linkage, complete-linkage, average linkage). UPGMA is an averega linkage linkage algorithm) Artificial neural network approach (SOM) Slide 4
UPGMA Gene1 Gene1 Gene2 Gene3 Gene4 Gene5 Gene1 0.015 0.045 0.143 0.198 Gene2 0.030 0.126 0.179 Gene3 0.092 0.179 Gene4 0.179 Gene5 D12,3 = (D1,3 + D2,3)/2 = 0.038 D12,4 = (D1,4 + D2,4)/2 = 0.135 D12,5 = (D1,5 + D2,5)/2 = 0.189 Gene12 Gene3 Gene4 Gene5 Gene12 0.038 0.135 0.189 Gene3 0.092 0.179 Gene4 0.179 Gene5 Gene2 Gene3 Gene4 Gene5 Gene3 Gene4 Gene5 Gene1 Gene2 (1,2),(3,4,5) Gene4 Gene5 Gene3 Gene1 Gene2 ((1,2),3),(4,5) Xuhua Xia
UPGMA Gene1 Gene2 Gene3 Gene4 Gene5 Gene1 0.015 0.045 0.143 0.198 Gene2 0.030 0.126 0.179 Gene3 0.092 0.179 Gene4 0.179 Gene5 D123,4 = (D1,4 + D2,4 + D3,4)/3 = 0.120 D123,5 = (D1,5 + D2,5 +D3,5)/3 = 0.185 Gene123 Gene4 Gene5 Gene123 0.120 0.185 Gene4 0.179 Gene5 D1234,5 = (D1,5 + D2,5 +D3,5 + D4,5)/4 = 0.184 Gene4 Gene5 Gene3 Gene1 Gene2 Gene5 Gene4 Gene3 Gene1 Gene2 (((1,2),3),4),5) Slide 6 Xuhua Xia
Phylogenetic Relationship from UPGMA Gene1 Gene2 Gene3 Gene4 Gene5 Gene1 0.015 0.045 0.143 0.198 Gene2 0.030 0.126 0.179 Gene3 0.092 0.179 Gene4 0.179 Gene5 Gene12 Gene3 Gene4 Gene5 Gene12 0.038 0.135 0.189 Gene3 0.092 0.179 Gene4 0.179 Gene5 Gene123 Gene4 Gene5 Gene123 0.120 0.185 Gene4 0.179 Gene5 Slide 7 Xuhua Xia
Branch Lengths D12 = 0.015 ((1,2),(3,4,5)) D12,3 = (D1,3 + D2,3)/2 = 0.038 (previous slide) D12,4 = (D1,4 + D2,4)/2 = 0.135 D12,5 = (D1,5 + D2,5)/2 = 0.189 D123,4 = (D1,4 + D2,4 + D3,4)/3 = 0.120 (previous slide) D123,5 = (D1,5 + D2,5 +D3,5)/3 = 0.185 D1234),5 = (D1,5 + D2,5 +D3,5 + D4,5)/4 = 0.184 ((1,2),(3,4,5)) (((1,2),3),(4,5)) ((((1,2),3),4),5) 0.0075 Gene1 Gene2 Gene3 Gene4 Gene5 0.019 0.06 ((1:0.0075,2:0.0075),(3,4,5)) (((1:0.0075,2:0.0075):0.019,3:0.019),(4,5)) ((((1:0.0075,2:0.0075):0.0115,3:0.019):0.041,4:0.06):0.032,5:0.092) 0.092 The reference book (Xia 2007) gave a different example. Make sure to go through it to gain a better understanding Slide 8 Xuhua Xia
UPGMA Result Slide 9
SOM A grid of "artificial neurons" Training data numeric vectors sequence motifs A distance or similarity index between numeric vectors between sequence motifs An algorithm to update the neurons in response to input (the learning process) Slide 10
Data Gene T0 T10 T20 Sum 1 93 76 87 256 2 80 81 85 246 3 89 88 262 4 69 74 96 239 5 95 277 6 65 237 7 268 8 78 255 9 97 264 10 67 55 218 11 91 90 276 12 72 215 13 79 94 251 14 250 15 66 64 63 193 Slide 11
Data and SOM grid Gene T0 T10 T20 Sum 1 93 76 87 256 2 80 81 85 246 3 89 88 262 4 69 74 96 239 5 95 277 6 65 237 7 268 8 78 255 9 97 264 10 67 55 218 11 91 90 276 12 72 215 13 79 94 251 14 250 15 66 64 63 193 3 by 3 SOM grid with random initial vectors 1 2 3 T0 T10 T20 2.2 11.5 33.4 41.9 27.6 0.8 6 63.8 51.2 28.5 30.2 47 28.7 51.3 9 61.6 38.2 17.9 40.5 76.1 71.2 79.8 94.6 23.2 76.2 40.9 23.9 Slide 12
Training We now randomly choose one gene, and suppose we happen to have chosen Gene 4 with T0, T10 and T20 equal to 69, 74 and 96, respectively (Table 6-1). The Euclidean distances (designated hereafter as d) between this gene and each of the 9 nodes (Table 6-3) show that Gene 4 is closest to node (3,1), with Gene T0 T10 T20 Sum 1 93 76 87 256 2 80 81 85 246 3 89 88 262 4 69 74 96 239 5 95 277 6 65 237 7 268 8 78 255 9 97 264 10 67 55 218 11 91 90 276 12 72 215 13 79 94 251 14 250 15 66 64 63 193 1 2 3 T0 T10 T20 2.2 11.5 33.4 41.9 27.6 0.8 6 63.8 51.2 28.5 30.2 47 28.7 51.3 9 61.6 38.2 17.9 40.5 76.1 71.2 79.8 94.6 23.2 76.2 40.9 23.9 Winning node All 9 distances: 1 2 3 111.0 109.0 78.0 77.2 98.5 86.2 37.8 76.4 79.6 Slide 13
Updating 1 2 3 T0 T10 T20 2.2 11.5 33.4 41.9 27.6 0.8 6 63.8 51.2 28.5 30.2 47 28.7 51.3 9 61.6 38.2 17.9 40.5 76.1 71.2 79.8 94.6 23.2 76.2 40.9 23.9 Learning rate: use 0.5 Gene 4: 69, 74 and 96 1 2 3 T0 T10 T20 2.2 11.5 33.4 41.9 27.6 0.8 6.0 63.8 51.2 38.7 41.1 59.3 28.7 51.3 9.0 61.6 38.2 17.9 54.7 75.0 83.6 77.1 89.5 41.4 76.2 40.9 23.9 Contrinue with other vectors and repeat until no more updating is possible Slide 14
Trained SOM 1 2 3 T0 T10 T20 80.7 77.6 73.3 74 77.9 67.3 69.2 79.5 72.2 84.4 81.9 82.1 82.5 81.2 81.4 78.5 78.2 89.6 85.4 91 88.1 82.9 91.7 79.9 79.1 89.4 Gene T0 T10 T20 Row Col d 1 93 76 87 3 2 9.69 80 81 85 4.41 89 88 6.54 4 69 74 96 13.77 5 95 6.80 6 65 17.43 7 4.90 8 78 10.17 9 97 6.12 10 67 55 23.00 11 91 90 6.30 12 72 6.19 13 79 94 4.84 14 13.63 15 66 64 63 16.54 Slide 15
Sequence as a matrix Table 2. A matrix representation of sequence “ACCGTTA” (a). The resulting position weight matrix (b) is obtained after adding a pseudocount of 0.01 to each cell in (a), with background frequencies being 0.3, 0.2, 0.2, and 0.3 for A, C, G, and T, respectively. (a) 1 2 3 4 5 6 7 A C G T (b) 1.695 −4.963 −4.379 2.280 Slide 16
Learning Table 3. Updating the node in Table 2a by the new input sequence “GCCATTA” and the resulting position weight matrix (PWM) obtained in (b) after adding a pseudocount of 0.01 to each cell in (a). (a) 1 2 3 4 5 6 7 A C G T (b) 0.723 −5.935 1.716 −5.350 2.301 1.308 Slide 17