Self-organizing map numeric vectors and sequence motifs

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Memristor in Learning Neural Networks
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.
Cluster analysis for microarray data Anja von Heydebreck.
Brandon Andrews CS6030.  What is a phylogenetic tree?  Goals in a phylogenetic tree generator  Distance based method  Fitch-Margoliash Method Example.
Kohonen Self Organising Maps Michael J. Watts
Introduction to Bioinformatics
A hierarchical unsupervised growing neural network for clustering gene expression patterns Javier Herrero, Alfonso Valencia & Joaquin Dopazo Seminar “Neural.
X0 xn w0 wn o Threshold units SOM.
Self Organizing Maps. This presentation is based on: SOM’s are invented by Teuvo Kohonen. They represent multidimensional.
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Multiple sequence alignment
Introduction to Bioinformatics - Tutorial no. 12
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Radial-Basis Function Networks
Kansas State University Department of Computing and Information Sciences KDD Group Presentation #7 Fall’01 Friday, November 9, 2001 Cecil P. Schmidt Department.
Unsupervised Learning and Clustering k-means clustering Sum-of-Squared Errors Competitive Learning SOM Pre-processing and Post-processing techniques.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Self Organizing Maps (SOM) Unsupervised Learning.
CZ5225: Modeling and Simulation in Biology Lecture 5: Clustering Analysis for Microarray Data III Prof. Chen Yu Zong Tel:
Self-organizing Maps Kevin Pang. Goal Research SOMs Research SOMs Create an introductory tutorial on the algorithm Create an introductory tutorial on.
Artificial Neural Networks Dr. Abdul Basit Siddiqui Assistant Professor FURC.
-Artificial Neural Network- Chapter 9 Self Organization Map(SOM) 朝陽科技大學 資訊管理系 李麗華 教授.
More on Microarrays Chitta Baral Arizona State University.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
Self-Organizing Maps Corby Ziesman March 21, 2007.
Microarrays.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Hierarchical Clustering of Gene Expression Data Author : Feng Luo, Kun Tang Latifur Khan Graduate : Chien-Ming Hsiao.
Clustering.
TreeSOM :Cluster analysis in the self- organizing map Neural Networks 19 (2006) Special Issue Reporter 張欽隆 D
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Semiconductors, BP&A Planning, DREAM PLAN IDEA IMPLEMENTATION.
Motif identification with Gibbs Sampler Xuhua Xia
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Computational Biology
Phylogeny - based on whole genome data
Data Mining, Neural Network and Genetic Programming
Self Organizing Maps: Parametrization of Parton Distribution Functions
Radial Basis Function G.Anuradha.
Other Applications of Energy Minimzation
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Creating fuzzy rules from numerical data using a neural network
A Simple Artificial Neuron
Molecular Classification of Cancer
CSE P573 Applications of Artificial Intelligence Neural Networks
Hierarchical clustering approaches for high-throughput data
Kohonen Self-organizing Feature Maps
CSE 573 Introduction to Artificial Intelligence Neural Networks
Cluster Analysis in Bioinformatics
Dimension reduction : PCA and Clustering
Phylogeny.
Introduction to Cluster Analysis
Artificial Neural Networks
Presentation transcript:

Self-organizing map numeric vectors and sequence motifs Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca

Co-expressed genes Fig. 6-1. A subset of six co-expressed genes from yeast gene expression data (Cho et al., 1998). These genes have similar expression profile and form a tight cluster in a gene expression tree built from distances that measure differences in expression profiles. Xia 2007. Bioinfomatics and the cell. Springer Slide 2

Distances and scale effect Euclidian distance: Mahalanobis distance = Euclidian distance after data standardization: Slide 3

Clustering approaches Genes whose expressions changes synchronously have short distances We need to identify these genes by clustering them together Two approaches Conventional approach (single-linkage, complete-linkage, average linkage). UPGMA is an averega linkage linkage algorithm) Artificial neural network approach (SOM) Slide 4

UPGMA Gene1 Gene1 Gene2 Gene3 Gene4 Gene5 Gene1 0.015 0.045 0.143 0.198 Gene2 0.030 0.126 0.179 Gene3 0.092 0.179 Gene4 0.179 Gene5 D12,3 = (D1,3 + D2,3)/2 = 0.038 D12,4 = (D1,4 + D2,4)/2 = 0.135 D12,5 = (D1,5 + D2,5)/2 = 0.189 Gene12 Gene3 Gene4 Gene5 Gene12 0.038 0.135 0.189 Gene3 0.092 0.179 Gene4 0.179 Gene5 Gene2 Gene3 Gene4 Gene5 Gene3 Gene4 Gene5 Gene1 Gene2 (1,2),(3,4,5) Gene4 Gene5 Gene3 Gene1 Gene2 ((1,2),3),(4,5) Xuhua Xia

UPGMA Gene1 Gene2 Gene3 Gene4 Gene5 Gene1 0.015 0.045 0.143 0.198 Gene2 0.030 0.126 0.179 Gene3 0.092 0.179 Gene4 0.179 Gene5 D123,4 = (D1,4 + D2,4 + D3,4)/3 = 0.120 D123,5 = (D1,5 + D2,5 +D3,5)/3 = 0.185 Gene123 Gene4 Gene5 Gene123 0.120 0.185 Gene4 0.179 Gene5 D1234,5 = (D1,5 + D2,5 +D3,5 + D4,5)/4 = 0.184 Gene4 Gene5 Gene3 Gene1 Gene2 Gene5 Gene4 Gene3 Gene1 Gene2 (((1,2),3),4),5) Slide 6 Xuhua Xia

Phylogenetic Relationship from UPGMA Gene1 Gene2 Gene3 Gene4 Gene5 Gene1 0.015 0.045 0.143 0.198 Gene2 0.030 0.126 0.179 Gene3 0.092 0.179 Gene4 0.179 Gene5 Gene12 Gene3 Gene4 Gene5 Gene12 0.038 0.135 0.189 Gene3 0.092 0.179 Gene4 0.179 Gene5 Gene123 Gene4 Gene5 Gene123 0.120 0.185 Gene4 0.179 Gene5 Slide 7 Xuhua Xia

Branch Lengths D12 = 0.015 ((1,2),(3,4,5)) D12,3 = (D1,3 + D2,3)/2 = 0.038 (previous slide) D12,4 = (D1,4 + D2,4)/2 = 0.135 D12,5 = (D1,5 + D2,5)/2 = 0.189 D123,4 = (D1,4 + D2,4 + D3,4)/3 = 0.120 (previous slide) D123,5 = (D1,5 + D2,5 +D3,5)/3 = 0.185 D1234),5 = (D1,5 + D2,5 +D3,5 + D4,5)/4 = 0.184 ((1,2),(3,4,5)) (((1,2),3),(4,5)) ((((1,2),3),4),5) 0.0075 Gene1 Gene2 Gene3 Gene4 Gene5 0.019 0.06 ((1:0.0075,2:0.0075),(3,4,5)) (((1:0.0075,2:0.0075):0.019,3:0.019),(4,5)) ((((1:0.0075,2:0.0075):0.0115,3:0.019):0.041,4:0.06):0.032,5:0.092) 0.092 The reference book (Xia 2007) gave a different example. Make sure to go through it to gain a better understanding Slide 8 Xuhua Xia

UPGMA Result Slide 9

SOM A grid of "artificial neurons" Training data numeric vectors sequence motifs A distance or similarity index between numeric vectors between sequence motifs An algorithm to update the neurons in response to input (the learning process) Slide 10

Data Gene T0 T10 T20 Sum 1 93 76 87 256 2 80 81 85 246 3 89 88 262 4 69 74 96 239 5 95 277 6 65 237 7 268 8 78 255 9 97 264 10 67 55 218 11 91 90 276 12 72 215 13 79 94 251 14 250 15 66 64 63 193 Slide 11

Data and SOM grid Gene T0 T10 T20 Sum 1 93 76 87 256 2 80 81 85 246 3 89 88 262 4 69 74 96 239 5 95 277 6 65 237 7 268 8 78 255 9 97 264 10 67 55 218 11 91 90 276 12 72 215 13 79 94 251 14 250 15 66 64 63 193 3 by 3 SOM grid with random initial vectors   1 2 3 T0 T10 T20 2.2 11.5 33.4 41.9 27.6 0.8 6 63.8 51.2 28.5 30.2 47 28.7 51.3 9 61.6 38.2 17.9 40.5 76.1 71.2 79.8 94.6 23.2 76.2 40.9 23.9 Slide 12

Training We now randomly choose one gene, and suppose we happen to have chosen Gene 4 with T0, T10 and T20 equal to 69, 74 and 96, respectively (Table 6-1). The Euclidean distances (designated hereafter as d) between this gene and each of the 9 nodes (Table 6-3) show that Gene 4 is closest to node (3,1), with Gene T0 T10 T20 Sum 1 93 76 87 256 2 80 81 85 246 3 89 88 262 4 69 74 96 239 5 95 277 6 65 237 7 268 8 78 255 9 97 264 10 67 55 218 11 91 90 276 12 72 215 13 79 94 251 14 250 15 66 64 63 193   1 2 3 T0 T10 T20 2.2 11.5 33.4 41.9 27.6 0.8 6 63.8 51.2 28.5 30.2 47 28.7 51.3 9 61.6 38.2 17.9 40.5 76.1 71.2 79.8 94.6 23.2 76.2 40.9 23.9 Winning node All 9 distances:   1 2 3 111.0 109.0 78.0 77.2 98.5 86.2 37.8 76.4 79.6 Slide 13

Updating   1 2 3 T0 T10 T20 2.2 11.5 33.4 41.9 27.6 0.8 6 63.8 51.2 28.5 30.2 47 28.7 51.3 9 61.6 38.2 17.9 40.5 76.1 71.2 79.8 94.6 23.2 76.2 40.9 23.9 Learning rate: use 0.5 Gene 4: 69, 74 and 96   1 2 3 T0 T10 T20 2.2 11.5 33.4 41.9 27.6 0.8 6.0 63.8 51.2 38.7 41.1 59.3 28.7 51.3 9.0 61.6 38.2 17.9 54.7 75.0 83.6 77.1 89.5 41.4 76.2 40.9 23.9 Contrinue with other vectors and repeat until no more updating is possible Slide 14

Trained SOM   1 2 3 T0 T10 T20 80.7 77.6 73.3 74 77.9 67.3 69.2 79.5 72.2 84.4 81.9 82.1 82.5 81.2 81.4 78.5 78.2 89.6 85.4 91 88.1 82.9 91.7 79.9 79.1 89.4 Gene T0 T10 T20 Row Col d 1 93 76 87 3 2 9.69 80 81 85 4.41 89 88 6.54 4 69 74 96 13.77 5 95 6.80 6 65 17.43 7 4.90 8 78 10.17 9 97 6.12 10 67 55 23.00 11 91 90 6.30 12 72 6.19 13 79 94 4.84 14 13.63 15 66 64 63 16.54 Slide 15

Sequence as a matrix Table 2. A matrix representation of sequence “ACCGTTA” (a). The resulting position weight matrix (b) is obtained after adding a pseudocount of 0.01 to each cell in (a), with background frequencies being 0.3, 0.2, 0.2, and 0.3 for A, C, G, and T, respectively. (a) 1 2 3 4 5 6 7 A C G T (b)   1.695 −4.963 −4.379 2.280 Slide 16

Learning Table 3. Updating the node in Table 2a by the new input sequence “GCCATTA” and the resulting position weight matrix (PWM) obtained in (b) after adding a pseudocount of 0.01 to each cell in (a). (a) 1 2 3 4 5 6 7 A C G T (b)   0.723 −5.935 1.716 −5.350 2.301 1.308 Slide 17