Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University Spring, 2009
Motivation Discovering Sequence Motifs Quality measurements Previous methods New approach Experiments and Results Wooyoung Kim 2 Contents
Motivation Sequence motif is a repeated pattern in protein sequences with biological significance. Conventional motif finding methods, including Gibbs Sampling, Block Maker and MEME only handle a limited size of dataset. We want to obtain protein recurring patterns which are universally conserved across protein family boundaries. We first cluster a huge dataset, and find a motif for each cluster. 3 Wooyoung Kim
Discovering Sequence Motifs 4 Wooyoung Kim Problem formulation Input: A set, V, consisting of N protein profile segments (l- mers) and a parameter k Output: 1) k number of clusters, and the data in each cluster should be similar biologically as well as computationally. 2) Find a consensus (motif) from each cluster.
Dataset 2710 protein sequences from PISCES(Protein Sequence Culling Server). Non-homologous dataset, that is, no sequences share more than 25% sequence identity. Each protein sequence represent a protein family by searching PDB and multiple alignment, so represented with profiles. By sliding with window size of 9, all the possible protein segments, ( > 560,000) are obtained for each sequence. Therefore, one data is represented as 9 by 20 matrix. 5 Wooyoung Kim
Dataset 6 A family of protein sequences V L I M F W Y G A P S T C H R K Q E N D QEND VLIMFWYGAPSTAHRTQAND AEND MFWYGAPSSCHHKQALD QEND VLIMAWYGAAS TCHREAND VEND VLIMFWY APSTCHRKF ANF QEND VLI FMYGAPSTCHRKQAN QEND VLIAFW GAPST HRKR ND Represented as frequency profile
Dataset For example, each data is 7 Wooyoung Kim amino acids frequencies V L I M F W Y G A P S T C H R K Q E N D
Quality measurements Secondary Structural Similarity measure ◦ Measure the similarity of each cluster with their secondary structure similarity. ◦ ws: the window size (9) ◦ : the frequency of helix at position i. ◦ : the frequency of sheets at position i. ◦ : the frequency of coils at position i. 8 Wooyoung Kim
Quality measurements David-Bouldin Index Measure Measure the similarity of each cluster with large inter-cluster and small intra- cluster distance. ◦ k: the number of clustering ◦ : average of the distance between the point and center in cluster P. ◦ : distance between two cluster’s centers. 9 Wooyoung Kim
Previous Methods Fuzzy Greedy K-means (FGK) : by Chen, Tai, Harrison and Pan. ◦ Separate the whole dataset into several smaller informational granules Use Fuzzy C-means (allows one piece of data to belong to two or more clusters) More than 560,000 segments are clustered to 10 separate files. ◦ Use Greedy K-means clustering algorithm on each granule. Apply k-means several times to get the number of “good” initial centroids which produces clusters with relatively high structural similarity. Run k-means algorithm at each set. ◦ Find the consensus sequence at each cluster. 10 Wooyoung Kim
New Approach Matrix Factorization ◦ Dimension reduction technology. Example) 2-dimensional data = {(1.09, 2), (7, 14), (10,10.1)} 11 Wooyoung Kim
New Approach Matrix Factorization ◦ Let A be a n x m data matrix. (n: number of dimension, m :number of data.) ◦ Factorize the data matrix A into a multiplication of two matrices of W and H. ◦ W is n x k basis matrix. ◦ H is k x m coefficient matrix, where k << min(n,m) ◦ Reduce dimension by construct a smaller number of bases vectors. ◦ Reduce data noise. 12 Wooyoung Kim
New Approach Principal Component Analysis (PCA) ◦ Maximize the data variance or minimize the projected error. ◦ Most accurate method. ◦ Produce the optimal number of bases vectors automatically. ◦ Used for linear system. Vector Quantification (VQ) ◦ Use K-means clustering, winner-take-all, nonlinear system. Non-negative Factorization (NMF) ◦ Matrix factorization with non-negativity constraints. ◦ Sparse NMF: NMF with sparseness constraints. 10/21/ Wooyoung Kim
New Approach Compare PCA, VQ, NMF with image processing 10/21/ Wooyoung Kim NMF Sparse bases gives the part- based representations VQ Bases represent a prototype of faces. PCA Bases do not give any contextual interpretation. image from “ Learning the parts of objects by non- negative factorization,”, by Lee and Seung
New Approach Non-negative Factorization with sparse constraints : by Haesun Park. ◦ A: n x m data matrix ◦ Control the sparseness of the coefficient matrix H with 15 Wooyoung Kim
New Approach Apply sparse NMF to the clustering problem ◦ NMF with sparse constraint to H can be applied to clustering. Example : Cluster the leukemia gene expression dataset. 10/21/ Wooyoung Kim Cluster the gene expression data into 3 clusters % accuracy. Image from, “Sparse Non-negative Matrix Factorizations via alternating non-negativity-constrained Least Squares for Microarray Data Analysis”, by Kim and Park.
New Approach Apply sparse NMF to the discovering motif problem ◦ Partition the whole dataset into a number of clusters using NMF with sparseness on H. Input: The dataset, and an integer k (the number of clusters), and sparseness parameters,, using the quality measurements. Output: k clusters. ◦ Discover a representing motif of each cluster. Input: For each cluster, given a number of sequences with window size 9. Output: Find the sequence with the smallest cost. 17 Wooyoung Kim
Problems ◦ Incorporate the secondary structure information to the data. Chou-Fasman Parameter ◦ The number of clusters is too large, then the chance of NMF assign a data into one cluster correctly becomes low. Further divide a big file into smaller files Experiments and Results 10/21/ Wooyoung Kim Top: K=3 Bottom : K=45 In the red box, if K=45, many of the coefficient remains nonzero, with 10% of weights each. It makes hard to assign the data to one out of 45
Experiments and Results 10/21/ Wooyoung Kim NameP(a)P(b)P(turn)f(i)f(i+1)f(i+2)f(i+3) Alanine Arginine Aspartic Acid Asparagine Cysteine Glutamic Acid Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine Chou-Fasman parameter
Experiments and Results 20 Wooyoung Kim FGK-model (Image from the paper by Chen, et al.) Double FCM+CF+SNMF Additional steps for Double FCM+CF+SNMF
Experiments and Results 21 Wooyoung Kim FGK-model (Image from the paper by Chen, et al.) FCM+CF+Kmeans … Additional steps For FCM+CF+Kmeans
Experiments and Results 22 Wooyoung Kim Result Methods> 60%>70%DBI Traditional25.82%10.44%6.09 FCM37.14%15.57%4.36 FGK42.93%14.39%4.63 Single FCM+SNMF/R24.41%5.76%4.09 Double FCM+SNMF/R44.07%12.73%5.42 Double FCM+Kmeans38.45%13.73%6.33 Single FCM+CF+Kmeans41.30%16.89%4.28 Double FCM+CF+Kmeans42.94%13.23%5.67 Double FCM+CF+SNMF/R48.44%16.23%4.81
Experiments and Results 23 Wooyoung Kim Example of Motif images Amino acid with only more than 8% of occurrence at the position is shown
Conclusion Sparse NMF/R often produce more consistent clustering results than K- means with random condition. Unlike Kmeans, the performance does not depend on the initial centroids. We include Chou-Fasman parameter to the data format in order to incorporate the secondary data structure information. The file with too many clusters is divided into further so that each file does not exceed 14 clusters. Single FCM + SNMF/R: DBI is best Double FCM+CF+sparseNMF: Secondary structure similarity is best. Future work: FGK+CF might increase the secondary structure similarity further. 24 Wooyoung Kim