Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University.

Similar presentations


Presentation on theme: "Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University."— Presentation transcript:

1 Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University Spring, 2009

2 Motivation Discovering Sequence Motifs Quality measurements Previous methods New approach Experiments and Results Wooyoung Kim 2 Contents

3 Motivation Sequence motif is a repeated pattern in protein sequences with biological significance. Conventional motif finding methods, including Gibbs Sampling, Block Maker and MEME only handle a limited size of dataset. We want to obtain protein recurring patterns which are universally conserved across protein family boundaries. We first cluster a huge dataset, and find a motif for each cluster. 3 Wooyoung Kim

4 Discovering Sequence Motifs 4 Wooyoung Kim Problem formulation Input: A set, V, consisting of N protein profile segments (l- mers) and a parameter k Output: 1) k number of clusters, and the data in each cluster should be similar biologically as well as computationally. 2) Find a consensus (motif) from each cluster.

5 Dataset 2710 protein sequences from PISCES(Protein Sequence Culling Server). Non-homologous dataset, that is, no sequences share more than 25% sequence identity. Each protein sequence represent a protein family by searching PDB and multiple alignment, so represented with profiles. By sliding with window size of 9, all the possible protein segments, ( > 560,000) are obtained for each sequence. Therefore, one data is represented as 9 by 20 matrix. 5 Wooyoung Kim

6 Dataset 6 A family of protein sequences V L I M F W Y G A P S T C H R K Q E N D QEND VLIMFWYGAPSTAHRTQAND AEND MFWYGAPSSCHHKQALD QEND VLIMAWYGAAS TCHREAND VEND VLIMFWY APSTCHRKF ANF QEND VLI FMYGAPSTCHRKQAN QEND VLIAFW GAPST HRKR ND Represented as frequency profile 0.0060.000.050.000.050.110.170.040.09 0.020.04 0.020.040.150.040.07 0.040.000.060.000.020.110.020.000.130.060.00 0.040.090.000.150.070.15 0.00 0.150.040.150.110.230.00 0.040.000.110.000.17 0.02 0.090.00 0.130.000.040.060.170.00 0.020.150.000.190.000.02 0.000.050.00 0.020.300.070.210.02 0.000.020.000.020.050.020.05 0.00 0.030.270.050.000.030.000.050.000.050.000.270.000.24 0.00 0.300.03 0.080.000.080.050.00 0.050.00 0.110.000.03 0.150.02 0.000.02 0.420.00 0.050.000.070.000.050.07 0.030.00 0.060.000.110.000.09 0.000.030.09 0.06 0.110.17 0.0060.000.050.000.050.110.170.040.09 0.020.04 0.020.040.150.040.07 0.040.000.060.000.020.110.020.000.130.060.00 0.040.090.000.150.070.15 0.00 0.150.040.150.110.230.00 0.040.000.110.000.17 0.02 0.090.00 0.130.000.040.060.170.00 0.020.150.000.190.000.02 0.000.050.00 0.020.300.070.210.02 0.000.020.000.020.050.020.05 0.00 0.030.270.050.000.030.000.050.000.050.000.270.000.24 0.00 0.300.03 0.080.000.080.050.00 0.050.00 0.110.000.03 0.150.02 0.000.02 0.420.00 0.050.000.070.000.050.07 0.030.00 0.060.000.110.000.09 0.000.030.09 0.06 0.110.17

7 Dataset For example, each data is 7 Wooyoung Kim 0.00 0.050.000.050.110.170.040.09 0.020.04 0.020.040.150.040.07 0.040.000.060.000.020.110.020.000.130.060.00 0.040.090.000.150.070.15 0.00 0.150.040.150.110.230.00 0.040.000.110.000.17 0.02 0.090.00 0.130.000.040.060.170.00 0.020.150.000.190.000.02 0.000.050.00 0.020.300.070.210.02 0.000.020.000.020.050.020.05 0.00 0.030.270.050.000.030.000.050.000.050.000.270.000.24 0.00 0.300.03 0.080.000.080.050.00 0.050.00 0.110.000.03 0.150.02 0.000.02 0.420.00 0.050.000.070.000.050.07 0.030.00 0.060.000.110.000.09 0.000.030.09 0.06 0.110.17 20 amino acids frequencies V L I M F W Y G A P S T C H R K Q E N D

8 Quality measurements Secondary Structural Similarity measure ◦ Measure the similarity of each cluster with their secondary structure similarity. ◦ ws: the window size (9) ◦ : the frequency of helix at position i. ◦ : the frequency of sheets at position i. ◦ : the frequency of coils at position i. 8 Wooyoung Kim

9 Quality measurements David-Bouldin Index Measure Measure the similarity of each cluster with large inter-cluster and small intra- cluster distance. ◦ k: the number of clustering ◦ : average of the distance between the point and center in cluster P. ◦ : distance between two cluster’s centers. 9 Wooyoung Kim

10 Previous Methods Fuzzy Greedy K-means (FGK) : by Chen, Tai, Harrison and Pan. ◦ Separate the whole dataset into several smaller informational granules  Use Fuzzy C-means (allows one piece of data to belong to two or more clusters)  More than 560,000 segments are clustered to 10 separate files. ◦ Use Greedy K-means clustering algorithm on each granule.  Apply k-means several times to get the number of “good” initial centroids which produces clusters with relatively high structural similarity.  Run k-means algorithm at each set. ◦ Find the consensus sequence at each cluster. 10 Wooyoung Kim

11 New Approach Matrix Factorization ◦ Dimension reduction technology. Example) 2-dimensional data = {(1.09, 2), (7, 14), (10,10.1)} 11 Wooyoung Kim

12 New Approach Matrix Factorization ◦ Let A be a n x m data matrix. (n: number of dimension, m :number of data.) ◦ Factorize the data matrix A into a multiplication of two matrices of W and H. ◦ W is n x k basis matrix. ◦ H is k x m coefficient matrix, where k << min(n,m) ◦ Reduce dimension by construct a smaller number of bases vectors. ◦ Reduce data noise. 12 Wooyoung Kim

13 New Approach Principal Component Analysis (PCA) ◦ Maximize the data variance or minimize the projected error. ◦ Most accurate method. ◦ Produce the optimal number of bases vectors automatically. ◦ Used for linear system. Vector Quantification (VQ) ◦ Use K-means clustering, winner-take-all, nonlinear system. Non-negative Factorization (NMF) ◦ Matrix factorization with non-negativity constraints. ◦ Sparse NMF: NMF with sparseness constraints. 10/21/200813 Wooyoung Kim

14 New Approach Compare PCA, VQ, NMF with image processing 10/21/200814 Wooyoung Kim NMF Sparse bases gives the part- based representations VQ Bases represent a prototype of faces. PCA Bases do not give any contextual interpretation. image from “ Learning the parts of objects by non- negative factorization,”, by Lee and Seung

15 New Approach Non-negative Factorization with sparse constraints : by Haesun Park. ◦ A: n x m data matrix ◦ Control the sparseness of the coefficient matrix H with 15 Wooyoung Kim

16 New Approach Apply sparse NMF to the clustering problem ◦ NMF with sparse constraint to H can be applied to clustering. Example : Cluster the leukemia gene expression dataset. 10/21/2008 16 Wooyoung Kim Cluster the gene expression data into 3 clusters. 97.5 % accuracy. Image from, “Sparse Non-negative Matrix Factorizations via alternating non-negativity-constrained Least Squares for Microarray Data Analysis”, by Kim and Park.

17 New Approach Apply sparse NMF to the discovering motif problem ◦ Partition the whole dataset into a number of clusters using NMF with sparseness on H.  Input: The dataset, and an integer k (the number of clusters), and sparseness parameters,, using the quality measurements.  Output: k clusters. ◦ Discover a representing motif of each cluster.  Input: For each cluster, given a number of sequences with window size 9.  Output: Find the sequence with the smallest cost. 17 Wooyoung Kim

18 Problems ◦ Incorporate the secondary structure information to the data.  Chou-Fasman Parameter ◦ The number of clusters is too large, then the chance of NMF assign a data into one cluster correctly becomes low.  Further divide a big file into smaller files Experiments and Results 10/21/2008 18 Wooyoung Kim Top: K=3 Bottom : K=45 In the red box, if K=45, many of the coefficient remains nonzero, with 10% of weights each. It makes hard to assign the data to one out of 45

19 Experiments and Results 10/21/2008 19 Wooyoung Kim NameP(a)P(b)P(turn)f(i)f(i+1)f(i+2)f(i+3) Alanine14283660.060.0760.0350.058 Arginine9893950.070.1060.0990.085 Aspartic Acid101541460.1470.110.1790.081 Asparagine67891560.1610.0830.1910.091 Cysteine70119 0.1490.050.1170.128 Glutamic Acid15137740.0560.060.0770.064 Glutamine111110980.0740.0980.0370.098 Glycine57751560.1020.0850.190.152 Histidine10087950.140.0470.0930.054 Isoleucine108160470.0430.0340.0130.056 Leucine121130590.0610.0250.0360.07 Lysine114741010.0550.1150.0720.095 Methionine145105600.0680.0820.0140.055 Phenylalanine113138600.0590.0410.065 Proline57551520.1020.3010.0340.068 Serine77751430.120.1390.1250.106 Threonine83119960.0860.1080.0650.079 Tryptophan108137960.0770.0130.0640.167 Tyrosine691471140.0820.0650.1140.125 Valine106170500.0620.0480.0280.053 Chou-Fasman parameter

20 Experiments and Results 20 Wooyoung Kim FGK-model (Image from the paper by Chen, et al.) Double FCM+CF+SNMF Additional steps for Double FCM+CF+SNMF

21 Experiments and Results 21 Wooyoung Kim FGK-model (Image from the paper by Chen, et al.) FCM+CF+Kmeans … Additional steps For FCM+CF+Kmeans

22 Experiments and Results 22 Wooyoung Kim Result Methods> 60%>70%DBI Traditional25.82%10.44%6.09 FCM37.14%15.57%4.36 FGK42.93%14.39%4.63 Single FCM+SNMF/R24.41%5.76%4.09 Double FCM+SNMF/R44.07%12.73%5.42 Double FCM+Kmeans38.45%13.73%6.33 Single FCM+CF+Kmeans41.30%16.89%4.28 Double FCM+CF+Kmeans42.94%13.23%5.67 Double FCM+CF+SNMF/R48.44%16.23%4.81

23 Experiments and Results 23 Wooyoung Kim Example of Motif images Amino acid with only more than 8% of occurrence at the position is shown

24 Conclusion Sparse NMF/R often produce more consistent clustering results than K- means with random condition. Unlike Kmeans, the performance does not depend on the initial centroids. We include Chou-Fasman parameter to the data format in order to incorporate the secondary data structure information. The file with too many clusters is divided into further so that each file does not exceed 14 clusters. Single FCM + SNMF/R: DBI is best Double FCM+CF+sparseNMF: Secondary structure similarity is best. Future work: FGK+CF might increase the secondary structure similarity further. 24 Wooyoung Kim


Download ppt "Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University."

Similar presentations


Ads by Google