Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University.

Slides:

Advertisements

Similar presentations

Proteins: Structure reflects function….. Fig. 5-UN1 Amino group Carboxyl group carbon.

Advertisements

Translation (The Specifics) Audra Brown Ward Marist School Atlanta, Georgia

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.

Principal Component Analysis

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”

Protein Structure.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.

Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.

Human Genetic Variation Basic terminology. What is a gene? A gene is a functional and physical unit of heredity passed from parent to offspring. Genes.

Non Negative Matrix Factorization

Proteins Secondary Structure Predictions Structural Bioinformatics.

©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”

Structure and Properties of Amino Acids and Proteins Amino Acids General Features Isomerism, Chirality and Optical Rotation Amphoteric Properties.

Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM) Dr. Bernard Chen Assistant Professor Department of Computer.

© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.

. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.

PROTEIN SYNTHESIS NOTES #1. Review What is transcription? Copying of DNA onto mRNA Where does transcription occur? In the Nucleus When copying DNA onto.

SAND C 1/17 Coupled Matrix Factorizations using Optimization Daniel M. Dunlavy, Tamara G. Kolda, Evrim Acar Sandia National Laboratories SIAM Conference.

My Research Work and Clustering Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.

Now playing: Frank Sinatra “My Way” A large part of modern biology is understanding large molecules like Proteins A large part of modern biology is understanding.

Doug Raiford Lesson 19.  Framework model  Secondary structure first  Assemble secondary structure segments  Hydrophobic collapse  Molten: compact.

Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.

Do Now Look at the picture below and answer the following questions.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.

Amino Acids ©CMBI 2001 “ When you understand the amino acids, you understand everything ”

Body System Project Animal Nutrition Chapter 41 Kristy Blake and Krystal Brostek.

Pg. 55. Carbohydrates Organic compounds composed of carbon, hydrogen, and oxygen in a ratio of 1:2:1 Carbohydrates can exist as 1) monosaccharides (simple.

Lloyd Algorithm K-Means Clustering. Gene Expression Susumu Ohno: whole genome duplications The expression of genes can be measured over time. Identifying.

Jeanette Andrade MS,RD,LDN,CDE Kaplan University Unit 7: Protein.

DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.

Amino Acids, Peptides, and Proteins. Introduction to Amino Acids  There are about 26 amino acids, many others are also known from a variety of sources.

1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

DANDY Deoxyribonucleic Acid ALL CELLS HAVE DNA… Cells are the basic unit of structure and function of all living things. –Prokaryotes (bacteria) –Eukaryotes.

Proteins Protos “of prime importance” Big Idea: Proteins perform the actions of the cell, they are coded for by the DNA. DNA is the principal, proteins.

Proteins Structure Predictions Structural Bioinformatics.

A PRESENTATION ON AMINO ACIDS AND PROTEINS PRESENTED BY SOMESH SHARMA Chemical Engineering Arham Veerayatan Institute of Engineering Technology.

Amino Acids. Amino acids are used in every cell of your body to build the proteins you need to survive. Amino Acids have a two-carbon bond: – One of the.

1 4. Nucleic acids and proteins in one and more dimensions - second part.

Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.

Genomics Lecture 3 By Ms. Shumaila Azam. Proteins Proteins: large molecules composed of one or more chains of amino acids, polypeptides. Proteins are.

Table 1: Essential amino acids profile of a complete protein in comparison to whey protein isolate and rice protein isolate used in this study (Eurofins.

Biosynthesis of Amino Acids

Semi-Supervised Clustering

Biochemistry Free For All

Protein Folding Notes.

Protein Synthesis: Translation

Protein Folding.

Basic machine learning background with Python scikit-learn

Protein Sequence Alignments

THE PRIMARY STRUCTURES OF PROTEINS

Section 3-4: Translation

How is the genetic code contained in DNA used to make proteins?

Introduction and Fundamentals of Protein Structure

Dimension reduction : PCA and Clustering

The 20 amino acids.

Introduction and Fundamentals of Protein Structure

Replication, Transcription, Translation PRACTICE

The 20 amino acids.

Do now activity #5 How many strands are there in DNA?

Replication, Transcription, Translation PRACTICE

Replication, Transcription, Translation PRACTICE

Non-Negative Matrix Factorization

Presentation transcript:

Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University Spring, 2009

Motivation Discovering Sequence Motifs Quality measurements Previous methods New approach Experiments and Results Wooyoung Kim 2 Contents

Motivation Sequence motif is a repeated pattern in protein sequences with biological significance. Conventional motif finding methods, including Gibbs Sampling, Block Maker and MEME only handle a limited size of dataset. We want to obtain protein recurring patterns which are universally conserved across protein family boundaries. We first cluster a huge dataset, and find a motif for each cluster. 3 Wooyoung Kim

Discovering Sequence Motifs 4 Wooyoung Kim Problem formulation Input: A set, V, consisting of N protein profile segments (l- mers) and a parameter k Output: 1) k number of clusters, and the data in each cluster should be similar biologically as well as computationally. 2) Find a consensus (motif) from each cluster.

Dataset 2710 protein sequences from PISCES(Protein Sequence Culling Server). Non-homologous dataset, that is, no sequences share more than 25% sequence identity. Each protein sequence represent a protein family by searching PDB and multiple alignment, so represented with profiles. By sliding with window size of 9, all the possible protein segments, ( > 560,000) are obtained for each sequence. Therefore, one data is represented as 9 by 20 matrix. 5 Wooyoung Kim

Dataset 6 A family of protein sequences V L I M F W Y G A P S T C H R K Q E N D QEND VLIMFWYGAPSTAHRTQAND AEND MFWYGAPSSCHHKQALD QEND VLIMAWYGAAS TCHREAND VEND VLIMFWY APSTCHRKF ANF QEND VLI FMYGAPSTCHRKQAN QEND VLIAFW GAPST HRKR ND Represented as frequency profile

Dataset For example, each data is 7 Wooyoung Kim amino acids frequencies V L I M F W Y G A P S T C H R K Q E N D

Quality measurements Secondary Structural Similarity measure ◦ Measure the similarity of each cluster with their secondary structure similarity. ◦ ws: the window size (9) ◦ : the frequency of helix at position i. ◦ : the frequency of sheets at position i. ◦ : the frequency of coils at position i. 8 Wooyoung Kim

Quality measurements David-Bouldin Index Measure Measure the similarity of each cluster with large inter-cluster and small intra- cluster distance. ◦ k: the number of clustering ◦ : average of the distance between the point and center in cluster P. ◦ : distance between two cluster’s centers. 9 Wooyoung Kim

Previous Methods Fuzzy Greedy K-means (FGK) : by Chen, Tai, Harrison and Pan. ◦ Separate the whole dataset into several smaller informational granules  Use Fuzzy C-means (allows one piece of data to belong to two or more clusters)  More than 560,000 segments are clustered to 10 separate files. ◦ Use Greedy K-means clustering algorithm on each granule.  Apply k-means several times to get the number of “good” initial centroids which produces clusters with relatively high structural similarity.  Run k-means algorithm at each set. ◦ Find the consensus sequence at each cluster. 10 Wooyoung Kim

New Approach Matrix Factorization ◦ Dimension reduction technology. Example) 2-dimensional data = {(1.09, 2), (7, 14), (10,10.1)} 11 Wooyoung Kim

New Approach Matrix Factorization ◦ Let A be a n x m data matrix. (n: number of dimension, m :number of data.) ◦ Factorize the data matrix A into a multiplication of two matrices of W and H. ◦ W is n x k basis matrix. ◦ H is k x m coefficient matrix, where k << min(n,m) ◦ Reduce dimension by construct a smaller number of bases vectors. ◦ Reduce data noise. 12 Wooyoung Kim

New Approach Principal Component Analysis (PCA) ◦ Maximize the data variance or minimize the projected error. ◦ Most accurate method. ◦ Produce the optimal number of bases vectors automatically. ◦ Used for linear system. Vector Quantification (VQ) ◦ Use K-means clustering, winner-take-all, nonlinear system. Non-negative Factorization (NMF) ◦ Matrix factorization with non-negativity constraints. ◦ Sparse NMF: NMF with sparseness constraints. 10/21/ Wooyoung Kim

New Approach Compare PCA, VQ, NMF with image processing 10/21/ Wooyoung Kim NMF Sparse bases gives the part- based representations VQ Bases represent a prototype of faces. PCA Bases do not give any contextual interpretation. image from “ Learning the parts of objects by nonnegative factorization,”, by Lee and Seung

New Approach Non-negative Factorization with sparse constraints : by Haesun Park. ◦ A: n x m data matrix ◦ Control the sparseness of the coefficient matrix H with 15 Wooyoung Kim

New Approach Apply sparse NMF to the clustering problem ◦ NMF with sparse constraint to H can be applied to clustering. Example : Cluster the leukemia gene expression dataset. 10/21/ Wooyoung Kim Cluster the gene expression data into 3 clusters % accuracy. Image from, “Sparse Non-negative Matrix Factorizations via alternating non-negativity-constrained Least Squares for Microarray Data Analysis”, by Kim and Park.

New Approach Apply sparse NMF to the discovering motif problem ◦ Partition the whole dataset into a number of clusters using NMF with sparseness on H.  Input: The dataset, and an integer k (the number of clusters), and sparseness parameters,, using the quality measurements.  Output: k clusters. ◦ Discover a representing motif of each cluster.  Input: For each cluster, given a number of sequences with window size 9.  Output: Find the sequence with the smallest cost. 17 Wooyoung Kim

Problems ◦ Incorporate the secondary structure information to the data.  Chou-Fasman Parameter ◦ The number of clusters is too large, then the chance of NMF assign a data into one cluster correctly becomes low.  Further divide a big file into smaller files Experiments and Results 10/21/ Wooyoung Kim Top: K=3 Bottom : K=45 In the red box, if K=45, many of the coefficient remains nonzero, with 10% of weights each. It makes hard to assign the data to one out of 45

Experiments and Results 10/21/ Wooyoung Kim NameP(a)P(b)P(turn)f(i)f(i+1)f(i+2)f(i+3) Alanine Arginine Aspartic Acid Asparagine Cysteine Glutamic Acid Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine Chou-Fasman parameter

Experiments and Results 20 Wooyoung Kim FGK-model (Image from the paper by Chen, et al.) Double FCM+CF+SNMF Additional steps for Double FCM+CF+SNMF

Experiments and Results 21 Wooyoung Kim FGK-model (Image from the paper by Chen, et al.) FCM+CF+Kmeans … Additional steps For FCM+CF+Kmeans

Experiments and Results 22 Wooyoung Kim Result Methods> 60%>70%DBI Traditional25.82%10.44%6.09 FCM37.14%15.57%4.36 FGK42.93%14.39%4.63 Single FCM+SNMF/R24.41%5.76%4.09 Double FCM+SNMF/R44.07%12.73%5.42 Double FCM+Kmeans38.45%13.73%6.33 Single FCM+CF+Kmeans41.30%16.89%4.28 Double FCM+CF+Kmeans42.94%13.23%5.67 Double FCM+CF+SNMF/R48.44%16.23%4.81

Experiments and Results 23 Wooyoung Kim Example of Motif images Amino acid with only more than 8% of occurrence at the position is shown

Conclusion Sparse NMF/R often produce more consistent clustering results than K- means with random condition. Unlike Kmeans, the performance does not depend on the initial centroids. We include Chou-Fasman parameter to the data format in order to incorporate the secondary data structure information. The file with too many clusters is divided into further so that each file does not exceed 14 clusters. Single FCM + SNMF/R: DBI is best Double FCM+CF+sparseNMF: Secondary structure similarity is best. Future work: FGK+CF might increase the secondary structure similarity further. 24 Wooyoung Kim