“ Using Sequence Motifs for Enhanced Neural Network Prediction of Protein Distance Constraints ” J.Gorodkin, O.Lund, C.A.Anderson, S.Brunak On ISMB 99.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Measuring the degree of similarity: PAM and blosum Matrix
Spring INTRODUCTION There exists a lot of methods used for identifying high risk locations or sites that experience more crashes than one would.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Protein threading algorithms 1.GenTHREADER Jones, D. T. JMB(1999) 287, Protein Fold Recognition by Prediction-based Threading Rost, B., Schneider,
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
Neural Networks Marco Loog.
Evaluating Hypotheses
Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Chapter 6: Multilayer Neural Networks
Similar Sequence Similar Function Charles Yan Spring 2006.
A unified statistical framework for sequence comparison and structure comparison Michael Levitt Mark Gerstein.
Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
The Analysis of Variance
Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Lecture 11, CS5671 Secondary Structure Prediction Progressive improvement –Chou-Fasman rules –Qian-Sejnowski –Burkhard-Rost PHD –Riis-Krogh Chou-Fasman.
Protein Sequence Alignment and Database Searching.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Texture. Texture is an innate property of all surfaces (clouds, trees, bricks, hair etc…). It refers to visual patterns of homogeneity and does not result.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Comp. Genomics Recitation 3 The statistics of database searching.
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Matching Protein  -Sheet Partners by Feedforward and Recurrent Neural Network Proceedings of Eighth International Conference on Intelligent Systems for.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise alignment incorporating dipeptide covariation
Predict Failures with Developer Networks and Social Network Analysis
Neural Networks and Their Application in the Fields of Coporate Finance By Eric Séverin Hanna Viinikainen.
Network Screening & Diagnosis
Sequence Based Analysis Tutorial
Volume 19, Issue 7, Pages (July 2011)
Yang Liu, Perry Palmedo, Qing Ye, Bonnie Berger, Jian Peng 
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
Protein structure prediction
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

“ Using Sequence Motifs for Enhanced Neural Network Prediction of Protein Distance Constraints ” J.Gorodkin, O.Lund, C.A.Anderson, S.Brunak On ISMB 99 Bio Learning Group Seminar Speaker: Eom Jae-Hong 2001/03/19

2 Abstract Investigate the correlation  Between sequence separation and distance. For pairs of amino acid where the distance atom is smaller than the threshold   found characteristic sequence motif. The motifs change as the sequence separation increases. Find correlations between the residues in the center of the motif.   Used for design new NN with statistical analysis. Statistical analysis  Explains why neural networks perform better than simple statistical data-driven approaches (ex: pair probability density functions).

3 Introduction The ability to adopt structure from sequences  Depends on constructing an appropriate cost function for native structure.  To find this function  Concentrate on finding a method to predict distance constraints,  That correlate well with the observed distances in proteins. The neural network approach is  The only approach so far which includes sequence context for the considered pair of amino acids.  Perform better.  Capture more features relating distance constraints and sequence composition.

4 Introduction (cont ’ d) The analysis include investigation of the distances  Between amino acids as well as sequence motifs and correlation for separated residues. Construct a prediction scheme.  Significantly improve on an earlier approach (Lund et al. 1997) For each particular sequence separation  The corresponding distance threshold is computed  As the avg. of all physical distances in a large data set between any two amino acids separated by the amount of residues (Lund et al. 1997).

5 Introduction (cont ’ d) Here, Include an analysis of the distance distributions relative to these threshold.  Use this to explain qualitative behavior of the neural network prediction scheme. Analysis of the network weight composition  Reveal intriguing properties of the distance constraints.   “The sequence motifs can be decomposed into sub-motifs associated with each of the hidden units in the neural network.”  The sequence separation increases there is  A clear correspondence in the change of the mean value, distance distribution, and the sequence motifs describing the distance constraints of the separated amino acids.  The predicted distance constraints  May be used as inputs to threading or loop modeling algorithm.

6 Data  Extracted from the Brookhaven Protein Data Bank (Bernstein et al. 1977), containing 5762 proteins.  Entries were excluded if:  The secondary structure of the proteins could not be assigned by the program DSSP (Kabsch & Sander 1983).  The proteins had any physical chain breaks.  They had a resolution value grater than 2.5 Angstrom.  Individual chains of entries were discarded if:  They had a length of less than 30 amino acids.  They had less than 50% secondary structure assign as defined by the program DSSP.  They had more than 85% non-amino acids in the sequence.  They had more than 10% of non-standard amino acids (B, X, Z). Material and Method – Data extraction

7  Representative set with low pairwise sequence selection  By running algorithm #1 of Hobohm et al. (1992) implemented in the program RedHom (Lund et al. 1997).  Sequence sorting  According to resolution (all NMR structures were assigned resolution 100).  The sequence with the same resolution were sorted so that higher priority was given to longer proteins.  Sequence aligned (local alignment program)  Ssearch (Myers & Miller 1998; Pearson 1990)  Using pam120 amino acid substitution matrix (Dayhoff & Orcutt 1978) with gap penalties –12, -4.  Use Cut off threshold for seq. similarity. Material and Method – Data extraction % of identity in the alignment

8 Material and Method – Data extraction  Ten cross-validation sets were selected such that  They all contain approximately the same number of residues.  And all have the same length distribution of the chains.  All the data are made publicly available 

9 - Information Content / Relative entropy measure Relative entropy  Used to measure the information content (Kullback & Leibler 1951) of aligned regions between separated residues.  Information content  The information content at each position will sometimes be displayed as a sequence logo (Schneider & Stephens 1990).  The position-dependent information content is given by  Symbols in logos turned 180 degrees   Observed fraction of symbol k at position i Background probability of finding symbol k by chance in the seq.

10 - Neural networks In previous work (Baldi & Brunak 1998, …)  Applied two-layer feed-forward neural networks  Trained by standard back-propagation.  To predict whether two residues are below or above a given distance threshold in space. Lund et al  The inputs were processed as two windows centered around each of the separated amino acids. Here, Extend previous scheme  By allowing the windows to grow towards each other, and even merge to a single large window covering the complete seq. between the separated amino acids.  Increases the computational requirements.  But, allow us to search for optimal covering between the separated amino acids.

11 - Neural networks  Positive (contact) and negative (non contact) windows.  Apply the balanced learning approach (Rost & sander 1993).  Training  Done by a 10 set cross-validation approach (bishop 1996).  Calculate the average performance over the partitions.  The performance on each partition is evaluated by Mathews correlation coefficient (Mathews 1975).  The analysis of the patterns stored in the weights of the network is done through the salience.  The cost of removing a single weight while keeping remaining ones.

12 - Neural networks (cont ’ d)  Each weight  Connected to a hidden unit corresponds exactly to a particular amino acids at a given position in the seq. Windows used as inputs.  Due to the sparse encoding.  Obtain a ranking of symbols  On each position in the input fields. To compute the saliencies  Use the approximation for two-layer one-output networks (Gorodkin et al. 1997).  The saliencies for the weights between input and hidden layer can be written as Weight between input i and hidden unit j. The weight between hidden unit j and the output.

13 Results Conduct statistical analysis of the data and distance constraints between amino acids. Use the result to  Design and explain the behavior of a neural network prediction scheme with enhanced performance.

14 - Statistical analysis Derive the mean of all the distance  Between pairs of atoms.  Use these means as distance constraint thresholds. To analyze which pair are above and below the threshold, it is relevant to compare:  The distribution of distances between amino acid pairs below and above the threshold.  The sequence composition of segments  Where the pairs of amino acids are below and above the threshold. Investigate the length distribution of the distances  As function of the sequence separation.

15 - Statistical analysis (cont ’ d) From figure 1.  -helices : make distinct peak up to 20 separations  -helices : make distinct peak up to 5 separations  The distance distribution of separation 3  Data is most bimodal.  Provides the most distinct partition of the data points.   the best prediction of distance constraints can be obtained.  -helices peak shifts relative to the mean  When separation is 11 to 13.  For an optimized predictor, it can be slightly harder  To predict distance constraints for separation 12 than 11 or 13.

16 Distance distribution approaches a universal shape

17 - Statistical analysis (cont ’ d) Helices at 12 residues coincide with seq. sep. 12.   hard to predict distance constraints. Bimodal distribution  unimodal distribution   prediction of distance constraints become harder with increasing sequence separation. The Universality only appears  When the distribution is displaced by its mean distance.   we can use mean as a threshold.

18

19 - Statistical analysis (cont ’ d) To use the information available in the sequence  Sequence segments above the threshold   use to calculate position-dependent background distribution.  Sequence segments below the threshold   all aligned and displayed in sequence logo using computed background distribution.  Sequence information content curve  Figure 3.  Corresponding sequence logo  Figure 4.  For larger sequence separation, the motif consists of 3 peaks  1 center peak, 2 peaks of separated amino acids.

20 - Statistical analysis (cont ’ d) Sequence information content curves

21 - Statistical analysis (cont ’ d) Smear out at separation (universal distance distribution) sequence motif becomes  “universal”

22

23

24 Neural networks: prediction and behavior Use NN to predict optimal distance constraints  Have to consider sequence separation distance. Two-layer network  1 output unit, 5 hidden units.  The size of the input field may vary. (Quantitative) Investigate the relation between  The seq. motifs in the logos and the amount of sequence context needed in prediction scheme.  Choose, the amount of seq. context with local windows around the separated amino acids  extend tie seq. region r.  For all seq. sep.s 2 to 99, train 8000 networks, use 10-cross validation.

25 Neural networks: prediction and behavior

26 Due to the lack of motif

27 Best performing network is that  Which uses as much context as possible.  More than residues  The amount of used context is not a factor anymore.  Performance curve fluctuation occur. We can use networks as an indicator for  When a sequence motif is well defined (using fluctuation). Independent prediction test on nine CASP3 targets  Prev. Method : 64.5% correct prediction  With correlation coefficient.   70.3% with Neural networks: prediction and behavior

28 Prediction example of distance constraints for R0067. Result  Predictions up to a sequence separation 30 clearly capture the main part of the distance constraints. Neural networks: prediction and behavior

29 Neural networks: prediction and behavior (Qualitative) Investigate the relation between  The network performance and information content in the sequence logos.   two curves have the same qualitative behavior as the sequence separation increase (figure 7).  Peak at separation 3.  Drop at separation 12.  Plateau for seq. sep. 30.  ? Decreasing sampling size!!

30

31