Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.

Slides:

Advertisements

Similar presentations

Secondary structure prediction from amino acid sequence.

Advertisements

Protein – Protein Interactions Lisa Chargualaf Simon Kanaan Keefe Roedersheimer Others: Dr. Izaguirre, Dr. Chen, Dr. Wuchty, ChengBang Huang.

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

Profiles for Sequences

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.

Structural bioinformatics

Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.

Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.

Heuristic alignment algorithms and cost matrices

Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

The Protein Data Bank (PDB)

. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Similar Sequence Similar Function Charles Yan Spring 2006.

Identification of Domains using Structural Data Niranjan Nagarajan Department of Computer Science Cornell University.

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Protein Tertiary Structure Prediction Structural Bioinformatics.

TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,

Chapter 5 Multiple Sequence Alignment.

Protein Tertiary Structure Prediction

Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒黃尹柔田耕豪蕭逸嫻謝朝茂莊閔傑 2014/05/12 1.

CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Representations of Molecular Structure: Bonds Only.

RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.

Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.

Particle Filters for Shape Correspondence Presenter: Jingting Zeng.

Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.

Secondary structure prediction

Comp. Genomics Recitation 3 The statistics of database searching.

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Protein Folding & Biospectroscopy Lecture 6 F14PFB David Robinson.

Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.

Sequence Alignment.

Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDN GVDGEWTYTE Structure-Sequence alignment “Structure is better preserved than sequence” Me! Non-redundant.

Step 3: Tools Database Searching

Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

CS-ROSETTA Yang Shen et al. Presented by Jonathan Jou.

EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)

Protein Tertiary Structure Prediction Structural Bioinformatics.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

“ Using Sequence Motifs for Enhanced Neural Network Prediction of Protein Distance Constraints ” J.Gorodkin, O.Lund, C.A.Anderson, S.Brunak On ISMB 99.

Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.

ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.

METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.

Improved Protein Secondary Structure Prediction. Secondary Structure Prediction Given a protein sequence a 1 a 2 …a N, secondary structure prediction.

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

Chapter 14 Protein Structure Classification

Sequence Based Analysis Tutorial

Protein structure prediction.

Protein structure prediction

Presentation transcript:

Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum

The Approach Learn a set of clusters or structure segments that can be identified from short local sequence Combine a set of local structural predictions into one whole structure

Methods - Database Database of 471 protein sequence families By Sander & Schneider 1994 Each family contains one known sequence structure No more than 25% sequence identity between any 2 alignments Well determined structures Non-membrane proteins

Clustering of Sequence Segments Each position in the database is described by a weighted amino acid frequency (Vingron & Argos 1989) Similarity between a sequence and a cluster is defined by “Cross-Entropy”: Segments of given length (3-15) were clustered via the K-means algorithm Unsupervised

Assessing Structure within a cluster and Choice of Paradigm Structural similarity between 2 peptide structure segments  S1 i->j is the distance between  -carbon atoms i and j in segments S1 The paradigm for a cluster was chosen from the top 20 segments as the one with the smallest sum of mda/dme values with the others

True/False Boundaries in Structure Space Used for the refinement procedure Find Natural Boundaries Compute Histograms of dme & mda vs the paradigm over all segments in the cluster The boundary was set to the point where the histogram first dropped to ½ of its maximum If reached 130 o or 1.3A o the cluster is rejected Average boundaries is 81 o and 89A 82 cluster were constructed (I-site library)

DMA-MDA for 9 residue serine B-hairpin

Iterative Refinement of Clusters For each cluster with good boundaries Clustering increases P(cluster|sequence) In order to increase P(structure|cluster) 2 residues are also observed on each side of each sequence All segments that are not within the natural boundaries of the paradigm are removed The frequency profile of the cluster is calculated The database is searched using the new profile and the highest 400 scored sequences are the new cluster

Cross-Validation and confidence A 10 fold cross validation was performed If the 10 paradigm were not structurally the same or if the 10 runs did not converge to the same profile then the cluster was rejected If the cluster was not rejected a confidence curve was computed as a function of the D pq sequence to cluster similarity. This enables to compare different profile lengths and incorporates P(clust|seq) and P(struct|clust)

Confidence for Similarity

Clustering – What do we want? Direction: Sequence -> Structure We want to as separated as possible cluster of sequences so that given a test sequence we can assign it to 1 cluster Each cluster should have 1 or a few possible structures. Those structures will be used to predict the test protein structure P(struct|seq) =  cluster P(struct|clust,seq)*P(clust|seq) = P(struct|clust)* P(clust|seq)

Iterative Peak Removal Similar Sequences can map to different structures in some cases When this happens, the predominant pattern occludes the second one To find those clusters the refinement was performed using subset of the data that excludes the other class members This helped identifying two distinct  -C-cap extensions which were very similar in sequence

Cluster Weights The prediction accuracy is improved by weighting the confidence curves Iterative update was used Where F + C are the false positive of cluster C and F - C are the false negative errors

Prediction Protocol Given a sequence to predict: 1.Submit the sequence to PHD (Rose 94) to obtain a set of multiple aligned sequences and hence a profile 2.Each segment of the profile is scored against each of the 82 clusters to produce weighted confidences 3.Confidences are sorted 4.The first segment assigns  &  from its paradigm 5.For all the subsequent segments in the sorted list the prediction is used if it doesn’t conflict with previously assigned  & 

Results Reported on the training set and on 55 independent protein family set Local evaluation is measured by agreement over 8 residue window 8 residue segment prediction is considered to be correct if non of the  &  differences is larger than 120 o or if the rmsd between the correct and predicted structure was less than 1.4A An error is counted per position iff all 8 overlapping segments are incorrect Mda is stricter than the commonly used Q3 score

Results Training Set –471 sequences -> 122,510 residues –95% of 471 had 1 match ¸ 0.8 confidence –40% of the residues had confidence ¸ 0.6 and were 71%(mda) correct

Results

Combinations of I-sites and conventional Secondary Structure Predictions With the PHD program Requires translation into Sec Structure or from SS into torsion angles Every program performed better in it’s pwn domain 64% Q3 because of under predicting loops and over predicting strands I-site was much better in loops and specific angles of turns Can compliment PHD

Comparison of I-Site & PHD

I-site library 82 cluster represents 13 structural motifs

Summary of the I-site library

Conclusions Method is fast – requires only profile comparisons There is a measure of “confidence” in the prediction They do not provide accuracy over the whole protein Believe that the strong local sequence- structure relationships (that occur more than 30 times) are present in I-site

Discussion NMR studies of isolated peptides of less than 30 residue show that the peptides do not have a well defined structure. The I- site motif are the exceptions It might be that the motifs are the areas that adopt structure independence to the rest of the protein An extension might be context specific motifs

2 Approaches for global scoring functions Derived from the protein Database –Large # of parameters –Complicated Potentials –Based on Chemical Intuitions –Simpler –Clearer insights into sequence/structure relations They chose the Database approach –Because of the dangers of crafting a measure for a specific protein family rather than for the whole DB

Scoring Functions P(Seq|Str) is used when computing sequence profiles for motifs P(Structure) is hardest to estimate and contains most of the non-local interactions. For ab-initio, P(Structure) captures the features that distinguish folded structures from random chain (local) configurations.

Radius of gryation 2 Scoring Function Measures the largest radius from the center of the fold

Radius of gryation 2 Scoring Function Advantages –Non-dependent on alpha-beta decomposition - since the generated structures is made from segments of real proteins its alpha-beta decomposition much like of real proteins Disadvantages –Structures with beta paired strands are no more probable than those of unpaired beta strands