Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.

Slides:



Advertisements
Similar presentations
Functional Site Prediction Selects Correct Protein Models Vijayalakshmi Chelliah Division of Mathematical Biology National Institute.
Advertisements

PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Structural bioinformatics
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Threading Optimization Using Consensus Homology Modeling Maliha Sarwat ( ), Tasmin Tamanna Haque ( ) Department of Computer Science.
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Protein structure (Part 2 of 2).
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
The Protein Data Bank (PDB)
1 Protein Structure Prediction Reporter: Chia-Chang Wang Date: April 1, 2005.
Protein Tertiary Structure. Primary: amino acid linear sequence. Secondary:  -helices, β-sheets and loops. Tertiary: the 3D shape of the fully folded.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Multiple Sequence Alignments
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Homology Modeling David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.
Protein Tertiary Structure Prediction
Construyendo modelos 3D de proteinas ‘fold recognition / threading’
Practical session 2b Introduction to 3D Modelling and threading 9:30am-10:00am 3D modeling and threading 10:00am-10:30am Analysis of mutations in MYH6.
COMPARATIVE or HOMOLOGY MODELING
Protein Sequence Alignment and Database Searching.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Representations of Molecular Structure: Bonds Only.
Lecture 12 CS5661 Structural Bioinformatics Motivation Concepts Structure Prediction Summary.
Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Protein Structure Prediction: Homology Modeling & Threading/Fold Recognition D. Mohanty NII, New Delhi.
DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Protein Folding & Biospectroscopy Lecture 6 F14PFB David Robinson.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Pair-wise Structural Comparison using DALILite Software of DALI Rajalekshmy Usha.
Protein Homologue Clustering and Molecular Modeling L. Wang.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Structural alignment methods Like in sequence alignment, try to find best correspondence: –Look at atoms –A 3-dimensional problem –No a priori knowledge.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)
Mean Field Theory and Mutually Orthogonal Latin Squares in Peptide Structure Prediction N. Gautham Department of Crystallography and Biophysics University.
An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.
Protein Tertiary Structure Prediction Structural Bioinformatics.
We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Protein Structure Prediction and Protein Homology modeling
Comparison of Exemplars of Rotamer Clusters Across the Proteinogenic Amino Acids
Protein Structures.
A complete functional map of UBE2I
Protein structure prediction.
Volume 21, Issue 6, Pages (June 2013)
Vilas Menon, Brinda K. Vallat, Joseph M. Dybas, Andras Fiser  Structure 
High-Resolution Comparative Modeling with RosettaCM
Presentation transcript:

Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates András Fiser Department of Biochemistry and Seaver Center for Bioinformatics Albert Einstein College of Medicine Bronx, New York, USA

Target – Template Alignment Model Building START Template Search Model Evaluation END Multiple Mapping Method Loop, side chain modeling Statistical potential Comparative protein structure modeling Multiple Templates

Why do we need sequence alignments? #Sequence vs. structure To generate input alignment for comparative modeling / threading #Sequence vs. databases: Querying sequence databases #Sequence vs. sequence: Establishing residue equivalencies between two proteins to locate conserved/variable regions

Ranking of models built on alternative alignments Problem: None of the currently available methods produce consistently superior results in all cases Template VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKK Target CLW DWTDAERAAIKALWGKIDVGEIGP—-QALSRLLIVYPWTQRHFKGFGNISTNAAILGNAKVAEHGKTVMGGLDRAVQNM Target A2D DWTDAERAAIKALWGKI—-DVGEIGPQALSRLLIVYPWTQRHFKGFGNISTNAAILGNAKVAEHGKTVMGGLDRAVQNM Template GHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRH-PGDFGADAQGAMNKALELFRKDIAAKYKELGY Target CLW DNIKNVYKQLSIKHSEKIHVDPDNFRLLGEIITMCVGAKFGPSAFTPEIHEAWQKFLAVVVSALGRQYH---- Target A2D DNIKNVYKQLSIKHSEKIHVDPDNFRLLGEIITMCVGAKF-G---PSAFTPEIHEAWQKFLAVVVSALGRQYH Example: Template: 1a6m; Target: 1spg, chain B ~21% sequence identity

Instead of relying on just one alignment method, one should combine results of several alternative techniques Alternative solutions vs. sequence similarity

Multiple Mapping Method Idea: –Improve the accuracy of sequence-to-structure alignment by optimally splicing alternative inputs. Three components: - Sampling - Algorithm - Scoring function

MMM scoring function: increasing the dimensionality of input information Template VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAIL Target CLW DWTDAERAAIKALWGKIDVGEIGP—-QALSRLLIVYPWTQRHFKGFGNISTNAAILGNAKVAEHGKTVMGGLDRAV Template KKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRH-PGDFGADAQGAMNKALELFRKDIAAKYKELGY Target CLW QNMDNIKNVYKQLSIKHSEKIHVDPDNFRLLGEIITMCVGAKFGPSAFTPEIHEAWQKFLAVVVSALGRQYH---- Template VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAIL Target A2D DWTDAERAAIKALWGKI—-DVGEIGPQALSRLLIVYPWTQRHFKGFGNISTNAAILGNAKVAEHGKTVMGGLDRAV Template KKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRH-PGDFGADAQGAMNKALELFRKDIAAKYKELGY Target A2D QNMDNIKNVYKQLSIKHSEKIHVDPDNFRLLGEIITMCVGAKF-G---PSAFTPEIHEAWQKFLAVVVSALGRQYH Different mapping identifies a different environment for each residue to align Assess the “fitness” of each mapping

Multiple Mapping Method: Algorithm Step 1: Identify variable regions from the consensus alignment of the input set Step 2: Select the best scoring variable segments, and combine them with with the core region of the alignment. Template VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKK Target CLW DWTDAERAAIKALWGKIDVGEIGP—-QALSRLLIVYPWTQRHFKGFGNISTNAAILGNAKVAEHGKTVMGGLDRAVQNM Target A2D DWTDAERAAIKALWGKI—-DVGEIGPQALSRLLIVYPWTQRHFKGFGNISTNAAILGNAKVAEHGKTVMGGLDRAVQNM Template GHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRH-PGDFGADAQGAMNKALELFRKDIAAKYKELGY Target CLW DNIKNVYKQLSIKHSEKIHVDPDNFRLLGEIITMCVGAKFGPSAFTPEIHEAWQKFLAVVVSALGRQYH---- Target A2D DNIKNVYKQLSIKHSEKIHVDPDNFRLLGEIITMCVGAKF-G---PSAFTPEIHEAWQKFLAVVVSALGRQYH Example: Template 1a6m; Target 1spg, chain B 21% sequence id

Experimental ClustalW, RMSD 2.0 Å Align2D, RMSD 2.7 Å Template VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKK Target MMM DWTDAERAAIKALWGKI—-DVGEIGPQALSRLLIVYPWTQRHFKGFGNISTNAAILGNAKVAEHGKTVMGGLDRAVQNM Template GHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRH-PGDFGADAQGAMNKALELFRKDIAAKYKELGY Target MMM DNIKNVYKQLSIKHSEKIHVDPDNFRLLGEIITMCVGAKFGPSAFTPEIHEAWQKFLAVVVSALGRQYH---- Experimental MMM, RMSD 1.8 Å CLUSTALW 2.6 Å ALIGN2D 6.1 Å MMM example using ideal scoring function CLUSTALW 4.6 Å ALIGN2D 1.1 Å

Multiple Mapping Method: scoring function (1) A composite scoring function to assess the compatibility/fit of alternative variable segments in the template structural environment. The composite scoring function consists of three mostly non-overlapping components. 1.Environment-specific substitution matrices (FUGUE 1 ). 2.A scoring scheme based on a comparison (PHD vs. DSSP) of the secondary structure types (H3P2 2 ). 3.Statistically derived residue-residue contact energy (Rykunov and Fiser 3 ). 1 Shi et al. J. Mol. Biol. (2001) 310, Rice et al., J. Mol. Biol (1997) 267, Rykunov & Fiser., Proteins. (2007) 67,

MMM performance on 1400 pairs

MMM performance on 87 pairs, meta-servers ESypred3D Consensus

Sampling vs. Scoring

Multiple Mapping Method optimally combines alternative alignments obtained from different methods or scoring function: On a benchmark dataset of 6635 protein pair structural alignments, comparative models built using MMM alignments are approximately 0.3 Ǻ and 0.5 Å more accurate on average in the whole spectrum and in the <30% target-template sequence identity regions, respectively, than the average accuracy of models built using the alternative input alignments ( ~3 and ~4 Å). Summary

Optimally combining multiple templates

Selecting multiple templates Target sequence: by PSI-BLAST. Hits selected if sequence overlap with the target is > 60% of the actual SCOP domain length or more than 75% of the PDB chain length in case of a missing SCOP classification. Iterative clustering procedure identifies the most suitable templates to combine. Templates are selected or discarded according to a hierarchical selection procedure that accounts for –sequence identity between templates and target sequence, –sequence identity among templates, –crystal resolution of the templates, –contribution of templates to the target sequence (i.e. if a region is covered by several templates or by a single template only).

Single versus multiple templates Using a dataset of 765 proteins with known structure two sets of models were built: (1) using one template (best E-value hit; light bars), (2) using multiple templates (grey bars)

And…increased coverage Histogram of models’ difference length. Each query sequence is modeled using single and multiple templates. The histogram shows the frequency of (Lm–Ls). Lm: length of model built using multiple templates, and Ls length of the model built using a single template.

The x-ray structure, the model with multiple templates and with a single template are shown in grey, red, and blue, respectively. Multiple templates agree much better in two exposed regions: A and B, than the model built using single template.

Increased Coverage The x-ray structure, the model with multiple templates, and model with single templates are shown in grey, red, and blue, respectively. The addition of extra templates allowed obtaining a longer model that include a beta-turn-beta-turn extra region (20 amino acids), depicted in ribbon.

Lab members: –Dmitrij Rykunov –Rotem Rubinstein –J. Eduardo Fajardo –Carlos J. Madrid-Aliste –Veena Venkatagiriyappa –Joseph Dybas –Mario Pujato –Brajesh Rai –Narcis Fernandez-Fuentes –Elliot Sternberger Acknowledgement