Lecture 7. Computing Protein Structures Current attempts: Threading: RAPTOR Consensus: ACE Fragment assembly Can we compute the protein structures eventually?

Slides:



Advertisements
Similar presentations
Consensus Fold Recognition Methods Dongbo Bu School of Computer Science University of Waterloo Joint work with S.C. Li, X. Gao, L. Yu, J. Xu, M. Li Nov.
Advertisements

Protein Structure Prediction using ROSETTA
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
Xin Gao PhD student Outline Traditional Protein Structure Prediction  Introduction  Methods Review  Experimental Results Refinement  Motivation.
COFFEE: an objective function for multiple sequence alignments
Protein Tertiary Structure Prediction
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Structural bioinformatics
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Abstracts of main servers in CASP11
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Fold Recognition Ole Lund, Assistant professor, CBS.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Fold Recognition Ole Lund, Associate professor, CBS.
Protein Fold recognition
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
Protein Tertiary Structure. Primary: amino acid linear sequence. Secondary:  -helices, β-sheets and loops. Tertiary: the 3D shape of the fully folded.
Protein structure prediction May 30, 2002 Quiz#4 on June 4 Learning objectives-Understand difference between primary secondary and tertiary structure.
Protein Side Chain Packing Problem: A Maximum Edge-Weight Clique Algorithmic Approach Dukka Bahadur K.C, Tatsuya Akutsu and Tomokazu Seki Proceedings of.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
Bioinformatics Ayesha M. Khan Spring 2013.
Homology Modeling David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.
Protein Tertiary Structure Prediction
Construyendo modelos 3D de proteinas ‘fold recognition / threading’
Practical session 2b Introduction to 3D Modelling and threading 9:30am-10:00am 3D modeling and threading 10:00am-10:30am Analysis of mutations in MYH6.
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
COMPARATIVE or HOMOLOGY MODELING
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
CSCE555 Bioinformatics Lecture 18 Protein Tertiary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Lecture 10 – protein structure prediction. A protein sequence.
Modelling binding site with 3DLigandSite Mark Wass
Representations of Molecular Structure: Bonds Only.
Lecture 12 CS5661 Structural Bioinformatics Motivation Concepts Structure Prediction Summary.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Bioinformatics 2 -- Lecture 8 More TOPS diagrams Comparative modeling tutorial and strategies.
1 P9 Extra Discussion Slides. Sequence-Structure-Function Relationships Proteins of similar sequences fold into similar structures and perform similar.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Order independent structural alignment of circularly permutated proteins T. Andrew Binkowski Bhaskar DasGupta  Jie Liang ‡ Bioengineering Computer Science.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
DALI Method Distance mAtrix aLIgnment
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Structure prediction: Homology modeling
Protein Modeling Protein Structure Prediction. 3D Protein Structure ALA CαCα LEU CαCαCαCαCαCαCαCα PRO VALVAL ARG …… ??? backbone sidechain.
Predicting Protein Structure: Comparative Modeling (homology modeling)
Protein Structure Prediction: Homology Modeling & Threading/Fold Recognition D. Mohanty NII, New Delhi.
Modelling protein tertiary structure Ram Samudrala University of Washington.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
Protein Folding & Biospectroscopy Lecture 6 F14PFB David Robinson.
Protein Structure Prediction Graham Wood Charlotte Deane.
Solving and Analyzing Side-Chain Positioning Problems Using Linear and Integer Programming Carleton L. Kingsford, Bernard Chazelle and Mona Singh Bioinformatics.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
3.3b1 Protein Structure Threading (Fold recognition) Boris Steipe University of Toronto (Slides evolved from original material.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Automated Structure Prediction using Robetta in CASP11 Baker Group David Kim, Sergey Ovchinnikov, Frank DiMaio.
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
Protein Folding and Protein Threading
Protein Structures.
Homology Modeling.
Protein structure prediction.
Presentation transcript:

Lecture 7. Computing Protein Structures Current attempts: Threading: RAPTOR Consensus: ACE Fragment assembly Can we compute the protein structures eventually? Your projects. CS882, Fall 2006

Homologous proteins have similar structure and functions Being homologous means that they have evolved from a common ancestral gene. Hence at least in the past they had the same structure and function. Caution: old genes can be recruited for new functions. Example: a structural protein in eye lens is homologous to an ancient glycolytic enzyme. Homology search is done by BLAST, or PatternHunter for more sensitivity. BLAST will work with over 30% sequence identity.

Conserving core regions Homologous proteins usually have conserved core regions. When we model one protein after a similar protein with known structure, the main problem becomes modeling loop regions. Modeling loops can also depend on database to some degree. Side chains: on a few side-chain conformations frequently occur – they are called rotamers, there is a such a database.

Primary, secondary, and tertiary There are many secondary structure prediction programs. However, without considering tertiary structure, we will never be correct solely predicting secondary structures. Most tertiary structure prediction programs today depend on good secondary predictions. This is also not good: you cannot get right tertiary structure with wrong starting information. They must be done together.

There are not too many candidates! There are only about 1000 topologically different domain structures. There is no reason whatsoever that we cannot compute their structures accurately. Ab initio method – we have heard about it. Another promising method is threading (separate lecture). After threading, an important step is “refinement”, perhaps by fragment assembly. This will be a separate topic (Xin Gao). Folding membrane proteins is a quite different topic (Richard Jang). Now we go to threading.

Protein Threading Make a structure prediction through finding an optimal placement (threading) of a protein sequence onto each known structure (structural template) “placement” quality is measured by some statistics-based energy function best overall “placement” among all templates may give a structure prediction target sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE template library

Threading Example

Introduction to Linear Program Optimize (Maximize or Minimize) a linear objective function e.g. 2x+3y+4z The variables satisfy some linear constraints. e.g. 1. x+y-z >=1 2. 2x+y+3z=3 integer program (IP) =linear program (LP) + integral variables LP can be solved within polynomial time --- Interior point method. Simplex method also runs fast. We used IBM package. Polynomial time for IP not likely, NP-hard IP can be relaxed to LP, solve the non-integral version Branch-and-bound or branch-and-cut (may cost exponential time)

Why Integer Programming? Treat pairwise potentials rigorously critical for fold-level targets Existing Exact algorithms for pairwise potentials High memory requirement, or Expensive computational time Exploit correlations between various kinds of item scores in the energy function 99% real data generate integral solutions directly, no branch-and-bound needed.

Different approaches Approximation Algorithm Interaction-Frozen Algorithm (A. Godzik et al.) Monte Carlo Sampling (T. Madej et al.) Double dynamic programming (D. Jones et al.) Recursive dynamic programming (R. Thiele et al.) Exact Algorithm Branch-and-bound (R.H. Lathrop et al.) Exploit the relationship among various scoring parameters, fast self-threading Divide-and-conquer (Y. Xu et al.) Exploit the topological structure of template contact graphs

Formulating Protein Threading by LP Protein Threading Needs: 1.Construction of Template Library 2.Design of Energy Function 3.Sequence-Structure Alignment 4.Template Selection and Model Construction

Threading Energy Function how well a residue fits a structural environment: E s (Fitness score) how preferable to put two particular residues nearby: E p (Pairwise potential) alignment gap penalty: E g (gap score) E= E p + E s + E m + E g + E ss Minimize E to find a sequence-structure alignment sequence similarity between query and template proteins: E m (Mutation score) Consistency with the secondary structures: E ss

Contact Graph 1.Each residue as a vertex 2.One edge between two residues if their spatial distance is within a given cutoff. 3.Cores are the most conserved segments in the template: alpha-helix, beta- sheet template

Simplified Contact Graph

Contact Graph and Alignment Diagram

Variables x(i,l) denotes core i is aligned to sequence position l y(i,l,j,k) denotes that core i is aligned to position l and core j is aligned to position k at the same time.

Formulation 1 E g, E p E s, E ss, E m Encodes interaction structures: the first makes sure no crosses; the second is quadratic, but can be converted to linear: a=bc is eqivalent to: a≤b, a≤c, a≥b+c-1 Encodes scoring system

Formulation used in RAPTOR E g, E p E s, E ss, E n Encodes interaction structures Encodes scoring system

Solving the Problem Practically 1. More than 99% threading instances can be solved directly by linear programming, the rest can be solved by branch-and-bound with only several branch nodes 2. Less memory consumption 3. Less computational time 4. Easy to extend to incorporate other constraints

CPU Time for CAFASP3 targets

Fold Recognition Support Vector Machines (SVM) Approach Features are extracted from the alignments A threading pair is treated as a positive pattern only if they are in at least fold-level similarity 60,000 threading pairs are employed to train SVM model. 5% more targets are recognized by SVM approach than the traditional z-Score

Part II. Experiments TestEvaluatorData SetBlindnesspublic Lindhal et al. benchmark uslargeno LiveBenchthird-partysmallnoyes CASP/CAFA SP third-partysmallyes

Target Category CASP5CMCM/FRFR(H)FR(A)NF/FRNF CAFASP 3 HM easy (family level) HM hard (superfamily level) FR (fold level) # targets Prediction Difficulty CM: Comparative Modelling, HM: Homology Modelling FR: Fold Recogniton, NF: New Fold Hard Easy

Lindahl Benchmark Test 976*975 threading pairs are tested, the results of other servers are taken from Shi et al.’s paper.

LiveBench Test MonthRank August3 September4 October7 November14 December9 Total6 Easy6 Hard5 LiveBench 6 MonthRank Feb10 March1 April3 May2 June6 Total4 Easy7 Hard3 LiveBench 7 (

CASP5/CAFASP3 62 targets Time allowed for each target: Individual Servers: 48 hours Meta Servers: 48 hours Predictors: computer program, no manual intervention (CAFASP3) Evaluated by computer program RAPTOR was voted by CASP5 attendees as the most novel approach, at CAFASP3: The Third Critical Assessment of Fully Automated Structure Prediction

CAFASP3 Evaluation Criteria Model Only the first submission considered for each target, each server can submit 10 models for each target, MaxSub (evaluation program) Superimpose the predicted structure with the experimental structure Calculate the length of maximum superimposable subsegment within 5Å RMSD one prediction is regarded as correct only if the length is above a given value.

CAFASP3 Evaluation Criteria Sensitivity (N-1 Rule) One miss allowed for each server, i.e., the first models of N-1 out of N targets ranked Specificity Rank the first models of all targets according to their zScores S(M): # Correct before the first M false positives Average of S(1),S(2),…,S(5)

Specificity Example Predicted Model zScoreCorrect ? (by MaxSub) T19.1Yes T28.4Yes T37.8No T47.6Yes T57.5No T67.4Yes ……… T30…… S(1)=2 S(2)=3 First false positive Second false positive

Sensitivity on FR targets (1) ServersSum MaxSub Score# correct 3ds5 robetta pmod 3ds3 pmode RAPTOR shgu dsn orfeus pcons fugu3 orf_c ……… pdbblast0.000 ……… blast0.000 ( released on Dec., 2002.) 30 FR targets 54 servers

Sensitivity on FR targets (2) CM/FRFR(H)FR(A)NF/FRNF # Correct64210 # Targets RAPTOR is weak at recognizing FR(A) targets (need improvement ) 2.RAPTOR cannot deal with NF targets at all (normal)

Sensitivity on Hard HM targets Ran k ServersScore# Correct 13ds ds3 shgu pmod pmod orfeus orfb 3dpsm raptor fugu3 pco3 robetta samt ………… 11pdbblast ………… blast0.322

Specificity of Servers RankServersSpecificity 13ds pmodel 3dsn 3ds3 pmodel pcons3 shgu inbgu fugu ffas03 orfeus fugsa raptor 3dpsm orf_c ……… pdbblast13.0 blast4.0 Out of 33 Targets

CAFASP3 Example Target ID: T0136_1 Target Size:144 Superimposable size within 5Å: 118 RMSD:1.9Å Red: Experimental Structure Blue/green: RAPTOR model

CASP6, T0199-2, ACE buffalo rank: 9 th From RAPTOR rank 1 model. TM= MaxSub= Good parts: , Left: predicted structure. Right: experimental structure

CASP6, T0203 ACE buffalo rank: 1 st From RAPTOR 2 nd model. TM=0.6041, MaxSub= Good parts: 19-57, 89-94, , , Predicted Experimental RAPTOR first Model ranks 5 th

CASP6, T0262-2, ACE buffalo rank: 4 th From Fugue3 6 th model. TM=0.4306, MaxSub= Good parts: Predicted Experimental Fugue’s top model ranks low

CASP6, T0242, NF, ACE buffalo rank: 1 From RAPTOR rank 5 model. TM score=0.2784, MaxSub score= However, RAPTOR top model ranks 44 th ! Trivial error? Predicted Experimental

CASP6, T0238, NF ACE buffalo rank 1 st From RAPTOR 8 th model TM=0.2748, MaxSub= Good part: High TM score, low MaxSub Raptor top model ranks 4 th Predicted Experimental

About RAPTOR Jinbo Xu’s Ph.D. thesis work. The RAPTOR system has benefited significantly from PROSPECT (Ying Xu, Dong Xu, et al). Currently distributed by BSI. References: J. Xu, M. Li, D. Kim, Y. Xu, Journal of Bioinformatics and Computational Biology, 1:1(2003), J. Xu, M. Li, PROTEINS: Structure, Function, and Genetics, CASP5 special issue.