Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.

Slides:



Advertisements
Similar presentations
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Advertisements

BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Hidden Markov Models What are the good for? Morten Nielsen CBS.
Measuring the degree of similarity: PAM and blosum Matrix
Protein Fold recognition Morten Nielsen, CBS, BioSys, DTU.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.
Protein structure and homology modeling Morten Nielsen, CBS, BioCentrum, DTU.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Can protein model accuracy be identified? Morten Nielsen, CBS, BioCentrum, DTU.
Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Homology modelling ? X-ray ? NMR ?. Homology Modelling !
Protein Fold recognition
Introduction to bioinformatics
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Protein Tertiary Structure. Primary: amino acid linear sequence. Secondary:  -helices, β-sheets and loops. Tertiary: the 3D shape of the fully folded.
Molecular modelling / structure prediction (A computational approach to protein structure) Today: Why bother about proteins/prediction Concepts of molecular.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein homology modeling Morten Nielsen, CBS, BioCentrum, DTU.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Can protein model accuracy be identified? Morten Nielsen, CBS, BioCentrum, DTU.
Homology modelling ? X-ray ? NMR ?. Homology Modelling !
Protein structure Classification Ole Lund, Associate professor, CBS, DTU.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Tertiary Structure Prediction
COMPARATIVE or HOMOLOGY MODELING
Protein Sequence Alignment and Database Searching.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
The Blosum scoring matrices Morten Nielsen BioSys, DTU.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Structure prediction: Homology modeling
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Predicting Protein Structure: Comparative Modeling (homology modeling)
Protein Structure Prediction: Homology Modeling & Threading/Fold Recognition D. Mohanty NII, New Delhi.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Blosum matrices What are they? Morten Nielsen BioSys, DTU
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
Step 3: Tools Database Searching
Psi-Blast Morten Nielsen, Department of systems biology, DTU.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Protein Classification
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
3.3b1 Protein Structure Threading (Fold recognition) Boris Steipe University of Toronto (Slides evolved from original material.
Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Outline Basic Local Alignment Search Tool
Outline Basic Local Alignment Search Tool
Homology Modeling.
Protein structure prediction.
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Homology modeling in short…
Presentation transcript:

Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU

Objectives Understand the basic concepts of fold recognition Learn why even sequences with very low sequence similarity can be modeled –Understand why is %id such a terrible measure for reliability See the beauty of sequence profiles –Position specific scoring matrices (PSSMs)

Protein Homology modeling? Identify template(s) – initial alignment Can give you protein function Improve alignment Can give you active site Backbone generation Loop modeling Most difficult part Side chains Refinement Validation

How to do it? Identify fold (template) for modeling –Find the structure in the PDB database that resembles your new protein the most –Can be used to predict function –And maybe active sites Align protein sequence to template –Simple alignment methods –Sequence profiles –Threading methods –Pseudo force fields Model side chains and loops

Homology modeling and the human genome

Identification of fold If sequence similarity is high proteins share structure (Safe zone) If sequence similarity is low proteins may share structure (Twilight zone) Most proteins do not have a high sequence homologous partner Rajesh Nair & Burkhard Rost Protein Science, 2002, 11,

Example. >1K7C.A TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADV VTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKL FTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETL GNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL What is the function Where is the active site? A post doc in our group did her PhD obtaining the structure of the sequence below

What would you do? Function Run Blast against PDB No significant hits Run Blast against NR (Sequence database) Function is Acetylesterase? Where is the active site?

Example. Where is the active site? 1WAB Acetylhydrolase 1G66 Acetylxylan esterase 1USW Hydrolase

Example. Where is the active site? Align sequence against structures of known acetylesterase, like 1WAB, 1FXW, … Cannot be aligned. Too low sequence similarity 1K7C.A 1WAB._ RMSD QAL 1K7C.A 71 GHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTF DAL 1WAB._ 160 GHPRAHFLDADPGFVHSDGTISH--HDMYDYLHLSRLGY

Is it really impossible? Protein homology modeling is only possible if %id greater than 30-50% WRONG!!!!!!!

Why %id is so bad!! 1200 models sharing 25-95% sequence identity with the submitted sequences (

Identification of correct fold % ID is a poor measure –Many evolutionary related proteins share low sequence homology –A short alignment of 5 amino acids can share 100% id, what does this mean? Alignment score even worse –Many sequences will score high against every thing (hydrophobic stretches) P-value or E-value more reliable

What are P and E values? E-value –Number of expected hits in database with score higher than match –Depends on database size P-value –Probability that a random hit will have score higher than match –Database size independent Score P(Score) Score hits with higher score (E=10) hits in database => P=10/10000 = 0.001

What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences

Blosum scoring matrix A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V

What goes wrong when Blast fails? Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences This scoring matrix is identical at all positions in the protein sequence! EVVFIGDSLVQLMHQC X X AGDS.GGGDSAGDS.GGGDS

Alignment accuracy. Scoring functions Blosum62 score matrix. Fg=1. Ng=0? Score = =17 Alignment LAGDSD F I G-4060 D S L LAGDS I-GDS

When Blast works! 1PLC._ 1PLB._

When Blast fails! 1PLC._ 1PMY._

When Blast fails, use sequence profiles!

1PLC._ 1PMY._

Sequence profiles In reality not all positions in a protein are equally likely to mutate Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score Sequence profiles can capture these differences

Protein world Protein fold Protein structure classification Protein superfamily Protein family New Fold

All  : Hemoglobin (1bab)

All  : Immunoglobulin (8fab)

 Triose phosphate isomerase (1hti)

 : Lysozyme (1jsf)

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Sequence profiles Conserved Non-conserved Matching any thing but G => large negative score Any thing can match TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

How to make sequence profiles Align (BLAST) sequence against large sequence database (Swiss-Prot) Select significant alignments and make profile (weight matrix) using techniques for sequence weighting and pseudo counts Use weight matrix to align against sequence database to find new significant hits Repeat 2 and 3 (normally 3 times!)

Example. (SGNH active site)

Example. Where is the active site? Sequence profiles might show you where to look! The active site could be around S9, G42, N74, and H195

Profile-profile scoring matrix 1K7C.A 1WAB._

Example. Where is the active site? Align using sequence profiles ALN 1K7C.A 1WAB._ RMSD = % ID 1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDN S G N 1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA 1WAB._ HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP 1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSL H 1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L

Structural superposition Blue: 1K7C.A Red: 1WAB._

Where was the active site? Rhamnogalacturonan acetylesterase (1k7c)

How good are we?

Alignment accuracy. Scoring functions How to make the most of sequence profiles? A R N D C Q E G H I L K M F P S T W Y V T 1K7C.A T 1K7C.A V 1K7C.A ….. G 1K7C.A D 1K7C.A S 1K7C.A E 1WAB._ V 1WAB._ …. G 1WAB._ D 1WAB._ S 1WAB._

Alignment accuracy

AUC performance measure Query Templ Score Hit/nonhit 1CJ0.A 1B78.A CJ0.A 1B8A.A CJ0.A 1B8B.A CJ0.A 1B8G.A CJ0.A 1B9H.A CJ0.A 1BAR.A CJ0.A 1BAV.C Query Templ Score Hit/nonhit 1CJ0.A 1B8G.A CJ0.A 1DTY.A CJ0.A 1DGD._ CJ0.A 1GTX.A CJ0.A 2GSA.A CJ0.A 1BW9.A CJ0.A 1AUP._ CJ0.A 1GTM.A AUC (area under the ROC curve)

Fold recognition performance

Outlook Include position dependent gap penalties The conventional alignment methods use equal gap penalties through out the scoring matrix In real proteins placement of insertions and deletions is highly structure dependent No gaps in secondary structure elements Gaps most frequent in loops Distance dependency

Take home message Identifying the correct fold is only a small step towards successful homology modeling Do not trust % ID or alignment score to identify the fold. Use P-values You can do reliable fold recognition AND homology modeling when for low sequence homology Use sequence profiles and local protein structure to align sequences

What are (some of) the different available methods? Simple sequence based methods –Align (BLAST) sequence against sequence of proteins with known structure (PDB database) Sequence profile based methods –Align sequence profile (Psi-BLAST) against sequence of proteins with known structure (PDB, FUGUE) –Align sequence profile against profile of proteins with known structure (FFAS) Sequence and structure based methods –Align profile and predicted secondary structure against proteins with known structure (3D-PSSM, Phyre) Sequence profiles and structure based methods –HHpred