IT og Sundhed 2010/11 Sequence based predictors. Secondary structure and surface accessibility Bent Petersen 13 January 2011.

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Secondary structure assignment
Three-Stage Prediction of Protein Beta-Sheets Using Neural Networks, Alignments, and Graph Algorithms Jianlin Cheng and Pierre Baldi Institute for Genomics.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
1 Protein Structure, Structure Classification and Prediction Bioinformatics X3 January 2005 P. Johansson, D. Madsen Dept.of Cell & Molecular Biology, Uppsala.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
1 September, 2004 Chapter 5 Macromolecular Structure.
Protein Secondary Structures
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU April 8, 2003Claus Lundegaard Protein Secondary Structures Assignment and prediction.
Predicting local Protein Structure Morten Nielsen.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Can protein model accuracy be identified? Morten Nielsen, CBS, BioCentrum, DTU.
JM - 1 Systems biology of cell-signaling systems: It's all about protein-protein interactions Jarek Meller Departments of Environmental.
Protein Structure Databases Databases of three dimensional structures of proteins, where structure has been solved using X-ray crystallography or nuclear.
IPM-POLYTECHNIQUE-WPI Workshop on Bioinformatics and Biomathematics April 11-21, 2005 IPM School of Mathematics Tehran.
Protein Secondary Structures Assignment and prediction.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU April 8, 2003Claus Lundegaard Protein Secondary Structures Assignment and prediction.
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen.
Protein Secondary Structures Assignment and prediction.
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
Protein Secondary Structures Assignment and prediction Pernille Haste Andersen
Structure Prediction in 1D
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU October 29, 2004Claus Lundegaard Protein Secondary Structures Assignment and.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Can protein model accuracy be identified? Morten Nielsen, CBS, BioCentrum, DTU.
Protein Secondary Structures Assignment and prediction.
Predicting local Protein Structure Morten Nielsen.
Class 7: Protein Secondary Structure
©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
1 Dictionary of Protein Secondary Structure Pattern Recognition of Hydrogen-Bonded and Geometrical Features Wolfgang Kabsch and Christian Sander Biopolymers,Vol.
Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.
Predicting Protein Solvent Accessibility with Sequence, Evolutionary Information and Context-based Features 12/05/2013 Ashraf Yaseen Department of Mathematics.
Proteins Secondary Structure Predictions Structural Bioinformatics.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”
Protein Secondary Structure Prediction Some of the slides are adapted from Dr. Dong Xu’s lecture notes.
Protein Secondary Structure Prediction. Input: protein sequence Output: for each residue its associated Secondary structure (SS): alpha-helix, beta-strand,
Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Protein Secondary Structure Prediction
Secondary structure prediction
2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
Protein Secondary Structure Prediction G P S Raghava.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
Protein Structure and Bioinformatics. Chapter 2 What is protein structure? What are proteins made of? What forces determines protein structure? What is.
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.
Proteins Structure Predictions Structural Bioinformatics.
Structural organization of proteins
Mir Ishruna Muniyat. Primary structure (Amino acid sequence) ↓ Secondary structure ( α -helix, β -sheet ) ↓ Tertiary structure ( Three-dimensional.
Statistical Machine Learning Methods for Bioinformatics IV
Feature Extraction Introduction Features Algorithms Methods
Prediction of RNA Binding Protein Using Machine Learning Technique
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Introduction to Bioinformatics II
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Yuchun Tang (1), Preeti Singh (1), Yanqing Zhang (1),
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen
Prediction of the Number of Residue Contacts in Proteins
Protein structure prediction
Presentation transcript:

IT og Sundhed 2010/11 Sequence based predictors. Secondary structure and surface accessibility Bent Petersen 13 January 2011

NetSurfP Real Value Solvent Accessibility predictions with amino acid associated reliability

Objective Predict residues as being either buried or exposed (25 % threshold)  Two states/classes, Buried/Exposed Predict the Relative Solvent Accessibility, RSA  “Real” Value

What is ASA? Accessible Solvent Area, Å 2 Surface area accessible to a rolling water molecule

RSA RSA = Relative Solvent Accessibility ACC = Accessible area in protein structure ASA = Accessible Surface Area in Gly-X-Gly or Ala-X-Ala Classification Networks“Real” value Networks Classification: Buried = RSA 25 % “Real” Value: values 0 - 1, RSA > 1 set to 1

Why predict RSA? Residues exposed on surface can be:  Involved in PTM’s  Potential epitopes  Involved in Protein-Protein interactions  Prediction of Disease-SNP’s

How to start? What do we want?  We want to be able to predict the exposure of an AA What do we need?  A training dataset and an independent evaluation dataset What information do we need?  True structural information the Neural Network can train on Where do we get that?  PDB, DSSP

Protein Data Bank, PDB Berman, H.M., et al., The Protein Data Bank. Nucl. Acids Res., (1): p

Define Secondary Structure of Proteins, DSSP Kabsch, W. and C. Sander, Dictionary of Protein Secondary Structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, (12): p ==== Secondary Structure Definition by the program DSSP, updated CMBI version by ElmK / April 1,2000 ==== DATE=23-MAR REFERENCE W. KABSCH AND C.SANDER, BIOPOLYMERS 22 (1983) HEADER TOXIN 12-AUG-98 3BTA. COMPND 2 MOLECULE: PROTEIN (BOTULINUM NEUROTOXIN TYPE A);. SOURCE 2 ORGANISM_SCIENTIFIC: CLOSTRIDIUM BOTULINUM;. AUTHOR R.C.STEVENS,D.B.LACY TOTAL NUMBER OF RESIDUES, NUMBER OF CHAINS, NUMBER OF SS-BRIDGES(TOTAL,INTRACHAIN,INTERCHAIN) ACCESSIBLE SURFACE OF PROTEIN (ANGSTROM**2) TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(J), SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS IN PARALLEL BRIDGES, SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS IN ANTIPARALLEL BRIDGES, SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I-5), SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I-4), SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+2), SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+3), SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+4), SAME NUMBER PER 100 RESIDUES TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+5), SAME NUMBER PER 100 RESIDUES *** HISTOGRAMS OF *** RESIDUES PER ALPHA HELIX PARALLEL BRIDGES PER LADDER ANTIPARALLEL BRIDGES PER LADDER LADDERS PER SHEET. # RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA 1 1 A P , 0.0 2,-3.8 0, 0.0 3, A F , ,-0.1 1, , A V , , , , A N S S ,-0.3 2, , , A K S S ,-0.1 2,-0.5 1, , A Q ,-0.5 2,-0.1 1, , A F ,-0.5 2, ,-0.1 3, A N > ,-0.1 3,-0.9 1, , A Y T 3 S , ,-0.1 1, , A K T 3 S , , ,-0.1 3, A D S < S ,-0.9 3,-0.1 1,-0.1 2, A P , ,-0.1 0, , A V ,-0.1 6,-0.2 1,-0.1 4, A N ,-3.7 2,-1.4 2,-0.2 5, A G S S ,-0.4 2,-0.3 3,-0.2 4, A V S S , ,-0.2 2, , A D S S , , ,-0.1 2, A I E S+A 35 0A 6 17, , , ,

Define Secondary Structure of Proteins, DSSP DSSP defines 8 types of secondary structure  G = 3-turn helix (3-10 helix)  H = 4-turn helix (α-helix)  I = 5-turn helix (π-helix)  T = Hydrogen bonded turn (3, 4 or 5 turn)  E = Extended strand  B = Residue in isolated β-bridge  S = Bend  Rest is C = coil

Required datasets Training/test  Used for optimization of settings using 10-fold cross- validation Evaluation  Used for final evaluation, less than 25 % homolog to the training/test dataset.

10-fold Cross Validation  Break dataset into 10 sets of size 1/10  Train on 9 datasets and test on 1  Repeat 10 times and take a mean accuracy

Learning / Training dataset Training set: Cull_1764:  Max. Seq. ID: 25 %  Resolution: ≤ 2.0 Å  R-Factor: ≤ 0.2  Seq. Length AA  Including X-ray entries only

PISCES

Learning / Training dataset Homology reduced towards evaluation set CB513 (302 sequences removed) Final Training set:  1764 sequences  amino acids ‣ Buried: % ( amino acids) ‣ Exposed: % ( amino acids)

Learning / Training dataset ---Sequence/residue statistics--- Number of seq.: 1764 Longest seq.: 1T3T.A (1283) Shortest seq.: 1YTV.M(6) Number of amino acids: Assignment category statistics --- B ( 44.20%) A ( 55.80%) ---Amino acid statistics--- H ( 2.40%) G ( 7.59%) Y ( 3.57%) V ( 7.22%) E ( 6.64%) S ( 5.84%) P ( 4.69%) A ( 8.53%) R ( 5.13%) Q ( 3.72%) C 5202 ( 1.24%) K ( 5.52%) L ( 9.21%) N ( 4.25%) T ( 5.50%) F ( 4.11%) D ( 5.92%) I ( 5.63%) W 6365 ( 1.52%) M 7353 ( 1.76%)

Evaluation dataset Final Evaluation dataset: CB513:  513 non-homologous sequences  Seq. Length aa  amino acids  Buried: % ( amino acids)  Exposed: % ( amino acids)

Evaluation dataset ---Sequence/residue statistics--- Number of seq.: 513 Longest seq.: 6acn.all(754) Shortest seq.: 1atpi-1(20) Number of amino acids: Assignment category statistics --- B ( 44.19%) A ( 55.81%) ---Amino acid statistics--- R 3812 ( 4.53%) T 5015 ( 5.96%) D 4973 ( 5.91%) C 1381 ( 1.64%) Y 3065 ( 3.64%) G 6657 ( 7.91%) N 3976 ( 4.73%) V 5795 ( 6.89%) I 4642 ( 5.52%) A 7267 ( 8.64%) S 5222 ( 6.21%) K 4976 ( 5.92%) P 3903 ( 4.64%) E 5050 ( 6.00%) L 7134 ( 8.48%) Q 3108 ( 3.69%) M 1710 ( 2.03%) H 1865 ( 2.22%) W 1236 ( 1.47%) F 3268 ( 3.88%) X 19 ( 0.02%) B 31 ( 0.04%) Z 14 ( 0.02%)

Neural Network - Input Position Specific Scoring Matrices, PSSM A R N D C Q E G H I L K M F P S T W Y V B H 2BEM.A A G 2BEM.A A Y 2BEM.A A V 2BEM.A B E 2BEM.A time iterativ psi-blast against nr70 Secondary Structure predictions B H 2BEM.A A G 2BEM.A A Y 2BEM.A A V 2BEM.A B E 2BEM.A (sec predictor by Pernille Andersen)

Secondary structure predictor Developed by Pernille Andersen, incorporated in NetSurfP Trained on 2,085 sequences using DSSP  H = H, E = E, C =., G, I, B, S and T  H ~ 30 %, E ~ 20 %, C ~ 50 % Performance of ~80 % Maximum theoretical limit is ~88 %

Neural Network - Settings Window Size: Hidden units: 10, 20, 25, 30, 40, 50, 75, 150, (200) Learning rate: 0.01 / (0.005) Epocs (training rounds): fold cross-validation  9/10 used for training, 1/10 for testing

Neural network window Sliding window of BEM.A mol:aa CHITIN-BINDING PROTEIN HGYVESPASRAYQCKLQLNTQCGSVQYEPQSVEGLKGFPQAGPADGHIASADKSTFFELDQQTPTRWNKLNLKTGPNSFT WKLTARHSTTSWRYFITKPNWDASQPLTRASFDLTPFCQFNDGGAIPAAQVTHQCNIPADRSGSHVILAVWDIADTANAF YQAIDVNLSK BAAABBAAAAAAAABBBBABBABBAABBABAABABBBAABBBABBABAAAABBBBABAAABABBBAABABBABAABABAA ABABBBBAABAAAAAAABBBABABBBAAABAABBBAAAAAABBBBBABBBABABABAABBABBBAAAAAAAAABBBBBAA AAAAAABABB Prediction on middle residue Serine, buried

Neural network window Sliding window of BEM.A mol:aa CHITIN-BINDING PROTEIN HGYVESPASRAYQCKLQLNTQCGSVQYEPQSVEGLKGFPQAGPADGHIASADKSTFFELDQQTPTRWNKLNLKTGPNSFT WKLTARHSTTSWRYFITKPNWDASQPLTRASFDLTPFCQFNDGGAIPAAQVTHQCNIPADRSGSHVILAVWDIADTANAF YQAIDVNLSK BAAABBAAAAAAAABBBBABBABBAABBABAABABBBAABBBABBABAAAABBBBABAAABABBBAABABBABAABABAA ABABBBBAABAAAAAAABBBABABBBAAABAABBBAAAAAABBBBBABBBABABABAABBABBBAAAAAAAAABBBBBAA AAAAAABABB Prediction on middle residue Proline, exposed

Neural network window Sliding window of BEM.A mol:aa CHITIN-BINDING PROTEIN HGYVESPASRAYQCKLQLNTQCGSVQYEPQSVEGLKGFPQAGPADGHIASADKSTFFELDQQTPTRWNKLNLKTGPNSFT WKLTARHSTTSWRYFITKPNWDASQPLTRASFDLTPFCQFNDGGAIPAAQVTHQCNIPADRSGSHVILAVWDIADTANAF YQAIDVNLSK BAAABBAAAAAAAABBBBABBABBAABBABAABABBBAABBBABBABAAAABBBBABAAABABBBAABABBABAABABAA ABABBBBAABAAAAAAABBBABABBBAAABAABBBAAAAAABBBBBABBBABABABAABBABBBAAAAAAAAABBBBBAA AAAAAABABB Prediction on middle residue Alanine, exposed

Method

Error function: Z-score:

Wisdom of the crowd Selecting best performing network architectures based on test performance Better than choosing any single network

Results - Classification networks Training: % CorrectMCC#Networks Best Single Architecture All Architectures Top 20 Architectures

Results - Classification networks Training: Evaluation: % CorrectMCC#Networks Best Single Architecture All Architectures Top 20 Architectures % CorrectMCC Dor and Zhou78.8Not Published NetsurfP CB500/CB

Results Evaluation

NetSurfP /usr/cbs/bio/src/NetSurfP/NetSurfP -h

NetSurfP

NetDiseaseSNP Disease-SNP prediction (Morten Bo Johansen) Without NetSurfP: Cross-validation: MCC= Cross-Evaluation: MCC= With NetSurfP: Cross-validation: MCC= Cross-Evaluation: MCC= 0.572

Paper is out..What then?

Statistics Submissions to the webserver from CBS website

Paper is out..What then?

As of 12 Jan sequences submitted from unique IP’s

First citation 24 october 2009 :-)