Rising accuracy of protein secondary structure prediction Burkhard Rost

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
1 Protein Structure, Structure Classification and Prediction Bioinformatics X3 January 2005 P. Johansson, D. Madsen Dept.of Cell & Molecular Biology, Uppsala.
PROTEIN SECONDARY STRUCTURE PREDICTION WITH NEURAL NETWORKS.
Structural bioinformatics
Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Secondary Structures
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Biological inspiration Animals are able to react adaptively to changes in their external and internal environment, and they use their nervous system to.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU April 8, 2003Claus Lundegaard Protein Secondary Structures Assignment and prediction.
Protein Secondary Structures Assignment and prediction.
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
Protein Secondary Structures Assignment and prediction Pernille Haste Andersen
Structure Prediction in 1D
Similar Sequence Similar Function Charles Yan Spring 2006.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU October 29, 2004Claus Lundegaard Protein Secondary Structures Assignment and.
Introduction to Bioinformatics - Tutorial no. 8 Predicting protein structure PSI-BLAST.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Artificial Neural Networks for Secondary Structure Prediction CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (slides by J. Burg)
Traffic Sign Recognition Using Artificial Neural Network Radi Bekker
Burkhard Rost (Columbia New York) Some gory details of protein secondary structure prediction Burkhard Rost CUBIC Columbia University
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Lecture 11, CS5671 Secondary Structure Prediction Progressive improvement –Chou-Fasman rules –Qian-Sejnowski –Burkhard-Rost PHD –Riis-Krogh Chou-Fasman.
Protein Sequence Alignment and Database Searching.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Protein Secondary Structure Prediction. Input: protein sequence Output: for each residue its associated Secondary structure (SS): alpha-helix, beta-strand,
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Protein Secondary Structure Prediction
Secondary structure prediction
2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
Protein structure prediction May 26, 2011 HW #8 due today Quiz #3 on Tuesday, May 31 Learning objectives-Understand the biochemical basis of secondary.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Protein Secondary Structure Prediction G P S Raghava.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Dr.Abeer Mahmoud ARTIFICIAL INTELLIGENCE (CS 461D) Dr. Abeer Mahmoud Computer science Department Princess Nora University Faculty of Computer & Information.
Sequence Alignment.
Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDN GVDGEWTYTE Structure-Sequence alignment “Structure is better preserved than sequence” Me! Non-redundant.
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Improved Protein Secondary Structure Prediction. Secondary Structure Prediction Given a protein sequence a 1 a 2 …a N, secondary structure prediction.
Artificial Neural Networks
Introduction to Bioinformatics II
of the Artificial Neural Networks.
Sequence Based Analysis Tutorial
network of simple neuron-like computing elements
Protein structure prediction.
Basic Local Alignment Search Tool (BLAST)
Computer Vision Lecture 19: Object Recognition III
Neural Networks for Protein Structure Prediction Dr. B Bhunia.
Presentation transcript:

Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

Rising accuracy of protein secondary structure prediction Burkhard Rost

? Main dogma in biology D.N.A. R.N.A strand of A.A. protein AGCTCTCTGAGGCTT UCGAGAGACUCCGAA AGHTY ?

Sequence – structure gap Today we have much more sequenced proteins than protein’s structures. The gap is rapidly increasing. Problem: Finding protein structure isn’t that simple. Solution: A good start : find secondary structure.

"ויהי כל הארץ שפה אחת ודברים אחדים" בראשית יא א "ויהי כל הארץ שפה אחת ודברים אחדים" בראשית יא א Comparing methods requires same terms and tests. Secondary structure types: H - helix E – β strand L\C – other. seq A A P P L L L L M M M G I M M R R I M E E E E E C C C C H H H H C C C E E E pred

How to evaluate a prediction? The Q3 test: correctly predicted residues number of residues Of course, all methods would be tested on the same proteins.

Old methods Fasman & Chou (1974) : First generation – single residue statistics Fasman & Chou (1974) : Some residues have particular secondary structure preference. Examples: Glu α-Helix Val β-strand Second generation – segment statistics Similar, but also considering adjacent residues.

Difficulties Q3 of strands (E) : 28% - 48%. Bad accuracy - below 66% (Q3 results). Q3 of strands (E) : 28% - 48%. Predicted structures were too short.

Methods accuracy comparison

3rd generation methods Third generation methods reached 77% accuracy. They consist of two new ideas: 1. A biological idea – Using evolutionary information. 2. A technological idea – Using neural networks.

How can evolutionary information help us? Homologues similar structure But sequences change up to 85% Sequence would vary differently - depends on structure

How can evolutionary information help us? Where can we find high sequence conservation? Some examples: In defined secondary structures. In protein core’s segments (more hydrophobic). In amphipatic helices (cycle of hydrophobic and hydrophilic residues).

How can evolutionary information help us? Predictions based on multiple alignments were made manually. Problem: There isn’t any well defined algorithm! Solution: Use Neural Networks .

Artificial Neural Networks An attempt to imitate the human brain construction, (assuming this is the way it works). When do we use it ? When we can’t solve the problems ourselves!!!

Artificial Neural Network The neural network basic structure : Big amount of processors – “neurons”. Highly connected. Working together.

Artificial Neural Network What does a neuron do? Gets “signals” from its neighbours. Each signal has different weight. When achieving certain threshold - sends signals.

Artificial Neural Network General structure of ANN : One input layer. Some hidden layers. One output layer. Our ANN have one-direction flow !

Artificial Neural Network A neuron may be: NOT gate gate AND OR gate Because this is a complete system, a neural network can compute anything.

Artificial Neural Network Network training and testing : Test set Correct Neural network Training set Incorrect Back - propagation Training set - inputs for which we know the wanted output. Back propagation - algorithm for changing neurons pulses “power”. Test set - inputs used for final network performance test.

Artificial Neural Network The Network is a ‘black box’: Even when it succeeds it’s hard to understand how. It’s difficult to conclude an algorithm from the network It’s hard to deduce new scientific principles.

Structure of 3rd generation methods Find homologues using large data bases. Create a profile representing the entire protein family. Give sequence and profile to ANN. Output of the ANN: 2nd structure prediction.

Structure of 3rd generation methods The ANN learning process: Training & testing set: - Proteins with known sequence & structure. Training: - Insert training set to ANN as input. - Compare output to known structure. - Back propagation.

3rd generation methods - difficulties Main problem - unwise selection of training & test sets for ANN. First problem – unbalanced training Overall protein composition: Helices - 32% Strands - 21% Coils – 47% What will happen if we train the ANN with random segments ?

3rd generation methods - difficulties Second problem – unwise separation between training & test proteins What will happen if homology / correlation exists between test & training proteins? over optimism! Above 80% accuracy in testing. Third problem – similarity between test proteins.

PSI - PRED : 3RD generation method based on the iterated Protein Secondary Structure Prediction Based on Position – specific Scoring Matrices David T. Jones PSI - PRED : 3RD generation method based on the iterated PSI – BLAST algorithm.

PSI - BLAST PSSM - position specific scoring matrix Sequence Distant homologues PSI - BLAST outperforms other algorithms in finding distant homologues. PSSM – input for PSI - PRED.

PSI - PRED Two ANNs working together. Sequence + PSSM Prediction ANN’s architecture: Two ANNs working together. Sequence + PSSM 1ST ANN Prediction 2ND ANN Final prediction

PSI - PRED Step 1: Step 2: 1ST ANN Create PSSM from sequence - 3 iterations of PSI – BLAST. Step 2: 1ST ANN Sequence + PSSM 1st ANN’s input. A D C Q E I L H T S T T W Y V 15 RESIDUES E/H/C output: central amino acid secondary state prediction. A D C Q E I L H T S T T W Y V

PSI - PRED Using PSI - BLAST brings up PSI – BLAST difficulties: Iteration - extension of proteins family Updating PSSM Inclusion of non – homologues “Misleading” PSSM

PSI - PRED Step 3: 2nd ANN what’s wrong with that ? So why do we need a second ANN ? possible output for 1st ANN: one-amino-acid helix doesn’t exist seq A A P P L L L L M M M G I M M R R I M E E E E E C C C C C H C C C C C E E E pred what’s wrong with that ? Solution: ANN that “looks” at the whole context ! Input: output of 1st ANN. Output: final prediction.

PSI - PRED Training : Testing : 10% of proteins were used as “inner” test. Balanced training. Testing : 187 proteins, Highly resolved structure. PSI – BLAST was used for removing homologues. Without structural similarities.

PSI - PRED Jones’s reported results : Q3 results : 76% - 77%.

PSI - PRED Reliability numbers: The way the ANN tells us how much it is sure about the assignment. Used by many methods. Correlates with accuracy.

Performance evaluation Through 3rd generation methods accuracy jumped ~10%. Many 3rd generation methods exist today. Which method is the best one ? How to recognize “over-optimism” ?

Performance evaluation CASP - Critical Assessment of Techniques for Protein Structure Prediction. EVA – Automatic Evaluation of Automatic Prediction Servers.

Performance evaluation Conclusion : PSI-PRED seams to be one of the most reliable method today. Reasons : The widest evolutionary information (PSI - BLAST profiles). Strict training & testing criterions for ANN.

Improvements The first 3rd generation method PHD: ~72% in Q3. 3rd generation methods best results: ~77% in Q3 . Sources of improvement : Larger protein data bases. PSI – BLAST PSI – PRED broke through, many followed...

Improvements How can we do better than that ? Through larger data bases (?). Combination of methods. Example: Combining 4 best methods Q3 of ~78% ! Find why certain proteins predicted poorly.

Improvements What is the limit of prediction improvement? than others. Some regions of proteins are more mobile than others. 12% of proteins structure is unknown even by “manual” methods. The limit of accuracy is 88% !

Secondary structure prediction in practice protein structure genome analysis finding structural switches

Finding Structural Switches young et al: Prediction of secondary structure with several methods Different results + same preferences Structural switch ???

Bibliography Jones DT. Protein secondary structure prediction based on position specific scoring matrices. J Mol Biol. 1999 292:195-202 Rost B. Rising accuracy of protein secondary structure prediction 'Protein structure determination, analysis, and modeling for drug discovery‘ (ed. D Chasman), New York: Dekker, pp. 207-249