Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rising accuracy of protein secondary structure prediction Burkhard Rost

Similar presentations


Presentation on theme: "Rising accuracy of protein secondary structure prediction Burkhard Rost"— Presentation transcript:

1 Protein secondary structure predictions By: Refael Vivanti & Tal Tabakman

2 Rising accuracy of protein secondary structure prediction Burkhard Rost

3 ? Main dogma in biology D.N.A. R.N.A strand of A.A. protein
AGCTCTCTGAGGCTT UCGAGAGACUCCGAA AGHTY ?

4 Sequence – structure gap
Today we have much more sequenced proteins than protein’s structures. The gap is rapidly increasing. Problem: Finding protein structure isn’t that simple. Solution: A good start : find secondary structure.

5 "ויהי כל הארץ שפה אחת ודברים אחדים" בראשית יא א
"ויהי כל הארץ שפה אחת ודברים אחדים" בראשית יא א Comparing methods requires same terms and tests. Secondary structure types: H - helix E – β strand L\C – other. seq A A P P L L L L M M M G I M M R R I M E E E E E C C C C H H H H C C C E E E pred

6 How to evaluate a prediction?
The Q3 test: correctly predicted residues number of residues Of course, all methods would be tested on the same proteins.

7 Old methods Fasman & Chou (1974) :
First generation – single residue statistics Fasman & Chou (1974) : Some residues have particular secondary structure preference. Examples: Glu α-Helix Val β-strand Second generation – segment statistics Similar, but also considering adjacent residues.

8 Difficulties Q3 of strands (E) : 28% - 48%.
Bad accuracy - below 66% (Q3 results). Q3 of strands (E) : 28% - 48%. Predicted structures were too short.

9 Methods accuracy comparison

10 3rd generation methods Third generation methods reached 77% accuracy.
They consist of two new ideas: 1. A biological idea – Using evolutionary information. 2. A technological idea – Using neural networks.

11 How can evolutionary information help us?
Homologues similar structure But sequences change up to 85% Sequence would vary differently - depends on structure

12 How can evolutionary information help us?
Where can we find high sequence conservation? Some examples: In defined secondary structures. In protein core’s segments (more hydrophobic). In amphipatic helices (cycle of hydrophobic and hydrophilic residues).

13 How can evolutionary information help us?
Predictions based on multiple alignments were made manually. Problem: There isn’t any well defined algorithm! Solution: Use Neural Networks .

14 Artificial Neural Networks
An attempt to imitate the human brain construction, (assuming this is the way it works). When do we use it ? When we can’t solve the problems ourselves!!!

15 Artificial Neural Network
The neural network basic structure : Big amount of processors – “neurons”. Highly connected. Working together.

16 Artificial Neural Network
What does a neuron do? Gets “signals” from its neighbours. Each signal has different weight. When achieving certain threshold - sends signals.

17 Artificial Neural Network
General structure of ANN : One input layer. Some hidden layers. One output layer. Our ANN have one-direction flow !

18 Artificial Neural Network
A neuron may be: NOT gate gate AND OR gate Because this is a complete system, a neural network can compute anything.

19 Artificial Neural Network
Network training and testing : Test set Correct Neural network Training set Incorrect Back - propagation Training set - inputs for which we know the wanted output. Back propagation - algorithm for changing neurons pulses “power”. Test set - inputs used for final network performance test.

20 Artificial Neural Network
The Network is a ‘black box’: Even when it succeeds it’s hard to understand how. It’s difficult to conclude an algorithm from the network It’s hard to deduce new scientific principles.

21 Structure of 3rd generation methods
Find homologues using large data bases. Create a profile representing the entire protein family. Give sequence and profile to ANN. Output of the ANN: 2nd structure prediction.

22 Structure of 3rd generation methods
The ANN learning process: Training & testing set: - Proteins with known sequence & structure. Training: - Insert training set to ANN as input. - Compare output to known structure. - Back propagation.

23 3rd generation methods - difficulties
Main problem - unwise selection of training & test sets for ANN. First problem – unbalanced training Overall protein composition: Helices - 32% Strands - 21% Coils – 47% What will happen if we train the ANN with random segments ?

24 3rd generation methods - difficulties
Second problem – unwise separation between training & test proteins What will happen if homology / correlation exists between test & training proteins? over optimism! Above 80% accuracy in testing. Third problem – similarity between test proteins.

25 PSI - PRED : 3RD generation method based on the iterated
Protein Secondary Structure Prediction Based on Position – specific Scoring Matrices David T. Jones PSI - PRED : 3RD generation method based on the iterated PSI – BLAST algorithm.

26 PSI - BLAST PSSM - position specific scoring matrix Sequence Distant homologues PSI - BLAST outperforms other algorithms in finding distant homologues. PSSM – input for PSI - PRED.

27 PSI - PRED Two ANNs working together. Sequence + PSSM Prediction
ANN’s architecture: Two ANNs working together. Sequence + PSSM 1ST ANN Prediction 2ND ANN Final prediction

28 PSI - PRED Step 1: Step 2: 1ST ANN
Create PSSM from sequence - 3 iterations of PSI – BLAST. Step 2: 1ST ANN Sequence + PSSM st ANN’s input. A D C Q E I L H T S T T W Y V 15 RESIDUES E/H/C output: central amino acid secondary state prediction. A D C Q E I L H T S T T W Y V

29 PSI - PRED Using PSI - BLAST brings up PSI – BLAST difficulties:
Iteration - extension of proteins family Updating PSSM Inclusion of non – homologues “Misleading” PSSM

30 PSI - PRED Step 3: 2nd ANN what’s wrong with that ?
So why do we need a second ANN ? possible output for 1st ANN: one-amino-acid helix doesn’t exist seq A A P P L L L L M M M G I M M R R I M E E E E E C C C C C H C C C C C E E E pred what’s wrong with that ? Solution: ANN that “looks” at the whole context ! Input: output of 1st ANN. Output: final prediction.

31 PSI - PRED Training : Testing :
10% of proteins were used as “inner” test. Balanced training. Testing : 187 proteins, Highly resolved structure. PSI – BLAST was used for removing homologues. Without structural similarities.

32 PSI - PRED Jones’s reported results : Q3 results : 76% - 77%.

33 PSI - PRED Reliability numbers: The way the ANN tells us
how much it is sure about the assignment. Used by many methods. Correlates with accuracy.

34 Performance evaluation
Through 3rd generation methods accuracy jumped ~10%. Many 3rd generation methods exist today. Which method is the best one ? How to recognize “over-optimism” ?

35 Performance evaluation
CASP - Critical Assessment of Techniques for Protein Structure Prediction. EVA – Automatic Evaluation of Automatic Prediction Servers.

36

37 Performance evaluation
Conclusion : PSI-PRED seams to be one of the most reliable method today. Reasons : The widest evolutionary information (PSI - BLAST profiles). Strict training & testing criterions for ANN.

38 Improvements The first 3rd generation method PHD: ~72% in Q3.
3rd generation methods best results: ~77% in Q3 . Sources of improvement : Larger protein data bases. PSI – BLAST PSI – PRED broke through, many followed...

39 Improvements How can we do better than that ?
Through larger data bases (?). Combination of methods. Example: Combining 4 best methods Q3 of ~78% ! Find why certain proteins predicted poorly.

40 Improvements What is the limit of prediction improvement? than others.
Some regions of proteins are more mobile than others. 12% of proteins structure is unknown even by “manual” methods. The limit of accuracy is 88% !

41 Secondary structure prediction in practice
protein structure genome analysis finding structural switches

42 Finding Structural Switches
young et al: Prediction of secondary structure with several methods Different results + same preferences Structural switch ???

43 Bibliography Jones DT. Protein secondary structure prediction based on
position specific scoring matrices. J Mol Biol : Rost B. Rising accuracy of protein secondary structure prediction 'Protein structure determination, analysis, and modeling for drug discovery‘ (ed. D Chasman), New York: Dekker, pp


Download ppt "Rising accuracy of protein secondary structure prediction Burkhard Rost"

Similar presentations


Ads by Google