Predicting local Protein Structure Morten Nielsen.

Predicting local Protein Structure Morten Nielsen

Use of local structure prediction Classification of protein structures Definition of loops (active sites) Relevant sites for mutagenesis Use in fold recognition methods Improvements of alignments Definition of domain boundaries Disease associated SNP’s

Protein Secondary Structure

ß-strand Helix Turn Bend Secondary Structure Elements

Helix formation is local THYROID hormone receptor (2nll) i i+4

 -sheet formation is NOT local

Secondary Structure Type Descriptions H = alpha helix G = 3 10 - helix I = 5 helix (pi helix) E = extended strand, participates in beta ladder B = residue in isolated beta-bridge T = hydrogen bonded turn S = bend C = coil (the rest)

Automatic assignment programs DSSP ( http://www.cmbi.kun.nl/gv/dssp/ ) STRIDE ( http://www.hgmp.mrc.ac.uk/Registered/Option/stride.html ) DSSPcont ( http://cubic.bioc.columbia.edu/services/DSSPcont/ ) # RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA 1 4 A E 0 0 205 0, 0.0 2,-0.3 0, 0.0 0, 0.0 0.000 360.0 360.0 360.0 113.5 5.7 42.2 25.1 2 5 A H - 0 0 127 2, 0.0 2,-0.4 21, 0.0 21, 0.0 -0.987 360.0-152.8-149.1 154.0 9.4 41.3 24.7 3 6 A V - 0 0 66 -2,-0.3 21,-2.6 2, 0.0 2,-0.5 -0.995 4.6-170.2-134.3 126.3 11.5 38.4 23.5 4 7 A I E -A 23 0A 106 -2,-0.4 2,-0.4 19,-0.2 19,-0.2 -0.976 13.9-170.8-114.8 126.6 15.0 37.6 24.5 5 8 A I E -A 22 0A 74 17,-2.8 17,-2.8 -2,-0.5 2,-0.9 -0.972 20.8-158.4-125.4 129.1 16.6 34.9 22.4 6 9 A Q E -A 21 0A 86 -2,-0.4 2,-0.4 15,-0.2 15,-0.2 -0.910 29.5-170.4 -98.9 106.4 19.9 33.0 23.0 7 10 A A E +A 20 0A 18 13,-2.5 13,-2.5 -2,-0.9 2,-0.3 -0.852 11.5 172.8-108.1 141.7 20.7 31.8 19.5 8 11 A E E +A 19 0A 63 -2,-0.4 2,-0.3 11,-0.2 11,-0.2 -0.933 4.4 175.4-139.1 156.9 23.4 29.4 18.4 9 12 A F E -A 18 0A 31 9,-1.5 9,-1.8 -2,-0.3 2,-0.4 -0.967 13.3-160.9-160.6 151.3 24.4 27.6 15.3 10 13 A Y E -A 17 0A 36 -2,-0.3 2,-0.4 7,-0.2 7,-0.2 -0.994 16.5-156.0-136.8 132.1 27.2 25.3 14.1 11 14 A L E >> -A 16 0A 24 5,-3.2 4,-1.7 -2,-0.4 5,-1.3 -0.929 11.7-122.6-120.0 133.5 28.0 24.8 10.4 12 15 A N T 45S+ 0 0 54 -2,-0.4 -2, 0.0 2,-0.2 0, 0.0 -0.884 84.3 9.0-113.8 150.9 29.7 22.0 8.6 13 16 A P T 45S+ 0 0 114 0, 0.0 -1,-0.2 0, 0.0 -2, 0.0 -0.963 125.4 60.5 -86.5 8.5 32.0 21.6 6.8 14 17 A D T 45S- 0 0 66 2,-0.1 -2,-0.2 1,-0.1 3,-0.1 0.752 89.3-146.2 -64.6 -23.0 33.0 25.2 7.6 15 18 A Q T <5 + 0 0 132 -4,-1.7 2,-0.3 1,-0.2 -3,-0.2 0.936 51.1 134.1 52.9 50.0 33.3 24.2 11.2 16 19 A S E < +A 11 0A 44 -5,-1.3 -5,-3.2 2, 0.0 2,-0.3 -0.877 28.9 174.9-124.8 156.8 32.1 27.7 12.3 17 20 A G E -A 10 0A 28 -2,-0.3 2,-0.3 -7,-0.2 -7,-0.2 -0.893 15.9-146.5-151.0-178.9 29.6 28.7 14.8 18 21 A E E -A 9 0A 14 -9,-1.8 -9,-1.5 -2,-0.3 2,-0.4 -0.979 5.0-169.6-158.6 146.0 28.0 31.5 16.7 19 22 A F E +A 8 0A 3 12,-0.4 12,-2.3 -2,-0.3 2,-0.3 -0.982 27.8 149.2-139.1 120.3 26.5 32.2 20.1 20 23 A M E -AB 7 30A 0 -13,-2.5 -13,-2.5 -2,-0.4 2,-0.4 -0.983 39.7-127.8-152.1 161.6 24.5 35.4 20.6 21 24 A F E -AB 6 29A 45 8,-2.4 7,-2.9 -2,-0.3 8,-1.0 -0.934 23.9-164.1-112.5 137.7 21.7 37.0 22.6 22 25 A D E -AB 5 27A 6 -17,-2.8 -17,-2.8 -2,-0.4 2,-0.5 -0.948 6.9-165.0-123.7 138.3 18.9 38.9 20.8 23 26 A F E > S-AB 4 26A 76 3,-3.5 3,-2.1 -2,-0.4 -19,-0.2 -0.947 78.4 -27.2-127.3 111.5 16.4 41.3 22.3 24 27 A D T 3 S- 0 0 74 -21,-2.6 -20,-0.1 -2,-0.5 -1,-0.1 0.904 128.9 -46.6 50.4 45.0 13.4 42.1 20.2 25 28 A G T 3 S+ 0 0 20 -22,-0.3 2,-0.4 1,-0.2 -1,-0.3 0.291 118.8 109.3 84.7 -11.1 15.4 41.4 17.0 26 29 A D E < S-B 23 0A 114 -3,-2.1 -3,-3.5 109, 0.0 2,-0.3 -0.822 71.8-114.7-103.1 140.3 18.4 43.4 18.1 27 30 A E E -B 22 0A 8 -2,-0.4 -5,-0.3 -5,-0.2 3,-0.1 -0.525 24.9-177.7 -74.1 127. 5 21.8 41.8 19.1 DSSP

Prediction of protein secondary structure What to predict? How to predict? How good are the best?

Secondary Structure Prediction What to predict? –All 8 types or pool types into groups? HEC DSSP *H = alpha helix (31%) *G = 3 10 -helix (3.5%) *I = 5 helix (pi helix) (<0.1%) *E = extended strand (21%) *B = beta-bridge (1%) *T = hydrogen bonded turn (11%) *S = bend (9%) *C = coil (23%)

What to predict? –All 8 types or pool types into groups Straight HEC Secondary Structure Prediction HEC *H = alpha helix *E = extended strand *T = hydrogen bonded turn *S = bend *C = coil *G = 3 10 -helix *I = 5 helix (pi helix) *B = beta-bridge

Secondary Structure Prediction Simple alignments Align to a close homolog for which the structure has been experimentally solved. Heuristic Methods (e.g., Chou-Fasman, 1974) Apply scores for each amino acid an sum up over a window. Neural Networks (different inputs) Raw Sequence (late 80’s) Blosum matrix (e.g., PhD, early 90’s) Position specific alignment profiles (e.g., PsiPred, late 90’s) Multiple networks balloting, probability conversion, output expansion (Petersen et al., 2000).

The pessimistic point of view Prediction by alignment

Solved structure of a homolog to query is needed Homologous proteins have ~88% identical (3 state) secondary structure If no close homologue can be identified alignments will give almost random results Simple Alignments

Improvement of accuracy 1974 Chou & Fasman~50-53% 1978 Garnier63% 1987 Zvelebil66% 1988 Quian & Sejnowski64.3% 1993 Rost & Sander70.8-72.0% 1997 Frishman & Argos<75% 1999 Cuff & Barton72.9% 1999 Jones76.5% 2000 Petersen et al.77.9%

Secondary structure predictions of 1. and 2. generation single residues (1. generation) –Chou-Fasman, GOR1957-70/80 50-55% accuracy segments (2. generation) –GORIII1986-92 55-60% accuracy problems –< 100% they said: 65% max – < 40% they said: strand non-local –short segments

Amino acid preferences in a-Helix

Amino acid preferences in  -Strand

Amino acid preferences in coil

Chou-Fasman NameP(a)P(b)P(turn)f(i)f(i+1)f(i+2)f(i+3) Ala 14283660.060.0760.0350.058 Arg 9893950.0700.1060.0990.085 Asp 101541460.1470.1100.1790.081 Asn 67891560.1610.0830.1910.091 Cys 701191190.1490.0500.1170.128 Glu 15137740.0560.0600.0770.064 Gln 111110980.0740.0980.0370.098 Gly 57751560.1020.0850.1900.152 His 10087950.1400.0470.0930.054 Ile 108160470.0430.0340.0130.056 Leu 121130590.0610.0250.0360.070 Lys 114741010.0550.1150.0720.095 Met 145105600.0680.0820.0140.055 Phe 113138600.0590.0410.0650.065 Pro 57551520.1020.3010.0340.068 Ser 77751430.1200.1390.1250.106 Thr 83119960.0860.1080.0650.079 Trp 108137960.0770.0130.0640.167 Tyr 691471140.0820.0650.1140.125 Val 106170500.0620.0480.0280.053

Chou-Fasman 1. Assign all of the residues in the peptide the appropriate set of parameters. 2. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix) > 100. That region is declared an alpha-helix. Extend the helix in both directions until a set of four contiguous residues that have an average P(a-helix) P(b-sheet) for that segment, the segment can be assigned as a helix. 3. Repeat this procedure to locate all of the helical regions in the sequence. 4. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(b- sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set of four contiguous residues that have an average P(b-sheet) 105 and the average P(b-sheet) > P(a-helix) for that region. 5. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(a- helix) for that region. 6. To identify a bend at residue number j, calculate the following value: p(t) = f(j)f(j+1)f(j+2)f(j+3) where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetra-peptide; and (3) the averages for the tetra-peptide obey the inequality P(a-helix) P(b-sheet), then a beta-turn is predicted at that location.

Chou-Fasman General applicable Works for sequences with no solved homologs But the accuracy is low! –50%

Improvement of accuracy 1974 Chou & Fasman~50-53% 1978 Garnier63% 1987 Zvelebil66% 1988 Quian & Sejnowski64.3% 1993 Rost & Sander70.8-72.0% 1997 Frishman & Argos<75% 1999 Cuff & Barton72.9% 1999 Jones76.5% 2000 Petersen et al.77.9%

PHD method (Rost and Sander, 1993!!) Combine neural networks with sequence profiles –6-8 Percentage points increase in prediction accuracy over standard neural networks (63% -> 71%) Use second layer “Structure to structure” network to filter predictions Jury of predictors Set up as mail server

Sequence profiles

Neural Networks Benefits –General applicable –Can capture higher order correlations –Inputs other than sequence information Drawbacks –Needs many data (different solved structures). However, these does exist today (nearly 5000 solved structures with low sequence identity/high resolution.) –Complex method with several pitfalls

How is it done One network (SEQ2STR) takes sequence (profiles) as input and predicts secondary structure –Cannot deal with SS elements i.e. helices are normally formed by at least 5 consecutive amino acids

I K E E H V I I Q A E H E C IKEEHVIIQAEFYLNPDQSGEF….. Window Input Layer Hidden Layer Output Layer Weights Architecture

Example PITKEVEVEYLLRRLEE (Sequence) HHHHHHHHHHHHTGGG. (DSSP) ECCCHEEHHHHHHHCCC (SEQ2STR)

How is it done One network (SEQ2STR) takes sequence (profiles) as input and predicts secondary structure –Cannot deal with SS elements i.e. helices are normally formed by at least 5 consecutive amino acids Second network (STR2STR) takes predictions of first network and predicts secondary structure –Can correct for errors in SS elements, i.e remove single helix prediction, mixture of strand and helix predictions

H E C H E C H E C H E C IKEEHVIIQAEFYLNPDQSGEF….. Window Input Layer Hidden Layer Output Layer Weights Secondary networks (Structure-to-Structure)

Example PITKEVEVEYLLRRLEE (Sequence) HHHHHHHHHHHHTGGG. (DSSP) ECCCHEEHHHHHHHCCC (SEQ2STR) CCCCHHHHHHHHHHCCC (STR2STR)

Slide courtesy by B. Rost 2004

Prediction accuracy PHD Slide courtesy by B. Rost 2004

Stronger predictions more accurate!

PSI-Pred (Jones) Use alignments from iterative sequence searches (PSI-Blast) as input to a neural network (Just like PHDsec) Better predictions due to better sequence profiles Available as stand alone program and via the web

Petersen et al. 2000 SEQ2STR (>70 networks) –Not one single network architecture is best for all sequences STR2STR (>70 network) => 4900 network predictions, –(wisdom of the crowd!!!) –Others have 1

Why so many networks?

Why not select the best?

Prediction accuracy (Q3=81.2%). 2006. (Petersen et al. 2000)

Spectrin homology domain (SH3) CEEEEEEECCCCCCCCCCCCCCCCEEEEEECCCCCEEEEEECCCEEEECCCCCEECC.EEEEESS.B...STTB..B.TT.EEEEEE..SSSEEEEEETTEEEEEEGGGEEE.. 93% Petersen

Prediction of protein secondary structure 1980: 55%simple 1990: 60%less simple 1993: 70%evolution 2000: 76%more evolution 2006: 80%more evolution 2008: >80% more evolution

Links to servers Database of links http://mmtsb.scripps.edu/cgi bin/renderrelres?protmodel ProfPHD http://www.predictprotein.org/ PSIPRED http://bioinf.cs.ucl.ac.uk/psipred/ JPred http://www.compbio.dundee.ac.uk/~www-jpred/

Surface exposure

What is Accessible Solvent Area? Surface area accessible to a rolling water molecule

RSA RSA = Relative Solvent Accessibility ACC = Accessible area in protein structure ASA = Accessible Surface Area in Gly-X-Gly or Ala-X-Ala Classification: Buried = RSA 25 % “Real” Value: values 0 - 1, RSA > 1 set to 1

Method

Neural Network - Input Position Specific Scoring Matrices, PSSM A R N D C Q E G H I L K M F P S T W Y V B H 2BEM.A 1 -4 -3 -2 -4 -6 -2 -3 -5 11 -6 -5 -3 -4 -4 -5 -3 -4 -5 -1 -6 A G 2BEM.A 2 -2 -5 -3 -4 -5 -4 -5 7 -5 -7 -6 -4 -5 -6 -5 -3 -4 -5 -6 -6 A Y 2BEM.A 3 -1 1 -4 -3 -5 -4 -4 -4 1 -4 -1 -4 -1 2 -5 0 -1 4 7 -2 A V 2BEM.A 4 -1 -5 -5 -6 -4 -4 -5 -5 -5 4 1 -5 6 -3 -2 -2 0 -5 -4 4 B E 2BEM.A 5 -2 -4 -3 0 -4 -1 3 -2 -4 0 -3 -2 1 -2 -3 3 3 -5 -4 0 Secondary Structure predictions B H 2BEM.A 1 0.003 0.003 0.966 A G 2BEM.A 2 0.018 0.086 0.868 A Y 2BEM.A 3 0.020 0.199 0.752 A V 2BEM.A 4 0.021 0.271 0.679 B E 2BEM.A 5 0.020 0.199 0.752

Wisdom of the crowd –Selecting best performing network architectures based on test performance Better than choosing any single network Ensemble size

TrainEvaluatedMethod Ahmad et al. (2003)Not Published0.48ANN Yuan and Huang (2004)Not Published0.52SVR Nguyen and Rajapakse(2006) Not Published0.66Two-Stage SVR Dor and Zhou (2007)0.738Not PublishedANN NetSurfP0.7220.70ANN Results - Real Value networks Training / Evaluation

Accuracy of predictions Prediction methods will always give an answer –A given method will predict that 25% of the residues in a protein are exposed But can you trust these predictions? Use benchmarking to give average prediction accuracy on a method evaluated on large independent data set. But what about residue/single prediction specific reliability?

Reliability (one real value target value) o = 0.55 Input layer Hidden layer Output layer w = 0.8 One target value per input, but two output values! Optimal value for : =0 => w =0; =∞ => w =1;

Performance

NetSurfP

Conclusions The big break through in SS prediction came due to sequence profiles –Rost et al. Prediction of secondary structure has not changed in the last 5 years –More protein sequences => higher prediction accuracy –No new theoretical break through Accuracy is close to 80% for globular proteins If you need a secondary structure prediction use one of profile based: –PSIPRED, and NetSurfP Amino acids exposure can be predicted with high accuracy (80%) –NetSurfP and Real-Spine

Predicting local Protein Structure Morten Nielsen.

Similar presentations

Presentation on theme: "Predicting local Protein Structure Morten Nielsen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predicting local Protein Structure Morten Nielsen.

Similar presentations

Presentation on theme: "Predicting local Protein Structure Morten Nielsen."— Presentation transcript:

Similar presentations

About project

Feedback