Protein Secondary Structures Assignment and prediction
Secondary Structure Elements ß-strand Helix Turn Bend
Use of secondary structure Classification of protein structures Definition of loops (active sites) Use in fold recognition methods Improvements of alignments Definition of domain boundaries
Classification of secondary structure Defining features Dihedral angles Hydrogen bonds Geometry Assigned manually by crystallographers or Automatic DSSP (Kabsch & Sander,1983) STRIDE (Frishman & Argos, 1995) DSSPcont (Andersen et al., 2002)
Dihedral Angles phi - dihedral angle about the N-Calpha bond psi - dihedral angle about the Calpha-C bond omega - dihedral angle about the C-N (peptide) bond From
Helices phi(deg) psi(deg) H-bond pattern right-handed alpha-helix i+4 pi-helix i helix i+3 (omega is 180 deg in all cases) From
Beta Strands phi(deg) psi(deg) omega (deg) beta strand Hydrogen bond patterns in beta sheets. Here a four-stranded beta sheet is drawn schematically which contains three antiparallel and one parallel strand. Hydrogen bonds are indicated with red lines (antiparallel strands) and green lines (parallel strands) connecting the hydrogen and receptor oxygen. From
Secondary Structure Elements ß-strand Helix Turn Bend
Helix formation is local THYROID hormone receptor (2nll)
b-sheet formation is NOT local
Secondary Structure Type Descriptions *H = alpha helix *G = helix *I = 5 helix (pi helix) *E = extended strand, participates in beta ladder *B = residue in isolated beta-bridge *T = hydrogen bonded turn *S = bend *C = coil
Automatic assignment programs DSSP ( ) STRIDE ( ) DSSPcont ( ) # RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA 1 4 A E , 0.0 2,-0.3 0, 0.0 0, A H , 0.0 2, , , A V , ,-2.6 2, 0.0 2, A I E -A 23 0A ,-0.4 2, , , A I E -A 22 0A 74 17, , ,-0.5 2, A Q E -A 21 0A 86 -2,-0.4 2, , , A A E +A 20 0A 18 13, , ,-0.9 2, A E E +A 19 0A 63 -2,-0.4 2, , , A F E -A 18 0A 31 9,-1.5 9, ,-0.3 2, A Y E -A 17 0A 36 -2,-0.3 2,-0.4 7,-0.2 7, A L E >> -A 16 0A 24 5,-3.2 4, ,-0.4 5, A N T 45S , , 0.0 2,-0.2 0, A P T 45S , ,-0.2 0, , A D T 45S , ,-0.2 1,-0.1 3, A Q T < ,-1.7 2,-0.3 1, , A S E < +A 11 0A 44 -5, ,-3.2 2, 0.0 2, A G E -A 10 0A 28 -2,-0.3 2, , , A E E -A 9 0A 14 -9, , ,-0.3 2, A F E +A 8 0A 3 12, , ,-0.3 2, A M E -AB 7 30A 0 -13, , ,-0.4 2, A F E -AB 6 29A 45 8,-2.4 7, ,-0.3 8, A D E -AB 5 27A 6 -17, , ,-0.4 2, A F E > S-AB 4 26A 76 3,-3.5 3, , , A D T 3 S , , , , A G T 3 S ,-0.3 2,-0.4 1, , A D E < S-B 23 0A , , , 0.0 2, A E E -B 22 0A 8 -2, , ,-0.2 3,
Prediction of protein secondary structure What to predict? How to predict? How good are the best?
Secondary Structure Prediction What to predict? –All 8 types or pool types into groups HEC DSSP Q3 *H = alpha helix *G = helix *I = 5 helix (pi helix) *E = extended strand *B = beta-bridge *T = hydrogen bonded turn *S = bend *C = coil
Straight HEC Secondary Structure Prediction What to predict? –All 8 types or pool types into groups HEC Q3 *H = alpha helix *E = extended strand *T = hydrogen bonded turn *S = bend *C = coil *G = helix *I = 5 helix (pi helix) *B = beta-bridge
Secondary Structure Prediction Simple alignments Align to a close homolog for which the structure has been experimentally solved. Heuristic Methods (e.g., Chou-Fasman, 1974) Apply scores for each amino acid an sum up over a window. Neural Networks (different inputs) Raw Sequence (late 80’s) Blosum matrix (e.g., PhD, early 90’s) Position specific alignment profiles (e.g., PsiPred, late 90’s) Multiple networks balloting, probability conversion, output expansion (Petersen et al., 2000).
FoRc HoMo 1D ….the art of being humble The pessimistic point of view Prediction by alignment
Secondary structure predictions of 1. and 2. generation single residues (1. generation) –Chou-Fasman, GOR / % accuracy segments (2. generation) –GORIII % accuracy problems –< 100% they said: 65% max – < 40% they said: strand non-local –short segments
Improvement of accuracy 1974 Chou & Fasman~50-53% 1978 Garnier63% 1987 Zvelebil66% 1988 Quian & Sejnowski64.3% 1993 Rost & Sander % 1997 Frishman & Argos<75% 1999 Cuff & Barton72.9% 1999 Jones76.5% 2000 Petersen et al.77.9%
Simple Alignments Solved structure of a homolog to query is needed Homologous proteins have ~88% identical (3 state) secondary structure If no close homologue can be identified alignments will give almost random results
Amino acid preferences in a-Helix
Amino acid preferences in b-Strand
Amino acid preferences in coil
Chou-Fasman NameP(a)P(b)P(turn)f(i)f(i+1)f(i+2)f(i+3) Ala Arg Asp Asn Cys Glu Gln Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
Chou-Fasman 1. Assign all of the residues in the peptide the appropriate set of parameters. 2. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix) > 100. That region is declared an alpha-helix. Extend the helix in both directions until a set of four contiguous residues that have an average P(a-helix) P(b-sheet) for that segment, the segment can be assigned as a helix. 3. Repeat this procedure to locate all of the helical regions in the sequence. 4. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(b- sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set of four contiguous residues that have an average P(b-sheet) 105 and the average P(b-sheet) > P(a-helix) for that region. 5. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(a- helix) for that region. 6. To identify a bend at residue number j, calculate the following value: p(t) = f(j)f(j+1)f(j+2)f(j+3) where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > ; (2) the average value for P(turn) > 1.00 in the tetra-peptide; and (3) the averages for the tetra-peptide obey the inequality P(a-helix) P(b-sheet), then a beta-turn is predicted at that location.
Chou-Fasman General applicable Works for sequences with no solved homologs But the accuracy is low!
Neural Networks Benefits General applicable Can capture higher order correlations Inputs other than sequence information Drawbacks Needs many data (different solved structures). However, theese does exist today (nearly 2500 solved structures with low sequence identity/high resolution.) Complex method with several pitfalls
How is it done One network (SEQ2STR) takes sequence (profiles) as input and predicts secondary structure –Cannot deal with SS elements i.e. helices are normally formed by at least 5 consecutive aminoacids Second network (STR2STR) takes predictions of first network and predicts secondary structure –Can correct for errors in SS elements, i.e remove single helix prediction, mixture of strand and helix predictions
Architecture I K E E H V I I Q A E H E C IKEEHVIIQAEFYLNPDQSGEF….. Window Input Layer Hidden Layer Output Layer Weights
Secondary networks (Structure-to-Structure) H E C H E C H E C H E C IKEEHVIIQAEFYLNPDQSGEF….. Window Input Layer Hidden Layer Output Layer Weights
Example PITKEVEVEYLLRRLEE (Sequence) HHHHHHHHHHHHTGGG. (DSSP) ECCCHEEHHHHHHHCCC (SEQ2STR) CCCCHHHHHHHHHHCCC (STR2STR)
PHD method (Rost and Sander) Combine neural networks with sequence profiles –6-8 Percentage points increase in prediction accuracy over standard neural networks Use second layer “Structure to structure” network to filter predictions Jury of predictors Set up as mail server
Sequence profiles
Prediction accuracy PHD
Stronger predictions more accurate!
PSI-Pred (Jones) Use alignments from iterative sequence searches (PSI-Blast) as input to a neural network Better predictions due to better sequence profiles Available as stand alone program and via the web
Petersen et al SEQ2STR (>70 networks) –Not one single network architecture is best for all sequences STR2STR (>70 network) => 4900 network predictions, –Others have 1 ACT2PROB (not used by others)
Why so many networks?
Why not select the best?
Prediction accuracy (Q3=81.2%) (Petersen et al. 2000)
Spectrin homology domain (SH3) CEEEEEEECCCCCCCCCCCCCCCCEEEEEECCCCCEEEEEECCCEEEECCCCCEECC.EEEEESS.B...STTB..B.TT.EEEEEE..SSSEEEEEETTEEEEEEGGGEEE.. 93% Petersen
False prediction for engineered proteins!
Benchmarking secondary structure predictions CASP –Critical Assessment of Structure Predictions –Sequences from about-to-be-deposited-structures are given to groups who submit their predictions before the structure is published –Every 2. year EVA –Newly solved structures are send to prediction servers. –Every week
EVA results (Rost et al., 2001) PROFphd77.0% PSIPRED76.8% SAM-T99sec 76.1% SSpro76.0% Jpred275.5% PHD71.7% –Cubic.columbia.edu/eva
EVA: secondary structure 76% Petersen et al. Proteins 2000
Prediction of protein secondary structure 1980: 55%simple 1990: 60%less simple 1993: 70%evolution 2000: 76%more evolution 2006: 80%more evolution what is the limit? 88% for proteins of similar structure 80% for 1/5th of proteins with families > 100 missing through: better definition of secondary structure including long-range interactions structural switches chameleon / folding
Links to servers Database of links bin/renderrelres?protmodel ProfPHD PSIPRED JPred
Conclusion Prediction of secondary structure has not changed in the last 5 years –More protein sequences => higher prediction accuracy –No new theoretical break through Accuracy is close to 80% for globular proteins If you need a secondary structure prediction use one of profile based: –ProfPHD, –PSIPRED, and –JPred And not one of the older ones such as : –Chou-Fasman –Garnier