Presentation is loading. Please wait.

Presentation is loading. Please wait.

V2 SS 2009 Membrane Bioinformatics 1 V2 - Predicting TM helices from sequence Review 20 - 25% of all genes code for transmembrane proteins (1) High energetic.

Similar presentations


Presentation on theme: "V2 SS 2009 Membrane Bioinformatics 1 V2 - Predicting TM helices from sequence Review 20 - 25% of all genes code for transmembrane proteins (1) High energetic."— Presentation transcript:

1 V2 SS 2009 Membrane Bioinformatics 1 V2 - Predicting TM helices from sequence Review 20 - 25% of all genes code for transmembrane proteins (1) High energetic cost of dehydrating the peptide bond  (a) The amino side chains in the TM region must be non-polar (b) The polar peptide bond must ¨self-satisfy¨ its H-bonding potential Therefore, polypeptide chains form alpha-helices or beta-sheets in the hydrophobic core of the membrane. Jones, Bioinformatics 23, 538 (2007)

2 V2 SS 2009 Membrane Bioinformatics 2 TM protein topology prediction Relies on two major topological features (1)TM helices are generally formed by hydrophobic stretches (2)Gunnar von Heijne observed a bias towards positively charged residues in the regions flanking the hydrophobic stretches, especially on the intracellular side of the membrane This feature has been termed positive-inside rule. Short loops are found to be enriched with Lys and Arg residues on the intracellular side and depleted on the outside. Jones, Bioinformatics 23, 538 (2007)

3 V2 SS 2009 Membrane Bioinformatics 3 TM prediction: Kyte-Doolittle hydrophobicity scale (1982) Assign hydropathy value to each amino acid. Use sliding-window to identify membrane regions. Sum the hydrophobicity scale over all w residues in the window of length w. Use threshold T to assign segment as predicted membrane helix. w = 19 residues could best discriminate between membrane and globular proteins. Threshold T > 1.6 was suggested for the average over 19 residues.

4 V2 SS 2009 Membrane Bioinformatics 4 First prediction of TM topology: TopPred TopPred (von Heijne 1992) predicts the complete topology of membrane proteins by using - hydrophobicity analysis - automatic generation of possible topologies - ranking these topologies by the positive-inside rule. TopPred uses a particular sliding trapezoid window to detect segments of outstanding hydrophobicity. The two bases of the trapezoid are 11 and 21 residues long. TopPred chooses thresholds by considering a segment as TM helix that yielded the optimal difference between the number of positively charged residues at the inside and at the outside.

5 V2 SS 2009 Membrane Bioinformatics 5 MEMSAT uses dynamic programming MEMSAT (Jones, 1994) implemented statistical tables (log likelihoods) compiled from well-characterized TM proteins and a dynamic programming algorithm to recognize membrane topology models by expectation maximisation. Expectation maximization attempts to search for the model which best explains the given data. Given a function which calculates the total probability for the match of a given model with a given sequence, the resulting model from expectation maximization should correspond to the maximum of this function. MODEL DEFINITION The first requirement for expectation maximization is the definition of a model. In the case of transmembrane prediction such a model includes parameters for - the number of membrane-spanning segments, n, - the topology, t (N-terminus in or out), - the length, I, and the location, i, in the sequence of each segment.

6 V2 SS 2009 Membrane Bioinformatics 6 MEMSAT structural states Residues are classified as being one of 5 structural states: L i inside loop L o outside loop H i inside helix end H m helix middle H o outside helix end. Helix end caps are arbitrarily defined to span over 4 adjacent residues (one helical turn). Idea: (1) Compile propensities of amino acids for 5 states. (2) Calculate score of relating given sequences to a predicted topology. (3) Find optimal score by dynamic programming.

7 V2 SS 2009 Membrane Bioinformatics 7 MEMSAT propensities For each of the 5 structural classes, log likelihood ratios for each of the 20 amino acids were calculated: where p i is the relative frequency of occurrence (or fraction) of amino acid i in all the sequences in the data set, and q i is the relative frequency of occurrence of amino acid i in a particular structural class. A positive score indicates a higher than expected frequency for a given amino acid to be found in a particular structural class and a negative score a lower than expected frequency. To circumvent the problem of classifying globular domains as loops, loops longer than 100 residues are not classified as loops, and are ignored in the calculation of q i values. These oversized loops are, however, included in the calculation of the overall relative frequencies of occurrence p i.

8 V2 SS 2009 Membrane Bioinformatics 8 Memsat

9 V2 SS 2009 Membrane Bioinformatics 9 Memsat

10 V2 SS 2009 Membrane Bioinformatics 10 Memsat: Topology prediction Using the propensities shown in Figures 2 and 3, it is possible to calculate a score relating to the compatibility of a given sequence with a given topology and secondary structure. We need to find the structural model with the best score. For a sequence of length m, and a given transmembrane topology (n, t ), there are approximately 9 n ((m - 21n) / n) n possible models. E.g. for a 7-helix TM topology and a sequence of length 250, this gives ca. 7  10 14 different models. Clearly a brute-force approach is inappropriate.

11 V2 SS 2009 Membrane Bioinformatics 11 Memsat: Topology prediction Despite the apparent complexity of the problem, the score for a particular residue depends solely on the identity of the residue, and its structural environment (L i, L o, H i, H m, or H o ). Therefore, as a result of this single dimensionality, it is straightforward to formulate a dynamic programming solution to the problem, which will ensure that the global optimum model will be found every time. The overall problem of determining the optimal position and length of n TM helices in a sequence of length m is divided into n subproblems: determine the optimal position and length of a single TM helix along with its associated C-terminal coil segment.

12 V2 SS 2009 Membrane Bioinformatics 12 Memsat: Topology prediction s il : score associated with a TM helix of length l at position i in the given sequence. This score is calculated according to the diagram shown in Figure 1, where the helix is divided into three sections (two caps of length 4, and a center region of length l - 8). Whether the cap and its associated loop are inside or outside depends on the initially specified membrane topology. In order to find the best set of s il j, MEMSAT uses a recursive algorithm almost identical to the algorithms used for pairwise sequence alignment (Needleman & Wunsch, 1970).

13 V2 SS 2009 Membrane Bioinformatics 132. Vorlesung WS 2007/08Softwarewerkzeuge der Bioinformatik 13 Insertion: Needleman-Wunsch algorithm Trace-back yields best alignment in the matrix. start in bottom right corner and follow arrows til left corner. this gives the best alignment of these two sequences: COELACANTH -PELICAN-- COELACANTH 0-2-3-4-5-6-7-8-9-10 P -2-3-4-5-6-7-8-9-10 E -2 -2-3-4-5-6-7-8 L -3 -20-2-3-4-5-6 I -4 -3 -2-3-4-5-6 C -5-3-4 -2 0-2-3-4 A -6-4 -5-3 10 -2 N -7-5 -4-2 0210

14 V2 SS 2009 Membrane Bioinformatics 14 Memsat: dynamic programming approach Define matrix : number of TM segments  length of total sequence Need also third parameter: length of every TM helix The pathway with highest score will then contain the correct number of TM helices, each with the correct length, at the correct position. Expectation maximization aspect: Optimize the log likelihood scores for the residue propensities

15 V2 SS 2009 Membrane Bioinformatics 15 Memsat: Topology prediction Define a score matrix S i j (i:1... n, j:1...m) as: where A is the minimum length of a loop segment. In this example, i varies from 1 to 3 helices, and the sequence j has a length of 115 minus the length of a TM helix. s il j : score of TM helix number i of length l at position j in the given sequence. The second maximum considers the following helices.

16 V2 SS 2009 Membrane Bioinformatics 16 Memsat: Topology prediction Having computed the score matrix S, the highest value in the column j = 1 is the score for the best path through the matrix, which represents the optimal lengths and positions of m TM helices in the given sequence. The highest value in column 2 is the optimal path score for m -1 helices, but with inverted topology, and this can be extended to the other columns.

17 V2 SS 2009 Membrane Bioinformatics 17 Memsat: Topology prediction In this way, only two score matrices need to be calculated to evaluate all possible membrane topologies for a given case: one with helix 1 (column 1) defined with the N- terminus on the inside, and the other with the helix 1 N-terminus on the outside. If we calculated two matrices for m = 7, one matrix would therefore provide optimal paths for topologies +7, -6, +5, -4, +3, -2, and +1, and the other would provide paths for -7, +6, -5, +4, -3, +2, and -1 (where +ve indicates the N-terminus inside). The appropriate score for the N-terminal loop must be added to the appropriate matrix values.

18 V2 SS 2009 Membrane Bioinformatics 18 Memsat: Topology prediction

19 V2 SS 2009 Membrane Bioinformatics 19 MEMSAT3 (Jones, 2007) Replace log likelihood propensities by the prediction of a neural network classifier similar to PSIPRED (also developed by Jones). Use feed-forward neural network comprising 399 inputs (19  21), 15 hidden units and 4 output units. Input: sequence window of 19 residue positions and 21 inputs per residue. This encoding is the same as that used by the PSIPRED secondary structure prediction method (Jones, 1999), though with a slightly longer window in this case. A number of different output encodings were considered, but it was decided that a minimal encoding of just four outputs would be preferable for optimal neural network training. The four output targets are: cytoplasmic (O in ), non-cytoplasmic (O out ), transmembrane segment (O tm ) and signal peptide (O sig ). Optimize neural network weights on training data set.

20 V2 SS 2009 Membrane Bioinformatics 20 MEMSAT3 (Jones, 2007) To calculate the most probable topology based on the neural network output, the MEMSAT dynamic programming algorithm (Jones et al., 1994) was used. In the original MEMSAT method, five different regions of a TM protein were defined. At every position in the target sequence, the four neural network outputs were combined as follows to generate scores for the MEMSAT topology search: where O x represents the raw neural network output x.

21 V2 SS 2009 Membrane Bioinformatics 21 MEMSAT3 (Jones, 2007) MEMSAT3 easily predicts long TM helices, but has problems with short, half- spanning helices. Also, the recognition of signal peptides should be improved further.

22 V2 SS 2009 Membrane Bioinformatics 22 Using evolutionary information It is known from predicting secondary structures of globular proteins that using multiple sequence alignment information improves prediction accuracy significantly. PHDtm: predict location and topology of TM helices by a system of neural networks. Was later combined with dynamical programming.

23 V2 SS 2009 Membrane Bioinformatics 23 Using grammatical rules The lipid bilayer constrains the structure of the membrane-passing regions of proteins in many ways. TMHMM (Sonnhammer et al. 1998, Krogh et al. 2001) and HMMTOP (Tusnady & Simon 1998, 2001) implement Hidden Markov Models.

24 V2 SS 2009 Membrane Bioinformatics 24 Using grammatical rules TMHMM: uses cyclic model with 7 states for - TM helix core - TM helix caps on the N- and C-terminal side - non-membrane region on the cytoplasmic side - 2 non-membrane regions on the non-cytoplasmic side (for short and long loops to account for different membrane insertion mechanism) - a globular domain state in the middle of each non-membrane region

25 V2 SS 2009 Membrane Bioinformatics 25 TMHMM: types of errors

26 V2 SS 2009 Membrane Bioinformatics 26 Availability of prediction methods. Many of these servers are also available through a Meta-Server META-PP at the site of Burkhard Rost.

27 V2 SS 2009 Membrane Bioinformatics 27 Most methods get number of helices right All methods based on advanced algorithms tend to underestimate TM helices %obs > %prd. a Data set: Sequence-unique subset of 36 high-resolution TM helical proteins from PDB. This is the largest subset of all 105 high-resolution membrane chains, which fulfils the condition that no pair in the set has significant sequence similarity as defined in Rost (1999). b Methods c Per-segment accuracy: Q ok percentage of proteins for which all TM helices are predicted correctly (allowed deviation of up to 3 residues), Q %obs htm percentage of all observed helices that are correctly predicted, Q %prd htm percentage of all predicted helices that are correctly predicted, TOPO percentage of proteins for which the topology (orientation of helices) is correctly predicted (empty for methods that do not predict topology). d Per-residue accuracy: Q 2 percentage of correctly predicted residues in two-states: membrane helix / non-membrane helix, Q %obs 2T percentage of all observed TMH helix residues that are correctly predicted, Q %prd 2T percentage of all predicted TMH helix residues that are correctly predicted, Q %obs 2N percentage of all observed non-TMH helix residues that are correctly predicted, Q %prd 2N percentage of all predicted non-TMH helix residues that are correctly predicted. e ERROR: the estimates for per-segment accuracy resulted from a bootstrap experiment with M = 100 and K = 18; the estimates for per-residue accuracy were obtained by standard deviations over Gaussian distributions for the respective score. f Numbers in italics: two standard deviations below the numerically highest value in each column (set in bold letters). NOTE: all methods are tested on the same set of proteins. However, the numbers are NOT from a cross-validation experiment, ie some methods may have used some of the proteins for training. Generally, newer methods are more likely to be overestimated than older ones. In particular, HMMTOP2, TMHMM1, and WW have been developed using ALL the proteins listed here.

28 V2 SS 2009 Membrane Bioinformatics 28 Future directions Meta servers yield improved predictions. > 90% correct topologies can be obtained by a simple majority vote between the results of various methods. TM helix prediction and signal peptide prediction should be combined Useful: databases for particular families of TM proteins and sequence motifs e.g. GPCR database Membrane-specific substitution matrices improve database searches e.g. PHAT by Henikoff & Henikoff improved alignments of TM proteins Account for helix-helix interactions.

29 V2 SS 2009 Membrane Bioinformatics 29 TopPred  G Use  G values from Hessa predictor and a variant of the TopPred method. Briefly, a sliding window of fixed length (l = 21 residues) is scanned across the protein sequence, and  G app values are calculated for each sequence position. Bernsel. PNAS 105, 7177 (2008) Here l is the length of the TM segment.  G app aa(i) is the contribution of amino acid aa in position i. The expression under the square root is the hydrophobic moment.

30 V2 SS 2009 Membrane Bioinformatics 30 TopPred  G (1) all minima <  G low are identified and marked as ‘‘certain’’ TM segments. All minima above  G low but below a second higher cutoff value (  G high ) are marked as ‘‘putative’’ TM segments. (2) all possible topologies, including all certain TM segments and either including or excluding each of the putative TM segments, are generated, and the topology that best complies with the positive-inside rule is chosen as the final prediction. The parameters  G low and  G high were optimized over a benchmark set of known transmembrane topologies Bernsel. PNAS 105, 7177 (2008)

31 V2 SS 2009 Membrane Bioinformatics 31 TopPred  G Bernsel. PNAS 105, 7177 (2008)

32 V2 SS 2009 Membrane Bioinformatics 32 TopPred  G Bernsel. PNAS 105, 7177 (2008) Multi-sequence results are obtained with input from multiple sequence alignments. TopPred  G works as well as the best statistical methods that include hundreds of optimized parameters.

33 V2 SS 2009 Membrane Bioinformatics 33 TopPred  G Generally, the ‘‘missed helices’’ have both higher  G app values (4.4 kcal/mol vs. 0.76 kcal/mol) and a higher fraction of surface area in contact with the surrounding protein (67% buried vs. 54% buried surface area) than found for the complete dataset. There is a strong tendency for highly exposed helices to have lower  G app values (Fig. 2A), indicating that such helices need to be able to insert efficiently by themselves, in the absence of stabilizing interactions with surrounding protein. Bernsel. PNAS 105, 7177 (2008)

34 V2 SS 2009 Membrane Bioinformatics 34 TopPred  G Fig. 2A shows that a good part of the surface of the high-  G app helices is buried already within the same polypeptide chain. However, the mean  G app for the most exposed group of helices (0–20% buried) is considerably higher when considering area buried against the chain than against the whole protein complex, indicating that a number of helices with relatively high  G app are efficiently buried (20%) only upon oligomerization. Bernsel. PNAS 105, 7177 (2008)

35 V2 SS 2009 Membrane Bioinformatics 35 TopPred  G On the same line of thought, there should be more opportunities for helix–helix interactions in proteins containing many TM helices, and such helices might thus be expected to be more polar on average. Indeed, the mean  G app increases with the number of TM helices in the protein (Fig. 2B). Among the overpredicted helices, more than half are reentrant regions, i.e., they partly penetrate the membrane but enter and exit from the same side. Bernsel. PNAS 105, 7177 (2008)

36 V2 SS 2009 Membrane Bioinformatics 36 Summary TM helices are typically continuous stretches of mostly hydrophobic residues. Simple methods that sum up hydrophobicities work okay but not really well. Advanced methods include additional features such as the „positive-inside rule“. The currently most successful methods are based on Hidden Markov Models or Neural Networks. Evaluating performance accuracy should be done using carefully separated training and test sets. It is possible to discriminate signal peptides and TM helices, e.g. Octopus. New method TopPred  G utilizes exp. insertion free energies.


Download ppt "V2 SS 2009 Membrane Bioinformatics 1 V2 - Predicting TM helices from sequence Review 20 - 25% of all genes code for transmembrane proteins (1) High energetic."

Similar presentations


Ads by Google