1 Lesson 5 Protein Prediction and Classification
2 Learning about a protein What does a protein do?? Post-translational modifications – phosphorylation, glycosylation, etc. Identifying patterns, motifs Secondary structure Tertiary/quaternary structure Protein-protein interactions
3 Domains & Motifs
4 Domains An analysis of known 3-D protein structures reveals that, rather than being monolithic, many of them contain multiple folding units. Each such folding unit is a domain (>50 aa, 50 aa, < 500 aa)
5 calcium/calmodulin-dependent protein kinase SH2 domain: interact with phosphorylated tyrosines, and are thus part of intracellular signal-transuding proteins. Characterized by specific sequences and tertiary structure
6 What is a motif?? A sequence motif = a certain sequence that is widespread and conjectured to have biological significance Examples: KDEL – ER-lumen retention signal PKKKRKV – an NLS (nuclear localization signal)
7 More loosely defined motifs KDEL (usually) + HDEL (rarely) = [HK]-D-E-L: H or K at the first position This is called a pattern (in Biology), or a regular expression (in computer science)
8 Syntax of a pattern Example: W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE].
9 Patterns W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE]. Any amino, between times F or Y or V WOPLASDFGYVWPPPLAWS ROPLASDFGYVWPPPLAWS WOPLASDFGYVWPPPLSQQQ
10 Patterns - syntax The standard IUPAC one-letter codes. ‘ x ’ : any amino acid. ‘ [] ’ : residues allowed at the position. ‘ {} ’ : residues forbidden at the position. ‘ () ’ : repetition of a pattern element are indicated in parenthesis. X(n) or X(n,m) to indicate the number or range of repetition. ‘ - ’ : separates each pattern element. ‘‹’ : indicated a N-terminal restriction of the pattern. ‘›’ : indicated a C-terminal restriction of the pattern. ‘. ’ : the period ends the pattern.
11 Pattern ~ motif ~ signature A pattern (similarly to consensus and profile) is a way to represent a conserved sequence Whereas a profile and consensus usually relate to the entire sequence, a pattern usually relates to a a few tens of amino-acids
12 Profile-pattern-consensus GTTCAA GCTGAA CTTCAC A.1000T C G GTTCAA [AC]-A-[GC]-T-[TC]-[GC] multiple alignment consensus pattern profile Information: consensus<pattern<profileNNTNAN
13 Interpro Interpro: a collection of many protein signature databases (Prosite, Pfam, Prints … ) integrated into a hierarchical classifying system
14 Interpro example
15 PTM – Post-Translational Modification
16 PTM – Post-Translational Modification Phosphorylation Tyr, Ser, Thr Glycosylation (addition of sugars) Asn, Ser, Thr Addition of fatty acids (e.g. N- myristoylation, S-Palmitoylation)
17 So how to predict Take into account: 1. Context (motif): PKC (a kinase) recognizes X S/T X R/K N-Myristoylation at M G X X X S/T Several times – we don ’ t know the exact motif! 2. Conservation Is the motif found (for instance, in human) also conserved in related organisms (for instance, in chimp)?
18 Prediction problems Signal for detection is very short Not enough biological knowledge for characterizing the signal Tertiary structure
19 Prediction will be more efficient if more information is available
20 Secondary Structure
21 Secondary Structure Reminder- secondary structure is usually divided into three categories: Alpha helix Beta strand (sheet) Anything else – turn/loop
22 Secondary Structure An easier question – what is the secondary structure when the 3D structure is known?
23 DSSP DSSP (Dictionary of Secondary Structure of a Protein) – assigns secondary structure to proteins which have a crystal structure H = alpha helix B = beta bridge (isolated residue) E = extended beta strand G = 3-turn helix I = 5-turn helix T = hydrogen bonded turn S = bend
24 Predicting secondary structure from primary sequence
25 Chou and Fasman (1974) Name P(a) P(b) P(turn) Alanine Arginine Aspartic Acid Asparagine Cysteine Glutamic Acid Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine The propensity of an amino acid to be part of a certain secondary structure (e.g. – Proline has a low propensity of being in an alpha helix or beta sheet breaker)
26 Chou-Fasman prediction Look for a series of >4 amino acids which all have (for instance) alpha helix values >100 Extend ( … ) Accept as alpha helix if average alpha score > average beta score Ala Pro Tyr Phe Phe Lys Lys His Val Ala Thr α β
27 Chou and Fasman (1974) Success rate of 50%
28 Improvements in the 1980’s Conservation in MSA Smarter algorithms (e.g. HMM, neural networks).
29 Accuracy Accuracy of prediction seems to hit a ceiling of 70-80% accuracy AccuracyMethod 50% Chou & Fasman 69% Adding the MSA 70-80% MSA+ sophisticated computations
30 Gene Ontology
31 GO Gene Ontology – a project for consistent description of gene products in different databases. Consistent description - Common key definitions. Example: ‘ protein synthesis ’ or ‘ translation ’
32 GO GO - GO describes proteins in terms of : biological process cellular component molecular function GO is not: –A sequence database. –A portal for sequence information
33 GO – structure nucleus Nuclear chromosome cell cellular component
34 GO example Links from the swissprot entry of human protein kinase C alphaprotein kinase C alpha
35 Examples for use of GO Enrichment for a GO category: 1. Do all up regulated genes in a microarray you built belong to the same GO “ molecular function ” category? 2. You have predicted a new transcription factor binding site. Do all genes with this site belong to the same GO biological process?
36 Evaluation of prediction methods
37 Evaluation of prediction methods Comparing our results to experimentally verified sites Positive (hit) Negative True True-positive True-positive True-negative True-negative False False-positive (false alarm) False-negative (miss) Our prediction gives: Is the prediction correct?
38 Method evaluation Positive (hit) Negative True True-positive True-positive True-negative True-negative False False- positive (false alarm) False-negative (miss) A good method will be one with a high level of true-positives and true-negatives, and a low level of false-positives and false-negatives Our prediction gives: Is the prediction correct?
39 Calibrating the method All methods have a parameter (or a score) that can be calibrated to improve the accuracy of the method. For example: the E-value cutoff in BLAST
40 Calibrating E-value cutoff Reminder: the lower the E-value, the more ‘ significant ’ the alignment between the query and the hit.
41 Calibrating the E-value What will happen if we raise the E-value cutoff (for instance – work with all hits with an E-value which is < 10) ? Positive (hit) Negative True True-positive True-positive True-negative True-negative False False-positive (false alarm) False-negative (miss) Our prediction gives: Is the prediction correct?
42 Calibrating the E-value On the other hand – if we lower the E- value (look only at hits with E-value < ) Positive (hit) Negative True True-positive True-positive True-negative True-negative False False-positive (false alarm) False-negative (miss) Our prediction gives: Is the prediction correct?
43 Improving prediction Trade-off between specificity and sensitivity
44 Sensitivity vs. specificity Sensitivity = Specificity = True positive True positive + False negative Represent all the proteins which are really phosphorylated True negative True negative + False positive Represent all the proteins which are really NOT phosphorylated How good we hit real phosphorylations How good we avoid real non- phosphorylations
45 Raising the E-value to 10: sensitivity specificity Lowering the E-value to sensitivity specificity
46 Over-predictions: example Many PTM-predictors tend to over- predict high level of false positives low specificity WHY? 1. Tertiary structure! (buried/exposed, tertiary motifs) 2. The phosphorylation recognition mechanism is not completely clear!
47 Next time on: Biological Sequences Analysis
48 The Human Genome
49 Horizontal (Lateral) Gene Transfer
50 Alternative splicing
51 Repetitive Elements