Download presentation
Presentation is loading. Please wait.
1
M.W. Mak and S.Y. Kung, ICASSP’09 1 Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University S.Y. Kung Princeton University
2
M.W. Mak and S.Y. Kung, ICASSP’09 2 Contents 1.Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction Information in Amino Acid Sequences Existing Approaches to Cleavage Site Prediction 2.Conditional Random Field (CRF) CRF for Cleavage Site Prediction 3.Experiments and Results Effectiveness of Different Feature Functions Effect of Varying Window Size Fusion with SignalP
3
M.W. Mak and S.Y. Kung, ICASSP’09 3 Proteins and Their Destination A protein consists of a sequence of amino acids. Newly synthesized proteins need to pass across intra-cellular membrane to their destination. http://redpoll.pharmacy.ualberta.ca
4
M.W. Mak and S.Y. Kung, ICASSP’09 4 Signal Peptide Source: S. R. Goodman, Medical Cell Biology, Elsevier, 2008. A short segment of 20 to 100 amino acids (known as signal peptides) contains information about the destination (address) of the protein. The signal peptide is cleaved off from the resulting mature protein when it passes across the membrane. http://nobelprize.org Mature protein Signal Peptide Cleavage Site
5
M.W. Mak and S.Y. Kung, ICASSP’09 5 Defects in the protein sorting process can cause serious diseases, e.g., kidney stone Importance of Cleavage Site Prediction Source: http://nobelprize.org/nobel_prizes/medicine/laureates/1999/illpres/diseases.html
6
M.W. Mak and S.Y. Kung, ICASSP’09 6 Many proteins (e.g. insulin) are produced in living cells. To cause the proteins to be secreted out of the cell, they are provided with a signal peptide. Importance of Cleavage Site Prediction Source: http://nobelprize.org/nobel_prizes/medicine /laureates/1999/illpres/diseases.html Bioreactor
7
M.W. Mak and S.Y. Kung, ICASSP’09 7 Information in Sequences Signal peptides contain some regular patterns. Although the patterns exhibit substantial variation, they can be detected by machine learning tools. Cleavage Site Rich in hydrophobic AA
8
M.W. Mak and S.Y. Kung, ICASSP’09 8 Existing Methods Weight matrices (PrediSi) Neural Networks (SignalP 1.1) HMMs (SignalP 3.0)
9
M.W. Mak and S.Y. Kung, ICASSP’09 9 Weight Matrices M A R S S L F T F L C L A V F I N G C L S Q I E Q Q Score at position t = 16+0+8+6+78+7+7+13+10+6+8+6+0+6+7=178 t -1 t t+1 20 AA 15 Positions
10
M.W. Mak and S.Y. Kung, ICASSP’09 10 SignalP-HMM Source: Nielsen and Krogh Mature protein Signal Peptide
11
M.W. Mak and S.Y. Kung, ICASSP’09 11 Contents 1.Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction Information in Amino Acid Sequences Existing Approaches to Cleavage Site Prediction 2.Conditional Random Field (CRF) CRF for Cleavage Site Prediction 3.Experiments and Results Effectiveness of Amino Acid Properties Effectiveness of Different Feature Functions Fusion with SignalP
12
M.W. Mak and S.Y. Kung, ICASSP’09 12 Conditional Random Fields Given a sequence of observations (e.g., words), a CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations. Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of- Speech (POS) tagging
13
M.W. Mak and S.Y. Kung, ICASSP’09 13 HMM Vs. CRF Conditional Random Fields: Learn Hidden Markov Models: Learn y1y1 y2y2 ………yTyT y1y1 y2y2 ………yTyT x1x1 x2x2 ………xTxT More direct Label Observation Label Observation
14
M.W. Mak and S.Y. Kung, ICASSP’09 14 Advantages of CRF Avoid computing likelihood p(observation|label). Instead, the posterior p(label|observation) is computed directly. Able to model long-range dependency without making the inference problem intractable. Guarantee global optimal. M A R S S L F T F L C L A V F I N G C L S Q I E Q Q Depends on
15
M.W. Mak and S.Y. Kung, ICASSP’09 15 CRF for Cleavage Cite Prediction Cleavage site Transition features State features Weights Length of Sequence n-grams of amino acids
16
M.W. Mak and S.Y. Kung, ICASSP’09 16 CRF for Cleavage Cite Prediction e.g. bi-gram and query sequence = T Q T W A G S H S...
17
M.W. Mak and S.Y. Kung, ICASSP’09 17 CRF for Cleavage Cite Prediction Position
18
M.W. Mak and S.Y. Kung, ICASSP’09 18 Contents 1.Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction Information in Amino Acid Sequences Existing Approaches to Cleavage Site Prediction 2.Conditional Random Field (CRF) CRF for Cleavage Site Prediction 3.Experiments and Results Effectiveness of Different Feature Functions Effect of Varying Window Size Fusion with SignalP
19
M.W. Mak and S.Y. Kung, ICASSP’09 19 Experiments Data: 1937 protein sequences extracted from Swissprot 56.5. The cleavage sites locations of these sequences were biologically determined Ten-fold cross validation For 1 st -order state features, up to 5-grams of amino acids For 2 nd -order state features, up to bi-grams of amino acids. Use CRF++ software
20
M.W. Mak and S.Y. Kung, ICASSP’09 20 Results Effectiveness of using AA Properties: Observations: (1) Amino acids provide the most relevant information (2) Hydrophobicity and charge/polarity can help
21
M.W. Mak and S.Y. Kung, ICASSP’09 21 Results Effectiveness of Different Feature Functions: Observations: (1)Transition feature by itself is no good. (2)But, once combined with state-features, performance improves (Transition only) (Transition + State)
22
M.W. Mak and S.Y. Kung, ICASSP’09 22 Results Effect of Varying the Window Size: e.g. query sequence = T Q T W A G S H S...
23
M.W. Mak and S.Y. Kung, ICASSP’09 23 Results Compared with Other Predictors Observations: (1) CRF is slightly better than SignalP (2) CRF is complementary to SignalP
24
M.W. Mak and S.Y. Kung, ICASSP’09 24 Web Server http://158.132.148.85:8080/CSitePred/faces/Page1.jsp
25
M.W. Mak and S.Y. Kung, ICASSP’09 25 Web Server http://158.132.148.85:8080/CSitePred/faces/Page1.jsp Available in May 2009
26
M.W. Mak and S.Y. Kung, ICASSP’09 26
27
M.W. Mak and S.Y. Kung, ICASSP’09 27 Conditional Random Fields Given a sequence of observations, A CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations. Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of-Speech (POS) tagging Observations Labels x x y
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.