M.W. Mak and S.Y. Kung, ICASSP’09 1 Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University.

Slides:



Advertisements
Similar presentations
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Advertisements

Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Statistical NLP: Lecture 11
Biological sequence analysis and information processing by artificial neural networks.
Speech Recognition. What makes speech recognition hard?
Tools to analyze protein characteristics Protein sequence -Family member -Multiple alignments Identification of conserved regions Evolutionary relationship.
Biological sequence analysis and information processing by artificial neural networks Morten Nielsen CBS.
An Introduction to Bioinformatics Protein Structure Prediction.
Introduction to BioInformatics GCB/CIS535
Conditional Random Fields
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Protein Modules An Introduction to Bioinformatics.
Sequence similarity.
Biological sequence analysis and information processing by artificial neural networks.
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
By: Manchikalapati Myerow Shivananda Monday, April 14, 2003
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Conditional Topic Random Fields Jun Zhu and Eric P. Xing ICML 2010 Presentation and Discussion by Eric Wang January 12, 2011.
Graphical models for part of speech tagging
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Day 2: Protein Sequence Analysis 1.Physico-chemical properties. 2.Cellular localization. 3.Signal peptides. 4.Transmembrane domains. 5.Post-translational.
1 Introduction(1/2)  Eukaryotic cells can synthesize up to 10,000 different kinds of proteins  The correct transport of a protein to its final destination.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
UIUC CS 498: Section EA Lecture #21 Reasoning in Artificial Intelligence Professor: Eyal Amir Fall Semester 2011 (Some slides from Kevin Murphy (UBC))
Localization prediction of transmembrane proteins Stefan Maetschke, Mikael Bodén and Marcus Gallagher The University of Queensland.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.
John Lafferty Andrew McCallum Fernando Pereira
Prediction of Protein Binding Sites in Protein Structures Using Hidden Markov Support Vector Machine.
Sequence Alignment.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Convolutional LSTM Networks for Subcellular Localization of Proteins
Protein Prediction with Neural Networks! Chris Alvino CS152 Fall ’06 Prof. Keller.
Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Eric Xing © Eric CMU, Machine Learning Structured Models: Hidden Markov Models versus Conditional Random Fields Eric Xing Lecture 13,
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
The Nobel Prize in Physiology or Medicine 1999
Conditional Random Fields for ASR
Functional Annotation of Transcripts
CRANDEM: Conditional Random Fields for ASR
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen
Bidirectional LSTM-CRF Models for Sequence Tagging
Presentation transcript:

M.W. Mak and S.Y. Kung, ICASSP’09 1 Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University S.Y. Kung Princeton University

M.W. Mak and S.Y. Kung, ICASSP’09 2 Contents 1.Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction Information in Amino Acid Sequences Existing Approaches to Cleavage Site Prediction 2.Conditional Random Field (CRF) CRF for Cleavage Site Prediction 3.Experiments and Results Effectiveness of Different Feature Functions Effect of Varying Window Size Fusion with SignalP

M.W. Mak and S.Y. Kung, ICASSP’09 3 Proteins and Their Destination A protein consists of a sequence of amino acids. Newly synthesized proteins need to pass across intra-cellular membrane to their destination.

M.W. Mak and S.Y. Kung, ICASSP’09 4 Signal Peptide Source: S. R. Goodman, Medical Cell Biology, Elsevier, A short segment of 20 to 100 amino acids (known as signal peptides) contains information about the destination (address) of the protein. The signal peptide is cleaved off from the resulting mature protein when it passes across the membrane. Mature protein Signal Peptide Cleavage Site

M.W. Mak and S.Y. Kung, ICASSP’09 5 Defects in the protein sorting process can cause serious diseases, e.g., kidney stone Importance of Cleavage Site Prediction Source:

M.W. Mak and S.Y. Kung, ICASSP’09 6 Many proteins (e.g. insulin) are produced in living cells. To cause the proteins to be secreted out of the cell, they are provided with a signal peptide. Importance of Cleavage Site Prediction Source: /laureates/1999/illpres/diseases.html Bioreactor

M.W. Mak and S.Y. Kung, ICASSP’09 7 Information in Sequences Signal peptides contain some regular patterns. Although the patterns exhibit substantial variation, they can be detected by machine learning tools. Cleavage Site Rich in hydrophobic AA

M.W. Mak and S.Y. Kung, ICASSP’09 8 Existing Methods Weight matrices (PrediSi) Neural Networks (SignalP 1.1) HMMs (SignalP 3.0)

M.W. Mak and S.Y. Kung, ICASSP’09 9 Weight Matrices M A R S S L F T F L C L A V F I N G C L S Q I E Q Q Score at position t = =178 t -1 t t+1 20 AA 15 Positions

M.W. Mak and S.Y. Kung, ICASSP’09 10 SignalP-HMM Source: Nielsen and Krogh Mature protein Signal Peptide

M.W. Mak and S.Y. Kung, ICASSP’09 11 Contents 1.Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction Information in Amino Acid Sequences Existing Approaches to Cleavage Site Prediction 2.Conditional Random Field (CRF) CRF for Cleavage Site Prediction 3.Experiments and Results Effectiveness of Amino Acid Properties Effectiveness of Different Feature Functions Fusion with SignalP

M.W. Mak and S.Y. Kung, ICASSP’09 12 Conditional Random Fields Given a sequence of observations (e.g., words), a CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations. Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of- Speech (POS) tagging

M.W. Mak and S.Y. Kung, ICASSP’09 13 HMM Vs. CRF Conditional Random Fields: Learn Hidden Markov Models: Learn y1y1 y2y2 ………yTyT y1y1 y2y2 ………yTyT x1x1 x2x2 ………xTxT More direct Label Observation Label Observation

M.W. Mak and S.Y. Kung, ICASSP’09 14 Advantages of CRF Avoid computing likelihood p(observation|label). Instead, the posterior p(label|observation) is computed directly. Able to model long-range dependency without making the inference problem intractable. Guarantee global optimal. M A R S S L F T F L C L A V F I N G C L S Q I E Q Q Depends on

M.W. Mak and S.Y. Kung, ICASSP’09 15 CRF for Cleavage Cite Prediction Cleavage site Transition features State features Weights Length of Sequence n-grams of amino acids

M.W. Mak and S.Y. Kung, ICASSP’09 16 CRF for Cleavage Cite Prediction e.g. bi-gram and query sequence = T Q T W A G S H S...

M.W. Mak and S.Y. Kung, ICASSP’09 17 CRF for Cleavage Cite Prediction Position

M.W. Mak and S.Y. Kung, ICASSP’09 18 Contents 1.Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction Information in Amino Acid Sequences Existing Approaches to Cleavage Site Prediction 2.Conditional Random Field (CRF) CRF for Cleavage Site Prediction 3.Experiments and Results Effectiveness of Different Feature Functions Effect of Varying Window Size Fusion with SignalP

M.W. Mak and S.Y. Kung, ICASSP’09 19 Experiments Data: 1937 protein sequences extracted from Swissprot The cleavage sites locations of these sequences were biologically determined Ten-fold cross validation For 1 st -order state features, up to 5-grams of amino acids For 2 nd -order state features, up to bi-grams of amino acids. Use CRF++ software

M.W. Mak and S.Y. Kung, ICASSP’09 20 Results Effectiveness of using AA Properties: Observations: (1) Amino acids provide the most relevant information (2) Hydrophobicity and charge/polarity can help

M.W. Mak and S.Y. Kung, ICASSP’09 21 Results Effectiveness of Different Feature Functions: Observations: (1)Transition feature by itself is no good. (2)But, once combined with state-features, performance improves (Transition only) (Transition + State)

M.W. Mak and S.Y. Kung, ICASSP’09 22 Results Effect of Varying the Window Size: e.g. query sequence = T Q T W A G S H S...

M.W. Mak and S.Y. Kung, ICASSP’09 23 Results Compared with Other Predictors Observations: (1) CRF is slightly better than SignalP (2) CRF is complementary to SignalP

M.W. Mak and S.Y. Kung, ICASSP’09 24 Web Server

M.W. Mak and S.Y. Kung, ICASSP’09 25 Web Server Available in May 2009

M.W. Mak and S.Y. Kung, ICASSP’09 26

M.W. Mak and S.Y. Kung, ICASSP’09 27 Conditional Random Fields Given a sequence of observations, A CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations. Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of-Speech (POS) tagging Observations Labels x x y