Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic.

Slides:



Advertisements
Similar presentations
Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,
Advertisements

(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
ECG Signal processing (2)
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Structural bioinformatics
Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.
Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids Y. Wang, O. Zaiane, R. Goebel.
Heuristic alignment algorithms and cost matrices
Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.
Profile-profile alignment using hidden Markov models Wing Wong.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
M.W. Mak and S.Y. Kung, ICASSP’09 1 Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University.
5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.
Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Ensemble Learning (2), Tree and Forest
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology.
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
1 Introduction(1/2)  Eukaryotic cells can synthesize up to 10,000 different kinds of proteins  The correct transport of a protein to its final destination.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Reduction of Training Noises for Text Classifiers Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
IMPROVING ACTIVE LEARNING METHODS USING SPATIAL INFORMATION IGARSS 2011 Edoardo Pasolli Univ. of Trento, Italy Farid Melgani Univ.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
A distributed PSO – SVM hybrid system with feature selection and parameter optimization Cheng-Lung Huang & Jian-Fan Dun Soft Computing 2008.
An Automatic Method for Selecting the Parameter of the RBF Kernel Function to Support Vector Machines Cheng-Hsuan Li 1,2 Chin-Teng.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Evaluating Classifiers
Prediction of RNA Binding Protein Using Machine Learning Technique
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung,
Presentation transcript:

Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic University Sun-Yuan KUNG Princeton University

2 Contents 1.Introduction – Cell Organelles and Proteins Subcellular Localization – Signal-Based vs. Homology-Based Methods 2.Speeding Up the Prediction Process – Predicting Cleaving Site Location – Truncating Profiles vs. Truncating Sequences – Perturbational Discriminant Analysis 3.Experiments and Results 4.Conclusions

3 Organelles Cells have a set of organelles that are specialized for carrying out one or more vital functions. Proteins must be transported to the correct organelles of a cell to properly perform their functions. Therefore, knowing the subcellular localization is one step towards understanding the functions of proteins.

4 Proteins and Their Subcellular Location

5 Subcellular Localization Prediction Two key methods: 1.Signal-based 2.Homology-based

6 Signal-Based Method Source: S. R. Goodman, Medical Cell Biology, Elsevier, The amino acid sequence of a protein contains information about its organelle destination. Typically, the information can be found within a short segment of 20 to 100 amino acids preceding the cleavage site. Signal-based methods (e.g. TargetP) can determine the cleavage site location Cleavage site

7 Full-length Query Sequence S (1) =KNKA··· S (2) =KAKN··· · S (N) =KGLL··· Full-length Training sequences Align with each of the training sequences SVM classifier N-dim alignment vector Subcellular Location 1 N Advantage: Can predict sequences that do not have cleavage sites. Drawback: Given a query sequence, we need to align it with every training sequence in the training set, causing long computation time. Homology-Based Method

Sequences Length Distribution Many sequences are fairly long, thus, aligning the whole sequence will take long computation time. cTP, mTP and SP are under 100 AAs only and contain the most relevant segment. Computation saving can be achieved by aligning the signal segments only. Occurrences of Seq. Length distribution of Seq. Sequence Length SP 820 Ext: Mit: Chl: 35 mTP cTP 760 Cleavage Site

9 Proposed Method: Aligning the Segments that Contain the Most Relevant Info. Signal-based Cleavage Site Predictor (e.g. TargetP) N truncate Homology-based Method Subcellular Location C Amino Acid Sequence … Truncated sequence Cleavage Site

10 Aligning Profiles Vs. Aligning Sequences Query Sequence Scheme I : Truncate the profiles Scheme II : Truncate the sequences

11 Perturbational Discriminant Analysis Input Space Hilbert Space Input and Hilbert Spaces: Empirical Space: Empirical Space

12 Perturbational Discriminant Analysis The objective of PDA is to find an optimal discriminant function in the Hilbert space or empirical space: The optimal solution (see derivation in paper) in the empirical space is ρ represents the noise (uncertainty) level in the measurement. It also ensures numerical stability of the matrix inverse. Ρ = 1 in this work.

13 Perturbational Discriminant Analysis 3 classes of 2-dim data in the input space RBF kernal matrix K Projection onto the 2-dim PDA space Decision boundaries in the input space Example on 2-D Data

14 Perturbational Discriminant Analysis Application to Sequence Classification Training sequences PSI-BLAST Pairwise Alignment Compute PDA Para Training Profiles K Test sequence PSI-BLAST Align with Training Profiles Compute PDA Score Test Profile

15 Perturbational Discriminant Analysis Application to Multi-Class Problems 1-vs-Rest PDA Classifier: MAXNET

16 Perturbational Discriminant Analysis Application to Multi-Class Problems Cascaded PDA-SVM Classifier: Test sequence Project onto (C–1)-dim PDA space 1-vs-rest SVM Classifier Class label

17 Experiments Materials: Eukaryotic sequences extracted from Swiss-Prot 57.5 Ext, Mit, and Chl contain experimentally determined cleavage sites 25% Sequence identity (based on BLASTclust) Performance Evaluation: 5-Fold cross validation Prediction accuracy and Matthew’s correlation coefficient (MCC)

18 Query Sequence Kernel matrix (Scheme I) Kernel matrix (Scheme II) Comparing Kernel Matrices

19 Sensitivity Analysis The localization performance degrades when the cut-off position drifts away from the ground-truth cleavage site. mTP and cTP are more sensitive to the error of cleavage site prediction than Ext. 19 Cut-off Position p-16p-8 p-2 p p+2 p+16 p+32 p+64 Ground-truth cleavage site Cyt/Nuc Overall Mit Chl Ext Cut Seq. at p ± x p: gournd-truth cleave site Subcellular localization (PairProSVM) Subcellular location Seq Subcellular Localiation Accuracy (%)

20 Performance of Cleavage Site Prediction Conditional Random Field (CRF) is better than TargetP(Plant) in terms of predicting the cleavage sites of signal peptide (Ext) but is worse than TargetP(Nonplant). CRF is slightly inferior to TargetP in predicting the cleavage sites of mitochondria, but it is significantly better than TargetP in predicting the cleavage site of chloroplasts. 20 TargetP(Plant) TargetP(NonPlant) CRF Csite Prediction ACC(%) Category

21 Findings:Profile creation time can be substantially reduced by truncating the protein sequences at the cleavage sites. Comparing Profile Creation Time Query Sequence short profile sequence short Cut SVM or KPDA Pairwise Alignment PSI- BLAST Subcellular Location Score Vector short profile Long PSI- BLAST SVM or KPDA Pairwise Alignment Cut Subcellular Location Score Vector Scheme I Scheme II short profile sequence short Cut SVM or KPDA Pairwise Alignment PSI- BLAST Subcellular Location Score Vector short profile Long PSI- BLAST SVM or KPDA Pairwise Alignment Cut Subcellular Location Score Vector Scheme I Scheme II Query Sequence

22 Findings:The training time of 1-vs-rest PDA and Cascaded PDA- SVM are substantially shorter than that of SVM. Training and Classification Time Project onto (C–1)-dim PDA space 1-vs-rest SVM Classifier

23 Findings:In terms of localization accuracy, the proposed “Signal+Homology” method performs slightly better than the signal-based TargetP and is substantially better than the homology-based SubLoc. Compare with State-of-the-Art Localization Predictors Conditional Random Fields Localization Accuracy (%) MCC

24 Conclusion Fast subcellular-localization-prediction can be achieved by a cascaded fusion of signal-based and homology-based methods. As far as localization accuracy is concerned, it does not matter whether we truncate the sequences or truncate the profiles. However, truncating the sequence can save the profile creation time by 6 folds. 24

25 Compare with State-of-the-Art Localization Predictors

26 Performance of Cascaded Fusion The computation time for full-length profile alignment is a striking 116 hours Our method not only leads to nearly a 20 folds reduction in computation time but also boosts the prediction performance. Full- length Seq. Seq. with Csite predicted by TargetP(P) Seq. with Csite predicted by TargetP(N) Seq. with Csite predicted by CRF 26 Time (hr.) Time Subcellular localization accuracy Acc (%)

27 1) Cleavage site detection. The cleavage site (if any) of a query sequence is determined by a signal-based method. 2) Pre-sequence selection. The pre-sequence of the query is obtained by selecting from the N-terminal up to the cleavage site. 3) Pairwise alignment. The pre-sequence is aligned with each of the training pre-sequences to form an N-dim vector, which is fed to a one-vs-rest SVM classifier for prediction. 27 Fusion of Signal- and Homology-Based Methods

28 Perturbational Discriminant Analysis Spectral Space: Define the kernel matrix K can be factorized via spectral decomposition into Empirical SpaceSpectral Space