BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S

Slides:



Advertisements
Similar presentations
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Advertisements

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
COFFEE: an objective function for multiple sequence alignments
Structural bioinformatics
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Profile-profile alignment using hidden Markov models Wing Wong.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.
Protein Tertiary Structure Prediction
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Protein Classification Using Averaged Perceptron SVM
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Step 3: Tools Database Searching
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.
We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Chapter 3: Maximum-Likelihood Parameter Estimation
Evaluating Classifiers
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
Machine Learning Basics
Pairwise alignment incorporating dipeptide covariation
Introduction Feature Extraction Discussions Conclusions Results
Prediction of RNA Binding Protein Using Machine Learning Technique
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Computer Science Department Brigham Young University

Protein Structures.
Generalizations of Markov model to characterize biological sequences
Homology Modeling.
Protein structure prediction.
Presentation transcript:

BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S051054

Introduction Method Result and Discussion Conclusion Introduction Metrials and Methods Results and Discussions Conclusions

Introduction Importance Most important tasks in computational biology Fill in this gap The detection of homologies with low sequence identity remains a challenging problem Some solutions The general sequence comparison methods Thread the query sequence onto the template structures Improve prediction performance by either incorporating new features or developing novel algorithms.

Introduction Problem Traditional sequence comparison methods fail to identify reliable homologies with low sequence identity The taxonomic methods are effective alternatives, but their prediction accuracies are around 70%, which are still relatively low for practical usage. Autor's solution Protein sequences have univariate direction from beginning to end Is analogous to time sequences of process data Autocross covariance (ACC) transformation SVM

Introduction each squence PSSM fixed-length vector classification results PSI-BLAST ACC SVM PSSM: position-specific score matrices The element Si,j in the matrix reflects the probability of amino acid i occurring at the position j Feature: Only the evolutionary information represented in the form of PSSM It alone can achieve promising results

Materilas and methods To evaluate the proposed method and compare it with existing methods,five datasets are used here: the D-B dataset the extended D-B dataset, the F86 datasets the F199 datasets the Lindahl dataset

The D-B dataset : 311 proteins for training 383 proteins for test. <40% identity Each fold has at least seven members. <35% identity in training set. Classes: all α, all β, α/β, α+β and small proteins.

The extended D-B dataset : 27folds <40% identity. contains 3202 sequences. The fold names and the number of proteins :

F86 and F199: The F86 dataset contains 86 folds and 5671 sequences, each fold has at least 25 members. The F199 dataset contains 199 folds and 7437 sequences each fold has at least 10 members

The Lindahl dataset : is used as a benchmark to compare the taxonomic fold recognition methods with the threading methods. 976 sequences in this dataset identity <40%.

ACC transformation ACC can transform the PSSMs of different lengths into fixed-length vectors by measuring the correlation between any two properties ACC results in two kinds of variables auto-covariance(AC):between the same property cross-covariance(CC):between two different properties.

AC variable: the correlation of the same property between two residues separated by a distance of lg along thesequence,which can be calculated as: i : is one of the residues L : is the length of the protein sequence Si,j : is the PSSM score of amino acid i at position j Si : is the average score for amino acid i along the whole sequence In such a way, the number of AC variables can be calculated as 20 ∗ LG, where LG is the maximum of lg (lg=1,2,...,LG).

CC variable : measures the correlation of two different properties between two residues separated by lg along the sequence i1,i2 : are two different amino acids The total number of CC variables : 380 ∗ LG. Each protein sequence is represented as a vector of either AC variable or ACC variable that is a combination of AC and CC.

Support vector machine Performance metrics The overall accuracy (Q) sensitivity (Sn) and specificity (Sp): Materilas and methods

RESULTS AND DISCUSSIONS The impact of LG Performance comparison with existing taxonomy-based method Performance comparison with threading methods

The impact of LG The maximum value of LG is the length of the shortest sequence minus one. D-B dataset: the optimal values of LG forAC and Extended dataset: the best values of LG is 10.

Performance comparison with existing taxonomy-based methods Results on the D-B dataset The detailed results are given in the Supplementary Material:

The proposed ACCFold method outperforms these methods by 2–14%. Results on the D-B dataset To give a more comprehensive comparison, we consider several other methods in the literature.

Results on the extended D-B dataset All the methods get improved Higher than the other methods by 9–17%. Extended D-B dataset: The same folds more sequences:3202

Especially, the performance of the folds in the α/β, α+β and small proteins classes are significantly improved. Results on the extended D-B dataset

Time complexity: SWPSSM:O(n^2 * L^2) ACC:O(n*L^2+n^2*L). The results indicate that the proposed method can be applied to the cases of large number of folds without significantly affecting its performance, as long as the number of samples in each fold is not too small. Results on the F86 and F199 datasets More folds: 86 folds,199folds

Performance comparison with threading methods Results on the Lindahl dataset At the family level, we select the families that contain at least two samples Threading methods:use the sequence–template alignments to detect the remote homologies of proteins.

Taxonomic methods are not as good as threading methods Difficult to be applied to practical fold recognition However the total number of folds are limited the number of proteins with known structure increases more space and chance to exploit the taxonomic methods to develope ffective fold cognition system. Performance comparison with threading methods

Conclusions Combines SVM with ACC is introduced for taxonomic protein fold recognition. ACC transformation is used to convert the PSSMs into fixed-length vectors The results obtained here stand for the state-of-the- art performance of taxonomic protein fold recognition