Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

Slides:



Advertisements
Similar presentations
Assignment of PROSITE motifs to topological regions: Application to a novel database of well characterised transmembrane proteins Tim Nugent.
Advertisements

Learning Algorithm Evaluation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. We demonstrate an approach to protein clustering,
Pfam(Protein families )
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Patterns, Profiles, and Multiple Alignment.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Service Discrimination and Audit File Reduction for Effective Intrusion Detection by Fernando Godínez (ITESM) In collaboration with Dieter Hutter (DFKI)
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Profile-profile alignment using hidden Markov models Wing Wong.
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)
Correlate Phosphorylation Sites to Kinases by Conditional Random Fields --- CS 104 Project Lu He, Tuobin Wang.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Analysis of Molecular and Clinical Data at PolyomX Adrian Driga 1, Kathryn Graham 1, 2, Sambasivarao Damaraju 1, 2, Jennifer Listgarten 3, Russ Greiner.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.
Center for Computational Intelligence, Learning, and Discovery Artificial Intelligence Research Laboratory Department of Computer Science Supported in.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
What is a Project Purpose –Use a method introduced in the course to describe some biological problem How –Construct a data set describing the problem –Define.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
Bioinformatics and Computational Biology
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
 Developed Struct-SVM classifier that takes into account domain knowledge to improve identification of protein-RNA interface residues  Results show that.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
(H)MMs in gene prediction and similarity searches.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Ontology-Based Information Integration Using INDUS System
Combining HMMs with SVMs
binding sites 58 of the 473 unambiguously assigned phosphorylation sites are predicted by Scansite to be sites for binding. 50 of these correspond.
Sequence Based Analysis Tutorial
Systems-wide Identification of cis-Regulatory Elements in Proteins
Presentation transcript:

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program Department of Computer Science ISMB 2006 Acknowledgements : This work is supported in part by grants from the National Science Foundation (IIS ), and the National Institutes of Health (GM ) to Vasant Honavar. On the Quality of Motifs for Protein Phosphorylation Site Prediction Yasser EL-Manzalawy, Cornelia Caragea, Drena Dobbs, and Vasant Honavar Case Study: Phosphorylation Site Prediction Because of the important role of phosphorylation in signal transduction pathways, discovering the amino acid sequence correlates of phosphorylation sites is an essential step towards understanding phosphorylation. Phosphorylation site prediction has important applications in understanding diseases and, ultimately, in design of therapies. Several computational methods for predicting kinase-specific phosphorylation sites have been proposed, including motif-based methods that rely on PSSMs and HMMs. However, it is unclear how the different motif-based approaches compare with each other. Data set used: Phospho.ELM Data Set, a resource containing 1805 proteins from different species covering 1372 Tyr, 3175 Ser and 767 Thr experimentally verified phosphorylation sites manually curated from the literature. Assessing the Quality of Motifs Reporting the motif performance only at the predetermined threshold score does not provide the whole picture about the motif since the user is allowed to use different threshold scores. In this work, we propose the use of the Receiver Operating Characteristic (ROC) curve and the area under ROC (AUC) as more accurate statistical measures for assessing the quality of the motif.Receiver Operating Characteristic Receiver Operating Characteristic (ROC) curve is a graphical plot of the relation between False Positive Rate (FPR) and True Positive Rate (TPR) for each possible threshold score. Hence, motif-based tools can assist the user in setting a proper threshold score by visualizing the ROC curve of each motif. Moreover, knowing the FPR and TPR of the motif at the user selected threshold score will lead to a better interpretation of the prediction results. KinaseCDKCK2MAPKPKAPKBPKC Ser Thr Total Problem Description Position Specific Scoring Matrices (PSSMs) and Hidden Markov Model (HMM) profiles are two widely used probabilistic methods for modeling overrepresented regions in biological sequences (Motifs). Both PSSMs and HMM profiles assign a score to an input sequence. The higher the score, the more likely the input sequence matches the motif. A predetermined cutoff score is used to specify whether an input sequence matches the motif or not. Many motif-based tools allow users to set a different threshold. A major problem with this approach is that usually the motif performance is reported only at the predetermined threshold score. Hence, the user has no way of knowing the influence of the user specified score on the predictive power of the motif (e.g. for a user specified p-value, what is the true positive rate of the motif). We constructed separate data sets for kinase families that are well represented in terms of the data available in the database (i.e., they are known to recognize more than 50 phosphorylation sites) (see Table 1) Functional sequences are extracted using a window of 15 amino acids, W, centered at the functional Ser and Thr sites in each family. Non- functional sequences are collected using the same window, W, centered at Ser and Thr sites that are not known to be targets for phoshorylation by any of the kinases. Experimental Methodology A direct comparison between Scansite and KinasePhos is not visible since Scansite PSSM motifs and KinasePhos HMM profiles are not publicly available. For each kinase family, we used 5-fold cross validation to evaluate the learned PSSM and HMM motifs. PSSM motifs were created using PROFILEWEIGHT program and HMM profiles were built using HMMER package. Results We report the ROC curves and the area under ROC curves (AUC) for the learned PSSM and HMM motifs estimated using 5-fold cross validation (Fig. 3 and Fig. 4). Fig.4: Comparison of ROC curves for Basic PSSM and Basic HMM for the six kinase families considered Fig.3: Comparison of the AUC for Basic PSSM and Basic HMM profiles for the six kinase families considered; the higher the ROC, the better the method. KinaseCDKCK2MAPKPKAPKBPKC PSSM HMM Conclusions  Visualizing the ROC curve of the motif can assist users in selecting a proper threshold score and in interpreting the resulting predictions.  The reported quality of the motifs can help users in choosing the better performing motif-based prediction tool for a given prediction task. Table 1: Kinase families considered in our study and the number of Ser and Thr sites known to be phosphorylated Discussion The motifs used by some methods including the popular Scansite and KinasePhos motifs are not publicly available to users (except through the online servers that generate predictions based on the motifs). Because the servers do not return scores for negative predictions, it is not straightforward to compare the ROC curves for the corresponding motifs. Such comparison is essential for an objective assessment of the effectiveness of the respective motifs and/or the underlying algorithms