Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.

Slides:



Advertisements
Similar presentations
Markov models and applications
Advertisements

Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Protein – Protein Interactions Lisa Chargualaf Simon Kanaan Keefe Roedersheimer Others: Dr. Izaguirre, Dr. Chen, Dr. Wuchty, ChengBang Huang.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Unsupervised Learning
Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Model.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Intelligent Systems and Software Engineering Lab (ISSEL) – ECE – AUTH 10 th Panhellenic Conference in Informatics Machine Learning and Knowledge Discovery.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
Profiles for Sequences
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
An analysis of “Using sequence compression to speed up probabilistic profile matching” by Valerio Freschi and Alessandro Bogliolo Cory Tobin.
Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.
Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Modeling biological data and structure with probabilistic networks I Yuan Gao, Ph.D. 11/05/2002 Slides prepared from text material by Simon Kasif and Arthur.
09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron.
Using CTW as a language modeler in Dasher Phil Cowans, Martijn van Veen Inference Group Department of Physics University of Cambridge.
Algorithms for variable length Markov chain modeling Author: Gill Bejerano Presented by Xiangbin Qiu.
Sequence alignment, E-value & Extreme value distribution
Using CTW as a language modeler in Dasher Martijn van Veen Signal Processing Group Department of Electrical Engineering Eindhoven University.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Outline What Neural Networks are and why they are desirable Historical background Applications Strengths neural networks and advantages Status N.N and.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sampling Approaches to Pattern Extraction
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Matching Protein  -Sheet Partners by Feedforward and Recurrent Neural Network Proceedings of Eighth International Conference on Intelligent Systems for.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Hidden Markov Models BMI/CS 576
Roberto Battiti, Mauro Brunato
Sequence Based Analysis Tutorial
CS590I: Information Retrieval
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Presentation transcript:

Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB2000), pp E. Eskin, W.N. Grundy, and Y. Singer Cho, Dong-Yeon

Abstract Classifying proteins into families using sparse Markov transducers (SMTs)  Estimation of a probability distribution conditioned on an input sequence  Similar to probability suffix trees  Allowing for wild-cards  Two models  Efficient data structures

Introduction Protein Classification  Pairwise similarity  Creating profiles for protein families  Consensus patterns using motifs  HMM-based approaches  Probability suffix trees (PSTs)  A PST is a model that predicts the next symbol in a sequence based on the previous symbols.  This approach is based on the presence of common short sequences (motifs) through the protein family.  One drawback of PSTs is that they rely on exact matches to the conditional sequences (e.g., 3-hydroxyacyl-CoA dehydrogenase). VAVIGSGTVGVLGLGTV*V*G*GT – wild cards

Sparse Markov Transducers (SMTs)  A generalization of PSTs  It can condition the probability model over a sequence that contains wild-cards.  In a transducer, the input symbol alphabet and output symbol alphabet can be different.  Two methods  Single amino acid  Protein family  Efficient data structure Experiments  Pfam database of protein family

Sparse Markov Transducers A Markov Transducer of Order L  Conditional probability distribution  X k are random variables over an input alphabet  Y k is a random variable over an output alphabet Sparse Markov Transducer  Conditional probability distribution   : wild card   Two approaches for SMT-based protein classification  A prediction model for each family: single amino acid  A single model for the entire database: protein family

Sparse Markov Trees  Representationally equivalent to SMTs  The topology of a tree encodes the positions of the wild-cards in the conditioning sequence of the probability distribution.

Training a Prediction Tree  A set of training examples  The input symbols are used to identify which leaf node is associated with that training example.  The output symbol is then used to update the count of the appropriate predictor.  The predictor kept counts of each output symbol seen by that predictor.  We smooth each count by adding a constant value to the count of each output symbol. Cf) Dirichlet distribution u 1 DACDADDDCAA, C CAAAACAD, D AACCAAA, ? C  0.5, D  0.5

Mixture of Sparse Prediction Trees  We do not know which tree topology can best estimate the distribution.  A mixture technique employs a weight sum of trees as a predictor.  Updating the weight of each tree for each input string in the data set based on how well the tree preformed on predicting the output  The prior probability of a tree is defined by the topology of the tree.

Implementation of SMTs  Two important parameters  MAX_DEPTH: the maximum depth of the tree  MAX_PHI: the maximum number of wild-cards at every node Ten tress in the mixture if MAX_DEPTH=2 and MAX_PHI = 1

 Template tree  We only store these nodes which are reached during training. AA, AC and CD

Efficient Data Structures Performance of the SMT typically improves with higher MAX_PHI and MAX_DEPTH.  The memory usage become bottleneck because it restricts these parameters to values that will allow the tree to fit in memory.

Lazy Evaluation  We store the tails of the training sequence and recompute the part of the tree on demand when necessary.  EXPAND_SEQUENCE_COUNT = 4 ACDACAC(A), DACADAC(C), DACAAAC(D), ACACDAC(A), ADCADAC(D) ACDACAC(D)

Methodology Data  Two versions of the Pfam database  Version 1.0: for comparing results to previous one  Version 5.2: the latest version  175 protein families  A total of single domain protein sequences containing a total residues  Training and test data with a ratio of 4:1 for each family   transmembrane receptor: 530 protein sequence ( )  The 424 sequences of the training set give subsequences that are used to train the model.

Building SMT Prediction Models  A prediction model for each protein family  A sliding window of size 11  Prediction of the middle symbol a 6 using neighboring symbols  The input symbols are a 5 a 7 a 4 a 8 a 3 a 9 a 2 a 10 a 1 a 11.  MAX_DEPTH = 7 and MAX_PHI = 1 Classification of a Sequence using a SMT Prediction Model  Computation of the likelihood for an unknown sequence  A sequence is classified into a family by computing the likelihood of the fit for each of the 175 models.

Building the SMT Classifier Model  Estimation of the probability over protein families given a sequence of amino acids  Input sequence: an amino acid sequence from a protein family  Output symbol: the protein family name  A sliding window of 10 amino acids: a 1,…,a 10  MAX_DEPTH=5 and MAX_PHI=1 Classification of a Sequence using an SMT Classifier  Each position of the sequence gives us a probability over the 175 families measuring how likely the substring originated from each family.

Results  Time-Space-Performance tradeoffs

 Results of Protein Classification using SMTs  The SMT models outperform the PST models.  SMT Classifier > SMT Prediction > PST Prediction

Discussion Sparse Markov Transducers (SMTs)  We have presented two methods for protein classification using sparse Markov transducers (SMTs). Future Work  Incorporating biological information into the model such as Dirichlet mixture priors  Combining a generative and discriminative model  Using both positive and negative examples in training