Generalizations of Markov model to characterize biological sequences

Slides:



Advertisements
Similar presentations
Markov models and applications
Advertisements

Probabilistic sequence modeling II: Markov chains Haixu Tang School of Informatics.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Genetic algorithms applied to multi-class prediction for the analysis of gene expressions data C.H. Ooi & Patrick Tan Presentation by Tim Hamilton.
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Heuristic alignment algorithms and cost matrices
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
Evaluating Hypotheses
Similar Sequence Similar Function Charles Yan Spring 2006.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Lecture 12 Splicing and gene prediction in eukaryotes
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Hidden Markov Models In BioInformatics
Multiple testing correction
C OMPUTATIONAL BIOLOGY. O UTLINE Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity of the Algorithms.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Sequence Analysis-III
Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius.
Sequence Alignment.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Step 3: Tools Database Searching
Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Biomathematics seminar Application of Fourier to Bioinformatics Girolamo Giudice.
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
What is a Hidden Markov Model?
A 2 veto for Continuous Wave Searches
Interpolated Markov Models for Gene Finding
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Ab initio gene prediction
Receiver Operating Curves
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
Modeling Signals in DNA
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Applying principles of computer science in a biological context
Protein structure prediction
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
Yingze Wang and Shi-Kuo Chang University of Pittsburgh
Presentation transcript:

Generalizations of Markov model to characterize biological sequences Authors: Junwen Wang and Sridhar Hannenhalli CISC841: Bioinformatics Presented By: Nikhil Shirude November 20, 2007

Outline Motivation Model Implementation - Training - Testing Results Challenges Conclusion

Motivation Markov Model – statistical technique to model sequences such that the probability of a sequence element is based on a limited context preceding the element Current kth order Markov Model – Generates a single base (model unit size=1) according to a probability distribution depending on ‘k’ bases immediately preceding the generated base (gap=0) Used in DNA sequence recognition problems such as promoter and gene prediction

Motivation cont’d Longer range dependencies and joint dependency of neighboring bases have been observed in protein and DNA sequences CG di-nucleotide characterizes CpG islands So, model with unit size of 2 is appropriate to characterize this joint dependency Longer range dependencies (gap>0) are useful to model the periodicity of the helix pattern

Model Implementation Generalized Markov Model (GMM)  Configurable tool to allow for generalizations Posterior bases - bases whose probability is to be computed Prior bases - bases upon which the above probability is calculated 6 parameters to specify the Markov Model Other parameters include – type of biological sequence, threshold for min. count of prior for k-mer elimination, pseudo count for k-mer absent in training set

Model Implementation cont’d Prior Posterior L1 L1 L1 g1 G U1 U2 Uo X1 X2 XL2 ... ... g2 Parameters: L1  model unit size in prior O  Order or the number of units g1  spacing between units L2  model unit size in posterior g2  spacing between bases G  gap parameter Prior Posterior

Model Implementation cont’d Examples A gap of length 2 within a posterior model in an amino acid captures the joint dependency for the first and fourth amino acid residue It is likely to form a hydrogen bond which is vital for the protein helix structure For a model where each tri-nucleotide depends on the previous 4 bases, configurable parameters can be set as: L1=4, O=1, L2=3, g1=g2=G=0 To use the 4 bases after ignoring the immediate preceding 3 bases, set G=3

Training K-mer Refers to specific nucleic or amino acid sequences that can be used to identify certain regions within bio-molecules like DNA or protein For statistical robustness consider k-mers above a certain threshold in positive sequences For the current model, default frequency threshold for positive sequences set at 300 For nucleosome sequences, the default frequency threshold is set at 50 due to the smaller size of the data set

Training cont’d Slide window one base at a time along the training sequence. Window size is given by user-defined parameters: For each window, extract the words corresponding to prior and posterior Window Size = L1*O + g1*(O-1) + G + L2 + g2 * (L2-1) User Defined parameters: L1=1, O=6, L2=2, g1=0, G=1, g2=1 Window Size = 10 Say the sequence to be  ACTGATGCAG The di-nucleotide CG represents the posterior

Training (cont’d) Increment the k-mer counts  ACTGATCG (6th order), CTGATCG (5th order),…., and so on till CG (0th order) Thus, 7 sub-models are present, one for each order After processing the training sequences, calculate the transition probabilities from the k-mer counts - for 0th order, probability is composition of the L2-mers - for higher order, compute the sum of frequencies of all the k-mers of that form (eg, for 4th order TGATCG, compute the sum of frequencies of all hexamers of the form TGAT**)

Training cont’d - if (sum > threshold) - calculate prob. by dividing the count of that sequence form by the sum - else the program automatically uses the (k-1)-mer Finally, convert the probability for each k-mer into a log-odds score

Testing Program reads the model  k-mer log-odds score Scoring - proceeds in the sliding window fashion - to score a window consider the highest order - if string exists, then use the score - else look for string corresponding to a lower order Sequence score is obtained by adding all the window scores To score ACTGATGCAG, first look at 6th order dependence i.e., ACTGATCG in the 8-mer table Look for 5th order and so on till the 0th order

Results Tested on: - Human Promoter Sequences - CpG poor promoters - All promoters - Human Exon Dataset - Nucleosome positioning sequences

Model Evaluation 10 fold cross-validations to train & test the models Sequences were partitioned into 10 equal parts Each part was tested after training on the 9 other parts Once models were trained, a score was calculated on the training set using the models A cutoff was obtained based on the Specificity-Sensitivity curve Choose a score cutoff that results in the best Correlation Coefficient for the training set

Model Evaluation cont’d Score the independent test set & apply this cutoff to obtain the CC values Calculate the mean and standard deviation over the 10 CC values Sensitivity (Sn) = TP / (TP + FN) Specificity (Sp) = TP / (TP + FP) CC = (TP*TN – FP*FN)/√(TP+FP)*(TP+FN) *(TN+FP)*(TN+FN)

Model Evaluation cont’d Total number of prior bases = 6 for all 3 models Classification accuracy for the three sequence classes was tested using the above 3 configurations 6th order single nucleotide model: L1 = L2 = 1, O=6, g1=0, G=0, g2=0 3rd order di-nucleotide model: L1 = L2 = 2, O=3, g1=0, G=0, g2=0 2nd order tri-nucleotide model: L1 = L2 = 3, O=2, g1=0, G=0, g2=0

Model Evaluation cont’d Classification of CpG poor promoters Sample(size) Single nucleotide Di-nucleotide Tri-nucleotide CpG-poor Promoters(1,466) 0.24 ± 0.05 0.28 ± 0.03 0.34 ± 0.04

Model Evaluation cont’d Classification of all promoters Classification of Exons Sample(size) Single nucleotide Di-nucleotide Tri-nucleotide All Promoters (12,333) 0.54 ± 0.02 0.54 ± 0.03 0.56 ± 0.02 Sample(size) Single nucleotide Di-nucleotide Tri-nucleotide All Exons (219,624) 0.63 ± 0.00 0.64 ± 0.00 0.67 ± 0.00

Model Evaluation cont’d Classification of nucleosome positioning sequences (112) Best classification accuracy at G = 4, 15 & 25 Worst classification accuracy at G = 7 & 18

Model Evaluation cont’d Compare Run-time for the three models Training time for single nucleotide model was 55.8 minutes Training time reduced to 23.8 minutes for di-nucleotide model Training time reduced to 18.9 minutes for tri-nucleotide model Time for testing reduces from 22.9 minutes to 15.4 and 14.0 minutes for di-nucleotide and tri-nucleotide models respectively

Conclusion Configurable tool to explore the generalizations of Markov models incorporating the joint and long range dependencies of sequence elements Evaluation done to 4 classes of sequences Compared two special cases i.e., the di-nucleotide model and the tri-nucleotide model vs. the traditional single nucleotide model Evaluation shows improved classification accuracy for di and tri nucleotide models Improved running time of software for di and tri nucleotide models

Thank You!!!!