Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

Markov models and applications
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Hidden Markov Model in Biological Sequence Analysis – Part 2
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Hidden Markov Models in Bioinformatics
Ab initio gene prediction Genome 559, Winter 2011.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
MNW2 course Introduction to Bioinformatics
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Profiles for Sequences
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Lyle Ungar, University of Pennsylvania Hidden Markov Models.
Master’s course Bioinformatics Data Analysis and Tools
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Eukaryotic Gene Finding
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
CSE182-L10 MS Spec Applications + Gene Finding + Projects.
Eukaryotic Gene Finding
Probabilistic Sequence Alignment BMI 877 Colin Dewey February 25, 2014.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
MNW2 course Introduction to Bioinformatics Lecture 22: Markov models Centre for Integrative Bioinformatics FEW/FALW
Hidden Markov Models for Sequence Analysis 4
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
(H)MMs in gene prediction and similarity searches.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Introducing Hidden Markov Models First – a Markov Model State : sunny cloudy rainy sunny ? A Markov Model is a chain-structured process where future states.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Hidden Markov Models BMI/CS 576
Genome Annotation (protein coding genes)
Markov Chain Models BMI/CS 776
What is a Hidden Markov Model?
Interpolated Markov Models for Gene Finding
Eukaryotic Gene Finding
Ab initio gene prediction
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (IV)
Profile HMMs GeneScan TMMOD
4. HMMs for gene finding HMM Ability to model grammar
Presentation transcript:

Applications of HMMs in Computational Biology BMI/CS Colin Dewey Fall 2010

The Gene Finding Task Given: an uncharacterized DNA sequence Do: locate the genes in the sequence, including the coordinates of individual exons and introns image from the UCSC Genome Browser

Eukaryotic Gene Structure

Each shape represents a functional unit of a gene or genomic region Pairs of intron/exon units represent the different ways an intron can interrupt a coding sequence (after 1 st base in codon, after 2 nd base or after 3 rd base) Complementary submodel (not shown) detects genes on opposite DNA strand The GENSCAN HMM for Eukaryotic Gene Finding [Burge & Karlin ‘97]

The GENSCAN HMM for each sequence type, GENSCAN models –the length distribution –the sequence composition length distribution models vary depending on sequence type –nonparametric (using histograms) –parametric (using geometric distributions) –fixed-length sequence composition models vary depending on type –5 th -order, inhomogeneous –5 th -order homogenous –0 th and 1 st -order inhomogeneous –tree-structured variable memory

Human Intron & Exon Lengths

Representing Exons in GENSCAN for exons, GENSCAN uses –Histograms to represent exon lengths – 5 th -order, inhomogeneous Markov models to represent exon sequences 5 th -order, inhomogeneous models can represent statistics about pairs of neighboring codons

Inference with the Gene-Finding HMM given: an uncharacterized DNA sequence do: find the most probable path through the model for the sequence This path will specify the coordinates of the predicted genes (including intron and exon boundaries) The Viterbi algorithm is used to compute this path

Parsing a DNA Sequence ACCGTTACGTGTCATTCTACGTGATCATCGGATCCTAGAATCATCGATCCGTGCGATCGATCGGATTAGCTAGCTTAGCTAGGAGAGCATCGATCGGATCGAGGAGGAGCCTATATAAATCAA The Viterbi path represents a parse of a given sequence, predicting exons, introns, etc

Assessing the Accuracy of a Trained Model two issues –What data should we use? –Which metrics should we use? Can we measure accuracy on the data set that was used to train the model? NO! This will result in accuracy estimates that are biased (too high).

Assessing the Accuracy of a Trained Model need to have a test set that is disjoint from the training set more generally, can use cross validation 3-fold CV illustrated Figure from

Accuracy true positives (TP) true negatives (TN) false positives (FP) false negatives (FN) positive negative positivenegative predicted class actual class

Accuracy Metrics true positives (TP) true negatives (TN) false positives (FP) false negatives (FN) positive negative positivenegative predicted class actual class sometimes specificity is defined this way

Accuracy of GENSCAN (and TWINSCAN) Figure from Flicek et al., Genome Research, 2003 genes exactly correct? exons exactly correct? nucleotides correct?

The Protein Classification Task Given: amino-acid sequence of a protein Do: predict the family to which it belongs GDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVCVLAHHFGKEFTPPVQAAYAKVVAGVANALAHKYH

Alignment of Globin Family Proteins The sequences in a family may vary in length Some positions are more conserved than others

Profile HMMs i 2 i 3 i 1 i 0 d 1 d 2 d 3 m 1 m 3 m 2 startend Match states represent key conserved positions Insert states account for extra characters in some sequences Delete states are silent; they Account for characters missing in some sequences profile HMMs are commonly used to model families of sequences A0.01 R0.12 D0.04 N0.29 C0.01 E0.03 Q0.02 G0.01 Insert and match states have emission distributions over sequence characters

Multiple Alignment of SH3 Domain Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences

A Profile HMM Trained for the SH3 Domain Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences match states delete states (silent) insert states

Profile HMMs To classify sequences according to family, we can train a profile HMM to model the proteins of each family of interest Given a sequence x, use Bayes’ rule to make classification β-galactosidase β-glucanase β-amylase α-amylase

Profile HMM Accuracy Figure from Jaakola et al., ISMB 1999 BLAST-based methods profile HMM-based methods classifying 2447 proteins into 33 families x-axis represents the median # of negative sequences that score as high as a positive sequence for a given family’s model

Other Issues in Markov Models there are many interesting variants and extensions of the models/algorithms we considered here (some of these are covered in BMI/CS 776) –separating length/composition distributions with semi- Markov models –modeling multiple sequences with pair HMMs –learning the structure of HMMs –going up the Chomsky hierarchy: stochastic context free grammars –discriminative learning algorithms (e.g. as in conditional random fields) –etc.