Applying Hidden Markov Models to Bioinformatics. Outline What are Hidden Markov Models? Why are they a good tool for Bioinformatics? Applications in Bioinformatics.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Applying Hidden Markov Models to Bioinformatics
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Bioinformatics lectures at Rice University
Hidden Markov Models in Bioinformatics
Ab initio gene prediction Genome 559, Winter 2011.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Hidden Markov Models Eine Einführung.
Matlab Simulations of Markov Models Yu Meng Department of Computer Science and Engineering Southern Methodist University.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
數據分析 David Shiuan Department of Life Science Institute of Biotechnology Interdisciplinary Program of Bioinformatics National Dong Hwa University.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Albert Gatt Corpora and Statistical Methods Lecture 8.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Visual Recognition Tutorial
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Lecture 5: Learning models using EM
Gene Finding (DNA signals) Genome Sequencing and assembly
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Hidden Markov Models.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Hidden Markov Models In BioInformatics
Introduction to Profile Hidden Markov Models
Hidden Markov Models As used to summarize multiple sequence alignments, and score new sequences.
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009.
Ensemble Methods in Machine Learning
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
(H)MMs in gene prediction and similarity searches.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Introducing Hidden Markov Models First – a Markov Model State : sunny cloudy rainy sunny ? A Markov Model is a chain-structured process where future states.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Hidden Markov Models BMI/CS 576
bacteria and eukaryotes
Genome Annotation (protein coding genes)
What is a Hidden Markov Model?
Bioinformatics lectures at Rice University
Statistical Models for Automatic Speech Recognition
Ab initio gene prediction
Hidden Markov Models Part 2: Algorithms
Hidden Markov Models (HMMs)
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Presentation transcript:

Applying Hidden Markov Models to Bioinformatics

Outline What are Hidden Markov Models? Why are they a good tool for Bioinformatics? Applications in Bioinformatics

Statistical Models Definition: Any mathematical construct that attempts to parameterize a random process Example: A normal distribution Assumptions Parameters Estimation Usage HMMs are just a little more complicated…

History of Hidden Markov Models HMM were first described in a series of statistical papers by Leonard E. Baum and other authors in the second half of the 1960s. One of the first applications of HMMs was speech recogniation, starting in the mid-1970s. They are commonly used in speech recognition systems to help to determine the words represented by the sound wave forms captured In the second half of the 1980s, HMMs began to be applied to the analysis of biological sequences, in particular DNA. Since then, they have become ubiquitous in bioinformatics Source:

What are Hidden Markov Models? HMM: A formal foundation for making probabilistic models of linear sequence 'labeling' problems. They provide a conceptual toolkit for building complex models just by drawing an intuitive picture. Source:

What are Hidden Markov Models? Machine learning approach in bioinformatics Machine learning algorithms are presented with training data, which are used to derive important insights about the (often hidden) parameters. Once an algorithm has been trained, it can apply these insights to the analysis of a test sample As the amount of training data increases, the accuracy of the machine learning algorithm typically increasess as well. Source:

Hidden Markov Models Has N states, called S1, S2,... Sn There are discrete timesteps, t=0, t=1 S1 S2 S3 N = 3 t = 0 Source:

Hidden Markov Models Has N states, called S1, S2,... Sn There are discrete timesteps, t=0, t=1 For each timestep, the system is in exactly one of the available states. S1 S2 S3 N = 3 t = 0

Hidden Markov Models S1S2S3 Bayesian network with time slices Bayesian Network Image:

A Markov Chain Bayes' Theory (statistics) a theorem describing how the conditional probability of a set of possible causes for a given observed event can be computed from knowledge of the probability of each cause and the conditional probability of the outcome of each cause -

Building a Markov Chain Concrete Example Two friends, Alice and Bob, who live far apart from each other and who talk together daily over the telephone about what they did that day. Bob is only interested in three activities: walking in the park, shopping, and cleaning his apartment. The choice of what to do is determined exclusively by the weather on a given day. Alice has no definite information about the weather where Bob lives, but she knows general trends. Based on what Bob tells her he did each day, Alice tries to guess what the weather must have been like. Alice believes that the weather operates as a discrete Markov chain. There are two states, "Rainy" and "Sunny", but she cannot observe them directly, that is, they are hidden from her. On each day, there is a certain chance that Bob will perform one of the following activities, depending on the weather: "walk", "shop", or "clean". Since Bob tells Alice about his activities, those are the observations. Source: Wikipedia.org

Hidden Markov Models

Building a Markov Chain

What now? * Find out the most probable output sequence Vertibi's algorithm Dynamic programming algorithm for finding the most likely sequence of hidden states – called the Vertibi path – that results in a sequence of observed events.

Vertibi Results

Gene Finding (An Ideal HMM Domain) Our Objective: To find the coding and non-coding regions of an unlabeled string of DNA nucleotides Our Motivation: Assist in the annotation of genomic data produced by genome sequencing methods Gain insight into the mechanisms involved in transcription, splicing and other processes

Gene Finding Terminology A string of DNA nucleotides containing a gene will have separate regions (lines): Introns – non-coding regions within a gene Exons – coding regions Separated by functional sites (boxes) Start and stop codons Splice sites – acceptors and donors

Gene Finding Challenges Need the correct reading frame Introns can interrupt an exon in mid-codon There is no hard and fast rule for identifying donor and acceptor splice sites Signals are very weak

Bioinformatics Example Assume we are given a DNA sequence that begins in an exon, contains one 5' splice site and ends in an intron Identify where the switch from exon to intron occurs Where is the splice site?? Sourece:

Bioinformatics Example In order for us to guess, the sequences of exons, splice sites and introns must have different statistical properties. Let's say... Exons have a uniform base composition on average A/C/T/G: 25% for each base Introns are A/T rich A/T: 40% for each C/G: 10% for each 5' Splice site consensus nucleotide is almost always a G... G: 95% A: 5% Sourece:

Bioinformatics Example We can build an Hidden Markov Model We have three states "E" for Exon "5" for 5' SS "I" for Intron Each State has its own emission probabilities which model the base composition of exons, introns and consensus G at the 5'SS Each state also has transition probabilities (arrows) Sourece:

HMM: A Bioinformatics Visual We can use HMMs to generate a sequence When we visit a state, we emit a nucleotide bases on the emission probability distribution We also choose a state to visit next according to the state's transition probability distribution. Source: We generate two strings of information Observed Sequence Underlying State Path

HMM: A Bioinformatics Visual The state path is a Markov Chain Since we're only given the observed sequence, this underlying state path is a hidden Markov Chain Therefore... We can apply Bayesian Probability Source:

HMM: A Bioinformatics Visual S – Observed sequence π – State Path Θ – Parameters The probability P(S, π |HMM, Θ ) is the product of all emission probabilites and transition probilities. Lets look at an example... Source:

HMM: A Bioinformatics Visual There are 27 transitions and 26 emissions. Multiply all 53 probabilities together (and take the log, since these are small numbers) and you'll calculate log P(S, π |HMM, Θ ) = Source:

HMM: A Bioinformatics Visual The model parameters and overall sequences scores are all probabilities Therefore we can use Bayesian probability theory to manipulate these numbers in standard, powerful ways, including optimizing parameters and interpreting the signifigance of scores. Source:

HMM: A Bioinformatics Visual Posterior Decoding: An alternative state path where the SS falls on the 6 th G instead of the 5 th (log probabilities of versus ) How confident are we that the fifth G is the right choice? Source:

HMM: A Bioinformatics Visual We can calculate our confidence directly. The probability that nucleotide i was emitted by state k is the sum of the probabilities of all the states paths use state k to generate i, normalized by the sum over all possible state paths Result: We get a probability of 46% that the best-scoring fifth G is correct and 28% that the sixth G position is correct. Source:

Further Possibilites The toy-model provided by the article is a simple example But we can go further, we could add a more realistic consensus GTRAGT at the 5' splice site We could put a row of six HMM states in place of '5' state to model a six-base ungapped consensus motif Possibilities are not limited

The catch HMM don't deal well with correlations between nucleotides Because they assume that each emitted nucleotide depends only on one underlying state. Example of bad use for HMM: Conserved RNA base pairs which induce long-range pairwise correlations; one position might be any nucleotide but the base- paired partner must be complementary. An HMM state path has no way of 'remembering' what a distant state generated. Source:

Credits html#B html#B

What makes a good HMM problem space? Characteristics: Classification problems There are two main types of output from an HMM: Scoring of sequences (Protein family modeling) Labeling of observations within a sequence (Gene Finding)

HMM Problem Characteristics Continued The observations in a sequence should have a clear, and meaningful order Unordered observations will not map easily to states It’s beneficial, but not necessary for the observations follow some sort of grammar Makes it easier to design an architecture Gene Finding Protein Family Modeling

HMM Requirements So you’ve decided you want to build an HMM, here’s what you need: An architecture Probably the hardest part Should be biologically sound & easy to interpret A well-defined success measure Necessary for any form of machine learning

HMM Requirements Continued Training data Labeled or unlabeled – it depends You do not always need a labeled training set to do observation labeling, but it helps Amount of training data needed is: Directly proportional to the number of free parameters in the model Inversely proportional to the size of the training sequences

Why HMMs might be a good fit for Gene Finding Classification: Classifying observations within a sequence Order: A DNA sequence is a set of ordered observations Grammar / Architecture: Our grammatical structure (and the beginnings of our architecture) is right here: Success measure: # of complete exons correctly labeled Training data: Available from various genome annotation projects

HMM Advantages Statistical Grounding Statisticians are comfortable with the theory behind hidden Markov models Freedom to manipulate the training and verification processes Mathematical / theoretical analysis of the results and processes HMMs are still very powerful modeling tools – far more powerful than many statistical methods

HMM Advantages continued Modularity HMMs can be combined into larger HMMs Transparency of the Model Assuming an architecture with a good design People can read the model and make sense of it The model itself can help increase understanding

HMM Advantages continued Incorporation of Prior Knowledge Incorporate prior knowledge into the architecture Initialize the model close to something believed to be correct Use prior knowledge to constrain training process

How does Gene Finding make use of HMM advantages? Statistics: Many systems alter the training process to better suit their success measure Modularity: Almost all systems use a combination of models, each individually trained for each gene region Prior Knowledge: A fair amount of prior biological knowledge is built into each architecture

HMM Disadvantages Markov Chains States are supposed to be independent P(y) must be independent of P(x), and vice versa This usually isn’t true Can get around it when relationships are local Not good for RNA folding problems P(x) … P(y)

HMM Disadvantages continued Standard Machine Learning Problems Watch out for local maxima Model may not converge to a truly optimal parameter set for a given training set Avoid over-fitting You’re only as good as your training set More training is not always good

HMM Disadvantages continued Speed!!! Almost everything one does in an HMM involves: “enumerating all possible paths through the model” There are efficient ways to do this Still slow in comparison to other methods

HMM Gene Finders: VEIL A straight HMM Gene Finder Takes advantage of grammatical structure and modular design Uses many states that can only emit one symbol to get around state independence

HMM Gene Finders: HMMGene Uses an extended HMM called a CHMM CHMM = HMM with classes Takes full advantage of being able to modify the statistical algorithms Uses high-order states Trains everything at once

HMM Gene Finders: Genie Uses a generalized HMM (GHMM) Edges in model are complete HMMs States can be any arbitrary program States are actually neural networks specially designed for signal finding

Conclusions HMMs have problems where they excel, and problems where they do not You should consider using one if: Problem can be phrased as classification Observations are ordered The observations follow some sort of grammatical structure (optional)

Conclusions Advantages: Statistics Modularity Transparency Prior Knowledge Disadvantages: State independence Over-fitting Local Maximums Speed

Some final words… Lots of problems can be phrased as classification problems Homology search, sequence alignment If an HMM does not fit, there’s all sorts of other methods to try with ML/AI: Neural Networks, Decision Trees Probabilistic Reasoning and Support Vector Machines have all been applied to Bioinformatics