Applying Hidden Markov Models to Bioinformatics

Slides:



Advertisements
Similar presentations
Hidden Markov Models.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Bioinformatics lectures at Rice University
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Ab initio gene prediction Genome 559, Winter 2011.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Hidden Markov Models Eine Einführung.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
數據分析 David Shiuan Department of Life Science Institute of Biotechnology Interdisciplinary Program of Bioinformatics National Dong Hwa University.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Applying Hidden Markov Models to Bioinformatics. Outline What are Hidden Markov Models? Why are they a good tool for Bioinformatics? Applications in Bioinformatics.
Albert Gatt Corpora and Statistical Methods Lecture 8.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Dynamic Time Warping Applications and Derivation
Probabilistic Prediction Algorithms Jon Radoff Biophysics 101 Fall 2002.
Hidden Markov Models In BioInformatics
Introduction to Profile Hidden Markov Models
Hidden Markov Models As used to summarize multiple sequence alignments, and score new sequences.
Conditional & Joint Probability A brief digression back to joint probability: i.e. both events O and H occur Again, we can express joint probability in.
THE HIDDEN MARKOV MODEL (HMM)
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Evaluation Decoding Dynamic Programming.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
CS Statistical Machine learning Lecture 24
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
1 Hidden Markov Models (HMMs). 2 Definition Hidden Markov Model is a statistical model where the system being modeled is assumed to be a Markov process.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Introducing Hidden Markov Models First – a Markov Model State : sunny cloudy rainy sunny ? A Markov Model is a chain-structured process where future states.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Hidden Markov Models BMI/CS 576
What is a Hidden Markov Model?
Bioinformatics lectures at Rice University
LECTURE 15: HMMS – EVALUATION AND DECODING
Ab initio gene prediction
Hidden Markov Models Part 2: Algorithms
LECTURE 14: HMMS – EVALUATION AND DECODING
Hidden Markov Models (HMMs)
Presentation transcript:

Applying Hidden Markov Models to Bioinformatics Conor Buckley

Outline What are Hidden Markov Models? Why are they a good tool for Bioinformatics? Applications in Bioinformatics

History of Hidden Markov Models HMM were first described in a series of statistical papers by Leonard E. Baum and other authors in the second half of the 1960s. One of the first applications of HMMs was speech recogniation, starting in the mid-1970s. They are commonly used in speech recognition systems to help to determine the words represented by the sound wave forms captured In the second half of the 1980s, HMMs began to be applied to the analysis of biological sequences, in particular DNA. Since then, they have become ubiquitous in bioinformatics Source: http://en.wikipedia.org/wiki/Hidden_Markov_model#History

What are Hidden Markov Models? HMM: A formal foundation for making probabilistic models of linear sequence 'labeling' problems. They provide a conceptual toolkit for building complex models just by drawing an intuitive picture. Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

What are Hidden Markov Models? Machine learning approach in bioinformatics Machine learning algorithms are presented with training data, which are used to derive important insights about the (often hidden) parameters. Once an algorithm has been trained, it can apply these insights to the analysis of a test sample As the amount of training data increases, the accuracy of the machine learning algorithm typically increasess as well. Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Hidden Markov Models Has N states, called S1, S2, ... Sn There are discrete timesteps, t=0, t=1 N = 3 t = 0 S2 S1 S3 Source: http://www.autonlab.org/tutorials/hmm.html

Hidden Markov Models Has N states, called S1, S2, ... Sn There are discrete timesteps, t=0, t=1 For each timestep, the system is in exactly one of the available states. N = 3 t = 0 S2 S1 S3

Hidden Markov Models Bayesian network with time slices S1 S2 S3 Bayesian Network Image: http://en.wikipedia.org/wiki/File:Hmm_temporal_bayesian_net.svg

A Markov Chain Bayes' Theory (statistics) a theorem describing how the conditional probability of a set of possible causes for a given observed event can be computed from knowledge of the probability of each cause and the conditional probability of the outcome of each cause - http://wordnetweb.princeton.edu/perl/webwn?s=bayes%27%20theorem

Building a Markov Chain Concrete Example Two friends, Alice and Bob, who live far apart from each other and who talk together daily over the telephone about what they did that day. Bob is only interested in three activities: walking in the park, shopping, and cleaning his apartment. The choice of what to do is determined exclusively by the weather on a given day. Alice has no definite information about the weather where Bob lives, but she knows general trends. Based on what Bob tells her he did each day, Alice tries to guess what the weather must have been like. Alice believes that the weather operates as a discrete Markov chain. There are two states, "Rainy" and "Sunny", but she cannot observe them directly, that is, they are hidden from her. On each day, there is a certain chance that Bob will perform one of the following activities, depending on the weather: "walk", "shop", or "clean". Since Bob tells Alice about his activities, those are the observations. Source: Wikipedia.org

Hidden Markov Models

Building a Markov Chain

What now? * Find out the most probable output sequence Vertibi's algorithm Dynamic programming algorithm for finding the most likely sequence of hidden states – called the Vertibi path – that results in a sequence of observed events.

Vertibi Results http://pcarvalho.com/forward_viterbi/

Bioinformatics Example Assume we are given a DNA sequence that begins in an exon, contains one 5' splice site and ends in an intron Identify where the switch from exon to intron occurs Where is the splice site?? Sourece: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Bioinformatics Example In order for us to guess, the sequences of exons, splice sites and introns must have different statistical properties. Let's say... Exons have a uniform base composition on average A/C/T/G: 25% for each base Introns are A/T rich A/T: 40% for each C/G: 10% for each 5' Splice site consensus nucleotide is almost always a G... G: 95% A: 5% Sourece: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Bioinformatics Example We can build an Hidden Markov Model We have three states "E" for Exon "5" for 5' SS "I" for Intron Each State has its own emission probabilities which model the base composition of exons, introns and consensus G at the 5'SS Each state also has transition probabilities (arrows) Sourece: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

HMM: A Bioinformatics Visual We can use HMMs to generate a sequence When we visit a state, we emit a nucleotide bases on the emission probability distribution We also choose a state to visit next according to the state's transition probability distribution. We generate two strings of information Observed Sequence Underlying State Path Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

HMM: A Bioinformatics Visual The state path is a Markov Chain Since we're only given the observed sequence, this underlying state path is a hidden Markov Chain Therefore... We can apply Bayesian Probability Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

HMM: A Bioinformatics Visual S – Observed sequence π – State Path Θ – Parameters The probability P(S, π|HMM, Θ) is the product of all emission probabilites and transition probilities. Lets look at an example... Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

HMM: A Bioinformatics Visual There are 27 transitions and 26 emissions. Multiply all 53 probabilities together (and take the log, since these are small numbers) and you'll calculate log P(S, π|HMM, Θ) = -41.22 Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

HMM: A Bioinformatics Visual The model parameters and overall sequences scores are all probabilities Therefore we can use Bayesian probability theory to manipulate these numbers in standard, powerful ways, including optimizing parameters and interpreting the signifigance of scores. Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

HMM: A Bioinformatics Visual Posterior Decoding: An alternative state path where the SS falls on the 6th G instead of the 5th (log probabilities of -41.71 versus -41.22) How confident are we that the fifth G is the right choice? Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

HMM: A Bioinformatics Visual We can calculate our confidence directly. The probability that nucleotide i was emitted by state k is the sum of the probabilities of all the states paths use state k to generate i, normalized by the sum over all possible state paths Result: We get a probability of 46% that the best-scoring fifth G is correct and 28% that the sixth G position is correct. Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Further Possibilites The toy-model provided by the article is a simple example But we can go further, we could add a more realistic consensus GTRAGT at the 5' splice site We could put a row of six HMM states in place of '5' state to model a six-base ungapped consensus motif Possibilities are not limited

The catch HMM don't deal well with correlations between nucleotides Because they assume that each emitted nucleotide depends only on one underlying state. Example of bad use for HMM: Conserved RNA base pairs which induce long-range pairwise correlations; one position might be any nucleotide but the base- paired partner must be complementary. An HMM state path has no way of 'remembering' what a distant state generated. Source: http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html#B1

Credits http://www.nature.com/nbt/journal/v22/n10/full/nbt10 04-1315.html#B1 http://en.wikipedia.org/wiki/Viterbi_algorithm http://en.wikipedia.org/wiki/Hidden_Markov_model http://en.wikipedia.org/wiki/Bayesian_network http://www.daimi.au.dk/~bromille/PHM/Storm.pdf

Questions?