B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 5 Hidden Markov Model Aleppo University Faculty of technical engineering Department of Biotechnology
G ENE PREDICTION : M ETHODS Gene Prediction can be based upon: Coding statistics Gene structure Comparison Statistical approach Similarity-based approach
G ENE PREDICTION : M ETHODS Gene Prediction can be based upon: Coding statistics Gene structure Comparison Statistical approach Similarity-based approach
G ENE PREDICTION : C ODING STATISTICS Coding regions of the sequence have different properties than non-coding regions: non random properties of coding regions. CG content Codon bias (CODON USAGE).
M ARKOV M ODEL
A Markov model is a process, which moves from state to state depending (only) on the previous n states. For example, calculating the probability of getting this weather sequence states in one week from march: Sunny, Sunny, Cloudy, Rainy, Rainy, Sunny, Cloudy. If today is Cloudy, it would be more appropriate to be Rainy tomorrow On march it’s more appropriate to start with a Sunny day more than other situations And so on.
Weather tomorrow Sunny cloudy Rainy Sunny Cloudy Rainy SunnyCloudy Rainy Weather today Sunny Cloudy Rainy M ARKOV M ODEL
E XAMPLE : Σ = P (S UNNY, S UNNY, C LOUDY, R AINY | M ODEL ) = Π( SUNNY )* P (S UNNY | S UNNY ) * P (C LOUDY | S UNNY ) *P (R AINY | C LOUDY ) = 0.6 * 0.5 * 0.25 * = Sunny Cloudy Rainy Weather today Sunny Cloudy Rainy SunnyCloudy Rainy Weather tomorrow Sunny cloudy Rainy
H IDDEN M ARKOV M ODELS States are not observable Observations are probabilistic functions of state State transitions are still probabilistic
10 CG I SLANDS AND THE “F AIR B ET C ASINO ” The CG islands problem can be modeled after a problem named “The Fair Bet Casino” The game is to flip coins, which results in only two possible outcomes: H ead or T ail. The F air coin will give H eads and T ails with same probability ½. The B iased coin will give H eads with prob. ¾.
11 T HE “F AIR B ET C ASINO ” ( CONT ’ D ) Thus, we define the probabilities: P(H|F) = P(T|F) = ½ P(H|B) = ¾, P(T|B) = ¼ The crooked dealer chages between Fair and Biased coins with probability 10%
12 HMM FOR F AIR B ET C ASINO ( CONT ’ D ) HMM model for the Fair Bet Casino Problem
HMM P ARAMETERS Σ: set of emission characters. Ex.: Σ = {H, T} for coin tossing Σ = {1, 2, 3, 4, 5, 6} for dice tossing Σ = {A, C, G, T} for DNA sequence Q: set of hidden states, each emitting symbols from Σ. Q={F,B} for coin tossing Q={Non-coding, Coding, Regulatory} for sequences
HMM P ARAMETERS ( CONT ’ D ) A = (a kl ): a |Q| x |Q| matrix of probability of changing from state k to state l. a FF = 0.9 a FB = 0.1 a BF = 0.1 a BB = 0.9 E = (e k ( b )): a |Q| x |Σ| matrix of probability of emitting symbol b while being in state k. e F (0) = ½ e F (1) = ½ e B (0) = ¼ e B (1) = ¾
HMM Yellow Red Green Blue Yellow Red Green Blue Yellow Red Green Blue Q1Q3Q2 Q1 Q2 Q3 Q1 Q2 Q3 i th turn i+1 turn Q1 Q2 Q3
The three Basic problems of HMMs Problem 1: Given observation sequence Σ=O 1 O 2 …O T and model M=(Π, A, E). Compute P(Σ | M). Problem 2: Given observation sequence Σ=O 1 O 2 …O T and model M=(Π, A, E) how do we choose a corresponding state sequence Q=q 1 q 2 …q T,which best “explains” the observation. Problem 3: How do we adjust the model parameters Π, A, E to maximize P(Σ |{Π, A, E})?
The three Basic problems of HMMs Problem 1: Given observation sequence Σ=O 1 O 2 …O T and model M=(Π, A, E) compute P(Σ | M). for example: P ( | M)
What is ? The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM. Naive computation is very expensive. Given T observations and N states, there are N T possible state sequences. Even small HMMs, e.g. T=10 and N=10, contain 10 billion different paths Solution to this and problem 2 is to use dynamic programming P ROBLEM 1: P ROBABILITY OF AN O BSERVATION S EQUENCE
Problem 1: Given observation sequence Σ=O1O2…OT and model M=(Π, A, E) compute P(Σ | M). Forward algorithm Solution: Forward algorithm Example: P( | M). Q1 Q3 Q2 Q1 Q3 Q2 Q1 Q3 Q2 Q1 Q3 Q2 Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q Yellow Red Green Blue Yellow Red Green Blue Yellow Red Green Blue * * * * 0.1 * 0.25 * 0.4 * 0.1 * 0.2 * * 0.1 * 0.25 = * 0.4 * 0.1 = * 0.2 * 0.65 = Sum=
Problem 2: Given observation sequence Σ=O 1 O 2 …O T and model M=(Π, A, E) how do we choose a corresponding state sequence Q=q 1 q 2 …q T,which best “explains” the observation. For example: What are most probable Q1Q2Q3Q4 given the observation Q? The three Basic problems of HMMs
P ROBLEM 2: D ECODING The solution to Problem 1 gives us the sum of all paths through an HMM efficiently. For Problem 2, we want to find the path with the highest probability.
Example: P( | M). Q1 Q3 Q2 Q1 Q3 Q2 Q1 Q3 Q2 Q1 Q3 Q2 Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q Yellow Red Green Blue Yellow Red Green Blue Yellow Red Green Blue * * * * 0.1 * 0.25 * 0.4 * 0.1 * 0.2 * * 0.1 * 0.25 = * 0.4 * 0.1 = * 0.2 * 0.65 = THE LARGEST
H IDDEN M ARKOV M ODEL AND G ENE P REDICTION
How is it connected to Gene prediction? Yellow Red Green Blue Yellow Red Green Blue Yellow Red Green Blue Q1Q3Q2 Q1 Q2 Q3 Q1 Q2 Q3 i th turn i+1 turn Q1 Q2 Q3
How is it connected to Gene prediction? AGCTAGCT G T G A G G T T T CC A A A A C C G T A G G A C C T T ExonIntronUTR AGCTAGCT AGCTAGCT
H IDDEN M ARKOV M ODELS (HMM) FOR GENE PREDICTION B S D A T F Basic probabilistic model of gene structure. SE 5‘ 3‘ I E FEIE Signals B: Begin sequence S: Start translation A: acceptor site (AG) D: Donor site (GT) T: Stop translation F: End sequence Hidden states 3‘: 3‘ UTR EI: Initial Exon SE : Single Exon I : Intron E : Exon FE : Final Exon 5‘: 5‘ UTR
E UKARYOTIC G ENES F EATURES H AND O VER ATG GT AG TAG TAG GT AG TG TAG ATG
T HANK YOU