1 DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008
2 What we saw in part I: 1. Markov Chain 2. DNA and Modeling 3. Markovian Models for DNA Sequences 4. HMM for DNA Sequences Part II: 1. DNA Methylation and CpG islands 2. Markov Chain Model 3. Hidden Markov Model 4. Finding the State Path 5. Parameter Estimation for HMMs 6. References
3 CG base pair in the human genome Modification of Cytosine by methylation High chance of mutation of methyl-C into a T CG dinucleotides are rarer in the genome Methylation is suppressed in short stretches of the genome such as around the promoters or start regions of many genes. more CG dinucleotides: CpG islands "p“: "C" and "G" are connected by a phosphodiester bond Two questions: – Given a short stretch of genomic sequence, how would we decide whether it comes from a CpG island? – Given a long piece of sequence, how would we find the CpG islands in it? 1.DNA Methylation and CpG islands
4 Markov Chain: Transition probabilities: Probability of sequences: Beginning and end of sequences: > Silent states 2.Given a short stretch of genomic sequence, how would we decide whether it comes from a CpG island?
5 Two Markov chain models: 1.CpG islands (the ‘+’ model) 2.Remainder of the sequence (the ‘-’ model) Table of frequencies: Each row sums to 1. Tables are asymmetric. Transition probabilities using Maximum likelihood estimator for CpG islands: +ACGT A C G T ACGT A C G T
6 x is the sequence β is the log likelihood ratio is corresponding transition probabilities - The histogram of the length-normalized scores,S(x), for all the sequences(~60,000 nucleotides) To use this model for discrimination: Log-odds ratio: β ACGT A C G T
7 Single model for the entire sequence that incorporates both Markov chains: HMM Similar transition probabilities within each set Small chance of switching between + and – regions There is no one-to-one correspondence between states and symbols. 3. Given a long piece of sequence, how would we find the CpG islands in it?
8 Sequence of states (path Π): Transition probabilities: – State sequence is hidden in HMM Sequence of symbols: emission probabilities: – Prob. b is seen in state s – emission prob. of CpG islands: 0 or 1 A sequence can be generated from a HMM as follows: – A state is chosen according to – In an observation is emitted according to – A new state is chosen according to – and so forth…: A sequence of random observations – P(x)= prob. X was generated by the model – Joint probability of an observed seq x and state seq :
9 Example: Prob. of sequence ‘CGCG’ being emitted by the state sequence (C+,G-,C-,G+): Not very useful in practice because the path is not known → Path estimation: By finding the most likely one – Viterbi Algorithm – Forward or Backward Algorithm Example: CpG model: Generating symbol sequence CGCG – State sequences: (C+,G+,C+,G+),(C-,G-,C-,G-), (C+,G-,C-,G+) – (C+,G-,C-,G+): switching back and forth between + and – – (C-,G-,C-,G-): small prob. of CG in ‘-’ group – (C+,G+,C+,G+): Best option!
10 5.Parameter Estimation for HMMs: HMM models: 1.Design the structure: states and their connections 2.Design parameter values: transition and emission probabilities, and Baum-Welch And Viterbi training
11 7.References Bandyopadhyay, Sanghamitra. Gene Identification: Classical and Computational Ingelligence Approach. 38 vols. IEEE, JAN2008. Durbin, R., S. Eddy, and A. Krogh. Biological Sequence Analysis. Cambridge: Cambridge University, Koski, Timo. Hidden Markov Models for Bioinformatics. Sweden: Kluwer Academic, Birney, E. "Hidden Markov models in biological sequence analysis". July 2001: Haussler, David. David Kulp, Martin Reese Frank Eeckman "A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA". Boufounos, Petros, Sameh El-Difrawy, Dan Ehrlich. "HIDDEN MARKOV MODELS FOR DNA SEQUENCING".