Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008.

Similar presentations


Presentation on theme: "1 DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008."— Presentation transcript:

1 1 DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008

2 2 What we saw in part I: 1. Markov Chain 2. DNA and Modeling 3. Markovian Models for DNA Sequences 4. HMM for DNA Sequences Part II: 1. DNA Methylation and CpG islands 2. Markov Chain Model 3. Hidden Markov Model 4. Finding the State Path 5. Parameter Estimation for HMMs 6. References

3 3 CG base pair in the human genome Modification of Cytosine by methylation High chance of mutation of methyl-C into a T CG dinucleotides are rarer in the genome Methylation is suppressed in short stretches of the genome such as around the promoters or start regions of many genes.  more CG dinucleotides: CpG islands "p“: "C" and "G" are connected by a phosphodiester bond Two questions: – Given a short stretch of genomic sequence, how would we decide whether it comes from a CpG island? – Given a long piece of sequence, how would we find the CpG islands in it? 1.DNA Methylation and CpG islands

4 4 Markov Chain: Transition probabilities: Probability of sequences: Beginning and end of sequences: > Silent states 2.Given a short stretch of genomic sequence, how would we decide whether it comes from a CpG island?

5 5 Two Markov chain models: 1.CpG islands (the ‘+’ model) 2.Remainder of the sequence (the ‘-’ model) Table of frequencies: Each row sums to 1. Tables are asymmetric. Transition probabilities using Maximum likelihood estimator for CpG islands: +ACGT A0.1800.2740.4260.120 C0.1710.3680.2740.188 G0.1610.3390.3750.125 T0.0790.3550.3840.182 -ACGT A0.3000.2050.2850.210 C0.3220.2980.0780.302 G0.2480.2460.2980.208 T0.1770.2390.292

6 6 x is the sequence β is the log likelihood ratio is corresponding transition probabilities - The histogram of the length-normalized scores,S(x), for all the sequences(~60,000 nucleotides) To use this model for discrimination: Log-odds ratio: β ACGT A-0.7400.4190.580-0.803 C-0.9130.3021.812-0.685 G-0.6240.4610.331-0.730 T-1.1690.5730.339-0.679

7 7 Single model for the entire sequence that incorporates both Markov chains: HMM Similar transition probabilities within each set Small chance of switching between + and – regions There is no one-to-one correspondence between states and symbols. 3. Given a long piece of sequence, how would we find the CpG islands in it?

8 8 Sequence of states (path Π): Transition probabilities: – State sequence is hidden in HMM Sequence of symbols: emission probabilities: – Prob. b is seen in state s – emission prob. of CpG islands: 0 or 1 A sequence can be generated from a HMM as follows: – A state is chosen according to – In an observation is emitted according to – A new state is chosen according to – and so forth…: A sequence of random observations – P(x)= prob. X was generated by the model – Joint probability of an observed seq x and state seq :

9 9 Example: Prob. of sequence ‘CGCG’ being emitted by the state sequence (C+,G-,C-,G+): Not very useful in practice because the path is not known → Path estimation: By finding the most likely one – Viterbi Algorithm – Forward or Backward Algorithm Example: CpG model: Generating symbol sequence CGCG – State sequences: (C+,G+,C+,G+),(C-,G-,C-,G-), (C+,G-,C-,G+) – (C+,G-,C-,G+): switching back and forth between + and – – (C-,G-,C-,G-): small prob. of CG in ‘-’ group – (C+,G+,C+,G+): Best option!

10 10 5.Parameter Estimation for HMMs: HMM models: 1.Design the structure: states and their connections 2.Design parameter values: transition and emission probabilities, and Baum-Welch And Viterbi training

11 11 7.References Bandyopadhyay, Sanghamitra. Gene Identification: Classical and Computational Ingelligence Approach. 38 vols. IEEE, JAN2008. Durbin, R., S. Eddy, and A. Krogh. Biological Sequence Analysis. Cambridge: Cambridge University, 1998. Koski, Timo. Hidden Markov Models for Bioinformatics. Sweden: Kluwer Academic, 2001. Birney, E. "Hidden Markov models in biological sequence analysis". July 2001: Haussler, David. David Kulp, Martin Reese Frank Eeckman "A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA". Boufounos, Petros, Sameh El-Difrawy, Dan Ehrlich. "HIDDEN MARKOV MODELS FOR DNA SEQUENCING".


Download ppt "1 DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008."

Similar presentations


Ads by Google