Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding Genes based on Comparative Genomics Robin Raffard November, 30 th 2004 CS 374.

Similar presentations


Presentation on theme: "Finding Genes based on Comparative Genomics Robin Raffard November, 30 th 2004 CS 374."— Presentation transcript:

1 Finding Genes based on Comparative Genomics Robin Raffard November, 30 th 2004 CS 374

2 References Main References Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. McAuliffe J., Pachter L., Jordan M. 2004. Computational identification of evolutionarily conserved exons. Siepel A., Haussler D. 2004. Additional references Phylogenetic shadowing if primate sequences to find functional regions of the human genome. Boffelli D., McAuliffe J., Ovcharenko D., Lewis K., Ovcharenko I., Pachter L., Rubin E. A hidden markov model approach to variation among sites in rate evolution. Felsenstein J., Churchill G. Statistics for Biology and health. Ewens W., Grant G.

3 DNA consists of genes (functional sequences) separated by intergenics (nonfunctional sequences). Problem formulation Gene 1Gene 2Gene 3 DNA Intergenics ATCATTACGCGGCTTAGCCCTTATAGCGATACGATGACAGATGACAA

4 DNA consists of genes (functional sequences) separated by intergenics (nonfunctional sequences). Problem formulation Gene 1Gene 2Gene 3 DNA

5 DNA consists of genes (functional sequences) separated by intergenics (nonfunctional sequences). Problem formulation Gene 1Gene 2Gene 3 DNA

6 DNA consists of genes (functional sequences) separated by intergenics (nonfunctional sequences). Problem: Find genes using comparative genomics Key: Exons are conserved along evolution Problem formulation Gene 1Gene 2Gene 3 DNA

7 In Practice >human AGTGAGACACGACGAGCCTACTATCAGGACGAGAGCAGGAGAGTGAT GATGAGTAGCGCACAGCGACGATCATCACGAGAGAGTAAGAAGCAGTG ATGATGTAGAGCGACGAGAGCACAGCGGCGACTACTACTAGG >mouse AGTGTGTCTCGTCGTGCCTACTTTCAGGACGAGAGCAGGTGAGTGTTG ATGAGTTGCGCTCTGCGACGTTCATCTCGAGTGAGTTAGAAAGTGAAG GTATAACACAAGGTGTGAAGGCAGTGATGATGTAGAGCGACGAGAGCA CAGCGGCGGGATGATATATCTAGGAGGATGCCCAATTTTTTTTT >platypus CTCTGCGGCGTTCGTCTCGGGTGGGTTGGGGGGTGGGGGTGTGGCG CAAGGTGTGAAGCACGACGACGATCTACGACGAGCGAGTGATGAGAG TGATGAGCGACGACGAGCACTAGAAGCGACGACTACTATCGACGAGCA GCCGAGATGATGATGAAAGAGAGAGAA

8 2 Questions 1 st question: Which genomes to compare: human/mouse or human/primates ? 2 nd question: How to extract genes from this comparison ?

9 Outline Human/Mouse vs Human/Primate –Advantages of Human/Mouse –Advantages of Human/Primate –Conclusion Gene Finding –Phylogenic tree –Hidden Markov Chain –Hidden Markov Phylogeny Contributions of the 2 papers

10 Functional sequences in Human/Mouse/Primates DNA sequence % of similitude

11 Advantage of Human/Mouse Easy to figure out what the functional sequences are

12 Disadvantage of Human/Mouse Some human genes are not present in the mouse genome. Therefore impossible to extract them from a Mouse/Human comparison Human Mouse

13 Human/Primates

14 Phylogenetic shadowing

15 Phylogenetic shadowing on real data DNA sequence Likelihood of mutation (log)

16 Motivating Example: Gene apo(a) Plasma protein Important cardiovascular disease risk predictor Absent Present

17 Phylogenetic shadowing of apo(a) DNA sequence Likelihood of mutation (log)

18 So Human/Mouse or Human/Primate ? Old genes: Human/Mouse (Non coding sequences are strongly different) New genes: Human/Primate (Straightforward alignment of coding sequences)

19 Outline Human/Mouse vs Human/Primate –Advantages of Human/Mouse –Advantages of Human/Primate –Conclusion Gene Finding –Phylogenic tree –Hidden Markov Chain –Hidden Markov Phylogeny Contributions of the 2 papers

20 Naive way of extracting genes 1.Is not flexible/probabilistic. 2.Does not respect gene structure. Drawbacks:

21 1 st step: Phylogenetic tree Given a nucleotide, is it functional or not ? Species Nucleotide 1Nucleotide 2

22 Primate phylogeny A T T G A A

23 A A T A G A A A C A Observed nucleotides Which nucleotide ? Which rate α ?

24 Algorithm Given observed nucleotide, find the most likely rate α. Mathematically, Therefore,

25 Phylogenetic tree: Results Drawback: No biological model built in

26 Gene structure A gene finder should satisfy: Promoter region about 50 base upstream of gene TATA: start of transcription 5’ untranslated region 3’ untranslated region

27 Gene Model Exon Intron TATA S1 S6 S5 S4 S3 S2

28 Hidden Markov Chain Model Composed of: 1.Sequence of states which are unobservable: S1, S2, S3, …, Sn. Si = exon, intron. Jump from Si to Si+1 follows a Markov chain: P(Si | Si+1) 2.Sequence of (sequence of) letters O1, O2, O3, …, On, which are emitted by the states ( according to P(Oi | Si ) ) and which are observed. S1 O1 S2 O2 S3 O3 S4 O4 S6 O6 S5 O5 S7 O7 = ACGTACG… P(S4 | S5) P(O1 | S1)

29 Viterbi Algorithm Given a sequence of letters O1, … On (observed), find the sequence of states S1,…,Sn (unobservable). Mathematically, find 2 steps: 1. Compute max Prob(S,O) via dynamic programming: max Prob(S1,…,Si+1,O) = f ( max Prob(S1,…,Si,O) ) 2.Find a sequence of state which achieves the optimal: Si = argmax max Prob(S1,…,Si,O).

30 Generalized hidden Markov phylogeny Cumulates the 2 concepts: Phylogenetic tree Hidden Markov chain + = Generalized hidden Markov phylogeny

31 Global Method Get a series of DNA sequences Align them Build the Generalized Hidden Markov Model Train the parameters on sample genes Find the hidden states: Si The coding sequences are the exons

32 Contributions of the 1 st paper 1 st to implement the Hidden Markov Phylogeny on the Primate/Human phylogeny. Require only 5 primate species. Able to sequence the apo(a) gene. Gene Finders

33 Contributions of the 2 nd paper Implement sophisticated Hidden Markov Phylogeny on Human/Mouse phylogeny 1.Context-dependent phylogenetic models ( High-order Markov chain: Emission of one state also depends of the neighboring states). More computationally expensive but better. 2.Explicit modeling of conserved non-coding sequences. 3.Modeling of insertions and deletions.

34 Results of the 2 nd paper Gene Finders

35 Conclusion Genes found based on genomics comparison. Mouse/Human for old genes Primate/Human for recent genes In any cases, same tool for extracting coding sequences: Hidden Markov Phylogeny Future: Improve Markov model, sequence more genomes.

36 Thank you! Questions ?


Download ppt "Finding Genes based on Comparative Genomics Robin Raffard November, 30 th 2004 CS 374."

Similar presentations


Ads by Google