Presentation is loading. Please wait.

Presentation is loading. Please wait.

Center for Biological Sequence Analysis Prokaryotic gene finding Marie Skovgaard Ph.D. student

Similar presentations


Presentation on theme: "Center for Biological Sequence Analysis Prokaryotic gene finding Marie Skovgaard Ph.D. student"— Presentation transcript:

1 Center for Biological Sequence Analysis Prokaryotic gene finding Marie Skovgaard Ph.D. student marie@cbs.dtu.dk

2 Center for Biological Sequence Analysis Prokarya

3 Center for Biological Sequence Analysis

4 >AE006641 GTATACTCTTCTTCCCTATACATTGTCGCAGCAAGCTTAGTTTCTTTAGCCTCTCTGCTTTCATTATTAC TTATAATCTTAATAGCAAGGAGACATATGATAGAGTATTTCTATATGATTCCTTCGTTCGTTTATATGAA CTTTATTGTCGCACTAAACTTCACTGCAATATTTTTAGAGTTAATAAGAGCACCTAGAGTGTGGGTAAAA ACTGAAAGAAGTGCCAAGGTTACGGGGGAGGTCATGGGATGATAACTGAATTTTTACTTAAAAAGAAATT AGAAGAACATTTAAGCCATGTAAAGGAAGAGAATACGATATATGTAACAGATTTAGTAAGATGCCCCAGA AGAGTAAGATATGAGAGTGAATACAAGGAGCTTGCAATCTCTCAGGTTTACGCGCCTTCAGCTATTTTAG GGGACATATTGCATCTCGGTCTTGAAAGCGTATTAAAAGGGAACTTTAATGCAGAAACTGAAGTTGAAAC TCTGAGAGAAATTAACGTCGGAGGTAAAGTTTATAAAATTAAAGGAAGAGCCGATGCAATAATTAGAAAT GACAACGGGAAGAGTATTGTAATTGAGATAAAAACTTCTAGAAGTGATAAAGGATTACCTCTAATTCATC ATAAAATGCAGCTACAGATATATTTATGGTTATTTAGTGCAGAAAAAGGTATACTAGTTTACATAACTCC AGATAGGATAGCTGAGTATGAAATAAACGAACCTTTAGATGAAGCAACAATAGTAAGACTTGCAGAGGAT ACAATAATGTTACAAAACTCACCTAGATTCAACTGGGAATGTAAATATTGCATATTTTCCGTCATTTGCC CAGCTAAACTAACCTAAAATTAAAATCTCTCATCGATATAATTAAATTGTGCACACTAGACCAGTAGTTG CCACAATAGCTGGGAGTGACAGTGGAGGAGGTGCTGGATTACAGGCTGATCTAAAGACGTTTAGCGCATT AGGAGTTTTTGGTACAACAATAATAACCGGTTTAACAGCACAGAATACAAGAACAGTTACAAAAGTATTA GAGATACCATTAGATTTCATTGAAGCTCAGTTTGATGCGGTTTGCCTAGATTTACATCCAACTCACGCCA AAACTGGAATGTTAGCTTCTGGTAAAGTGGTAGAACTTGTACTGAGAAAAATTAGAGAGTATAACATAAA ACTAGTTTTAGATCCAGTGATGGTTGCGAAATCTGGATCATTATTGGTAACAGAGGATATCTCGGAGCAA ATAAAAAAGGCGATGAAGGAGGCCATAATATCTACTCCAAACAGATATGAAGCTGAGATAATAAATAAGA CAAAGATTAATAGTCAAGATGATGTTATAAAAGCGGCAAGGGAAATTTATTCTAAGTATGGGAATGTTGT AGTTAAAGGATTTAATGGAGTAGATTACGCCATAATTGACGGAGAAGAAATAGAGTTAAAAGGTGATTAC ATCAGTACTAAAAATACACATGGTAGTGGAGACGTATTTTCTGCCTCCATAACTGCATATCTTGCCTTGG GATACAAACTTAAAGATGCATTAATAAGAGCTAAAAAATTCGCTACAATGACAGTCAAATACGGTTTGGA CTTAGGAGGAGGATATGGACCAGTAGATCCCTTTGCCCCTATAGAGTCCATAGTGAAGAGAGAAGAAGGA AGAAATCAGCTAGAAAACTTACTTTGGTACTTAGAGTCTAATCTTAACGTTATACTTAAACTAATTAACG Can you spot the gene? / (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/

5 Center for Biological Sequence Analysis Identifying open reading frames / (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/

6 Center for Biological Sequence Analysis A. pernix (43% AT)

7 Center for Biological Sequence Analysis Why care about over annotated genes? Genome comparison: Fraction of known proteins Average gene length Amino acid composition The quality of our databases To gain biological knowledge

8 Center for Biological Sequence Analysis Regular expression Regular expression: /[AT][CG][AC][ACGT]*A[TG][CG]/ The regular expression is able to find all posible sequences, but do not distinguish between the consensus sequence and the highly unlikely sequence: ACAC—ATC or TGCT--AGG Weigth matrixes can be used to score the sequence but do not deal with insertions and deletions. ACA---ATG TCAACTATC ACAC--AGC AGA---ATG ACCG--ATC

9 Center for Biological Sequence Analysis Markov model A 0.8 C G T 0.2 A C 0.8 G 0.2 T A 0.8 C 0.2 G T A 1.0 C G T A C G 0.2 T 0.8 A C 0.8 G 0.2 T A 0.2 C 0.4 G 0.2 T 0.2 1.0 0.4 1.0 0.6 0.4 ACA---ATG TCAACTATC ACAC--AGC AGA---ATG ACCG--ATC

10 Center for Biological Sequence Analysis Profile HMM Profile HMM have a predefined architecture and the parameters are estimated from multiple sequence alignments. Profile HMM are not usefull for gene finding, since all genes in an organism can not be aligned in a meaningfull way. Begin End

11 Center for Biological Sequence Analysis Markov Model for gene finding Define a simple architecture: / (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/ ATGCATGC TAG TAA TGA ATGCATGC ATGCATGC ATG GTG TTG S1 S2S3S4S5

12 Center for Biological Sequence Analysis Markov models Knowledge of the structure of genes is used to define the architecture of the model. Sequences ( x ) from known genes are used to estimate the parameters of the model – training of the model. The training is done by counting the number of times a nucleotide occur in a given state and dividing this number with the number of sequences used in training giving the frequencies.

13 Center for Biological Sequence Analysis Training Sequence x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 …..….x n States S1 S2 S3 S4 S5 ATGCATGC TAG TAA TGA ATGCATGC ATGCATGC ATG GTG TTG S1 S2S3S4S5

14 Center for Biological Sequence Analysis Model after training A: 0.22 T: 0.24 G: 0.27 C: 0.27 TAG: 0.6 TAA: 0.3 TGA: 0.1 A: 0.25 T: 0.23 G: 0.27 C: 0.25 A: 0.26 T: 0.24 G: 0.25 C: 0.25 ATG: 0.77 TTG: 0.11 GTG: 0.12 CTG: 0.00 S1 S2S3S4S5 0.98 The trained model can be used to search for genes in DNA sequences.

15 Center for Biological Sequence Analysis Searching with the HMM S1 S2 S3 S4 S5 Sequence ATG A T T T C G C G C G A T ……….T A G States 0.77 0.00 0.00 0.00 0.00 (0.22*0.77) 0.00 0.00 0.00 0.00 (0.23*0.22*0.77) 0.00 0.00 0.00 0.00 (0.24*0.23*0.22*0.77) 0.00 0.00 0.00 =P(x|M)

16 Center for Biological Sequence Analysis Log-Odds score The propability of a sequence gets infinitly small as the sequence x becomes longer. This is solved by defining a background (NULL) model. For example a random distribution: A=T=C=G=0.25 From this the Log-Odds score can be calculated: -log(P( x |M)/P( x |NULL)) A high Log-Odds score corresponds to a sequence that looks more like the gene model than the background model.

17 Center for Biological Sequence Analysis Is the model to simple? ATGCATGC TAG TAA TGA ATGCATGC ATGCATGC ATG GTG TTG S1 S2S3S4S5

18 Center for Biological Sequence Analysis Codon usage Synonymous codons incode the same amino acid. At random synonymous codons would be expected to be used with equal frequencies. In real life synonomous codons have different frequencies. Different species have consistent and characteristic codon biases. Lateral transferred genes and genes from plasmids and phages will have atypical codon usage. Variations in codon usage within an organism can be modelled in different coding models in the HMM.

19 Center for Biological Sequence Analysis 1st Position 2nd Position 3rd Position UCAG U 30,407 Phe 22,581 Phe 18,943 Leu 18,629 Leu 11,523 Ser 11,766 Ser 9,793 Ser 12,195 Ser 22,048 Tyr 16,669 Tyr 2,706 Stop 326 Stop 7,062 Cys 8,846 Cys 1,260 Stop 20,756 Trp UCAGUCAG C 15,018 Leu 15,104 Leu 5,316 Leu 71,710 Leu 9,569 Pro 7,491 Pro 11,496 Pro 31,614 Pro 17,631 His 13,272 His 20,912 Gln 39,285 Gln 28,458 Arg 29,968 Arg 4,860 Arg 7,404 Arg UCAGUCAG A 41,375 Ile 34,261 Ile 5,967 Ile 37,994 Met 12,223 Thr 31,889 Thr 9,683 Thr 19,682 Thr 24,189 Asn 29,529 Asn 45,812 Lys 14,076 Lys 11,982 Ser 21,907 Ser 2,899 Arg 1,694 Arg UCAGUCAG G 24,910 Val 20,800 Val 14,850 Val 35,979 Val 20,808 Ala 34,770 Ala 27,468 Ala 45,862 Ala 43,817 Asp 25,996 Asp 53,780 Glu 24,312 Glu 33,731 Gly 40,396 Gly 10,902 Gly 15,118 Gly UCAGUCAG Fields : [number] [amino acid]

20 Center for Biological Sequence Analysis Is the model to simple? ATG GTG TTG TAG TAA TGA S2S3 AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC S1

21 Center for Biological Sequence Analysis HMM for gene finding TAG TAA TGA S4S3 AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC ATG GTG TTG S1 S2 AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC

22 Center for Biological Sequence Analysis Multiple coding models TAG TAA TGA E ATG GTG TTG S AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC

23 Center for Biological Sequence Analysis Order of the model A zero order Markov model (state) has a propability of letter in the state – the propabilities are independent of the previous sequence. The NULL model is a zero order Markov model (A=T=G=C=0.25). The propability of a letter in a first order Markov model depends on the previous letter (di- nucleotide distributions). Second order depends on the two previous letters (corresponding to a codon).

24 Center for Biological Sequence Analysis Order of the coding model Inter-codon denpendencies are correlations between amino acids typically found in proteins. They reflect typical features of proteins and can be used to improve the performance of the gene finder. The use of higher order coding models in gene finding is a way to capture these inter-codon denpendencies. Higher order models requires more training data and more computational time when searching.

25 Center for Biological Sequence Analysis The Shine-Dalgarno sequence The ribosome binds to the messenger RNA through baseparing to the 30S ribosomal subunit. The binding site is the Shine-Dalgarno sequence (SD). The SD is a purine-rich sequence (consensus sequence: AGGAG) at the 5' end of most prokaryotic mRNAs. The SD is found 5-10 basepairs upstream from the start codon.

26 Center for Biological Sequence Analysis EasyGene

27 Center for Biological Sequence Analysis

28 R. prowazekii

29 Center for Biological Sequence Analysis GeneMark.hmm http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi Lukashin A. and Borodovsky M., “GeneMark.hmm: new solutions for gene finding”, NAR, 1998, Vol. 26, No. 4, pp. 1107-1115. EasyGene http://cbs.dtu.dk/services/EasyGene Schou Larsen T. and Krogh A., “EasyGene – A prokaryotic gene finder that ranks ORFs by statistical significance”. BMC Bioinformatics 2003, 4:21


Download ppt "Center for Biological Sequence Analysis Prokaryotic gene finding Marie Skovgaard Ph.D. student"

Similar presentations


Ads by Google