Presentation is loading. Please wait.

Presentation is loading. Please wait.

”Gene Finding in Eukaryotic Genomes”

Similar presentations


Presentation on theme: "”Gene Finding in Eukaryotic Genomes”"— Presentation transcript:

1 ”Gene Finding in Eukaryotic Genomes”
DTU course #27803 Fall 2003 Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark

2 Human Genome Published
HUGO: Nature, 15.feb.2001 Celera: Science, 16.feb.2001

3 We Have the Human Genome Sequence...now what?
So, what is the problem? Well... We don’t know how many genes there are! We don’t know where they are! We don’t know what they do!

4

5 The cellular machinery recognize genes without access to GenBank, SwissProt or computers – can we?

6 Needles Hiding in Genome Haystacks...
Genes are embedded in the genome sequence Coding regions constitute only 2% of human genome Can we distinguish the gene features from the background?

7 Can U spot ’Spot’?

8 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTATGCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTTGTGTGTGTTATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCTTGTCAGGTTTTCACCCCATGCTCCTCCATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGCTAGTCTGCTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTGACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

9

10 Can U spot the Gene? Can U spot the Gin? Ooops
AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG Can U spot the Gene? Can U spot the Gin? Ooops

11 AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG

12 Needles Hiding in Genome Haystacks...
Intron-exon structure of genes Large introns (average 3365 bp ) Small exons (average 145 bp) Long genes (average 27 kb)

13 Manual Genefinding Start codon: ATG Stop codons: TAA, TAG, TGA
Donor splice site: ^GT[AG]AG Acceptor splice site: [CT]AG^ >U70368 (950 bp) 351 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 401 GTGGTTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT 451 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 501 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC 551 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 601 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA 651 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA 701 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG 751 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 801 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT 851 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG 901 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG 951 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT 1001 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT 1051 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 1101 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA 1151 GGTAAAARAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 1201 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA 1251 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

14 Manual Genefinding Start codon: ATG
P(ATG)=p(A) x p(T) x p(G) ~ ¼ x ¼ x ¼ = 1/64 (in 950 bp = 14.8 ATG expected; observed = 16) >U70368 (950 bp) 351 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 401 GTGGTTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT 451 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 501 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC 551 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 601 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA 651 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA 701 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG 751 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 801 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT 851 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG 901 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG 951 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT 1001 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT 1051 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 1101 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA 1151 GGTAAAARAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 1201 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA 1251 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

15 Manual Genefinding Start codon: ATG Stop codons: TAA, TAG, TGA
>U70368 (950 bp) 351 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 401 GTGGTTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT 451 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 501 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC 551 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 601 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA 651 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA 701 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG 751 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 801 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT 851 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG 901 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG 951 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT 1001 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT 1051 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 1101 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA 1151 GGTAAAARAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 1201 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA 1251 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

16 Manual Genefinding Start codon: ATG Stop codons: TAA, TAG, TGA
>U70368 (950 bp) 351 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 401 GTGGTTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT 451 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 501 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC 551 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 601 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA 651 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA 701 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG 751 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 801 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT 851 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG 901 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG 951 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT 1001 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT 1051 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 1101 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA 1151 GGTAAAARAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 1201 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA 1251 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

17 Genes and Signals

18

19 Manual Genefinding Start codon: ATG Stop codons: TAA, TAG, TGA
Donor splice site: ^GT[AG]AG Acceptor splice site: [CT]AG^ >U70368 (950 bp) 351 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 401 GTGGTTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT 451 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 501 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC 551 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 601 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA 651 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA 701 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG 751 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 801 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT 851 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG 901 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG 951 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT 1001 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT 1051 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 1101 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA 1151 GGTAAAARAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 1201 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA 1251 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

20 Manual Genefinding Start codon: ATG Stop codons: TAA, TAG, TGA
Donor splice site: ^GT[AG]AG Acceptor splice site: [CT]AG^ >U70368 (950 bp) 351 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 401 GTGGTTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT 451 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 501 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC 551 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 601 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA 651 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA 701 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG 751 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 801 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT 851 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG 901 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG 951 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT 1001 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT 1051 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 1101 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA 1151 GGTAAAARAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 1201 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA 1251 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC

21 Gene Features Codon frequency/bias Transcriptional Exon/introns
Organism dependent Hexamer statistics Transcriptional Promoters/enhancers Exon/introns Length distributions ORFs Splicing Donor/acceptor sites Branchpoints Translational Start codon context

22 Codon Bias tRNA availability Expression level
Gene Finders are often organism specific Coding regions often modelled by 5th order Markov chain (hexamers/di-codons)

23 Exon Size

24 Intron Size

25 Intron Prevalence

26 Exon definition model

27 Gene Prediction Eukaryotic Gene Prediction
Prediction relies on integration of several gene features Each gene feature carries a low signal E.g. ATG, splice sites, etc. Combinatorial explosion Some are mutually exclusive (e.g. reading frame) Sensor based HMMs well suited for gene prediction

28 Gene Prediction Take home messages Human genome sequence is known
Number of human genes is unknown! Before 2001: est. 30, ,000 Anno 2003: 25,000-40,000 Why? Because gene structure prediction is hard! Location, structure and function of many human genes is unknown! Genes may be discovered by different means and methods ...

29 The End

30

31 Gene Finding Challenges
Need the correct reading frame Introns can interrupt an exon in mid-codon There is no hard and fast rule for identifying donor and acceptor splice sites Signals are very weak

32 Overpredicting Genes Easy to predict all exons
Report all sequences flanked by ..AG and GT.. as exons Sensitivity = 100% Specificity ~ 0%

33 Sensor-based methods Similarity searches misses some/many genes
cDNA/EST libraries are not perfect Ab initio Gene Finders HMM-based (Hidden Markov Model) GenScan HMMgene Neural network-based GRAIL NetGene2 (splice sites)

34 Gene Prediction ”Isolated” methods ”Integrated” methods
Predict individual features E.g. splice sites, coding regions NetGene (Neural network) ”Integrated” methods Predict genes in context ”Grammar” of genes Certain elements in specific order are required HMMgene GenScan (HMM-based)

35 HAPPYEUGENEAWASGUYFINDER
Gene Grammar Isolated features HAPPYEUGENEAWASGUYFINDER

36 HAPPYEUGENEAWASGUYFINDER Intron 3’UTR Exon Promoter Exon RBS
Gene Grammar Isolated features HAPPYEUGENEAWASGUYFINDER Intron 3’UTR Exon Promoter Exon RBS

37 HAPPYEUGENEAWASGUYFINDER EUGENEFINDERWASAHAPPYGUY
Gene Grammar Integrated features HAPPYEUGENEAWASGUYFINDER EUGENEFINDERWASAHAPPYGUY

38 EUGENEFINDERWASAHAPPYGUY PromRBSExonIntronExon3’UTR
Gene Grammar Integrated features EUGENEFINDERWASAHAPPYGUY PromRBSExonIntronExon3’UTR

39 HAPPYEUGENEAWASGUYFINDER EUGENEFINDERWASAHAPPYGUY
Gene Grammar ”Isolated” methods (e.g.NN): HAPPYEUGENEAWASGUYFINDER ”Integrated” methods (e.g.HMM): EUGENEFINDERWASAHAPPYGUY

40 HMMs for genefinding GenScan principle E=exon I=intron F=5’ UTR
T=3’ UTR P=promoter N=intergenic

41 Genscan http://genes.mit.edu/GENSCAN.html

42 Genscan

43 Genscan http://genes.mit.edu/GENSCAN.html

44 Genscan

45 Genscan

46 HMMgene http://www.cbs.dtu.dk/services/HMMgene/

47 HMMgene http://www.cbs.dtu.dk/services/HMMgene/
Columns Sequence identifier Program name Prediction (see table below for the meaning). Beginning End Score between 0 and 1 Strand: $+$ for direct and $-$ for complementary Frame (for exons it is the position of the donor in the frame) Group to which prediction belong. If several CDS's are found they will be called cds_1, cds_2, etc. `bestparse:' is there because alternative predictions will also be available (see below). Name Meaning firstex The coding part of the first coding exon starting with the first base of the start codon. exon_N The N'th predicted internal coding exon. lastex The coding part of the last coding exon ending with the last base of the stop codon. singleex The coding part of an exon in a gene with only one coding exon. CDS Coding region composed of the exon predictions prior to this line.

48 Defining the term ’exon’
Gene Prediction programs often use Exon = CDS (coding sequence) Real exons may contain 5’ or 3’ UTRs (untranslated regions)

49 Gene Prediction – NetGene 2

50 Gene Prediction – NetGene 2

51 Gene Prediction – NetGene 2

52 Gene Prediction – NetGene 2

53 NIX – Visualizing Gene Predictions
NO method is always best!

54 Gene Prediction – Performance of Genscan

55 Performance of Genscan – Exon Length

56 Repeatmasker Repetitive sequences in human/eukaryotic genomes are a problem Run gene predictions on large genomic regions before and after masking of repetitive sequence: Up to 45% of human genomic sequence derived from transposable/repetitive elements

57 Repeatmasker

58 Future Challenges Bootstrapping: prediction improves as more genes become known ’Extreme’ genes (long/short) still difficult Initial and terminal exons are predicted with lower confidence Combine with Sequence Similarity Matches Non-coding RNAs Most gene prediction programs only predict protein-coding genes tRNA and rRNA genes are not predicted Predict alternatice splicing, enhancers and silencers Predict matrix- and scaffold-attachment regions, insulators and boundary elements

59 Gene Prediction Take home messages Human genome sequence is known
Number of human genes is unknown! Before 2001: est. 30, ,000 Anno 2003: 25,000-40,000 Location, structure and function of many human genes is unknown! Genes may be discovered by different means and methods ...

60 Gene Prediction Take home messages Prediction methods are not perfect!
Genes may be predicted by computer programs Masking of repetitive sequences may be required for large genomic sequences ’Unusual’ genes are difficult (high GC%, short or terminal exons) HMM-based gene prediction programs are suitable for “Gene Grammar” Prediction methods are not perfect!

61 The End


Download ppt "”Gene Finding in Eukaryotic Genomes”"

Similar presentations


Ads by Google