Download presentation
Presentation is loading. Please wait.
1
Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob@cbs.dtu.dk ”Gene Finding in Eukaryotic Genomes” PhD course #27803 Spring 2003
2
Center for Biologisk Sekvensanalyse Human Genome Published HUGO: Nature, 15.feb.2001 Celera: Science, 16.feb.2001
3
Center for Biologisk Sekvensanalyse We Have the Human Genome Sequence...now what? So, what is the problem? Well... We don’t know how many genes there are! We don’t know where they are! We don’t know what they do!
4
Center for Biologisk Sekvensanalyse
5
The cellular machinery recognize genes without access to GenBank, SwissProt or computers – can we?
6
Center for Biologisk Sekvensanalyse Needles in Haystacks... Only 2% of human genome is coding regions Intron-exon structure of genes Large introns (average 3365 bp ) Small exons (average 145 bp) Long genes (average 27 kb)
7
Center for Biologisk Sekvensanalyse AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTG GGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTA AGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCAT CTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGC TGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATG CCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCA TGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATT TCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATA TATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATAC CCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCA TTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGC TACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCA AAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGT TTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCT TCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATT ATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTC TTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCT CTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCC AGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATT TTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAG TAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCT GCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCT CCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTG ACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGG TTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTT ATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACAC CACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTG GTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATA GCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTAC ATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGC TCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATAT ATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATT TGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTC TTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGT GTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAG TGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTA ACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAG GCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAA TGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG
8
Center for Biologisk Sekvensanalyse AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTG GGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTA AGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCAT CTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGC TGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATG CCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCA TGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATT TCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATA TATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATAC CCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCA TTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGC TACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCA AAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGT TTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCT TCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATT ATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTC TTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCT CTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCC AGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATT TTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAG TAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCT GCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCT CCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTG ACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGG TTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTT ATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACAC CACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTG GTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATA GCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTAC ATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGC TCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATAT ATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATT TGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTC TTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGT GTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAG TGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTA ACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAG GCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAA TGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG
9
Center for Biologisk Sekvensanalyse
10
Genes and Signals
11
Center for Biologisk Sekvensanalyse Gene Features Codon frequency/bias Organism dependent Hexamer statistics Transcriptional Promoters/enhancers Exon/introns Length distributions ORFs Splicing Donor/acceptor sites Branchpoints Translational Ribosome binding sites
12
Center for Biologisk Sekvensanalyse Codon Bias Gene Finders are often organism specific Coding regions often modelled by 5th order Markov chain (hexamers/di- codons)
13
Center for Biologisk Sekvensanalyse Exon Size
14
Center for Biologisk Sekvensanalyse Intron Size
15
Center for Biologisk Sekvensanalyse Intron Prevalence
16
Center for Biologisk Sekvensanalyse Gene Finding Challenges Need the correct reading frame Introns can interrupt an exon in mid-codon There is no hard and fast rule for identifying donor and acceptor splice sites Signals are very weak
17
Center for Biologisk Sekvensanalyse
18
Overpredicting Genes Easy to predict all exons Report all sequences flanked by..AG and GT.. as exons Sensitivity = 100% Specificity ~ 0%
19
Center for Biologisk Sekvensanalyse Sensor-based methods Similarity searches misses some/many genes cDNA/EST libraries are not perfect Ab initio Gene Finders HMM-based GenScan HMMgene Neural network-based GRAIL NetGene2 (splice sites)
20
Center for Biologisk Sekvensanalyse Gene Prediction ”Isolated” methods Predict individual features E.g. splice sites, coding regions NetGene (Neural network) – http://www.cbs.dtu.dk/services/NetGene2/ http://www.cbs.dtu.dk/services/NetGene2/ ”Integrated” methods Predict genes in context ”Grammar” of genes Certain elements in specific order are required – HMMgene http://www.cbs.dtu.dk/services/HMMgene/ http://www.cbs.dtu.dk/services/HMMgene/ – GenScan (HMM-based) http://genes.mit.edu/GENSCAN.html http://genes.mit.edu/GENSCAN.html
21
Center for Biologisk Sekvensanalyse Gene Grammar HAPPYEUGENEAWASGUYFINDER Isolated features
22
Center for Biologisk Sekvensanalyse Gene Grammar HAPPYEUGENEAWASGUYFINDER Isolated features Intron 3’UTR Exon Promoter Exon RBS
23
Center for Biologisk Sekvensanalyse Gene Grammar EUGENEFINDERWASAHAPPYGUY Integrated features HAPPYEUGENEAWASGUYFINDER
24
Center for Biologisk Sekvensanalyse Gene Grammar EUGENEFINDERWASAHAPPYGUY Integrated features Prom RBS Exon Intron Exon 3’UTR
25
Center for Biologisk Sekvensanalyse Gene Grammar ”Isolated” methods (e.g.NN): HAPPYEUGENEAWASGUYFINDER ”Integrated” methods (e.g.HMM): EUGENEFINDERWASAHAPPYGUY
26
Center for Biologisk Sekvensanalyse HMMs for genefinding GenScan principle E=exon I=intron F=5’ UTR T=3’ UTR P=promoter N=intergenic
27
Center for Biologisk Sekvensanalyse Genscan http://genes.mit.edu/GENSCAN.html http://genes.mit.edu/GENSCAN.html
28
Center for Biologisk Sekvensanalyse Genscan
29
Center for Biologisk Sekvensanalyse Genscan http://genes.mit.edu/GENSCAN.html http://genes.mit.edu/GENSCAN.html
30
Center for Biologisk Sekvensanalyse Genscan
31
Center for Biologisk Sekvensanalyse Genscan
32
Center for Biologisk Sekvensanalyse HMMgene http://www.cbs.dtu.dk/services/HMMgene/ http://www.cbs.dtu.dk/services/HMMgene/
33
Center for Biologisk Sekvensanalyse HMMgene http://www.cbs.dtu.dk/services/HMMgene/ http://www.cbs.dtu.dk/services/HMMgene/ Columns 1.Sequence identifier 2.Program name 3.Prediction (see table below for the meaning). 4.Beginning 5.End 6.Score between 0 and 1 7.Strand: $+$ for direct and $-$ for complementary 8.Frame (for exons it is the position of the donor in the frame) 9.Group to which prediction belong. If several CDS's are found they will be called cds_1, cds_2, etc. `bestparse:' is there because alternative predictions will also be available (see below). NameMeaning firstex The coding part of the first coding exon starting with the first base of the start codon. exon_N The N'th predicted internal coding exon. lastex The coding part of the last coding exon ending with the last base of the stop codon. singleex The coding part of an exon in a gene with only one coding exon. CDS Coding region composed of the exon predictions prior to this line.
34
Center for Biologisk Sekvensanalyse Defining the term ’exon’ Gene Prediction programs often use Exon = CDS (coding sequence) Real exons may contain 5’ or 3’ UTRs (untranslated regions)
35
Center for Biologisk Sekvensanalyse Gene Prediction – NetGene2
36
Center for Biologisk Sekvensanalyse Gene Prediction – NetGene2
37
Center for Biologisk Sekvensanalyse Gene Prediction – NetGene2
38
Center for Biologisk Sekvensanalyse Gene Prediction – NetGene2
39
Center for Biologisk Sekvensanalyse NIX – Visualizing Gene Predictions http://www.hgmp.mrc.ac.uk/NIX/
40
Center for Biologisk Sekvensanalyse Gene Prediction – Performance of Genscan
41
Center for Biologisk Sekvensanalyse Performance of Genscan – Exon Length
42
Center for Biologisk Sekvensanalyse Repeatmasker Repetitive sequences in human/eukaryotic genomes are a problem Run gene predictions on large genomic regions before and after masking of repetitive sequence: http://ftp.genome.washington.edu/cgi- bin/RepeatMasker http://ftp.genome.washington.edu/cgi- bin/RepeatMasker Up to 45% of human genomic sequence derived from transposable/repetitive elements
43
Center for Biologisk Sekvensanalyse Repeatmasker
44
Center for Biologisk Sekvensanalyse Future Challenges Bootstrapping: prediction improves as more genes become known ’Extreme’ genes (long/short) still difficult Initial and terminal exons are predicted with lower confidence Combine with Sequence Similarity Matches Non-coding RNAs Most gene prediction programs only predict protein- coding genes tRNA and rRNA genes are not predicted Prokaryotic gene finding Much easier (no introns), but still not perfect Especially short genes (<300 bp) difficult
45
Center for Biologisk Sekvensanalyse Gene Prediction Take home messages Human genome sequence is known Number of human genes is unknown! Before 2001: est.30,000-140,000 Anno 2003: 30,000-40,000 Location, structure and function of many human genes is unknown! Genes may be discovered by different means and methods...
46
Center for Biologisk Sekvensanalyse Gene Prediction Take home messages Genes may be predicted by computer programs Masking of repetitive sequences may be required for large genomic sequences ’Unusual’ genes are difficult (high GC%, short or terminal exons) HMM-based gene prediction programs are suitable for “Gene Grammar” Prediction methods are not perfect!
47
Center for Biologisk Sekvensanalyse The End
49
Gene Prediction Exercises I. Gene Finding in Prokaryotic Sequence II. Gene Finding in Eukaryotic Sequence Exercises at: http://www.cbs.dtu.dk/phdcourse/programme.html http://www.cbs.dtu.dk/phdcourse/cookbooks/genefinding/pro.html http://www.cbs.dtu.dk/phdcourse/cookbooks/genefinding/euk.html
50
Center for Biologisk Sekvensanalyse Gene Prediction Exercise SequenceGenBankGenscanHMMgeneNetGene2 Seq#1 (HoxA10) 320..1226 2401..2675 320 1226 0.871 2401 2675 0.988 320 1226 0.744 2401 2675 0.971 Donor 1227 0.95H Acc. 2400 1.00H Seq#2 (Dub-2) 398..425 1208..2817 - 1208 2817 0.800 398 425 0.418 1208 2817 0.735 Donor 426 0.87 Acc. 1207 0.42 Acc. 1210 0.71 http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob /exercises/gf_exercise_solution.html
51
Center for Biologisk Sekvensanalyse
53
Gene Prediction – Performance of Genscan
54
Center for Biologisk Sekvensanalyse Genome Browsing - Exercise #1 How many exons are encoded by the hoxA10 gene? 2 exons How many basepairs is the transcript length ? 2542 bp
55
Center for Biologisk Sekvensanalyse Genome Browsing - Exercise #1 On what chromosome is the hoxA10 gene? Human chr.7 On which arm (short/p or long/q) ? p What gene is located ca. 500 kb downstream of HoxA10 ? Scap2 On what mouse chromosome is the ortholog/homolog of human HoxA10 located? Mouse chr.6 In the overview panel, there is a gene located ca. 300 kb downstream of HoxA10, what is the name? Scap2
56
Center for Biologisk Sekvensanalyse http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob /exercises/gf_exercise_solution.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.