”Gene Finding in Eukaryotic Genomes” DTU course #27011 23.03.2004 Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob@cbs.dtu.dk
Today’s plan 13.00-13.30 Lecture on gene finding Gene features, Repeatmasker, etc. 13.30-14.00 Get notebooks (building 208; secretary)+Pause 14.00-16.00 Work on project Nikolaj present from 14.00-14.30 Lars present from 14.45-15.15
Practical Stuff Webpage, Literature, Textbooks Report writing format Contribution from each student specified E.g. Lars & Dorte mainly wrote the Introduction and Methods: Lise & Jens wrote the Results and Discussion sections Repeatmasker http://www.repeatmasker.org/
Gene Features Codon frequency/bias Transcriptional Exon/introns Organism dependent Hexamer statistics Transcriptional Promoters/enhancers Exon/introns Length distributions ORFs Splicing Donor/acceptor sites Branchpoints Translational Start codon context
Codon Bias tRNA availability Expression level Gene Finders are often organism specific Coding regions often modelled by 5th order Markov chain (hexamers/di-codons)
Human genes: Short exons Long introns
Human genes: Introns lengths have broad distribution Min. Length ca. 60 bp
Intron Prevalence
Gene Prediction – Performance of Genscan
NIX – Visualizing Gene Predictions http://www.hgmp.mrc.ac.uk/NIX/ NO method is always best!
Performance of Genscan – Exon Length Low performance at short exon lengths
Future Challenges Bootstrapping: prediction improves as more genes become known ’Extreme’ genes (long/short) still difficult Initial and terminal exons are predicted with lower confidence Combine with Sequence Similarity Matches Non-coding RNAs Most gene prediction programs only predict protein-coding genes tRNA and rRNA genes are not predicted Predict alternatice splicing, enhancers and silencers Predict matrix- and scaffold-attachment regions, insulators and boundary elements
Gene Prediction Take home messages Prediction methods are not perfect! Genes may be predicted by computer programs Masking of repetitive sequences may be required for large genomic sequences ’Unusual’ genes are difficult (high GC%, short or terminal exons) HMM-based gene prediction programs are suitable for “Gene Grammar” Prediction methods are not perfect!
Repeatmasker Repetitive sequences in human/eukaryotic genomes are a problem Run gene predictions on large genomic regions before and after masking of repetitive sequence: Up to 45% of human genomic sequence derived from transposable/repetitive elements
Repeatmasker http://www.repeatmasker.org/ Screens DNA sequences for interspersed repeats and low complexity DNA sequences Matches against database of known repeat elements Repeats in genomic sequence may cause wrong gene predictions
Select ”html” format
>chr19_not_repeatmasked hg16_dna range=chr19:6318243-6334922 5'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=none AGGTGTGTTGGCACACGCCTGTAATCCCAGCTACTGAGGAGGCTGAGGCATGAGAATCGCTTGAACCTGAGAGGCGGAGGTTGTAGTGAGTCGAGATTGCACCACTGCACTCCAGCCTGGGTGACAAAGTGAGACCCTGTCTCAAAAAAAAAAAAAAAAAAAAAGTGAATGTTCCACAGCATCACAGATGAATTTTGCAAATATGTTGCATGAAAGAAGAATAAACACTCTGTGATTCCATTTATTTAAACTATAAAAACAAGGAGAGCTAATTTATGCTGTTAGAGGAGTGGTTGCTTTGGGGTATGGGGAGGGGGTGGCAAGGATTAGTGACTGTCGTGGGCCCAAGTGGGGTTTCAGGGGTGCTGGCATTATTCCATCTCTTGGTCTGGGTGCTGGTCCTGTAGGGTATGTTCAGTCTGAAAATCCATCCCACCAGACATTTACGAATCATGCCCTTTCCTGGGTGTATATTATACATCAATAACAATTTTTTTTTTTTTTTGAGATGGAGTCTTGCTTTGTTGCCCAGGCTGGAGTGCAGTGGTGCAGTCTCCACCTCCCAGATTTAAGTGATTCTCATACCTCAGCCTCCCTAGTAGCTGGGATTACAGGCGTGTGCCACCACACCTGGCTCATTTTTGTATTTTTAGTAGAGACAGGGTTTCACCATGTTGGCCATGGTGAAACTTTGAAGGCCAATGGTGAAACATGAGGCCAAACTCCTGGCCTCAAGTGGTCCACCCACCT >chr19_repeatmasked hg16_dna range=chr19:6318243-6334922 5'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=N nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnGTGAATGTTCnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Repetitive Elements LINE = Long interspersed elements ______ 45% LINE = Long interspersed elements SINE = Short interspersed elements
The End