Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark

Slides:



Advertisements
Similar presentations
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Advertisements

Ch 17 Gene Expression I: Transcription
An Introduction to Bioinformatics Finding genes in prokaryotes.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
BME 130 – Genomes Lecture 7 Genome Annotation I – Gene finding & function predictions.
Gene Finding Charles Yan.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
The Human Genome Project Public: International Human Genome Sequencing Consortium (aka HUGO) Private: Celera Genomics, Inc. (aka TIGR)
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.
Protein Synthesis The genetic code – the sequence of nucleotides in DNA – is ultimately translated into the sequence of amino acids in proteins – gene.
Gene Structure and Identification
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
© 2012 Pearson Education, Inc. Lecture by Edward J. Zalisko PowerPoint Lectures for Campbell Biology: Concepts & Connections, Seventh Edition Reece, Taylor,
Grupo 5. 5’site 3’site branchpoint site exon 1 intron 1 exon 2 intron 2 AG/GT CAG/NT.
Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
Genome Annotation Rosana O. Babu.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
DNA in the Cell Stored in Number of Chromosomes (24 in Human Genome) Tightly coiled threads of DNA and Associated Proteins: Chromatin 3 billion bp in Human.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Applied Bioinformatics
Finding genes in the genome
Exercise 3 Inspecting the primary structure of a gene.
Annotation of eukaryotic genomes
CFE Higher Biology DNA and the Genome Transcription.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Identification of Coding Sequences Bert Gold, Ph.D., F.A.C.M.G.
The Central Dogma of Molecular Biology DNA  RNA  Protein  Trait.
Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark
Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark
Bioinformatics Computing 1 CMP 807 – Day 4 Kevin Galens.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
”Gene Finding in Eukaryotic Genomes”
Eukaryotic Gene Structure
Genes, Genomes, and Genomics
Visualization of genomic data
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
”Gene Finding in Eukaryotic Genomes”
Gene Structure.
The Toy Exon Finder.
Gene Structure.
Presentation transcript:

Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark ”Gene Finding in Eukaryotic Genomes” PhD course #27803 Spring 2003

Center for Biologisk Sekvensanalyse Human Genome Published HUGO: Nature, 15.feb.2001 Celera: Science, 16.feb.2001

Center for Biologisk Sekvensanalyse We Have the Human Genome Sequence...now what? So, what is the problem? Well... We don’t know how many genes there are! We don’t know where they are! We don’t know what they do!

Center for Biologisk Sekvensanalyse

The cellular machinery recognize genes without access to GenBank, SwissProt or computers – can we?

Center for Biologisk Sekvensanalyse Needles in Haystacks... Only 2% of human genome is coding regions Intron-exon structure of genes Large introns (average 3365 bp ) Small exons (average 145 bp) Long genes (average 27 kb)

Center for Biologisk Sekvensanalyse AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTG GGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTA AGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCAT CTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGC TGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATG CCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCA TGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATT TCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATA TATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATAC CCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCA TTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGC TACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCA AAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGT TTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCT TCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATT ATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTC TTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCT CTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCC AGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATT TTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAG TAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCT GCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCT CCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTG ACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGG TTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTT ATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACAC CACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTG GTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATA GCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTAC ATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGC TCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATAT ATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATT TGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTC TTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGT GTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAG TGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTA ACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAG GCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAA TGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG

Center for Biologisk Sekvensanalyse AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTG GGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTA AGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCAT CTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGC TGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATG CCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCA TGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATT TCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATA TATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATAC CCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCA TTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGC TACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCA AAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGT TTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCT TCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATT ATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTC TTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCT CTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCC AGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATT TTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAG TAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCT GCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCT CCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTG ACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGG TTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTT ATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACAC CACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTG GTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATA GCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTAC ATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGC TCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATAT ATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATT TGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTC TTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGT GTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAG TGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTA ACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAG GCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAA TGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG

Center for Biologisk Sekvensanalyse

Genes and Signals

Center for Biologisk Sekvensanalyse Gene Features Codon frequency/bias Organism dependent Hexamer statistics Transcriptional Promoters/enhancers Exon/introns Length distributions ORFs Splicing Donor/acceptor sites Branchpoints Translational Ribosome binding sites

Center for Biologisk Sekvensanalyse Codon Bias Gene Finders are often organism specific Coding regions often modelled by 5th order Markov chain (hexamers/di- codons)

Center for Biologisk Sekvensanalyse Exon Size

Center for Biologisk Sekvensanalyse Intron Size

Center for Biologisk Sekvensanalyse Intron Prevalence

Center for Biologisk Sekvensanalyse Gene Finding Challenges Need the correct reading frame Introns can interrupt an exon in mid-codon There is no hard and fast rule for identifying donor and acceptor splice sites Signals are very weak

Center for Biologisk Sekvensanalyse

Overpredicting Genes Easy to predict all exons Report all sequences flanked by..AG and GT.. as exons Sensitivity = 100% Specificity ~ 0%

Center for Biologisk Sekvensanalyse Sensor-based methods Similarity searches misses some/many genes cDNA/EST libraries are not perfect Ab initio Gene Finders HMM-based GenScan HMMgene Neural network-based GRAIL NetGene2 (splice sites)

Center for Biologisk Sekvensanalyse Gene Prediction ”Isolated” methods Predict individual features E.g. splice sites, coding regions NetGene (Neural network) – ”Integrated” methods Predict genes in context ”Grammar” of genes Certain elements in specific order are required – HMMgene – GenScan (HMM-based)

Center for Biologisk Sekvensanalyse Gene Grammar HAPPYEUGENEAWASGUYFINDER Isolated features

Center for Biologisk Sekvensanalyse Gene Grammar HAPPYEUGENEAWASGUYFINDER Isolated features Intron 3’UTR Exon Promoter Exon RBS

Center for Biologisk Sekvensanalyse Gene Grammar EUGENEFINDERWASAHAPPYGUY Integrated features HAPPYEUGENEAWASGUYFINDER

Center for Biologisk Sekvensanalyse Gene Grammar EUGENEFINDERWASAHAPPYGUY Integrated features Prom  RBS  Exon  Intron  Exon  3’UTR

Center for Biologisk Sekvensanalyse Gene Grammar ”Isolated” methods (e.g.NN): HAPPYEUGENEAWASGUYFINDER ”Integrated” methods (e.g.HMM): EUGENEFINDERWASAHAPPYGUY

Center for Biologisk Sekvensanalyse HMMs for genefinding GenScan principle E=exon I=intron F=5’ UTR T=3’ UTR P=promoter N=intergenic

Center for Biologisk Sekvensanalyse Genscan

Center for Biologisk Sekvensanalyse Genscan

Center for Biologisk Sekvensanalyse Genscan

Center for Biologisk Sekvensanalyse Genscan

Center for Biologisk Sekvensanalyse Genscan

Center for Biologisk Sekvensanalyse HMMgene

Center for Biologisk Sekvensanalyse HMMgene Columns 1.Sequence identifier 2.Program name 3.Prediction (see table below for the meaning). 4.Beginning 5.End 6.Score between 0 and 1 7.Strand: $+$ for direct and $-$ for complementary 8.Frame (for exons it is the position of the donor in the frame) 9.Group to which prediction belong. If several CDS's are found they will be called cds_1, cds_2, etc. `bestparse:' is there because alternative predictions will also be available (see below). NameMeaning firstex The coding part of the first coding exon starting with the first base of the start codon. exon_N The N'th predicted internal coding exon. lastex The coding part of the last coding exon ending with the last base of the stop codon. singleex The coding part of an exon in a gene with only one coding exon. CDS Coding region composed of the exon predictions prior to this line.

Center for Biologisk Sekvensanalyse Defining the term ’exon’ Gene Prediction programs often use Exon = CDS (coding sequence) Real exons may contain 5’ or 3’ UTRs (untranslated regions)

Center for Biologisk Sekvensanalyse Gene Prediction – NetGene2

Center for Biologisk Sekvensanalyse Gene Prediction – NetGene2

Center for Biologisk Sekvensanalyse Gene Prediction – NetGene2

Center for Biologisk Sekvensanalyse Gene Prediction – NetGene2

Center for Biologisk Sekvensanalyse NIX – Visualizing Gene Predictions

Center for Biologisk Sekvensanalyse Gene Prediction – Performance of Genscan

Center for Biologisk Sekvensanalyse Performance of Genscan – Exon Length

Center for Biologisk Sekvensanalyse Repeatmasker Repetitive sequences in human/eukaryotic genomes are a problem Run gene predictions on large genomic regions before and after masking of repetitive sequence: bin/RepeatMasker bin/RepeatMasker Up to 45% of human genomic sequence derived from transposable/repetitive elements

Center for Biologisk Sekvensanalyse Repeatmasker

Center for Biologisk Sekvensanalyse Future Challenges Bootstrapping: prediction improves as more genes become known ’Extreme’ genes (long/short) still difficult Initial and terminal exons are predicted with lower confidence Combine with Sequence Similarity Matches Non-coding RNAs Most gene prediction programs only predict protein- coding genes tRNA and rRNA genes are not predicted Prokaryotic gene finding Much easier (no introns), but still not perfect Especially short genes (<300 bp) difficult

Center for Biologisk Sekvensanalyse Gene Prediction Take home messages Human genome sequence is known Number of human genes is unknown! Before 2001: est.30, ,000 Anno 2003: 30,000-40,000 Location, structure and function of many human genes is unknown! Genes may be discovered by different means and methods...

Center for Biologisk Sekvensanalyse Gene Prediction Take home messages Genes may be predicted by computer programs Masking of repetitive sequences may be required for large genomic sequences ’Unusual’ genes are difficult (high GC%, short or terminal exons) HMM-based gene prediction programs are suitable for “Gene Grammar” Prediction methods are not perfect!

Center for Biologisk Sekvensanalyse The End

Gene Prediction Exercises I. Gene Finding in Prokaryotic Sequence II. Gene Finding in Eukaryotic Sequence Exercises at:

Center for Biologisk Sekvensanalyse Gene Prediction Exercise SequenceGenBankGenscanHMMgeneNetGene2 Seq#1 (HoxA10) Donor H Acc H Seq#2 (Dub-2) Donor Acc Acc /exercises/gf_exercise_solution.html

Center for Biologisk Sekvensanalyse

Gene Prediction – Performance of Genscan

Center for Biologisk Sekvensanalyse Genome Browsing - Exercise #1 How many exons are encoded by the hoxA10 gene? 2 exons How many basepairs is the transcript length ? 2542 bp

Center for Biologisk Sekvensanalyse Genome Browsing - Exercise #1 On what chromosome is the hoxA10 gene? Human chr.7 On which arm (short/p or long/q) ? p What gene is located ca. 500 kb downstream of HoxA10 ? Scap2 On what mouse chromosome is the ortholog/homolog of human HoxA10 located? Mouse chr.6 In the overview panel, there is a gene located ca. 300 kb downstream of HoxA10, what is the name? Scap2

Center for Biologisk Sekvensanalyse /exercises/gf_exercise_solution.html