Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt agtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagctt cgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacg tacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagc atctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggc tagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttct aggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagt cttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtag tcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtct atggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatt tttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtat gctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgta gtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatg gctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtattttt ctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgtt agctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgta gtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgta gtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatc tgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag
Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt agtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagctt cgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacg tacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagc atctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggc tagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttct aggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagt cttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtag tcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtct atggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatt tttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtat gctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgta gtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatg gctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtattttt ctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgtt agctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgta gtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgta gtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatc tgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag
Gene predictions for eukaryotes
Three different approaches to computational gene- finding: Intrinsic: use statistical information about known genes (Hidden Markov Models) Extrinsic: compare genomic sequence with known proteins / genes Cross-species sequence comparison: search for similarities among genomes
Hidden-Markov-Models (HMM) for gene prediction s B F F U U U U U F F F F F F E φ For sequence s and parse φ: P(φ) probability of φ P(φ,s) joint probability of φ and s = P(φ) * P(s|φ) P(φ|s) a-posteriori probability of φ
Hidden-Markov-Models (HMM) for gene prediction B F F U U U U U F F F F F F E Goal: find path φ with maximum a-posteriori probability P(φ|s) Equivalent: find path that maximizes joint probability P(φ,s) Optimal path calculated by dynamic programming (Viterbi algorithm)
Hidden-Markov-Models (HMM) for gene prediction B F F U U U U U F F F F F F E Program parameters learned from training data
Hidden-Markov-Models (HMM) for gene prediction Application to gene prediction: A T A A T G C C T A G T C s (DNA) Z Z Z E E E E E E I I I I φ (parse) Introns, exons etc modeled as states in GHMM („generalized HMM“) Given sequence s, find parse that maximizes P(φ|s) (S. Karlin and C. Burge, 1997)
AUGUSTUS Basic model for GHMM-based intrinsic gene finding comparable to GenScan (M. Stanke)
AUGUSTUS
Features of AUGUSTUS: Intron length model Initial pattern for exons Similarity-based weighting for splice sites Interpolated HMM Internal 3’ content model
Hidden-Markov-Models (HMM) for gene prediction A T A A T G C C T A G T C s (DNA) Z Z Z E E E E I I I I φ (parse) Explicit intron length model computationally expensive.
AUGUSTUS Intron length model: Explicit length distribution for short introns Geometric tail for long introns Intron (fixed) Exon Intron (expl.) Exon Intron (geo.)
AUGUSTUS
AUGUSTUS+ Extension of AUGUSTUS using include extrinsic information: Protein sequences EST sequences Syntenic genomic sequences User-defined constraints
Gene prediction by phylogenetic footprinting Comparison of genomic sequences (human and mouse)
Gene prediction by phylogenetic footprinting
AUGUSTUS+ Extended GHMM using extrinsic information Additional input data: collection h of `hints’ about possible gene structure φ for sequence s Consider s, φ and h result of random process. Define probability P(s,h,φ) Find parse φ that maximizes P(φ|s,h) for given s and h.
AUGUSTUS+ Hints created using Alignments to EST sequences Alignments to protein sequences Combined EST and protein alignment (EST alignments supported by protein alignments) Alignments of genomic sequences User-defined hints
AUGUSTUS+ Alignment to EST: hint to (partial) exon EST G1
AUGUSTUS+ EST alignment supported by protein: hint to exon (part), start codon EST G1 Protein
AUGUSTUS+ Alignment to ESTs, Proteins: hints to introns, exons ESTs, Protein G1
AUGUSTUS+ Alignment of genomic sequences: hint to (partial) exon G2 G1
AUGUSTUS+ Consider different types of hints: type of hints: start, stop, dss, ass, exonpart, exon, introns Hint associated with position i in s (exons etc. associated with right end position) max. one hint of each type allowed per position in s Each hint associated with a grade g that indicates its source.
AUGUSTUS+ h i,t = information about hint of type t at position i h i,t = [grade, strand, (length, reading frame)] if hint available (hints created by protein alignments contain information about reading frame) h i,t = $ if no hint of type t available at i
AUGUSTUS+ Standard program version, without hints A T A A T G C C T A G T C s (sequence) Z Z Z E E E E E E I I I I φ (parse) Find parse that maximizes P(φ|s)
AUGUSTUS+ AUGUSTUS+ using hints A T A A T G C C T A G T C s (sequence) $ $ $ $ $ $ $ X $ $ $ $ $ h (type 1) $ $ $ $ $ $ $ $ $ $ $ $ $ h (type 2) $ $ $ $ X $ $ $ $ $ $ $ $ h (type 3).. Z Z Z E E E E E E I I I I φ (parse) Find parse that maximizes P(φ|s,h)
AUGUSTUS+ As in standard HMM theory: maximize joint probability P(φ,s,h) How to calculate P(φ,s,h) ?
AUGUSTUS+ Simplifying assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types).
AUGUSTUS+ Simplifying assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types).
AUGUSTUS+ Simplifying assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types).
AUGUSTUS+ Results: Gene (sub-)structures supported by hints receive bonus compared to non-supported structures Gene (sub-)structures not supported by hints receive malus (M. Stanke et al. 2006, BMC Bioinformatics)
AUGUSTUS+
Using hints from DIALIGN alignments: 1. Obtain large human/mouse sequence pairs (up to 50kb) from UCSC 2. Run CHAOS to find anchor points 3. Run DIALIGN using CHAOS anchor points 4. Create hints h from DIALIGN fragments 5. Run AUGUSTUS with hints
AUGUSTUS+ Hints from DIALIGN fragments: Consider fragments with score ≥ 20 Distinguish high scores (≥ 45) from low scores Consider reading frame given by DIALIGN Consider strand given by DIALIGN => 2*2*2 = 8 grades
AUGUSTUS+ EGASP competition to evaulate and compare gene-prediction methods (Sanger Center, 2005) AUGUSTUS best ab-initio method at EGASP
EGASP test results
Application of AUGUSTUS in genome projects Brugia malayi (TIGR) Aedes aegypti (TIGR) Schistosoma mansoni (TIGR) Tetrahymena thermophilia (TIGR) Galdieria Sulphuraria (Michigan State Univ.) Coprinus cinereus (Univ. Göttingen) Tribolium castaneum (Univ. Göttingen)