Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Ab initio gene prediction Genome 559, Winter 2011.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Gene Prediction: Similarity-Based Approaches (selected from Jones/Pevzner lecture notes)
6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple Alignment and Motif Searching Burkhard Morgenstern Universität Göttingen Institute of Microbiology and.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Gene Finding Charles Yan.
CSE182-L10 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Eukaryotic Gene Finding
Genome Annotation BCB 660 October 20, From Carson Holt.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Dynamic Programming II
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Similarity-Based Approaches.
Hidden Markov Models In BioInformatics
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Splicing Exons: A Eukaryotic Challenge to Gene Prediction Ian McCoy.
(H)MMs in gene prediction and similarity searches.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Eukaryotic Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Bioinformatics: Buzzword or Discipline (???)
Introduction to Bioinformatics II
Pair Hidden Markov Model
Geneid: training on S. lycopersicum
Hidden Markov Models (HMMs)
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (IV)
Profile HMMs GeneScan TMMOD
4. HMMs for gene finding HMM Ability to model grammar
Gene Structure Prediction Using Neural Networks and Hidden Markov Models June 18, 권동섭 신수용 조동연.
Presentation transcript:

Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt agtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagctt cgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacg tacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagc atctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggc tagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttct aggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagt cttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtag tcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtct atggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatt tttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtat gctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgta gtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatg gctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtattttt ctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgtt agctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgta gtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgta gtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatc tgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag

Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt agtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagctt cgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacg tacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagc atctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggc tagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttct aggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagt cttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtag tcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtct atggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatt tttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtat gctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgta gtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatg gctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtattttt ctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgtt agctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgta gtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgta gtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatc tgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag

Gene predictions for eukaryotes

Three different approaches to computational gene- finding: Intrinsic: use statistical information about known genes (Hidden Markov Models) Extrinsic: compare genomic sequence with known proteins / genes Cross-species sequence comparison: search for similarities among genomes

Hidden-Markov-Models (HMM) for gene prediction s B F F U U U U U F F F F F F E φ For sequence s and parse φ: P(φ) probability of φ P(φ,s) joint probability of φ and s = P(φ) * P(s|φ) P(φ|s) a-posteriori probability of φ

Hidden-Markov-Models (HMM) for gene prediction B F F U U U U U F F F F F F E Goal: find path φ with maximum a-posteriori probability P(φ|s) Equivalent: find path that maximizes joint probability P(φ,s) Optimal path calculated by dynamic programming (Viterbi algorithm)

Hidden-Markov-Models (HMM) for gene prediction B F F U U U U U F F F F F F E Program parameters learned from training data

Hidden-Markov-Models (HMM) for gene prediction Application to gene prediction: A T A A T G C C T A G T C s (DNA) Z Z Z E E E E E E I I I I φ (parse) Introns, exons etc modeled as states in GHMM („generalized HMM“) Given sequence s, find parse that maximizes P(φ|s) (S. Karlin and C. Burge, 1997)

AUGUSTUS Basic model for GHMM-based intrinsic gene finding comparable to GenScan (M. Stanke)

AUGUSTUS

Features of AUGUSTUS: Intron length model Initial pattern for exons Similarity-based weighting for splice sites Interpolated HMM Internal 3’ content model

Hidden-Markov-Models (HMM) for gene prediction A T A A T G C C T A G T C s (DNA) Z Z Z E E E E I I I I φ (parse) Explicit intron length model computationally expensive.

AUGUSTUS Intron length model: Explicit length distribution for short introns Geometric tail for long introns Intron (fixed) Exon Intron (expl.) Exon Intron (geo.)

AUGUSTUS

AUGUSTUS+ Extension of AUGUSTUS using include extrinsic information: Protein sequences EST sequences Syntenic genomic sequences User-defined constraints

Gene prediction by phylogenetic footprinting Comparison of genomic sequences (human and mouse)

Gene prediction by phylogenetic footprinting

AUGUSTUS+ Extended GHMM using extrinsic information Additional input data: collection h of `hints’ about possible gene structure φ for sequence s Consider s, φ and h result of random process. Define probability P(s,h,φ) Find parse φ that maximizes P(φ|s,h) for given s and h.

AUGUSTUS+ Hints created using Alignments to EST sequences Alignments to protein sequences Combined EST and protein alignment (EST alignments supported by protein alignments) Alignments of genomic sequences User-defined hints

AUGUSTUS+ Alignment to EST: hint to (partial) exon EST G1

AUGUSTUS+ EST alignment supported by protein: hint to exon (part), start codon EST G1 Protein

AUGUSTUS+ Alignment to ESTs, Proteins: hints to introns, exons ESTs, Protein G1

AUGUSTUS+ Alignment of genomic sequences: hint to (partial) exon G2 G1

AUGUSTUS+ Consider different types of hints: type of hints: start, stop, dss, ass, exonpart, exon, introns Hint associated with position i in s (exons etc. associated with right end position) max. one hint of each type allowed per position in s Each hint associated with a grade g that indicates its source.

AUGUSTUS+ h i,t = information about hint of type t at position i h i,t = [grade, strand, (length, reading frame)] if hint available (hints created by protein alignments contain information about reading frame) h i,t = $ if no hint of type t available at i

AUGUSTUS+ Standard program version, without hints A T A A T G C C T A G T C s (sequence) Z Z Z E E E E E E I I I I φ (parse) Find parse that maximizes P(φ|s)

AUGUSTUS+ AUGUSTUS+ using hints A T A A T G C C T A G T C s (sequence) $ $ $ $ $ $ $ X $ $ $ $ $ h (type 1) $ $ $ $ $ $ $ $ $ $ $ $ $ h (type 2) $ $ $ $ X $ $ $ $ $ $ $ $ h (type 3).. Z Z Z E E E E E E I I I I φ (parse) Find parse that maximizes P(φ|s,h)

AUGUSTUS+ As in standard HMM theory: maximize joint probability P(φ,s,h) How to calculate P(φ,s,h) ?

AUGUSTUS+ Simplifying assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types).

AUGUSTUS+ Simplifying assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types).

AUGUSTUS+ Simplifying assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types).

AUGUSTUS+ Results: Gene (sub-)structures supported by hints receive bonus compared to non-supported structures Gene (sub-)structures not supported by hints receive malus (M. Stanke et al. 2006, BMC Bioinformatics)

AUGUSTUS+

Using hints from DIALIGN alignments: 1. Obtain large human/mouse sequence pairs (up to 50kb) from UCSC 2. Run CHAOS to find anchor points 3. Run DIALIGN using CHAOS anchor points 4. Create hints h from DIALIGN fragments 5. Run AUGUSTUS with hints

AUGUSTUS+ Hints from DIALIGN fragments: Consider fragments with score ≥ 20 Distinguish high scores (≥ 45) from low scores Consider reading frame given by DIALIGN Consider strand given by DIALIGN => 2*2*2 = 8 grades

AUGUSTUS+ EGASP competition to evaulate and compare gene-prediction methods (Sanger Center, 2005) AUGUSTUS best ab-initio method at EGASP

EGASP test results

Application of AUGUSTUS in genome projects Brugia malayi (TIGR) Aedes aegypti (TIGR) Schistosoma mansoni (TIGR) Tetrahymena thermophilia (TIGR) Galdieria Sulphuraria (Michigan State Univ.) Coprinus cinereus (Univ. Göttingen) Tribolium castaneum (Univ. Göttingen)