Eukaryotic Gene Finding

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Ab initio gene prediction Genome 559, Winter 2011.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt.
Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Gene Finding (DNA signals) Genome Sequencing and assembly
BME 130 – Genomes Lecture 7 Genome Annotation I – Gene finding & function predictions.
CSE182-L10 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
CSE182-L10 MS Spec Applications + Gene Finding + Projects.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Eukaryotic Gene Finding
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Biological Motivation Gene Finding in Eukaryotic Genomes
Gene Structure and Identification
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
© 2012 Pearson Education, Inc. Lecture by Edward J. Zalisko PowerPoint Lectures for Campbell Biology: Concepts & Connections, Seventh Edition Reece, Taylor,
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Hidden Markov Models BMI/CS 576
bacteria and eukaryotes
Genome Annotation (protein coding genes)
”Gene Finding in Eukaryotic Genomes”
The Transcriptional Landscape of the Mammalian Genome
What is a Hidden Markov Model?
Lesson Four Structure of a Gene.
Lesson Four Structure of a Gene.
Basics of Comparative Genomics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Inferring Models of cis-Regulatory Modules using Information Theory
Hidden Markov Model I.
Mass spectrometry-based proteomics
Interpolated Markov Models for Gene Finding
RNA Secondary Structure Prediction
Ab initio gene prediction
Linking Genetic Variation to Important Phenotypes
Introduction to Bioinformatics II
Hidden Markov Models in Bioinformatics min
Stochastic Context Free Grammars for RNA Structure Modeling
Genes Encode RNAs and Polypeptides
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (IV)
Profile HMMs GeneScan TMMOD
4. HMMs for gene finding HMM Ability to model grammar
Basics of Comparative Genomics
Presentation transcript:

Eukaryotic Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2018 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Mark Craven, Colin Dewey, and Anthony Gitter

Goals for Lecture Key concepts Incorporating sequence signals into gene finding with HMMs Modeling durations with generalized HMMs Modeling conversation with pair HMMs Modern gene finding and genome annotation

Sources of Evidence for Gene Finding Signals: the sequence signals (e.g. splice junctions) involved in gene expression Content: statistical properties that distinguish protein-coding DNA from non-coding DNA Conservation: signal and content properties that are conserved across related sequences (e.g. orthologous regions of the mouse and human genome)

Eukaryotic Gene Structure

Splice Signals Example donor sites acceptor sites -3 -2 -1 1 2 3 4 5 6 exon Figures from Yi Xing exon There are significant dependencies among non-adjacent positions in donor splice signals Informative for inferring hidden state of HMM

Parsing a DNA Sequence The HMM Viterbi path represents a parse of a given sequence, predicts exons, acceptor sites, introns, etc. Hidden state ACCGTTA CGTGTCATTCTACGTGATCAT CGGATCCTAGAATCATCGATC CGTGCGATCGATCGGATTAGCTAGCTTAGCTAGGA 5’UTR Exon Intron Intergenic Observed sequence ACCGTTACGTGTCATTCTACGTGATCATCGGATCCTAGAATCATCGATCCGTGCGATCGATCGGATTAGCTAGCTTAGCTAGGA How can we properly model the transitions from one state to another?

Length Distributions of Introns/Exons Initial exons geometric dist. provides good fit Figure from Burge & Karlin, Journal of Molecular Biology, 1997 Internal exons Terminal exons geometric dist. provides poor fit

Duration Modeling in HMMs Semi-Markov models are well-motivated for some sequence elements (e.g. exons) Semi-Markov: explicitly model length duration of hidden states Also called generalized hidden Markov model

The GENSCAN HMM for Eukaryotic Gene Finding [Burge & Karlin ‘97] Pairs of intron/exon units represent the different ways an intron can interrupt a coding sequence (after 1st base in codon, after 2nd base or after 3rd base) Each shape represents a functional unit of a gene or genomic region Figure from Burge & Karlin, Journal of Molecular Biology, 1997 Complementary submodel (not shown) detects genes on opposite DNA strand

Parsing a DNA Sequence The Viterbi path represents a parse of a given sequence, predicting exons, introns, etc. ACCGTTA ACCGTTACGTGTCATTCTACGTGATCATCGGATCCTAGAATCATCGATCCGTGCGATCGATCGGATTAGCTAGCTTAGCTAGGAGAGCATCGATCGGATCGAGGAGGAGCCTATATAAATCAA CGTGTCATTCTACGTGATCAT CGGATCCTAGAATCATCGATC CGTGCGATCGATCGGATTAGCTAGCTTAGCTAGGA GAGCATCGATCGGATCGAGGAGGAGCCTATATAAATCAA

Comparative Algorithms Genes are among the most conserved elements in the genome use conservation to help infer locations of genes Some signals associated with genes are short and occur frequently in the genome use conservation to eliminate false candidate sites from consideration

Pair Hidden Markov Models Each non-silent state emits one or a pair of characters Transition probabilities H: homology (match) state I: insert state D: delete state

Pair HMM Paths are Alignments sequence 1: AAGCGC sequence 2: ATGTC B H A H A T I G I C H G D T H C E hidden: observed:

Generalized Pair HMMs Represent a parse π, as a sequence of states and a sequence of associated lengths for each input sequence sequence of hidden states F+ P+ N Einit+ pair of sequences generated by hidden state may be gaps in the sequences pair of duration times generated by hidden state SLAM: Pachter et al. RECOMB 2001

Modern Genome Annotation RNA-Seq, mass spectrometry, and other technologies provide powerful information for genome annotation

Modern Genome Annotation Yandell et al. Nature Reviews Genetics 2012

Modern Genome Annotation protein-coding genes, isoforms, translated regions small RNAs long non-coding RNAs promoters and enhancers pseudogenes Mudge and Harrow Nature Reviews Genetics 2016