Ab initio gene prediction

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Hidden Markov Model in Biological Sequence Analysis – Part 2
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Ab initio gene prediction Genome 559, Winter 2011.
Markov Models Charles Yan Markov Chains A Markov process is a stochastic process (random process) in which the probability distribution of the.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Profiles for Sequences
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Gene Finding Charles Yan.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
CSE182-L10 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Eukaryotic Gene Finding
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Biological Motivation Gene Finding in Eukaryotic Genomes
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive.
Construction of Substitution Matrices
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Chapter 3 The Interrupted Gene.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Step 3: Tools Database Searching
(H)MMs in gene prediction and similarity searches.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
EGASP 2005 Evaluation Protocol
What is a Hidden Markov Model?
EGASP 2005 Evaluation Protocol
Visualization of genomic data
Eukaryotic Gene Finding
Recitation 7 2/4/09 PSSMs+Gene finding
Pair Hidden Markov Model
Hidden Markov Models (HMMs)
Finding regulatory modules
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
4. HMMs for gene finding HMM Ability to model grammar
Hidden Markov Model Lecture #6
The Toy Exon Finder.
Presentation transcript:

Ab initio gene prediction Genome 559, Winter 2012

Ab initio gene prediction method Define parameters of real genes (based on experimental evidence): Use those parameters to obtain a best interpretation of genes from any region from genome sequence alone. Splice donor sequence model Splice acceptor sequence model Intron and exon length distribution Open reading frame requirement in coding exons Requirement that introns maintain reading frame Transcription start and stop models (difficult to predict, often omitted). ab initio = "from the beginning" (i.e. without experimental evidence)

Sites we might want to predict Translation start Translation stop Splice donor site Splice acceptor site (some predictors only deal with coding exons; the 5' and 3' ends are harder to predict.)

Open reading frames (random sequence) 61 of 64 codons are not stop codons (0.953 assuming equal nucleotide frequencies). Probability of not having a stop codon in a particular reading frame along a length L of DNA is a geometric distribution that decays rapidly with L. There are 3 reading frames on each DNA strand.

Geometric distribution in random sequence of distance to first stop codon (p=3/64) k amino acid codons followed by one stop codon long open reading frames are rare in random sequence (distance in codons)

Splice donor and acceptor information exon intron donor, C. elegans (sums to ~8 bits) intron exon acceptor, C. elegans (sums to ~9 bits) Note – these show a log-odds measure of information content compared to background nucleotide frequencies. Similar to BLOSUM matrix log-odds.

Position Specific Score Matrix (PSSM) splice donor 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A C G T Slide PSSM along DNA, computing a score at every position. (this is a conceptual example, the real thing would be computed as log-odds values, similar to BLOSUM matrices)

Intron length distribution (C. elegans) Note: intron length distributions in Drosophila melanogaster and Homo sapiens (and most other animals) are longer and broader.

Other information that can be used Splice donor and acceptor must be paired and donor must be upstream of acceptor (duh). Introns in coding regions must maintain reading frame of the flanking exons. Nucleotide content analysis (e.g. introns tend to be AT rich).

Simple conceptual example (plus strand only) Sites scored on basis of PSSM matches to known splice donor sequence model. Arrow length reflects quality of match (worse matches not shown).

Add splice acceptor information (example cont.) Add splice acceptor information Where would you infer introns? (one reasonable interpretation)

stop codon before highest scoring splice donor! (example cont.) stop codon before highest scoring splice donor! reinterpreted (avoids stop codon by using lower scoring splice donor):

Real example (end result) 1 2 3 4 Note that this gene has no mRNA sequences (EST and ORFeome tracks empty). This is a pure ab initio prediction (made by Phil Green's genefinder).

Hidden Markov Model (HMM) Markov chain - a linear series of states in which each state is dependent only on the previous state. HMM - a model that uses a Markov chain to infer the most likely states in data with unknown states ("hidden" states). A Markov chain has states and transition probabilities: A B pAB pBA (since there are only two states, the probability of staying in state A is 1-pAB and the probability of staying in state B is 1-pBA ) states A and B

A B 0.98 0.02 0.4 0.6 A -> B B -> A What will the series of states look like (qualitatively) for this Markov chain? It will have long stretches of A states, interspersed with short stretches of B states.

Hidden Markov Model We have a Markov chain with appropriate states and known transition probabilities (e.g. best fit to experimentally known genes). We have a DNA sequence with unknown states. Find the series of Markov chain states with the maximum likelihood for the DNA sequence. Solved with the Viterbi algorithm (we won't cover this, but it is another dynamic programming algorithm). See http://en.wikipedia.org/wiki/Viterbi_algorithm

Gene Prediction HMM States coding exon states (three frames) intron states (three frames of codon they insert into) special first (init) and last (term) coding exon states (splice acceptor) (splice donor) taken from Stormo lab paper

A way to connect the HMM formalism to specifics Note – these probabilities are qualitative and are intended only to portray the local trends.

Long open reading frames favor exon state

Ab initio gene prediction Requires a Markov chain with transition probabilities (typically "trained" on known genes or genes in a closely-related species). Produces a parse of a DNA sequence into each of the Markov states (intron, exon, intergenic, etc). Accuracy varies and always imperfect.

Intron positions and reading frame exon M I L E S D A V I S The intron can be any length and still produce the same exons This particular splice is between two codons (0-shifting) The splice position can move and maintain coding frame as long as both positions move coordinately. If one splice endpoint moves it may change reading frame

Using evolutionary conservation as a guide to improve ab initio predictions - an example using dot plots is shown in the next two slides. The principle is that introns usually diverge much faster than coding exons. (though not always true)

other possible corrections DNA dot matrix comparison of two ab initio gene predictions in related genomes dubious exon 1 exon 2 too long good exon add intron? Gene A (ab initio model) remove intron? other possible corrections Gene B (ab initio model)

After correction of exons 1 and 2 in the red gene