Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.

Slides:



Advertisements
Similar presentations
Markov models and applications
Advertisements

GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
An Introduction to Bioinformatics Finding genes in prokaryotes.
Ab initio gene prediction Genome 559, Winter 2011.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Finding Eukaryotic Open reading frames.
Section 8.6: Gene Expression and Regulation
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
CSE182-L10 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Eukaryotic Gene Finding
Gene Finding. Biological Background The Central Dogma Transcription RNA Translation Protein DNA.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Biological Motivation Gene Finding in Eukaryotic Genomes
Finding prokaryotic genes and non intronic eukaryotic genes
Hidden Markov Models In BioInformatics
Gene Structure and Identification
Essentials of the Living World Second Edition George B. Johnson Jonathan B. Losos Chapter 13 How Genes Work Copyright © The McGraw-Hill Companies, Inc.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Markov Chain Models BMI/CS 576 Fall 2010.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Gene prediction. Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Genome Organization & Evolution. Chromosomes Genes are always in genomic structures (chromosomes) – never ‘free floating’ Bacterial genomes are circular.
Chapter 10 Transcription RNA processing Translation Jones and Bartlett Publishers © 2005.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Fig.1.8 DNA STRUCTURE 5’ 3’ Antiparallel DNA strands Hydrogen bonds between bases DOUBLE HELIX 5’ 3’
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
LECTURE CONNECTIONS 14 | RNA Molecules and RNA Processing © 2009 W. H. Freeman and Company.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Chapter 17 From Gene to Protein. 2 DNA contains the genes that make us who we are. The characteristics we have are the result of the proteins our cells.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Eukaryotic Gene Structure
What is a Hidden Markov Model?
Interpolated Markov Models for Gene Finding
Eukaryotic Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Gene Discovery.
The Toy Exon Finder.
Prokaryotes Eukaryotes  
Presentation transcript:

Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman. Partially modified by Benny Chor.

Annotation of Genomic Sequence Given the sequence of an organism ’ s genome, we would like to be able to identify: –Genes –Exon boundaries & splice sites –Beginning and end of translation –Alternative splicings –Regulatory elements (e.g. promoters) The only certain way to do this is experimentally, but it is time consuming and expensive. Computational methods can achieve reasonable accuracy quickly, and help direct experimental approaches. primary goals secondary goals

Prokaryotic Gene Structure Promoter CDS Terminator transcription Genomic DNA mRNA  Most bacterial promoters contain the Shine-Delgarno signal, at about -10 that has the consensus sequence: 5'-TATAAT-3'.  The terminator: a signal at the end of the coding sequence that terminates the transcription of RNA  The coding sequence is composed of nucleotide triplets. Each triplet codes for an amino acid. The AAs are the building blocks of proteins.

Pieces of a (Eukaryotic) Gene (on the genome) 5’ 3’ 5’ ~ Mbp 5’ 3’ 5’ … … … … ~ kbp exons (cds & utr) / introns (~ bp) (~ bp) Polyadenylation site promoter (~10 3 bp) enhancers (~ bp) other regulatory sequences (~ bp)

What is it about genes that we can measure (and model)? Most of our knowledge is biased towards protein-coding characteristics –ORF (Open Reading Frame): a sequence defined by in- frame AUG and stop codon, which in turn defines a putative amino acid sequence. –Codon Usage: most frequently measured by CAI (Codon Adaptation Index) Other phenomena –Nucleotide frequencies and correlations: value and structure –Functional sites: splice sites, promoters, UTRs, polyadenylation sites

A simple measure: ORF length Comparison of Annotation and Spurious ORFs in S. cerevisiae Basrai MA, Hieter P, and Boeke J Genome Research :

Codon Adaptation Index (CAI) Parameters are empirically determined by examining a “large” set of example genes This is not perfect –Genes sometimes have unusual codons for a reason –The predictive power is dependent on length of sequence

CAI Example: Counts per 1000 codons

Splice signals (mice): GT, AG

General Things to Remember about (Protein-coding) Gene Prediction Software It is, in general, organism-specific It works best on genes that are reasonably similar to something seen previously It finds protein coding regions far better than non- coding regions In the absence of external (direct) information, alternative forms will not be identified It is imperfect! (It’s biology, after all…)

Simple HMM : Prokaryotes x m (i) = probability of being in state m at position i; H(m,y i ) = probability of emitting character y i in state m;  mk = probability of transition from state k to m.

Outline: Rest of Lecture Eukaryotic gene structure Modeling gene structure Using the model to make predictions Improving the model topology Modeling fixed-length signals

A eukaryotic gene This is the human p53 tumor suppressor gene on chromosome 17. Genscan is one of the most popular gene prediction algorithms.

A eukaryotic gene 3’ untranslated region Final exon Initial exon Introns Internal exons This particular gene lies on the reverse strand.

An Intron 3’ splice site 5’ splice site revcomp(CT)=AG revcomp(AC)=GT GT: signals start of intron AG: signals end of intron

Signals vs contents In gene finding, a small pattern within the genomic DNA is referred to as a signal, whereas a region of genomic DNA is a content. Examples of signals: splice sites, starts and ends of transcription or translation, branch points, transcription factor binding sites Examples of contents: exons, introns, UTRs, promoter regions

Prior knowledge We want to build a probabilistic model of a gene that incorporates our prior knowledge. E.g., the translated region must have a length that is a multiple of 3.

Prior knowledge The translated region must have a length that is a multiple of 3. Some codons are more common than others. Exons are usually shorter than introns. The translated region begins with a start signal and ends with a stop codon. 5’ splice sites (exon to intron) are usually GT; 3’ splice sites (intron to exon) are usually AG. The distribution of nucleotides and dinucleotides is usually different in introns and exons.

A simple gene model Transcription stop Transcription start StartEnd Gene Intergenic

A probabilistic gene model Transcription stop Transcription start StartEnd Gene Intergenic Every box stores transition probabilities for outgoing arrows. Every arrow stores emission probabilities for emitted nucleotides Pr(TACAGTAGATATGA) = Pr(AACAGT) = Pr(AACAGTAC) = …

Parse For a given sequence, a parse is an assignment of gene structure to that sequence. In a parse, every base is labeled, corresponding to the content it (is predicted to) belongs to. In our simple model, the parse contains only “I” (intergenic) and “G” (gene). A more complete model would contain, e.g., “-” for intergenic, “E” for exon and “I” for intron. S = ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCG P = IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGG TATGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTAC GGGGGGGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

The probability of a parse Transcription stop Transcription start StartEnd Gene Intergenic Pr(ACTGACTACTACGACTACGAT CTACTACGGGCGCGACCT) = Pr(ATGCGTATGTTTTGA) = Pr(ACTGACTATGCGATCTACGAC TCGACTAGCTAC) = Pr(parse P| sequence S, model M) = 0.67   1.00   0.75 x =  S = ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCGTATGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTAC P = IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGGGGGGGGGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Finding the best parse For a given sequence S, the model M assigns a probability Pr(P|S,M) to every parse P. We want to find the parse P* that receives the highest probability.

Beyond Simplest Model Improving the gene model topology Fixed-length signals –PSSMs –Dependencies between positions Variable-length contents –Using HMMs –Semi-Markov models Parsing algorithms –Viterbi –Posterior decoding Including other types of data –Expressed sequence tags –Orthology

Improved model topology Draw a model that includes introns Transcription stop Transcription start StartEnd Gene Intergenic 2 Intergenic 1 Intergenic 4 Intergenic 3

Improved model topology Transcription stop Transcription start Start End 5’ splice site 3’ splice site

Improved model topology Transcription stop Transcription start Start End 5’ splice site 3’ splice site 4 intergenics 1 intron 4 exons

Improved model topology Transcription stop Transcription start Start End 5’ splice site 3’ splice site Single exonInitial exon Intron Internal exon Final exon

Modeling the 5’ splice site Most introns begin with the letters “GT.” We can add this signal to the model. 5’ splice site 3’ splice site Intron GT

Modeling the 5’ splice site Most introns begin with the letters “GT.” We can add this signal to the model. Indeed, we can model each nucleotide with its own arrow. 5’ splice site 3’ splice site Intron GT Pr(A)=0 Pr(C)=0 Pr(G)=0 Pr(T)=1 Pr(A)=0 Pr(C)=0 Pr(G)=1 Pr(T)=0

Modeling the 5’ splice site Like most biological phenomenon, the splice site signal admits exceptions. The resulting model of the 5’ splice site is a length-2 PSSM. 5’ splice site 3’ splice site Intron GT Pr(A)=0.01 Pr(C)=0.01 Pr(G)=0.01 Pr(T)=0.97 Pr(A)=0.01 Pr(C)=0.01 Pr(G)=0.97 Pr(T)=0.01

Real splice sites Real splice sites show some conservation at positions beyond the first two. We can add additional arrows to model these states. weblogo.berkeley.edu

Modeling the 5’ splice site 5’ splice site 3’ splice site Intron

Adding signals Transcription stop Transcription start Start End 5’ splice site 3’ splice site Single exonInitial exon Intron Internal exon Final exon Red ellipses correspond to signal models like this:

Positional Independence Pr(“ACTT”|M) = Pr(“A” at position 1 and “C” at position 2 and “T” at position 3 and “T” at position 4|M) = Pr(“A” at position 1|M)  Pr(“C” at position 2|M)  Pr(“T” at position 3|M)  Pr(“T” at position 4|M) In general, probabilities of independent events get multiplied. A PSSM assumes independence among nucleotides at different positions.

Positional dependence In this data, every time a “G” appears in position 1, an “A” appears in position 3. Conversely, an “A” in position 1 always occurs with a “T” in position 3. ACTGACTTGCACACTTACTAGCATACTAACTTACTGACTTGCACACTTACTAGCATACTAACTT

n th -order PSSM Normally, PSSM entry (i,j) gives the score for observing the i th letter in position j. In an n th -order PSSM, each score is conditioned on the preceding letters in the sequence. The entries A|A, C|A, G|A and T|A should sum to A|A A|C A|G A|T C|A … T|T nd -order PSSM

n th -order PSSM Normally, PSSM entry (i,j) gives the score for observing the i th letter in position j. In an n th -order PSSM, each score is conditioned on the preceding letters in the sequence. How many rows are in a 3 rd -order PSSM for nucleotides? n th -order? 1234 A|A A|C A|G A|T C|A … T|T nd -order PSSM The probability of observing an “A” in position 3, given that we already observed a “C” in position 2.

Conditional probability What is the probability of observing an “A” at position 2, given that we observed a “C” at the previous position? GCG CAG CCG GCG CCG CCG GCG CCT CCG GGG CGG GCG AGG CAG CCT CAT CCT GCG

Conditional probability What is the probability of observing an “A” at position 2, given that we observed a “C” at the previous position? Answer: total number of CA’s divided by total number of C’s in position 1. 3/11 = 27% Probability of observing CA = 3/18 = 17%. GCG CAG CCG GCG CCG CCG GCG CCT CCG GGG CGG GCG AGG CAG CCT CAT CCT GCG

Conditional probability The conditional probability Pr(x|y) = Number of occurrences of y:x Number of occurrences of y:* where * is any letter. GCG CAG CCG GCG CCG CCG GCG CCT CCG GGG CGG GCG AGG CAG CCT CAT CCT GCG

Conditional probability What is the probability of observing a “G” at position 3, given that we observed a “C” at the previous position? GCG CAG CCG GCG CCG CCG GCG CCT CCG GGG CGG GCG AGG CAG CCT CAT CCT GCG

Conditional probability What is the probability of observing a “G” at position 3, given that we observed a “C” at the previous position? Answer: 9/12 = 75%. GCG CAG CCG GCG CCG CCG GCG CCT CCG GGG CGG GCG AGG CAG CCT CAT CCT GCG

Modeling signals Transcription stop Transcription start Start End 5’ splice site 3’ splice site Single exonInitial exon Intron Internal exon Final exon Red ellipses may correspond to n th -order PSSMs.

Modeling variable-length regions Exon length

Modeling variable-length regions 1.The easy way, using standard HMMs. 2.And why that’s not so great. How are variable-length insertions modeled in protein HMMs?

The HMM solution 5’ splice site 3’ splice site Intron Fixed-length signals Variable-length content 5’ splice site 3’ splice site Intron

Codons start translation end translation Single exon start translation end translation Single exon

The complete model Transcription stop Transcription start Start End 5’ splice site 3’ splice site Single exonInitial exon Intron Internal exon Final exon Red ellipses correspond to n th -order PSSMs. Every arrow contains an invisible box with a self-loop.

A small problem Say that each blue arrow emits one letter. What is the probability that the intron will be exactly 2 letters long? 3 letters long? 4 letters long? 5’ splice site 3’ splice site Intron

A small problem Say that each blue arrow emits one letter. What is the probability that the intron will be exactly 2 letters long? 10% 3 letters long? 9% 4 letters long? 8.1% 5’ splice site 3’ splice site Intron

A small problem HMMs tend to produce geometric distributions Real contents are not necessarily geometric.

Building an HMM Input: annotated gene sequences Output: HMM parameters –Emission distributions within each content –Length distributions of contents –Transition distributions between contents

A more realistic (and complex) HMM model for Gene Prediction (Genie) Kulp, D., PhD Thesis, UCSC 2003

Assessing performance: Sensitivity and Specificity Testing of predictions is performed on sequences where the gene structure is known Sensitivity is the fraction of known genes (or bases or exons) correctly predicted –“Am I finding the things that I’m supposed to find” Specificity is the fraction of predicted genes (or bases or exons) that correspond to true genes –“What fraction of my predictions are true?” In general, increasing one decreases the other

Graphic View of Specificity and Sensitivity

Quantifying the tradeoff: Correlation Coefficient

Specificity/Sensitivity Tradeoffs Ideal Distribution of Scores More Realistically…

Bayesian Statistics Bayes’ Rule M: the model, D: data or evidence posterior likelihoodprior marginal

Basic Bayesian Statistics Bayes’ Rule is at the heart of much predictive software In the simplest example, we can simply compare two models, and reduce it to a log-odds ratio

Genetic + Genetic - short + short - intergenic Initiation + Initiation - Termination - Termination + overlap 0 overlap 1 overlap 2 overlap 3 Prokaryotes HMMs: Taking Overlaps on Two Strands into Account

Genetic + Genetic - short + short - Initiation + Initiation - Termination - Termination + overlap 0 overlap 1 overlap 2 overlap 3 Coding region (genes) intergenic

AAAAACAAG …. TTT Transition from any codon to any other. Model of all possible 64 codons Coding region (genes)

Integenic regions and overlap regions: Model Design (3)  Two consecutive genes either overlap each other or separated by an itergenic region.  The overlaping segment or the intergenic region is bordered in one of 4 possible ways. 5' 3' 5' 3' 5' 3' 5' 3' 5' 3' 5' 3' 5' 3' 5' 3' 5' Tail–Head Head–Tail Tail–Tail Head–Head Intermediate intergenic regionOverlapping Region TailHead

Example 1 Genetic + Genetic - short + short - Initiation + Initiation - Termination - Termination + overlap 0 overlap 1 overlap 2 overlap 3 Transition between two genes on the same strand. 5' 3' 5' intergenic

Example 2 Genetic + Genetic - short + short - Initiation + Initiation - Termination - Termination + overlap 0 overlap 1 overlap 2 overlap 3 Two genes on the opposite strands. 5' 3' 5' intergenic

Transitions between genes Genetic + Genetic - short + short - intergenic Initiation + Initiation - Termination - Termination + overlap 0 overlap 1 overlap 2 overlap 3 5' 3' 5'

Intergenic regions are modeled by profile HMMs. Intergenic Regions 5' 3' 5' We model two different types of intergenic regions: 1. Short intergenic sequences:  9 bases long.  Model situations where two same strand genes are close together.  This situation is common in polycistronic operons. 2. Long intergenic sequences are the more common case. They are modeled by the following 2 profile HMMs:  Transcription termination signal: 18 bases long.  Promoter region including the Shine-Dalgarno signal: 25 bases long.

A C G T---- A-- C G-- T A---- C G T A C G---- T A C G T---- Weight matrix models[i] (WMM) are used to represent overlapping regions of 1 or 4 bases, consisting of the stop codon of the previous gene and the start codon of the next one.[i]. TA A GT NNATGANN A- C- G- T- A- C- G- T- A--- C G- T A C G---- T A C G T---- A---- C G T A- C- G- T- A- C- G- T- Overlap Regions (1) 1 base overlap of stop codon TAA or TGA, with init codon ATG: 4 bases overlap: First gene terminated by TGA, second gene starts with [AG]TG: WMM format bases WMM format Overlap regions of 1 or 4 bases:

For each one of the 4 possible paths described (head  head, head  tail, tail  tail, tail  head), all possible frame differences are allowed. For example: a tail  head transition allows a 1 or 2 bases' shift of the reading frame. Overlap Regions (2) Overlap regions of 6 or more bases: Frame 1 Stop codon Frame 1 Stop codon Frame 1 Frame 2 Frame 3 Init codon Init codon Frame 2/3 5' 3' 5'