Bacterial Gene Finding and Glimmer (also Archaeal and viral gene finding) Arthur L. Delcher and Steven Salzberg Center for Bioinformatics and Computational.

Slides:

Advertisements

Similar presentations

Markov models and applications

Advertisements

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.

Hidden Markov Model in Biological Sequence Analysis – Part 2

An Introduction to Bioinformatics Finding genes in prokaryotes.

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.

Gene Prediction Preliminary Results Computational Genomics February 20, 2012.

Ab initio gene prediction Genome 559, Winter 2011.

MNW2 course Introduction to Bioinformatics

1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry

 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.

1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.

Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.

McPromoter – an ancient tool to predict transcription start sites

Finding Eukaryotic Open reading frames.

CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.

Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Gene Identification Lab

S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter

Master’s course Bioinformatics Data Analysis and Tools

Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter

Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.

Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.

Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.

Lecture 12 Splicing and gene prediction in eukaryotes

Biological Motivation Gene Finding in Eukaryotic Genomes

Finding prokaryotic genes and non intronic eukaryotic genes

Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)

Gene Structure and Identification

Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.

Markov Chain Models BMI/CS 576 Fall 2010.

Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.

MNW2 course Introduction to Bioinformatics Lecture 22: Markov models Centre for Integrative Bioinformatics FEW/FALW

BINF6201/8201 Hidden Markov Models for Sequence Analysis

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Denovo genome assembly and analysis

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.

Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.

From Genomes to Genes Rui Alves.

Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.

Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.

Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.

CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:

Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.

ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.

. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.

Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.

Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.

(H)MMs in gene prediction and similarity searches.

Finding genes in the genome

1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.

1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),

Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.

1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.

Bacterial infection by lytic virus

bacteria and eukaryotes

Bacterial infection by lytic virus

A Quest for Genes What’s a gene? gene (jēn) n.

Markov Chain Models BMI/CS 776

Interpolated Markov Models for Gene Finding

Ab initio gene prediction

Recitation 7 2/4/09 PSSMs+Gene finding

University of Maryland

Microbial gene identification using interpolated Markov models

The Toy Exon Finder.

Presentation transcript:

Bacterial Gene Finding and Glimmer (also Archaeal and viral gene finding) Arthur L. Delcher and Steven Salzberg Center for Bioinformatics and Computational Biology University of Maryland

Outline A (very) brief overview of microbial gene-finding – Codon composition methods – GeneMark: Markov models Glimmer1 & 2 – Interpolated Markov Model (IMM) – Interpolated Context Model (ICM) Glimmer3 – Reducing false positives – Improving coding initiation site predictions – Running Glimmer3

Step One Find open reading frames (ORFs). …TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA… Stop codon

Step One Find open reading frames (ORFs). But ORFs generally overlap … …TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA… Stop codon …ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT… Shifted Stop Stop codon Reverse strand

Campylobacter jejuni RM %GC All ORFs on both strands shown - color indicates reading frame Longest ORFs likely to be protein-coding genes Note the low GC content

Campylobacter jejuni RM %GC Purple lines are the predicted genes Purple ORFs show annotated (“true”) genes

Campylobacter jejuni RM %GC Mycobacterium smegmatis MC2 67.4%GC Note what happens in a high-GC genome

Campylobacter jejuni RM %GC Mycobacterium smegmatis MC2 67.4%GC Purple lines show annotated genes

The Problem Need to decide which orfs are genes. – Then figure out the coding start sites Can do homology searches but that won’t find novel genes – Besides, there are errors in the databases Generally can assume that there are some known genes to use as training set. – Or just find the obvious ones

Codon Composition Find patterns of nucleotides in known coding regions (assumed to be available). – Nucleotide distribution at 3 codon positions – Hexamers – GC-skew (G-C)/(G+C) computed in windows of size N – Amino-acid composition Use these to decide which orfs are genes. – Prefer longer orfs – Must deal with overlaps

Bacterial Replication Early replication Theta structure

Termination of Replication E. coli B. subtilis

Borrelia burgdorferi (Lyme disease pathogen) GC-skew plot

Codon Composition Nucleotide variation at codon position: Campylobacter jejuni Codon Position 123 a36% c13%17%9% g30%14%10% t21%33%44% Mycobacterium smegmatis Codon Position 123 a19%23%6% c27%28%48% g42%20%39% t12%28%7%

Codon-Composition Gene Finders ZCURVE – Guo, Ou & Zhang, NAR 31, 2003 – Based on nucleotide and di-nucleotide frequency in codons – Uses Z-transform and Fisher linear discriminant MED – Ouyang, Zhu, Wang & She, JBCB 2(2) 2004 – Based on amino-acid frequencies – Uses nearest-neighbor classification on entropies

Probabilistic Methods Create models that have a probability of generating any given sequence. Train the models using examples of the types of sequences to generate. The “score” of an orf is the probability of the model generating it. – Can also use a negative model (i.e., a model of non- orfs) and make the score be the ratio of the probabilities (i.e., the odds) of the two models. – Use logs to avoid underflow

Fixed-Order Markov Models k th -order Markov model bases the probability of an event on the preceding k events. Example: With a 3 rd -order model the probability of this sequence: would be: Target

Fixed-Order Markov Models Advantages: –Easy to train. Count frequencies of (k+1)-mers in training data. –Easy to compute probability of sequence. Disadvantages: –Many (k+1)-mers may be undersampled in training data. –Models data as fixed-length chunks. Target Fixed-Length Context

GeneMark Borodovsky & McIninch, Comp. Chem 17, Uses 5 th -order Markov model. Model is 3-periodic, i.e., a separate model for each nucleotide position in the codon. DNA region gets 7 scores: 6 reading frames & non-coding―high score wins. Lukashin & Borodovsky, Nucl. Acids Res. 26, 1998 is the HMM version.

Interpolated Markov Models (IMM) Introduced in Glimmer 1.0 Salzberg, Delcher, Kasif & White, NAR 26, Probability of the target position depends on a variable number of previous positions (sometimes 2 bases, sometimes 3, 4, etc.) How many is determined by the specific context. E.g., for context ggtta the next position might depend on previous 3 bases tta. But for context catta all 5 bases might be used. ggtta

Real IMMs Model has additional probabilities, λ, that determine which parts of the context to use. E.g., the probability of g occurring after context atca is:

Real IMMs Result is a linear combination of different Markov orders: where Can view this as interpolating the results of different-order models. The probability of a sequence is still the probability of the bases in the sequence.

Real IMMs Problem: How to determine the λ’s (or equivalently the b j ’s)? Traditionally done with EM algorithm using cross-validation (deleted estimation). –Slow –Hard to understand results –Overtraining can be a problem We will cover EM later as part of HMMs

Glimmer IMM Glimmer assumes: –Longer context is always better –Only reason not to use it is undersampling in training data. If sequence occurs frequently enough in training data, use it, i.e., Otherwise, use frequency and χ 2 significance to set λ. Interpolation is always between only 2 adjacent model lengths.

More Precisely Suppose context of length k+1 occurs a times, a<400, with target distribution D 1 ; and context of length k occurs ≥400 times, with target distribution D 2 Let p be the χ 2 probability that D 1 differs from D 2 Then the interpolated distribution will be: This formula computes the probability of a base B at any position in a sequence, given the context preceding that position This formula is repeatedly invoked for all positions in an ORF

IMMs vs Fixed-Order Models Performance –IMM generally should do at least as well as a fixed- order model. –Some risk of overtraining. IMM result can be stored and used like a fixed- order model. IMM will be somewhat slower to train and will use more memory. Target Variable-Length Context

Interpolated Context Model (ICM) Introduced in Glimmer 2.0 Delcher, Harmon, et al., Nucl. Acids Res. 27, Doesn’t require adjacent bases in the window preceding the target position. Choose set of positions that are most informative about the target position. Target Variable-Position Context

ICM For all windows compare distribution at each context position with target position Choose position with max mutual information Compare

ICM Continue for windows with fixed base at chosen positions Recurse until too few training windows – Result is a tree—depth is # of context positions used Then same interpolation as IMM, between node and parent in tree Compare

Sample ICM Model ver = 2.00 len = 12 depth = 7 periodicity = 3 nodes = |---|---|*-? |---|---|a*? |---|---|c*? |---|---|g*? |---|---|t*? |---|--*|aa? |---|--*|ac? |---|--*|ag? |---|*--|at? |---|--*|ca? |---|--*|cc? |---|--*|cg? |---|--*|ct? |---|--*|ga? |---|--*|gc? |---|--*|gg? |---|--*|gt? |---|--*|ta? |---|--*|tc? |---|--*|tg? |---|-*-|tt? |---|-*a|aa? |---|-*c|aa? |---|-*g|aa? |--*|--t|aa? |---|-*a|ac?

ICM ICM’s not just for modeling genes Can use to model any set of training sequences – Doesn’t even need to be DNA Have used recently to finding contaminant sequences in shotgun sequencing projects. – Used a non-periodic (i.e., homogeneous) model

Fixed-Length Sequences and ICMs Array of ICM’s―a different one for each position in fixed-length sequence. Can model fixed regions around transcription start sites or splice sites. Positions in region can be permuted.

Overlapping Orfs Glimmer1 & 2 used rules. For overlapping orfs A and B, the overlap region AB is scored separately: –If AB scores higher in A’s reading frame, and A is longer than B, then reject B. –If AB scores higher in B’s reading frame, and B is longer than A, then reject A. –Otherwise, output both A and B with a “suspicious” tag. Also try to move start site to eliminate overlaps. Leads to high false-positive rate, especially in high-GC genomes.

Glimmer 2.0 Overlap Comments OrfID Start End Frame Len Score Comments [+1 L=1191 r=-1.151] [ShadowedBy #3] [-1 L=1560 r=-1.317] [Contains #2] [LowScoreBy #4 L=257 S=0] [+2 L= 891 r=-1.156] [OlapWith #3 L=257 S=99] [+2 L=1152 r=-1.184] [+1 L= 582 r=-1.223] [DelayedBy #5 L=216] [+1 L=2115 r=-1.181] [OlapWith #9 L=2103 S=99] [-1 L=2121 r=-1.365] [LowScoreBy #8 L=2103 S=0] [+3 L=2526 r=-1.155] [+1 L= 780 r=-1.217] [+1 L= 798 r=-1.192] [+1 L=1029 r=-1.165] [ShorterThan #15 L=865 S=99] [-3 L=1449 r=-1.295] [LowScoreBy #13 L=865 S=0] [LowScoreBy # [+3 L=1056 r=-1.228] [OlapWith #15 L=585 S=99] [DelayedBy #13 L= [+2 L= 840 r=-1.161] [-3 L=1011 r=-1.230] [+1 L= 327 r=-1.284] [+3 L=1731 r=-1.175] [+2 L=1767 r=-1.266] [DelayedBy #20 L=1263] [+2 L=22569 r=-1.144] [-1 L=1002 r=-1.200] [ShadowedBy #26] [+1 L=1443 r=-1.401] [Contains #25] [LowScoreBy #27 L=167 S=0] [+2 L= 453 r=-1.261] [ShorterThan #26 L=167 S=99] [+1 L=1335 r=-1.169] [+1 L= 525 r=-1.153] [-1 L= 567 r=-1.298] [DelayedBy #29 L=363] [-2 L= 405 r=-1.275] [-1 L= 282 r=-1.273] [+3 L= 792 r=-1.178]

Glimmer3 Uses a dynamic programming algorithm to choose the highest-scoring set of orfs and start sites. Not quite an HMM – Allows small overlaps between genes “small” is user-defined – Scores of genes are not necessarily probabilities. – Score includes component for likelihood of start site

Reverse Scoring In Glimmer3 orfs are scored from 3’ end to 5’ end, i.e., from stop codon back toward start codon. Helps find the start site. – The start should appear near the peak of the cumulative score in this direction. – Keeps the context part of the model entirely in the coding portion of gene, which it was trained on.

Reverse Scoring

Finding Start Sites Can specify a position weight matrix (PWM) for the ribosome binding site. Used to boost the score of potential start sites that have a match. Would like to try: –A fixed-length ICM model of start sites. –A decision-tree model of start sites. –Hard to get enough reliable data. Glimmer2 had a hybridization-energy calculation with 16sRNA tail sequence that didn’t work well.

Glimmer3 vs. Glimmer2 Table 3Table 3. Probability models trained on the output of the long-orfs program. Predictions compared to genes with annotated function. Table 4Table 4. Probability models trained on the output of the long-orfs program. Predictions compared to all annotated genes.

Other Glimmer3 Features Can specify any set of start or stop codons – Including by NCBI translation table # Can specify list of “guaranteed” genes – Or just orfs and let Glimmer choose the starts Can specify genome regions to avoid – Guarantee no prediction will overlap these regions – Useful for RNA genes (tRNAs, rRNAs) Output more meaningful orf scores. Allow genes to extend beyond end of sequence – important for annotation of incomplete genomes

Glimmer3 Output Start Length Scores ID Frame of Orf of Gene Stop of Orf of Gene Raw Frame NC

Finding Initial Training Set Glimmer1 & 2 have program long-orfs It finds orfs longer than min-length that do not overlap other such orfs. –Current version automatically finds min-length. Works OK for low- to medium-GC genomes Terrible for high-GC genomes –Up to half the orfs can be wrong. –Uses first possible start site―even if orf is correct much of sequence isn’t. Can use codon composition information in Glimmer3 version (similar to MED).

Running Glimmer3 #1 Find long, non-overlapping orfs to use as a training set long-orfs --no_header -t 1.0 tpall.1con tp.longorfs #2 Extract the training sequences from the genome file extract --nostop tpall.1con tp.longorfs > tp.train #3 Build the icm from the training sequences build-icm -r tp.icm < tp.train #4 Run first Glimmer glimmer3 -o50 -g110 -t30 tpall.1con tp.icm tp.run1 #5 Get training coordinates from first predictions tail +2 tp.run1.predict > tp.coords #6 Create a position weight matrix (PWM) from the regions # upstream of the start locations in tp.coords upstream-coords.awk 25 0 tp.coords | extract tpall.1con - > tp.upstream elph tp.upstream LEN=6 | get-motif-counts.awk > tp.motif #7 Determine the distribution of start-codon usage in tp.coords set startuse = `start-codon-distrib --3comma tpall.1con tp.coords` #8 Run second Glimmer glimmer3 -o50 -g110 -t30 -b tp.motif -P $startuse tpall.1con tp.icm tp

A novel application of Glimmer P. didemni is a photosynthetic microbe that lives as an endosymbiont in the sea squirt > patella P. didemni can only be cultured in L. patella cells L. patella (sea squirt)

A novel application of Glimmer Generated 82,337 shotgun reads Bacterial genome 5 Mb Host genome estimated at 160 Mb Depth of coverage therefore much greater for bacterial contigs Singleton reads primarily belong to host

A novel application of Glimmer Create training sets by classifying reads from scaffolds > 10kb as bacterial – 36,920 reads Reads where both read and mate were singletons were treated as sea squirt – 21,276 reads 21,141 reads unclassified

A novel application of Glimmer Train a non-periodic IMM on both sets of data 2 IMMs created Then classify reads using the ratio of scores from the two IMMs In a 5-fold cross-validation, classification accuracy was – 98.9% on P. didemni reads – 99.9% on L. patella reads

A novel application of Glimmer Finally, re-assemble using ONLY bacterial reads Original assembly: – 65 scaffolds of 20 Kb or longer – total scaffold length 5.74 Mb Improved assembly: – 58 scaffolds of 20 Kb or longer – total scaffold length 5.84 Mb

Acknowledgements Art Delcher Steven Salzberg Owen White (TIGR) Simon Kasif (Boston U.) Doug Harmon (Loyola College) Kirsten Bratke Edwin Powers (Johns Hopkins U.) Dan Haft (TIGR) Bill Nelson (TIGR)