Microbial gene identification using interpolated Markov models

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
An Introduction to Bioinformatics Finding genes in prokaryotes.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Ab initio gene prediction Genome 559, Winter 2011.
MNW2 course Introduction to Bioinformatics
Ka-Lok Ng Dept. of Bioinformatics Asia University
Predicting Genes in Mycobacteriophages December 8, In Silico Workshop Training D. Jacobs-Sera.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.
Gene Identification Lab
Master’s course Bioinformatics Data Analysis and Tools
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Lecture 12 Splicing and gene prediction in eukaryotes
Algorithms for variable length Markov chain modeling Author: Gill Bejerano Presented by Xiangbin Qiu.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Biological Motivation Gene Finding in Eukaryotic Genomes
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Gene Structure and Identification
BLAST What it does and what it means Steven Slater Adapted from pt.
Markov Chain Models BMI/CS 576 Fall 2010.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
MNW2 course Introduction to Bioinformatics Lecture 22: Markov models Centre for Integrative Bioinformatics FEW/FALW
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Denovo genome assembly and analysis
Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette.
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Hidden Markov Models BMI/CS 576
Bacterial infection by lytic virus
ORF Calling.
bacteria and eukaryotes
Metagenomic Species Diversity.
Bacterial infection by lytic virus
What is a Hidden Markov Model?
Basics of Comparative Genomics
Functional Annotation of Transcripts
Interpolated Markov Models for Gene Finding
Genome Center of Wisconsin, UW-Madison
Ab initio gene prediction
Predicting Genes in Actinobacteriophages
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
Hidden Markov Models (HMMs)
Generalizations of Markov model to characterize biological sequences
Basic Local Alignment Search Tool
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
Applying principles of computer science in a biological context
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Presentation transcript:

Microbial gene identification using interpolated Markov models On Journal of Nucleic Acids Research, 1998, Vol. 26, No. 2, pp. 544-548. Microbial gene identification using interpolated Markov models Steven L. Salzberg, Arthur L. Delcher, Simon Kasif, Owen White Weekly LAB. Seminar 2001/05/04 Speaker : Eom Jae-Hong

Abstract Describes a new system, GLIMMER Gene finding tool for microbial genomes. Has proven to be very accurate at locating virtually all the genes in the sequences (Haemophilus influenza, Helicobacter pylori). Outperforming previous methods. Find more than 97% of all genes. Use interpolated Markov models (IMMs) as framework For capturing dependencies between nearby nucleotides in DNA sequence. An IMM-based method makes predictions Based on a variable context (variable-length oligomer in DNA seq.). More flexible & powerful than fixed-order Markov methods. 2019-04-30

Introduction The abundance of data Demands new highly accurate computational analysis tools in order to explore these genomes and maximize the scientific knowledge gained from them. First steps in the analysis of a microbial genome is The identification of all its genes. Microbial gene seq. : Tend to be Gene-rich (>90% coding seq.)  have different character than eukaryotic genomes (<10% coding seq.). The most difficult problem is Determining which of two or more overlapping open reading frames (orfs) represent true genes. 2019-04-30

Introduction (2) Identifying the start of translation and finding regulatory signals such as promoters and terminators. The most reliable way to identify a gene in a new genome is To find a close homolog from another organism. E.g. BLAST, FASTA programs  search all the entries in GenBank. However, many of the genes on new genomes Still have no significant homology to known genes. Must rely on computational method of scoring the coding region  e.g. GeneMark  best-known program for this. Uses a Markov chain model to score coding regions. 2019-04-30

Introduction (3) Here, introduce a new system called “GLIMMER” Uses a interpolate Markov models(IMMs) technique. To find identifying coding regions in microbial DNA. More powerful than Markov chains. Produce more accurate results for finding genes in bacterial DNA. A fixed-order Markov model Predict each base of a DNA seq. with fixed # of preceding base in the sequence (e.g. 5th-order model of GeneMark). However, insufficient training data  difficult to accurately estimate the probability of each base occurring after every 5 possible combination of 5 preceding bases. 2019-04-30

Introduction (4) kth-order Markov model requires 4k+1 probabilities To be estimate from the training data.(e.g. 5th-model: 4096 prob.) An IMM overcomes this problem By combining probabilities from contexts of varying length. By only using those contexts (oligomers) for which sufficient data are available. E.g. 5mers occur infrequently, 8mers occur frequently(*). Use a linear combination of probabilities obtained from several lengths of oligomers to make prediction. High weights (for frequently occur oligomers), low weight. IMM uses a longer context to make a prediction whenever possible. Insufficient longer oligomers  an IMM fall back on shorter oligomers to make its predictions. 2019-04-30

Introduction (5) GLIMMER URL: Based on frequency of occurrence and predictive value. Tested this system using H. influenzae, Helicobacter pylori, Escherichia coli genomes. Recently been used to find the genes in two newly completed genomes Borrelia burgdorferi: the bacteria that causes Lyme disease. Treponema pallidum: the bacteria that causes syphilis. URL: http://www.tigr.org/softlab/glimmer/glimmer.html 2019-04-30

Interpolated Markov Models - Markov chains Represent a sequence as a process that may be described as a sequence of random variable. : position i in the sequence. take the value (a,c,g,t). Variable in in state a if . Fig.1 can model any length of DNA sequence. Sequence “aaaaa” : (0.2)5 = 0.00032. States 2019-04-30

Interpolated Markov Models - Markov chains (2) A first order Markov chain takes a particular values depends on the preceding variable ( kth order Markov chain). Two essential computational issues that must be considered in building and using these probabilistic models The learning problem Involves learning a good model for coding regions in microbial DNA. The evaluation problem Involves assigning a score to a new DNA sequence that represents the likelihood that the sequence is coding. 2019-04-30

Interpolated Markov Models - Markov chains (3) To use a Markov chain model Need to build at least six submodels One for each of the possible reading frames (3 forward, 3 reverse). Seventh: separate model for non-coding regions. Each model makes different predictions for the bases in the three codon positions. In a 1st-order model A base in dependent on the previous base.  compute sixteen probabilities ( ). In order to score new sequence, considers two bases at a time (the current base and previous one). 2019-04-30

Interpolated Markov Models - Markov chains (4) In a 2nd-order model Output of a state depends on the two previous bases. To predict a base in the third codon position Look at the first and second codon positions. To predict a base in the first codon position This model looks at the second and third codon positions in the previous codon. Using the Markov models for each of the six possible frames + model of non-coding DNA Can straightforwardly produce a simple finding algorithm.  Simply score every orf using all seven models, and choose the model with highest score.  Overlapping gene problem occur. 2019-04-30

-Interpolated models Use the highest-order Markov model possible. The higher-order model should always do at least as well as, and frequently better than, lower-order models. The problem that arise in practice is that As we move to higher models, the # of probabilities that one must estimate from the data increases exponentially. kth-order Markov model requires 4k+1 probabilities.  six submodels require: 6* 4k+1 probabilities to be estimated. E.g. 5th-order model: 24576 probabilities. In some case, not enough n-mer frequency In nth-order model  can’t train the model prob. parameter.  Use IMM 2019-04-30

-Interpolated models (2) IMM Uses a combination of all the probabilities Based on 0, 1, 2, …, k previous bases (where, k is the parameter). GLIMMER: k = 8 In order to ‘smooth’ its predictions, an IMM uses Predictions from the lower-order models (much more data available), To adjust the predictions made from higher-order models. Training GLIMMER (k = 8, ) Compute the probability of each base a, c, g, t. For each k-mer, computes a weight. Once weight computed, GLIMMER evaluates new sequences by computing the probability that the model M generated the sequence S, P(S|M) 2019-04-30

-Interpolated models (3) IMM8 (Sx): the 8th-order interpolated Markov model score, is computed as : the numeric weight associated with the kmer ending at position x-1 in the seq. S. Pk(Sx) : the estimate obtained from the training data of the probability of the base located at x in the kth-order model.  solution the evaluation problem mentioned earlier. 2019-04-30

-Interpolated models (4) An IMM is In principle always preferable to a fixed-order Markov model. In IMM, not only longer but also shorter oligomers will help improve performance. Even if there may be some rare kmers for which insufficient data are available.  IMM can fall back on the much more reliable predictions made by the (k-1)mers in such cases. 2019-04-30

Algorithm and System Design -Setting IMM parameters Computing parameter values for the kth-order IMM A set of known coding sequences must be assembled into a training set.  use only very long orfs and sequences with homology known genes from other organism.  can be identified a priori without knowing anything else about the genome being analyzed. From the training set of genes The frequencies of occurrence of all possible substring patterns of length 1 to k+1 are tabulated in each of the six reading frames. The last base in the substirng defines the reading frame. 2019-04-30

Algorithm and System Design -Setting IMM parameters (2) Consider just a single reading frame f(S): the # of occurrences of string (sequence) . Compute the probability of base sx given the i previous bases that we associate with Pk(Sx): a measure of our confidence in the accuracy of this value as an estimate of the true value. GLIMMER uses two criteria to determine . Frequency of occurrence (sufficient case). If the number of context string in the training data >= threshold set to this value with 1.0 2019-04-30

Algorithm and System Design -Setting IMM parameters (3) Frequency of occurrence (insufficient case). Employ an additional criterion to assign a value. For a given context string Sx,i of length i, compare the observed frequencies of the bases Determine, how likely it is that the four observed frequencies are consistent with the IMM values from the next shorter context. Differ significantly from the IMM values use this as better predictors of the next base (give them a high value). 2019-04-30

Algorithm and System Design -Setting IMM parameters (4) Calculate the confidence c Assign higher values based on A combination of predictive value Accuracy The value now define 2019-04-30

-The GLIMMER system Consist is two system (build-imm, glimmer) Takes an input set of sequence and builds and outputs the IMM for them describe above. Glimmer Use this IMM to identify putative genes in an entire genome. Does not use sliding window to score regions. Final output of the program is A list of putative gene coordinates in the genome. 2019-04-30

Method and Results Data Comparison on H.influenzae H.influenzae, H.pylori. Compare with GeneMark system. Comparison on H.influenzae Orf is >500 base long and orf does not overlap any other longer than 500 bp. All same cond. Self-trained. 2019-04-30

Method and Results (2) Gene finding accuracy on H.pylori. 1590 annotated genes of Helicobacter were identified. 1548/1590 gene were found by GLIMMER. Additional 314 potential orfs were found. Some genes eliminated which conflict ribosomal and tRNAs. False negative rate for GLIMMER: 0.44 ~ 2.6 % GeneMark vs. GLIMMER (the set of 974 genes) GLIMMER found 21 genes that GeneMark missed. GeneMakr found one gene that GLIMMER missed. GeneMark missed 28 genes, GLIMMER missed 8 genes. Two system agreed on 945 of 974 genes. 2019-04-30

Method and Results (3) GLIMMER vs. GeneMarkHMM Missed 23 genes. Did not found any genes that GLIMMER missed. GLIMMER Found 15 genes that GeneMarkHMM missed. Two system agreed on 951/974 (97.6%) of the genes. All experiments used a fully automatic training protocol -END- 2019-04-30

References 2019-04-30