Microbial gene identification using interpolated Markov models On Journal of Nucleic Acids Research, 1998, Vol. 26, No. 2, pp. 544-548. Microbial gene identification using interpolated Markov models Steven L. Salzberg, Arthur L. Delcher, Simon Kasif, Owen White Weekly LAB. Seminar 2001/05/04 Speaker : Eom Jae-Hong
Abstract Describes a new system, GLIMMER Gene finding tool for microbial genomes. Has proven to be very accurate at locating virtually all the genes in the sequences (Haemophilus influenza, Helicobacter pylori). Outperforming previous methods. Find more than 97% of all genes. Use interpolated Markov models (IMMs) as framework For capturing dependencies between nearby nucleotides in DNA sequence. An IMM-based method makes predictions Based on a variable context (variable-length oligomer in DNA seq.). More flexible & powerful than fixed-order Markov methods. 2019-04-30
Introduction The abundance of data Demands new highly accurate computational analysis tools in order to explore these genomes and maximize the scientific knowledge gained from them. First steps in the analysis of a microbial genome is The identification of all its genes. Microbial gene seq. : Tend to be Gene-rich (>90% coding seq.) have different character than eukaryotic genomes (<10% coding seq.). The most difficult problem is Determining which of two or more overlapping open reading frames (orfs) represent true genes. 2019-04-30
Introduction (2) Identifying the start of translation and finding regulatory signals such as promoters and terminators. The most reliable way to identify a gene in a new genome is To find a close homolog from another organism. E.g. BLAST, FASTA programs search all the entries in GenBank. However, many of the genes on new genomes Still have no significant homology to known genes. Must rely on computational method of scoring the coding region e.g. GeneMark best-known program for this. Uses a Markov chain model to score coding regions. 2019-04-30
Introduction (3) Here, introduce a new system called “GLIMMER” Uses a interpolate Markov models(IMMs) technique. To find identifying coding regions in microbial DNA. More powerful than Markov chains. Produce more accurate results for finding genes in bacterial DNA. A fixed-order Markov model Predict each base of a DNA seq. with fixed # of preceding base in the sequence (e.g. 5th-order model of GeneMark). However, insufficient training data difficult to accurately estimate the probability of each base occurring after every 5 possible combination of 5 preceding bases. 2019-04-30
Introduction (4) kth-order Markov model requires 4k+1 probabilities To be estimate from the training data.(e.g. 5th-model: 4096 prob.) An IMM overcomes this problem By combining probabilities from contexts of varying length. By only using those contexts (oligomers) for which sufficient data are available. E.g. 5mers occur infrequently, 8mers occur frequently(*). Use a linear combination of probabilities obtained from several lengths of oligomers to make prediction. High weights (for frequently occur oligomers), low weight. IMM uses a longer context to make a prediction whenever possible. Insufficient longer oligomers an IMM fall back on shorter oligomers to make its predictions. 2019-04-30
Introduction (5) GLIMMER URL: Based on frequency of occurrence and predictive value. Tested this system using H. influenzae, Helicobacter pylori, Escherichia coli genomes. Recently been used to find the genes in two newly completed genomes Borrelia burgdorferi: the bacteria that causes Lyme disease. Treponema pallidum: the bacteria that causes syphilis. URL: http://www.tigr.org/softlab/glimmer/glimmer.html 2019-04-30
Interpolated Markov Models - Markov chains Represent a sequence as a process that may be described as a sequence of random variable. : position i in the sequence. take the value (a,c,g,t). Variable in in state a if . Fig.1 can model any length of DNA sequence. Sequence “aaaaa” : (0.2)5 = 0.00032. States 2019-04-30
Interpolated Markov Models - Markov chains (2) A first order Markov chain takes a particular values depends on the preceding variable ( kth order Markov chain). Two essential computational issues that must be considered in building and using these probabilistic models The learning problem Involves learning a good model for coding regions in microbial DNA. The evaluation problem Involves assigning a score to a new DNA sequence that represents the likelihood that the sequence is coding. 2019-04-30
Interpolated Markov Models - Markov chains (3) To use a Markov chain model Need to build at least six submodels One for each of the possible reading frames (3 forward, 3 reverse). Seventh: separate model for non-coding regions. Each model makes different predictions for the bases in the three codon positions. In a 1st-order model A base in dependent on the previous base. compute sixteen probabilities ( ). In order to score new sequence, considers two bases at a time (the current base and previous one). 2019-04-30
Interpolated Markov Models - Markov chains (4) In a 2nd-order model Output of a state depends on the two previous bases. To predict a base in the third codon position Look at the first and second codon positions. To predict a base in the first codon position This model looks at the second and third codon positions in the previous codon. Using the Markov models for each of the six possible frames + model of non-coding DNA Can straightforwardly produce a simple finding algorithm. Simply score every orf using all seven models, and choose the model with highest score. Overlapping gene problem occur. 2019-04-30
-Interpolated models Use the highest-order Markov model possible. The higher-order model should always do at least as well as, and frequently better than, lower-order models. The problem that arise in practice is that As we move to higher models, the # of probabilities that one must estimate from the data increases exponentially. kth-order Markov model requires 4k+1 probabilities. six submodels require: 6* 4k+1 probabilities to be estimated. E.g. 5th-order model: 24576 probabilities. In some case, not enough n-mer frequency In nth-order model can’t train the model prob. parameter. Use IMM 2019-04-30
-Interpolated models (2) IMM Uses a combination of all the probabilities Based on 0, 1, 2, …, k previous bases (where, k is the parameter). GLIMMER: k = 8 In order to ‘smooth’ its predictions, an IMM uses Predictions from the lower-order models (much more data available), To adjust the predictions made from higher-order models. Training GLIMMER (k = 8, ) Compute the probability of each base a, c, g, t. For each k-mer, computes a weight. Once weight computed, GLIMMER evaluates new sequences by computing the probability that the model M generated the sequence S, P(S|M) 2019-04-30
-Interpolated models (3) IMM8 (Sx): the 8th-order interpolated Markov model score, is computed as : the numeric weight associated with the kmer ending at position x-1 in the seq. S. Pk(Sx) : the estimate obtained from the training data of the probability of the base located at x in the kth-order model. solution the evaluation problem mentioned earlier. 2019-04-30
-Interpolated models (4) An IMM is In principle always preferable to a fixed-order Markov model. In IMM, not only longer but also shorter oligomers will help improve performance. Even if there may be some rare kmers for which insufficient data are available. IMM can fall back on the much more reliable predictions made by the (k-1)mers in such cases. 2019-04-30
Algorithm and System Design -Setting IMM parameters Computing parameter values for the kth-order IMM A set of known coding sequences must be assembled into a training set. use only very long orfs and sequences with homology known genes from other organism. can be identified a priori without knowing anything else about the genome being analyzed. From the training set of genes The frequencies of occurrence of all possible substring patterns of length 1 to k+1 are tabulated in each of the six reading frames. The last base in the substirng defines the reading frame. 2019-04-30
Algorithm and System Design -Setting IMM parameters (2) Consider just a single reading frame f(S): the # of occurrences of string (sequence) . Compute the probability of base sx given the i previous bases that we associate with Pk(Sx): a measure of our confidence in the accuracy of this value as an estimate of the true value. GLIMMER uses two criteria to determine . Frequency of occurrence (sufficient case). If the number of context string in the training data >= threshold set to this value with 1.0 2019-04-30
Algorithm and System Design -Setting IMM parameters (3) Frequency of occurrence (insufficient case). Employ an additional criterion to assign a value. For a given context string Sx,i of length i, compare the observed frequencies of the bases Determine, how likely it is that the four observed frequencies are consistent with the IMM values from the next shorter context. Differ significantly from the IMM values use this as better predictors of the next base (give them a high value). 2019-04-30
Algorithm and System Design -Setting IMM parameters (4) Calculate the confidence c Assign higher values based on A combination of predictive value Accuracy The value now define 2019-04-30
-The GLIMMER system Consist is two system (build-imm, glimmer) Takes an input set of sequence and builds and outputs the IMM for them describe above. Glimmer Use this IMM to identify putative genes in an entire genome. Does not use sliding window to score regions. Final output of the program is A list of putative gene coordinates in the genome. 2019-04-30
Method and Results Data Comparison on H.influenzae H.influenzae, H.pylori. Compare with GeneMark system. Comparison on H.influenzae Orf is >500 base long and orf does not overlap any other longer than 500 bp. All same cond. Self-trained. 2019-04-30
Method and Results (2) Gene finding accuracy on H.pylori. 1590 annotated genes of Helicobacter were identified. 1548/1590 gene were found by GLIMMER. Additional 314 potential orfs were found. Some genes eliminated which conflict ribosomal and tRNAs. False negative rate for GLIMMER: 0.44 ~ 2.6 % GeneMark vs. GLIMMER (the set of 974 genes) GLIMMER found 21 genes that GeneMark missed. GeneMakr found one gene that GLIMMER missed. GeneMark missed 28 genes, GLIMMER missed 8 genes. Two system agreed on 945 of 974 genes. 2019-04-30
Method and Results (3) GLIMMER vs. GeneMarkHMM Missed 23 genes. Did not found any genes that GLIMMER missed. GLIMMER Found 15 genes that GeneMarkHMM missed. Two system agreed on 951/974 (97.6%) of the genes. All experiments used a fully automatic training protocol -END- 2019-04-30
References 2019-04-30