Center for Biological Sequence Analysis Prokaryotic gene finding Marie Skovgaard Ph.D. student

Slides:



Advertisements
Similar presentations
Introduction to perl programming: the minimum to know! Bioinformatic and Comparative Genome Analysis Course HKU-Pasteur Research Centre - Hong Kong, China.
Advertisements

Uses of Cloned Genes sequencing reagents (eg, probes) protein production insufficient natural quantities modify/mutagenesis library screening Expression.
The genetic code.
Center for Biological Sequence Analysis The Technical University of Denmark DTU Chromatin and Gene Expression in E. coli Dave Ussery Biological Sequence.
1 Number of substitutions between two protein- coding genes Dan Graur.
 -GLOBIN MUTATIONS AND SICKLE CELL DISORDER (SCD) - RESTRICTION FRAGMENT LENGTH POLYMORPHISMS (RFLP)
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
RNA Say Hello to DNA’s little friend!. EngageEssential QuestionExplain Describe yourself to long lost uncle. How do the mechanisms of genetics and the.
Supplementary Fig.1: oligonucleotide primer sequences.
Gene Mutations Worksheet
Transcription & Translation Worksheet
Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal.
1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 5 High-level Programming with Python Part II: Container Objects Reference:
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
Dictionaries.
GENE MUTATIONS aka point mutations. DNA sequence ↓ mRNA sequence ↓ Polypeptide Gene mutations which affect only one gene Transcription Translation © 2010.
IGEM Arsenic Bioremediation Possibly finished biobrick for ArsR by adding a RBS and terminator. Will send for sequencing today or Monday.
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Nature and Action of the Gene
FEATURES OF GENETIC CODE AND NON SENSE CODONS
Biological Dynamics Group Central Dogma: DNA->RNA->Protein.
Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Undifferentiated Differentiated (4 d) Supplemental Figure S1.
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
Fig. S1 siControl E2 G1: 45.7% S: 26.9% G2-M: 27.4% siER  E2 G1: 70.9% S: 9.9% G2-M: 19.2% G1: 57.1% S: 12.0% G2-M: 30.9% siRNF31 E2 A B siRNF31 siControl.
PART 1 - DNA REPLICATION PART 2 - TRANSCRIPTION AND TRANSLATION.
TRANSLATION: information transfer from RNA to protein the nucleotide sequence of the mRNA strand is translated into an amino acid sequence. This is accomplished.
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
Prodigiosin Production in E. Coli Brian Hovey and Stephanie Vondrak.
Cell Division and Gene Expression
Passing Genetic Notes in Class CC106 / Discussion D by John R. Finnerty.
Supplementary materials
Dictionaries. A “Good morning” dictionary English: Good morning Spanish: Buenas días Swedish: God morgon German: Guten morgen Venda: Ndi matscheloni Afrikaans:
Transcription and Translation Activity 1.You will work with the person sitting next to you. 2.One of you will take the role of RNA polymerase and transcribe.
©1998 Timothy G. Standish From DNA To RNA To Protein Timothy G. Standish, Ph. D.
Parts is parts…. AMINO ACID building block of proteins contain an amino or NH 2 group and a carboxyl (acid) or COOH group PEPTIDE BOND covalent bond link.
Suppl. Figure 1 APP23 + X Terc +/- Terc +/-, APP23 + X Terc +/- G1Terc -/-, APP23 + X G1Terc -/- G2Terc -/-, APP23 + X G2Terc -/- G3Terc -/-, APP23 + and.
GENE EXPRESSION. Transcription 1. RNA polymerase unwinds DNA 2. RNA polymerase adds RNA nucleotides (A ↔ U, G ↔ C) 3. mRNA is formed! DNA reforms a double.
RA(4kb)- Atggagtccgaaatgctgcaatcgcctcttctgggcctgggggaggaagatgaggc……………………………………………….. ……………………………………………. ……………………….,……. …tactacatctccgtgtactcggtggagaagcgtgtcagatag.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
Name of presentation Month 2009 SPARQ-ed PROJECT Mutations in the tumor suppressor gene p53 Pulari Thangavelu (PhD student) April Chromosome Instability.
DNA, RNA and Protein.
The response of amino acid frequencies to directional mutation pressure in mitochondrial genomes is related to the physical properties of the amino acids.
Ms. Hatch, What are we doing today?
Ms. Hatch, What are we doing today?
Fundamentals of Protein Structure
Modelling Proteomes.
Supplementary information Table-S1 (Xiao)
Sequence – 5’ to 3’ Tm ˚C Genome Position HV68 TMER7 Δ mt. Forward
Python.
Supplemental Table 3. Oligonucleotides for qPCR
Laboratory Encounters in Plant Genomics
GENE MUTATIONS aka point mutations © 2016 Paul Billiet ODWS.
Supplementary Figure 1 – cDNA analysis reveals that three splice site alterations generate multiple RNA isoforms. (A) c.430-1G>C (IVS 6) results in 3.
Huntington Disease (HD)
DNA By: Mr. Kauffman.
Gene architecture and sequence annotation
PROTEIN SYNTHESIS RELAY
More on translation.
Molecular engineering of photoresponsive three-dimensional DNA
Fundamentals of Protein Structure
Laboratory Encounters in Plant Genomics
Python.
Station 2 Protein Synethsis.
6.096 Algorithms for Computational Biology Lecture 2 BLAST & Database Search Manolis Piotr Indyk.
DNA to proteins.
Presentation transcript:

Center for Biological Sequence Analysis Prokaryotic gene finding Marie Skovgaard Ph.D. student

Center for Biological Sequence Analysis Prokarya

Center for Biological Sequence Analysis

>AE GTATACTCTTCTTCCCTATACATTGTCGCAGCAAGCTTAGTTTCTTTAGCCTCTCTGCTTTCATTATTAC TTATAATCTTAATAGCAAGGAGACATATGATAGAGTATTTCTATATGATTCCTTCGTTCGTTTATATGAA CTTTATTGTCGCACTAAACTTCACTGCAATATTTTTAGAGTTAATAAGAGCACCTAGAGTGTGGGTAAAA ACTGAAAGAAGTGCCAAGGTTACGGGGGAGGTCATGGGATGATAACTGAATTTTTACTTAAAAAGAAATT AGAAGAACATTTAAGCCATGTAAAGGAAGAGAATACGATATATGTAACAGATTTAGTAAGATGCCCCAGA AGAGTAAGATATGAGAGTGAATACAAGGAGCTTGCAATCTCTCAGGTTTACGCGCCTTCAGCTATTTTAG GGGACATATTGCATCTCGGTCTTGAAAGCGTATTAAAAGGGAACTTTAATGCAGAAACTGAAGTTGAAAC TCTGAGAGAAATTAACGTCGGAGGTAAAGTTTATAAAATTAAAGGAAGAGCCGATGCAATAATTAGAAAT GACAACGGGAAGAGTATTGTAATTGAGATAAAAACTTCTAGAAGTGATAAAGGATTACCTCTAATTCATC ATAAAATGCAGCTACAGATATATTTATGGTTATTTAGTGCAGAAAAAGGTATACTAGTTTACATAACTCC AGATAGGATAGCTGAGTATGAAATAAACGAACCTTTAGATGAAGCAACAATAGTAAGACTTGCAGAGGAT ACAATAATGTTACAAAACTCACCTAGATTCAACTGGGAATGTAAATATTGCATATTTTCCGTCATTTGCC CAGCTAAACTAACCTAAAATTAAAATCTCTCATCGATATAATTAAATTGTGCACACTAGACCAGTAGTTG CCACAATAGCTGGGAGTGACAGTGGAGGAGGTGCTGGATTACAGGCTGATCTAAAGACGTTTAGCGCATT AGGAGTTTTTGGTACAACAATAATAACCGGTTTAACAGCACAGAATACAAGAACAGTTACAAAAGTATTA GAGATACCATTAGATTTCATTGAAGCTCAGTTTGATGCGGTTTGCCTAGATTTACATCCAACTCACGCCA AAACTGGAATGTTAGCTTCTGGTAAAGTGGTAGAACTTGTACTGAGAAAAATTAGAGAGTATAACATAAA ACTAGTTTTAGATCCAGTGATGGTTGCGAAATCTGGATCATTATTGGTAACAGAGGATATCTCGGAGCAA ATAAAAAAGGCGATGAAGGAGGCCATAATATCTACTCCAAACAGATATGAAGCTGAGATAATAAATAAGA CAAAGATTAATAGTCAAGATGATGTTATAAAAGCGGCAAGGGAAATTTATTCTAAGTATGGGAATGTTGT AGTTAAAGGATTTAATGGAGTAGATTACGCCATAATTGACGGAGAAGAAATAGAGTTAAAAGGTGATTAC ATCAGTACTAAAAATACACATGGTAGTGGAGACGTATTTTCTGCCTCCATAACTGCATATCTTGCCTTGG GATACAAACTTAAAGATGCATTAATAAGAGCTAAAAAATTCGCTACAATGACAGTCAAATACGGTTTGGA CTTAGGAGGAGGATATGGACCAGTAGATCCCTTTGCCCCTATAGAGTCCATAGTGAAGAGAGAAGAAGGA AGAAATCAGCTAGAAAACTTACTTTGGTACTTAGAGTCTAATCTTAACGTTATACTTAAACTAATTAACG Can you spot the gene? / (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/

Center for Biological Sequence Analysis Identifying open reading frames / (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/

Center for Biological Sequence Analysis A. pernix (43% AT)

Center for Biological Sequence Analysis Why care about over annotated genes? Genome comparison: Fraction of known proteins Average gene length Amino acid composition The quality of our databases To gain biological knowledge

Center for Biological Sequence Analysis Regular expression Regular expression: /[AT][CG][AC][ACGT]*A[TG][CG]/ The regular expression is able to find all posible sequences, but do not distinguish between the consensus sequence and the highly unlikely sequence: ACAC—ATC or TGCT--AGG Weigth matrixes can be used to score the sequence but do not deal with insertions and deletions. ACA---ATG TCAACTATC ACAC--AGC AGA---ATG ACCG--ATC

Center for Biological Sequence Analysis Markov model A 0.8 C G T 0.2 A C 0.8 G 0.2 T A 0.8 C 0.2 G T A 1.0 C G T A C G 0.2 T 0.8 A C 0.8 G 0.2 T A 0.2 C 0.4 G 0.2 T ACA---ATG TCAACTATC ACAC--AGC AGA---ATG ACCG--ATC

Center for Biological Sequence Analysis Profile HMM Profile HMM have a predefined architecture and the parameters are estimated from multiple sequence alignments. Profile HMM are not usefull for gene finding, since all genes in an organism can not be aligned in a meaningfull way. Begin End

Center for Biological Sequence Analysis Markov Model for gene finding Define a simple architecture: / (ATG|TTG|GTG)((…)*?)(TGA|TAG|TAA)/ ATGCATGC TAG TAA TGA ATGCATGC ATGCATGC ATG GTG TTG S1 S2S3S4S5

Center for Biological Sequence Analysis Markov models Knowledge of the structure of genes is used to define the architecture of the model. Sequences ( x ) from known genes are used to estimate the parameters of the model – training of the model. The training is done by counting the number of times a nucleotide occur in a given state and dividing this number with the number of sequences used in training giving the frequencies.

Center for Biological Sequence Analysis Training Sequence x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 …..….x n States S1 S2 S3 S4 S5 ATGCATGC TAG TAA TGA ATGCATGC ATGCATGC ATG GTG TTG S1 S2S3S4S5

Center for Biological Sequence Analysis Model after training A: 0.22 T: 0.24 G: 0.27 C: 0.27 TAG: 0.6 TAA: 0.3 TGA: 0.1 A: 0.25 T: 0.23 G: 0.27 C: 0.25 A: 0.26 T: 0.24 G: 0.25 C: 0.25 ATG: 0.77 TTG: 0.11 GTG: 0.12 CTG: 0.00 S1 S2S3S4S The trained model can be used to search for genes in DNA sequences.

Center for Biological Sequence Analysis Searching with the HMM S1 S2 S3 S4 S5 Sequence ATG A T T T C G C G C G A T ……….T A G States (0.22*0.77) (0.23*0.22*0.77) (0.24*0.23*0.22*0.77) =P(x|M)

Center for Biological Sequence Analysis Log-Odds score The propability of a sequence gets infinitly small as the sequence x becomes longer. This is solved by defining a background (NULL) model. For example a random distribution: A=T=C=G=0.25 From this the Log-Odds score can be calculated: -log(P( x |M)/P( x |NULL)) A high Log-Odds score corresponds to a sequence that looks more like the gene model than the background model.

Center for Biological Sequence Analysis Is the model to simple? ATGCATGC TAG TAA TGA ATGCATGC ATGCATGC ATG GTG TTG S1 S2S3S4S5

Center for Biological Sequence Analysis Codon usage Synonymous codons incode the same amino acid. At random synonymous codons would be expected to be used with equal frequencies. In real life synonomous codons have different frequencies. Different species have consistent and characteristic codon biases. Lateral transferred genes and genes from plasmids and phages will have atypical codon usage. Variations in codon usage within an organism can be modelled in different coding models in the HMM.

Center for Biological Sequence Analysis 1st Position 2nd Position 3rd Position UCAG U 30,407 Phe 22,581 Phe 18,943 Leu 18,629 Leu 11,523 Ser 11,766 Ser 9,793 Ser 12,195 Ser 22,048 Tyr 16,669 Tyr 2,706 Stop 326 Stop 7,062 Cys 8,846 Cys 1,260 Stop 20,756 Trp UCAGUCAG C 15,018 Leu 15,104 Leu 5,316 Leu 71,710 Leu 9,569 Pro 7,491 Pro 11,496 Pro 31,614 Pro 17,631 His 13,272 His 20,912 Gln 39,285 Gln 28,458 Arg 29,968 Arg 4,860 Arg 7,404 Arg UCAGUCAG A 41,375 Ile 34,261 Ile 5,967 Ile 37,994 Met 12,223 Thr 31,889 Thr 9,683 Thr 19,682 Thr 24,189 Asn 29,529 Asn 45,812 Lys 14,076 Lys 11,982 Ser 21,907 Ser 2,899 Arg 1,694 Arg UCAGUCAG G 24,910 Val 20,800 Val 14,850 Val 35,979 Val 20,808 Ala 34,770 Ala 27,468 Ala 45,862 Ala 43,817 Asp 25,996 Asp 53,780 Glu 24,312 Glu 33,731 Gly 40,396 Gly 10,902 Gly 15,118 Gly UCAGUCAG Fields : [number] [amino acid]

Center for Biological Sequence Analysis Is the model to simple? ATG GTG TTG TAG TAA TGA S2S3 AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC S1

Center for Biological Sequence Analysis HMM for gene finding TAG TAA TGA S4S3 AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC ATG GTG TTG S1 S2 AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC

Center for Biological Sequence Analysis Multiple coding models TAG TAA TGA E ATG GTG TTG S AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC AAA ATA AGA ACA TAA TTA TGA TCA AAT ATT AGT ACT TAT TTT TGT TCT AAG ATG AGG ACG TAG TTG TGG TCG AAC ATC AGC ACC TAC TTC TGC TCC GAA GTA GGA GCA CAA CTA CGA CCA GAT GTT GGT GCT CAT CTT CGT CCT GAG GTG GGG GCG CAG CTG CGG CCG GAC GTC GGC GCC CAC CTC CGC CCC

Center for Biological Sequence Analysis Order of the model A zero order Markov model (state) has a propability of letter in the state – the propabilities are independent of the previous sequence. The NULL model is a zero order Markov model (A=T=G=C=0.25). The propability of a letter in a first order Markov model depends on the previous letter (di- nucleotide distributions). Second order depends on the two previous letters (corresponding to a codon).

Center for Biological Sequence Analysis Order of the coding model Inter-codon denpendencies are correlations between amino acids typically found in proteins. They reflect typical features of proteins and can be used to improve the performance of the gene finder. The use of higher order coding models in gene finding is a way to capture these inter-codon denpendencies. Higher order models requires more training data and more computational time when searching.

Center for Biological Sequence Analysis The Shine-Dalgarno sequence The ribosome binds to the messenger RNA through baseparing to the 30S ribosomal subunit. The binding site is the Shine-Dalgarno sequence (SD). The SD is a purine-rich sequence (consensus sequence: AGGAG) at the 5' end of most prokaryotic mRNAs. The SD is found 5-10 basepairs upstream from the start codon.

Center for Biological Sequence Analysis EasyGene

Center for Biological Sequence Analysis

R. prowazekii

Center for Biological Sequence Analysis GeneMark.hmm Lukashin A. and Borodovsky M., “GeneMark.hmm: new solutions for gene finding”, NAR, 1998, Vol. 26, No. 4, pp EasyGene Schou Larsen T. and Krogh A., “EasyGene – A prokaryotic gene finder that ranks ORFs by statistical significance”. BMC Bioinformatics 2003, 4:21