Bioinformatics Gene detection and prediction Gene predictions in prokaryotes Gene predictions in eukaryotes Difficulties of gene prediction Statistical.

Slides:



Advertisements
Similar presentations
The genetic code.
Advertisements

An Introduction to Bioinformatics Finding genes in prokaryotes.
Ab initio gene prediction Genome 559, Winter 2011.
 -GLOBIN MUTATIONS AND SICKLE CELL DISORDER (SCD) - RESTRICTION FRAGMENT LENGTH POLYMORPHISMS (RFLP)
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
Supplementary Fig.1: oligonucleotide primer sequences.
Gene Mutations Worksheet
Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal.
chromosome organization, what about genome organization?
Introduction to Molecular Biology. G-C and A-T pairing.
1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 5 High-level Programming with Python Part II: Container Objects Reference:
Lecture 12 Splicing and gene prediction in eukaryotes
 Genetic information, stored in the chromosomes and transmitted to the daughter cells through DNA replication is expressed through transcription to RNA.
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
Figure S1. Sequence alignment of yeast and horse cyt-c (Identity~60%), green highly conserved residues. There are 40 amino acid differences in the primary.
Reading the blueprint of life DNA sequencing. Introduction The blueprint of life is contained in the DNA in the nuclei of eukaryotic cells and simply.
Dictionaries.
GENE MUTATIONS aka point mutations. DNA sequence ↓ mRNA sequence ↓ Polypeptide Gene mutations which affect only one gene Transcription Translation © 2010.
IGEM Arsenic Bioremediation Possibly finished biobrick for ArsR by adding a RBS and terminator. Will send for sequencing today or Monday.
Nature and Action of the Gene
Biological Dynamics Group Central Dogma: DNA->RNA->Protein.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Undifferentiated Differentiated (4 d) Supplemental Figure S1.
A.B. C. orf60(pOrf60) 042orf orf60(pOrf60-M5 ) orf60(pOrf60-M1) orf60(pOrf60-M4) 042orf60 042orf60(pOrf60-M5) orf60(pOrf60) 042orf60(pOrf60-M1)
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
Fig. S1 siControl E2 G1: 45.7% S: 26.9% G2-M: 27.4% siER  E2 G1: 70.9% S: 9.9% G2-M: 19.2% G1: 57.1% S: 12.0% G2-M: 30.9% siRNF31 E2 A B siRNF31 siControl.
PART 1 - DNA REPLICATION PART 2 - TRANSCRIPTION AND TRANSLATION.
TRANSLATION: information transfer from RNA to protein the nucleotide sequence of the mRNA strand is translated into an amino acid sequence. This is accomplished.
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
From Genomes to Genes Rui Alves.
Prodigiosin Production in E. Coli Brian Hovey and Stephanie Vondrak.
Passing Genetic Notes in Class CC106 / Discussion D by John R. Finnerty.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
Supplementary materials
Dictionaries. A “Good morning” dictionary English: Good morning Spanish: Buenas días Swedish: God morgon German: Guten morgen Venda: Ndi matscheloni Afrikaans:
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Suppl. Figure 1 APP23 + X Terc +/- Terc +/-, APP23 + X Terc +/- G1Terc -/-, APP23 + X G1Terc -/- G2Terc -/-, APP23 + X G2Terc -/- G3Terc -/-, APP23 + and.
Structure and Function of DNA DNA Replication and Protein Synthesis.
RA(4kb)- Atggagtccgaaatgctgcaatcgcctcttctgggcctgggggaggaagatgaggc……………………………………………….. ……………………………………………. ……………………….,……. …tactacatctccgtgtactcggtggagaagcgtgtcagatag.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
Name of presentation Month 2009 SPARQ-ed PROJECT Mutations in the tumor suppressor gene p53 Pulari Thangavelu (PhD student) April Chromosome Instability.
DNA, RNA and Protein.
The response of amino acid frequencies to directional mutation pressure in mitochondrial genomes is related to the physical properties of the amino acids.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
ORF Calling.
Genomics 101 DNA sequencing Alignment Gene identification
bacteria and eukaryotes
Modelling Proteomes.
Supplementary information Table-S1 (Xiao)
Sequence – 5’ to 3’ Tm ˚C Genome Position HV68 TMER7 Δ mt. Forward
Supplemental Table 3. Oligonucleotides for qPCR
GENE MUTATIONS aka point mutations © 2016 Paul Billiet ODWS.
Supplementary Figure 1 – cDNA analysis reveals that three splice site alterations generate multiple RNA isoforms. (A) c.430-1G>C (IVS 6) results in 3.
Huntington Disease (HD)
DNA By: Mr. Kauffman.
DNA and RNA.
Gene architecture and sequence annotation
Ab initio gene prediction
More on translation.
Molecular engineering of photoresponsive three-dimensional DNA
Fundamentals of Protein Structure
Python.
Station 2 Protein Synethsis.
Shailaja Gantla, Conny T. M. Bakker, Bishram Deocharan, Narsing R
Presentation transcript:

Bioinformatics Gene detection and prediction Gene predictions in prokaryotes Gene predictions in eukaryotes Difficulties of gene prediction Statistical measure of prediction accuracy Lecture 11

Gene detection in genomic DNA was a challenging question since early 1980 Intensive sequencing of large genomic DNA sequences and entire genomes made gene prediction even more important in the last few years Ever growing number of programs dealing with gene prediction is the best demonstration of significant efforts On the other hand this is also an indication of significant difficulties on the way caused by the tremendous variety of the genes’ signals The accuracy of modern gene prediction programs varies broadly form 0.45 to 0.9 depending on gene type, species, software, etc. Gene detection and prediction

Gene prediction in prokaryotic organisms is generally easier than in eukaryotic mainly because their genes usually lack introns Another reason is more conserved structure of the promoter region Thus having smaller number of patterns, which have to be recognised, and much more conservative nature of these patterns plus a single ORF makes the objective more achievable As a result gene prediction in eukaryotes more reliable Gene prediction in prokaryotic genomes

HMM of an E.coli gene This model is based in a simple assumption that codons are independent, which allows to implement Markov process. However more complex assumptions are possible and such programs exist. Round  match state Diagonal  insert state Square  delete state Model trained on E.coli is capable of finding genes in unknown sequence of E.coli. If a sequence were absolutely accurate, only match states are needed in the model. The insert and delete states allow an ORF with an extra or missing base to be recognised.

Knowledge of basic structural elements of genes opens a possibility for prediction genes in long stretches of unknown DNA. There two major types of information, which are used in ab initio gene prediction programs. Prediction of numerous signals (next slides) Prediction of exons/ORF Ab initio gene prediction in eukaryotes

Eukaryotic gene organization and signals 5’UTR3’UTRCDS Promoter transcriptional initiation site translational start site translational stop site poly(A) site transcriptional termination site 5’3’ Gene mRNA transcript Cap AAA...AA tail This gene contains 6 exons (3 of which can be translated) and numerous signals/motifs, including ORF, splicing signals, promoter sites as well as sites responsible for initiation and termination of transcription and translation. There are other essential signals not shown at this figure. Nearly each of these signals may vary significantly in different genes and species.

Promoter area Transcriptional initiation site ORFs Splicing signals Translational initiation site (start codon ATG) Translational stop site (a stop codon) Poly(A) site Transcriptional termination site Some of these site are very conservative, while others vary significantly None of the above features alone provide sufficient information to predict a gene in a sea of non-coding sequences Ab initio gene prediction: features requiring recognition in eukaryotic genes

Prediction of ORF alone is not always highly reliable because only long ORF can be more easily statistically discriminated from non-ORF, which do not contain stop codons by a chance. False-positive predictions for relatively short sequences become a significant issue. I used 2 different programs to predict number of exons for two genes with known structure (3 and 14 exons respectively). In both cases predicted numbers of exons were higher than real (10-15 and respectively). As frequencies of trinucleotides TAA, TAG and TGA serving as stop codons are rather low, there is a high probability that long stretches of non-coding DNA may not contain these short combinations. This may be considered as a lack of stop codons and false identification of the sequence as an exon can be made. False negative predictions also can be an issue particularly when exons are very short < 15 nucleotides. Difficulties of ORF finding in eukaryotes

Accuracy of prediction is higher in more sophisticated programs, which assume that codons may not be independent, as codons may carry much more information that just coding a single amino acid. Such programs use values of codon frequencies, which are not the same in different species (next slide). Frequencies of codons in particular species reflect a long evolutionary history and carry a lot of hidden information about structure of exons and genes, including periodicity of DNA in exons, introns phase distribution and exons symmetricity. The Codon Usage DataBase can be found at: ORF predictions and codon usage tables

Codon usage is species specific AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT Codon usage frequencies for C. elegans  of codon frequencies = 1

Gene > 45328_AB protein_id:BAA ; Homo sapiens DNA for GPI-anchored molecule-like protein, complete; intron(phase:11,size:609,5302,intr_sum:5911); exon(size:73,108,296,ex_sum:477); {splice:gtag,gtag} ATGCTCCTCTTTGCCTTACTCCTAGCCATGGAGCTCCCATTGGTGGCAGCCAGTGCCACCATGCGCGCTCAGTgtaagtatcattccctctcactgtcctggagaggacgagaattccacctggggtgctgggggtcac tgggatgattggctgcaacgtggagcaagcctccgttagctggggcctgcattgtctgtgtaatcaggggtgggcactagggcagtccaggagtagtcatgagcaaggagagggttaggatgaaggagcagctgaccag ggaccaagggggaaccttgatgtggcccttccccatcagcgccaggcaggaggggctctgtccagggaaacccaggaggatggcggacccctgtgagtatccagtcttccttggcgaggtgagccaggtctgcagagca tagcaatcccgtatgtgaccaccaagtggcgctctctggagcctgcgttggagagcagggaaagctctccttgtgcctggcctccctcccaggagctagcctgggccagactcagactgcatagagagctgagctgtgc aggctaggagaagtccttggaagcagaggggaagggctggccgctgaagaagggtggagtgagctggtaatgggtggaaaaggcgtagtggagcagaagcctgaagcctgctttctcccctctcagGGACTTACAGTTT GAGATGCCATGACTGTGCGGTCATAAATGACTTCAACTGTCCCAACATTAGAGTATGTCCGTATCATATTAGGCGCTGTATGACAATCTCCATTCgtaagtacctcttggtcatttggacacattgtagattagtcccc tacctgggtagtttctggggccagggccagtctgctttcttctctgaacccagctctgtttccccttccctcatgtcctcccatcctgagtgcgtttctgcacgcttgggtctcagcctcatgataggccagcatgcat catcttgtggagccaggtactctgcaaagtagtacagtctgtccacacatctgcagcgtctccagggggtgggagcattgttggaccgcagagcctctactgtccctggttgtgtgtgtagacaccccatgctgcctgg gtgagtcctcactggccattcaatttctggtagtttagagagtgcttccagttcggaaggtcaaagaagtggttgggccatggcactccccccagggaaaaacctaccacacacagttgggagacaagcatgaggtcag ggcacagcctgcacttccaggatgctcttgctttttcctcagagccctctggtgccctctctccccagggccctaacctgcagaagatgtatggccagaggccagtgaccaatggagcaagcagggagggtgcagccag tgtatgccctggcgcacaggtggagcctcgtctggggctctgctcagggcctttcctgtagcccttttgtccccggagtctgtggttgtgcttgcaacttcacatgtcattctgtgttctaagctttgtgcagaaataa tcccctgattaccacctcggtgtggttgggttatgtccccacccagatctcatcttgaattgtaactcccataatccccgtgtgtcatgggagggacctggtgggaggtaattgattcatctgggtggttaccctcatg atgtgctcgtgatagcgagtgcccacaagggcgaaacctgccacacacagatgggccggacaagcatgcggtgcagggcacagactgcacttccaggatgctcttgctttttcctcagagcccgctggtgccctctctc cccagggccctaacctgcagaagatctgatggttttctaaggagggttttccccggaacttctcctgcctgccatcattcgaagaatgtgttggttacccctgctgttgtcatcgtaagtttcctgaggcctccccagc cacgcggaactgtgagtcagttaaaccgcttgcctttataaattgcccagtctcgggcagttcttatagcagcgtgagaagagatcatacacacctcttcttcagaccaactgctgatttgggacatgggcaggaggct gagaatctgggctttgtctccacagtggccactgctggggaccttgatggcatggttttgtcatgttgggattactgcctctaaatgaggctgagaactatctttcccataactccctcgtggggtaatagttggcaaa aagaggaacttgagtgagattcgcaaggtatgggtgaagcagagatcactcttctggaaggtcacgatggtttagtgaagtgaaggacactggtggactccaatctgtcctccttcatgtccaccccatcttctccact ttgtgagcctttgactgaaagcaattctaagactccagcaggcatttggccactggcctgcctaggtgatgggttctccagtggcgcaggttaaatgtctctgcaagagactcctctgcacagccttacttgagaggat aacggtgcatggcctttgatattgccctgcagagttgggctgtgccagcctgccccagtgtaagtgacctagctctggactgttgctcctctgatcttcagccttacctgactgacttcctctcttggctcccccatgg tcatggcagctgcaacacttttatttaataacttagacctagactgttttagaggctccattttcctgaatgaaccctgatggatattaggacaaagaaattgggaatgctggaatgctgggacatttttcctttcagg agatttgctgatttctggggctgtgcaggatggtaagcaaaagacctctatggaaagaaacgcaggcagtgtcctgagacaaggggcctgggtatgaggcatataggaactctgcactatctctgcaacttttctgaaa acctaaaactatttttaaaataaaaggttcatacacacacacacacacacacacacacacacacaccataaccccactaccttgtggattcaataacagaattgcaatgtggtttacaatttgaaacccatcagtttaa ttcaccatattaacaaagaaaagggcagagatgattgtagatgctgaaatgttttgataaaatccaacaccctttccaaatctctgagggaaatgtctaccatcaccacttctattcccaatggaactggagttgttag gtagttcagttaggcgagggggagtattaagtctcatggagcggaggaagcaagaatttaacctgcttactattcatnagagacaaccaacttcgaataacaaatatctttaggcaagctcctggagacaagataagta ttccgaaatcagtttaccaaatcagttctccttttgtacatcaagaaacagattggaaatgaaatccgaggcatggtgctctttaaagtatngccaaatcaagtaccacaaatctagctggagttgtgcaagatttcaa cactcaacctgcaagaacactattgaagacttaaaacaccattaaagacttaaacacatggaattgtacacgatacttatggatcatgagaatcnaacagagtaaggatatctcttctccaattaacctacaaatccaa cagaatttcagtcaaagtctcagcaggggtatgtttgtgttttgacaagttggtgccgaatgctttttgaaaatatgaagggctaggaataaccaaagcagtcttggggaaaaagaaaaaagctcgaggacttgtgctc tgaactataaaaccggtttccttaatcaaaacggtttggtgttggatcctgtggtcaggattgaaaaataacaacattagaagagaaataattcagaaaaagccacatacttctggtttcttgatctctgacaaagttg tccctgcagtgcagcaggtaaagacacttctctgcgctgagcagtagtggatatgttagatatttatacacaatatctgcaccatataaaagaagcagtttcaggtggtttgtagatctcaatgtaaatattaaaccaa taaagtttttagagaaaaatgtaagttgctgtctttatgacttgtaggaattaagaatttaaaaaatcgaacaaaatagtgctagtcacataagaatgaaattgataggttgaattacctctaacttcaaaacttctgt tcctaagaagatacctgtgaaatagtataaccgcaagccacggatgaggagaaagcatttcaaaacatgtatttactaaagtacttgtgttcataatataatgaacttattcctatgaattaataaaaaactgatcaat ggaaatttaaagaaaaggtttaaagggaattggccagaaaatggtgtctggatgtgtaacagacataagcgcagttgttccctgtcgttcttcttgaaaggaatgcagcttaaaaccacaacacaataacgccacagga aatgactcagtatcctaaagtgatcataccaagtgttgtcagggttctgaactgtgcacaatccctcagactatagataagttcaacccagtgtagaaaactcttcgatgttgtcggctaaagctgcgtatccaacact gagactcagcgcttctactcctaggtacatagctaattataaaaaagcttatacaggctcaccttaaggcgaatacaagacaagagtattcattgcagcactatttggagtagacaaaaaccagaaacaacctcaatgt ccatcaacaataaattgtggtatgttcaccaatgtaatgctatgcaattttgagaatgaacaatatacaacagtgtgcaacaatagggataaatttcataaacacagattagagcaaaagaagccatccacaaaagcag atgtaatgtataatttcatttcataaagttcaaaaccaaacagaactaacctataggtttagaagcccatacatttcctgtctttggcctgggctgctcacagcagtgtctgcatggaggaacagagtgaagaatcctg gggctctctcttgcagctctgtttctgatgctgcatgctgagtactgcctgtgggcacttggtgacaatccattgagttgtggacgtatggtcaggcacatttctgtagaatggtgaaccccaataaactttttaaatg gaaaataacagaaaacaagtgactggttcgtattcatcatttgtttctggtgttatagaggagacaagacccagacccatgaatagataaatcaattatggtaagttccagaataaagtatttgtgtaaacactggata gacttcaaggggaacagaagtagtgatgagggccaaggtatttaaccctgggattcttgaaggaggtctnagatggaactttggtcaggtgccattggtcaactgaggcccctggacaagccctctagtgaagattatt gatcacattagtcgtgatcacattagttgcgatcgataaccttcaccagagggcttgtccatcggataacacgtacctagttggaaaggtgcatcaagtctgtactggggaggaggttggcaccaagggtcaagagcat ttatatgttggtccaggagagctgaaacaccaggctgagggtcccaggaaggagtttacccgagaggggtcttgccaactgcagaaaatctagttaaggctacattaatagggacacctctgttaaaggtccatgtgat ctctagaccataagtgaaacatgatttctggtctgcatttaatttctcgggctctagaattcttaccataagaaatgaacaggacatttcattgcgtgtagctacatgancaaggtaccatgataccattttgggaggg ccagagncaannagaatgggcagctctagnatagaaaactatctccagaagaagttgttctgggagtgatgagagcccgctagtggacttggatgttctctcttagtcaagtgttctagacaattttatcacagcgtgg gagtgtagaatgtgtacatggagctaattatggttgaatgtgaggtgtatgtgcctcaatatttacaagcagaaaatgtgaaatcaattattttcattgctgcttctttttttagGCATAAATTCTCGTGAACTACTTG TTTATAAGAACTGTACAAACAACTGCACATTTGTATATGCAGCTGAACAGCCTCCTGAAGCCCCAGGAAAAATCTTCAAAACTAATAGCTTCTACTGGGTTTGTTGTTGTAATAGCATGGTTTGCAATGCAGGAGGACC TACTAATCTTGAAAGGGACATGTTACCCGATGAAGTAACTGAGGAGGAGCTTCCAGAAGGAACTGTGAGGCTGGGGGTATCAAAACTGTTGCTGAGTTTTGCCTCTATCATAGTCAGCAATATATTGCCATGA

Output of efficient HMM based gene prediction program Genscan GENSCANW output for sequence 21:16:46 GENSCAN 1.0 Date run: 22-Mar-104 Time: 21:16:46 Sequence 21:16:46 : 6388 bp : 45.81% C+G : Isochore 2 ( C+G%) Parameter matrix: HumanIso.smat Predicted genes/exons: Gn.Ex Type S.Begin...End.Len Fr Ph I/Ac Do/T CodRg P....Tscr Init Intr Intr Predicted peptide sequence(s): >21:16:46|GENSCAN_predicted_peptide_1|127_aa MELPLVAASATMRAQWTYSLRCHDCAVINDFNCPNIRVCPYHIRRCMTISIRINSRELLV YKNCTNNCTFVYAAEQPPEAPGKIFKTNSFYWVCCCNSMVCNAGGPTNLERDMLPDEVTE EELPEGT

Genscan (HMM) output Explanation Gn.Ex : gene number, exon number (for reference) Type : Init = Initial exon (ATG to 5' splice site) Intr = Internal exon (3' splice site to 5' splice site) Term = Terminal exon (3' splice site to stop codon) Sngl = Single-exon gene (ATG to stop) Prom = Promoter (TATA box / initation site) PlyA = poly-A signal (consensus: AATAAA) S : DNA strand (+ = input strand; - = opposite strand) Begin : beginning of exon or signal (numbered on input strand) End : end point of exon or signal (numbered on input strand) Len : length of exon or signal (bp) Fr : reading frame (a forward strand codon ending at x has frame x mod 3) Ph : net phase of exon (exon length modulo 3) I/Ac : initiation signal or 3' splice site score (tenth bit units) Do/T : 5' splice site or termination signal score (tenth bit units) CodRg : coding region score (tenth bit units) P : probability of exon (sum over all parses containing exon) Tscr : exon score (depends on length, I/Ac, Do/T and CodRg scores) Comments The SCORE of a predicted feature (e.g., exon or splice site) is a log-odds measure of the quality of the feature based on local sequence properties. For example, a predicted 5' splice site with score > 100 is strong; is moderate; 0-50 is weak; and below 0 is poor (more than likely not a real donor site). The PROBABILITY of a predicted exon is the estimated probability under GENSCAN's model of genomic sequence structure that the exon is correct. This probability depends in general on global as well as local sequence properties, e.g., it depends on how well the exon fits with neighboring exons. It has been shown that predicted exons with higher probabilities are more likely to be correct than those with lower probabilities.

Genscan (HMM) > 45774_HSAJ9610 protein_id:CAA ; Homo sapiens AIRE gene.; intron(phase: ,size:418,246,383,753,1198,185,1026,1091,590,612,580,1879,1206,intr_sum:10 167); exon(size:132,175,156,75,114,146,81,116,100,183,32,103,63,72,ex_sum:1548); {splice:gtag,gtag,gtag,gtag,gtag,gtag,gtag,gtag,gtag,gtag,cgag,gtag,gtag} GENSCAN 1.0 Date run: 22-Mar-104 Time: 20:53:16 Sequence seq1 : bp : 64.78% C+G : Isochore 4 ( C+G%) Parameter matrix: HumanIso.smat Predicted genes/exons: Gn.Ex Type S. Begin..End.Len Fr Ph I/Ac Do/T CodRg P... Tscr Init Intr Intr Intr Intr Intr Intr Intr Intr Intr Intr Intr Intr Term Some discrepancies between reality and prediction are shown in red

Genscan (HMM) output

The rationale behind comparative or similarity base gene prediction methods is that the regions in the genome sequence coding for proteins are generally more conserved during evolution than non-coding regions. There are two main classes of similarity based approaches for gene identification: 1. The comparison of the DNA query sequence with a protein or cDNA sequence, or DB of such sequences; 2. The comparison of of two or more genomic sequences. Both methods are also useful for prokaryotes. ORF are prominent features of genes. BLASTX translates a genomic query into 6 possible reading frames and then compares against a protein DB. Similar approach is realised in BLASTN and FASTA, when comparisons with a DB of cDNA are made. Genomic query against genomic target became more popular in the last few years and based on assumption that conserved regions will tend to correspond to coding exons from homologous genes. A number of specialised programs were developed. Ever growing power of different DB makes these methods much more efficient and popular. Comparative gene prediction in eukaryotes

The accuracy can be evaluated on the nucleotide, exon, and gene levels. At each level there are two basic measures: sensitivity and specificity, which essentially measure prediction errors of the first and the second kind. Sensitivity is the proportion of real elements (e.g. exons) that have been correctly predicted. Specificity is a proportion of predicted elements that are correct. Perfect prediction occurs when the both are equal to 1 (both measures vary between 0 and 1). Neither alone constitute good measures of global accuracy, since one can have high sensitivity with little specificity and vice versa On nucleotide level such measure is the coefficient of correlation: Measure of prediction accuracy TP – true positives, TN – true negatives, FP – false positives, FN – false negatives