Genome Annotation
Genome Annotation Annotation is the process of adding information to DNA sequence. The information usually has DNA coordinate. Features could be repeats, genes, promoters, protein domains…….. Features can be linked to other databases eg Pfam/Pubmed
Genome Annotation Genome Databases The EMBL file format Editing EMBL files with Artemis The annotation process Common pitfalls
Public Databases Genbank, Embl and DDBJ. All databases update each other automatically
EMBL and TREMBL Patricia Rodriguez-Tomé , Peter J. Stoehr , Graham N. Cameron and Tomas P. Flores, "The European Bioinformatics Institute (EBI) databases", Nucleic Acids Res. 24:(6-13), 1996 EMBL currently contains 14366182 entries
EMBL File Contains: A header File containing: Information about the sequence Organism Authors References Comments A feature table containing Sequence features and co-ordinates
Header File ID PFMAL1P4 standard; DNA; INV; 66441 BP. XX AC AL031747; SV AL031747.8 DT 24-SEP-1998 (Rel. 57, Created) DT 27-APR-2000 (Rel. 63, Last updated, Version 13) DE Plasmodium falciparum DNA from MAL1P4 KW HTG; rifin; telomere; var; var-like hypothetical protein. OS Plasmodium falciparum (malaria parasite P. falciparum) OC Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium. RN [1] RA Oliver K., Bowman S., Churcher C., Harris B., Harris D., Lawson D., RA Quail M., Rajandream M., Barrell B.; RT ; RL Submitted (24-SEP-1998) to the EMBL/GenBank/DDBJ databases. RL P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome RL Trust Genome Campus, Hinxton, Cambridge CB10 1S. Header File
EMBL File Feature Table Anything that can have a cordinate on a DNA sequence. misc_difference misc_feature misc_recomb misc_RNA misc_signal misc_structure modified_base mRNA N_region old_sequence polyA_signal polyA_site precursor_RNA prim_transcript primer_bind promoter protein_bind RBS repeat_region repeat_unit rep_origin rRNA S_region satellite scRNA sig_peptide snRNA snoRNA source stem_loop STS TATA_signal terminator attenuator C_region CAAT_signal CDS conflict D-loop D_segment enhancer exon GC_signal gene iDNA intron J_segment LTR mat_peptide misc_binding transit_peptide tRNA unsure V_region V_segment variation 3'clip 3'UTR 5'clip 5'UTR -10_signal -35_signal
Feature qualifiesr Additional information about a feature /allele="text" /citation=[number] /codon=(seq:"text",aa:<amino_acid>) /codon_start=<1 /db_xref="<database>:<identifier>" /EC_number="text" /evidence=<evidence_value> /exception="text" /function="text" /gene="text" /label=feature_label /map="text" /note="text" /number=unquoted /product="text" /protein_id="<identifier>" /pseudo /standard_name="text" /translation="text" /transl_except=(pos:<base_range>,aa:<amino_acid>) /transl_table /usedin=accnum:feature_label
Features
Annotation in Artemis FT CDS 732..1415 FT /db_xref="IPR002038" FT /gene="PfLtest.01" FT /label=PfLtest.01 FT /note="PfLtest.01. len=227aa. Asp-rich protein.Predicted FT by Genefinder, Phat and GlimmerM. Similar to Plasmodium FT falciparum hypothetical 132.2 kDa protein TR:O97242 FT (EMBL:AL034558) (1114 aa) fasta scores: E(): 7.1e-21, FT 44.388% id in 196 aa." FT /product="Asp-rich hypothetical protein" FT /colour=10 FT /fasta_file="fasta/sanger_100kb.embl.seq.00001.out" FT misc_feature complement(1855..1871) FT /fasta_file="fasta/TEST100.tab.seq.00105.out" FT CDS 3151..4821 FT /gene="PfLtest.02" FT /label=PfLtest.02 FT /note="PfLtest.02. len=556aa. Predicted by Genefinder, FT Phat and GlimmerM. Unknown hypothetical protein" FT /product="unknown hypothetical protein" FT /colour=8 FT /fasta_file="fasta/sanger_100kb.embl.seq.00002.out"
CDS features CDS stands for coding sequence and is used to denote genes and pseudogenes. These features are automatically translated on submission and the protein added to the protein databases.
/note Note field contains all the evidence for a gene call……..plus anything else. Similarity (fasta or blast) Domain/motif information (pfam, tmhmm etc) Unusual features (repeats, aa richness)
/product The name of the gene product eg Alcohol dehydrogenase Unless there is proof we must qualify.. Putative Possible Always be conservative!.. eg. Putative dehydrogenase dehyrogenase like protein Only piece of annotation added to the protein databases.
Naming protocols Hypothetical protein unknown function and no homology Conserved hypothetical protein unknown function WITH homology alcohol dehydrogenase like looks a bit like it, but may not be. Putative alcohol dehydrogenase probably a alcohol dehydrogenase Alcohol dehydrogenase this has previously been characterised and shown to be alcohol dehydrogenase in this organism.
/gene The gene name Only transfer a gene name if it is meaningful Eg ADH1 Only transfer a gene name if it is meaningful Never transfer a gene name like PfB0024. Is it a gene family? make sure two genes have the same name.
Transitive Annotation AKA annotation catastrophe Junk in = Junk out Miss-annotations spread through incorrect database submissions.