Genome Annotation.

Slides:



Advertisements
Similar presentations
On line (DNA and amino acid) Sequence Information Lecture 7.
Advertisements

Genomic Innovations- Orthology Paralogy. Genomic innovation.
Genome Annotation. Now that you’ve assembled your genome, what is next? GENOME ANNOTATION What is that? Why is it important? How do you do it?
Finding Eukaryotic Open reading frames.
Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
BME 130 – Genomes Lecture 7 Genome Annotation I – Gene finding & function predictions.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
Eukaryotic Gene Finding
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Wellcome Trust Workshop Working with Pathogen Genomes Module 1 Artemis.
On line (DNA and amino acid) Sequence Information
Figure 1. P. Knowlesi top, six frame translation showing snap generated gene models (blue), contigs depicted alternate brown and orange. P falciparum (bottom)
Sequence Analysis with Artemis & Artemis Comparison Tool (ACT) South East Asian Training Course on Bioinformatics Applied to Tropical Diseases (Sponsored.
Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Spring 2007 Bioinformatiatics Ch. 6 - Genomics Figure 1.1.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Bsubt.embl complete entry in EMBL format (DNA and Features) bsubt.embl.Z bsubt.fasta complete DNA sequence in Fasta format bsubt.fasta.Z bsubt.con construct.
Biological databases Nicky Mulder:
Biological Databases By : Lim Yun Ping E mail :
is accessible at: The following pages are a schematic representation of how to navigate through ALE-HSA21.
Adding GO for Large Datasets COST Functional Modeling Workshop April, Helsinki.
Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Part I: Identifying sequences with … Speaker : S. Gaj Date
1 EMBL Outstation — The European Bioinformatics Institute Automatic and Reliable Functional Annotation of Proteins.
Sequence Search and Analysis SPE 1653 (703)
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Function preserves sequences
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
.1Sources of DNA and Sequencing Methods.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 2 Genome Assembly.
Large-scale Prediction of Yeast Gene Function Introduction to Bio-Informatics Winter Roi Adadi Naama Kraus
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
1 EMBL Outstation — The European Bioinformatics Institute Mus musculus - a model organism in SWISS-PROT.
Copyright OpenHelix. No use or reproduction without express written consent1.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Finding genes in the genome
Sequence Curation Paul Davis Sanger Institute. Overview Sequence curation within WormBase consortium. Import of sequence data. Prediction stats. Work.
Bioinformatics. History Margaret Dayhoff, 1965: Atlas of Protein Sequence and Structure Brookhaven, 1970s: Protein Data Bank (PDB) Needleman & Wunsch,
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Bacterial infection by lytic virus
bacteria and eukaryotes
Bacterial infection by lytic virus
Editing Pathway/Genome Databases
Protein databases Henrik Nielsen
VectorBase genome annotation
Protein Families, Motifs & Domains.
Lecture 5.1: Genome Annotation
Primer design.
Genes, Genomes, and Genomics
Protein Synthesis Genetics.
Central Dogma.
Genome Center of Wisconsin, UW-Madison
BLAST.
Ensembl Genome Repository.
Chapter 3. THE GENBANK SEQUENCE DATABASE
The Structure of the Genome
Introduction to Databases
.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 3 Gene Prediction and Annotation 4 Genome Structure 5 Genome.
Gene Structure.
Annotator Interface GUS 3.0 Workshop June 18-21, 2002.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Gene Structure.
Presentation transcript:

Genome Annotation

Genome Annotation Annotation is the process of adding information to DNA sequence. The information usually has DNA coordinate. Features could be repeats, genes, promoters, protein domains…….. Features can be linked to other databases eg Pfam/Pubmed

Genome Annotation Genome Databases The EMBL file format Editing EMBL files with Artemis The annotation process Common pitfalls

Public Databases Genbank, Embl and DDBJ. All databases update each other automatically

EMBL and TREMBL Patricia Rodriguez-Tomé , Peter J. Stoehr , Graham N. Cameron and Tomas P. Flores, "The European Bioinformatics Institute (EBI) databases", Nucleic Acids Res. 24:(6-13), 1996 EMBL currently contains 14366182 entries

EMBL File Contains: A header File containing: Information about the sequence Organism Authors References Comments A feature table containing Sequence features and co-ordinates

Header File ID PFMAL1P4 standard; DNA; INV; 66441 BP. XX AC AL031747; SV AL031747.8 DT 24-SEP-1998 (Rel. 57, Created) DT 27-APR-2000 (Rel. 63, Last updated, Version 13) DE Plasmodium falciparum DNA from MAL1P4 KW HTG; rifin; telomere; var; var-like hypothetical protein. OS Plasmodium falciparum (malaria parasite P. falciparum) OC Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium. RN [1] RA Oliver K., Bowman S., Churcher C., Harris B., Harris D., Lawson D., RA Quail M., Rajandream M., Barrell B.; RT ; RL Submitted (24-SEP-1998) to the EMBL/GenBank/DDBJ databases. RL P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome RL Trust Genome Campus, Hinxton, Cambridge CB10 1S. Header File

EMBL File Feature Table Anything that can have a cordinate on a DNA sequence. misc_difference misc_feature misc_recomb misc_RNA misc_signal misc_structure modified_base mRNA N_region old_sequence polyA_signal polyA_site precursor_RNA prim_transcript primer_bind promoter protein_bind RBS repeat_region repeat_unit rep_origin rRNA S_region satellite scRNA sig_peptide snRNA snoRNA source stem_loop STS TATA_signal terminator attenuator C_region CAAT_signal CDS conflict D-loop D_segment enhancer exon GC_signal gene iDNA intron J_segment LTR mat_peptide misc_binding transit_peptide tRNA unsure V_region V_segment variation 3'clip 3'UTR 5'clip 5'UTR -10_signal -35_signal

Feature qualifiesr Additional information about a feature /allele="text" /citation=[number] /codon=(seq:"text",aa:<amino_acid>) /codon_start=<1 /db_xref="<database>:<identifier>" /EC_number="text" /evidence=<evidence_value> /exception="text" /function="text" /gene="text" /label=feature_label /map="text" /note="text" /number=unquoted /product="text" /protein_id="<identifier>" /pseudo /standard_name="text" /translation="text" /transl_except=(pos:<base_range>,aa:<amino_acid>) /transl_table /usedin=accnum:feature_label

Features

Annotation in Artemis FT CDS 732..1415 FT /db_xref="IPR002038" FT /gene="PfLtest.01" FT /label=PfLtest.01 FT /note="PfLtest.01. len=227aa. Asp-rich protein.Predicted FT by Genefinder, Phat and GlimmerM. Similar to Plasmodium FT falciparum hypothetical 132.2 kDa protein TR:O97242 FT (EMBL:AL034558) (1114 aa) fasta scores: E(): 7.1e-21, FT 44.388% id in 196 aa." FT /product="Asp-rich hypothetical protein" FT /colour=10 FT /fasta_file="fasta/sanger_100kb.embl.seq.00001.out" FT misc_feature complement(1855..1871) FT /fasta_file="fasta/TEST100.tab.seq.00105.out" FT CDS 3151..4821 FT /gene="PfLtest.02" FT /label=PfLtest.02 FT /note="PfLtest.02. len=556aa. Predicted by Genefinder, FT Phat and GlimmerM. Unknown hypothetical protein" FT /product="unknown hypothetical protein" FT /colour=8 FT /fasta_file="fasta/sanger_100kb.embl.seq.00002.out"

CDS features CDS stands for coding sequence and is used to denote genes and pseudogenes. These features are automatically translated on submission and the protein added to the protein databases.

/note Note field contains all the evidence for a gene call……..plus anything else. Similarity (fasta or blast) Domain/motif information (pfam, tmhmm etc) Unusual features (repeats, aa richness)

/product The name of the gene product eg Alcohol dehydrogenase Unless there is proof we must qualify.. Putative Possible Always be conservative!.. eg. Putative dehydrogenase dehyrogenase like protein Only piece of annotation added to the protein databases.

Naming protocols Hypothetical protein unknown function and no homology   Conserved hypothetical protein unknown function WITH homology alcohol dehydrogenase like looks a bit like it, but may not be. Putative alcohol dehydrogenase probably a alcohol dehydrogenase Alcohol dehydrogenase this has previously been characterised and shown to be alcohol dehydrogenase in this organism.

/gene The gene name Only transfer a gene name if it is meaningful Eg ADH1 Only transfer a gene name if it is meaningful Never transfer a gene name like PfB0024. Is it a gene family? make sure two genes have the same name.

Transitive Annotation AKA annotation catastrophe Junk in = Junk out Miss-annotations spread through incorrect database submissions.