Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo.

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Applications of GO. Goals of Gene Ontology Project.
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
Gene Ontology John Pinney
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
Genome Annotation. Now that you’ve assembled your genome, what is next? GENOME ANNOTATION What is that? Why is it important? How do you do it?
COG and GO tutorial.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Internet tools for genomic analysis: part 2
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Protein and Function Databases
BICH CACAO Biocurator Training Session #3.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Wellcome Trust Workshop Working with Pathogen Genomes Module 1 Artemis.
PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,
Spring 2007 Bioinformatiatics Ch. 6 - Genomics Figure 1.1.
SPH 247 Statistical Analysis of Laboratory Data 1 May 12, 2015 SPH 247 Statistical Analysis of Laboratory Data.
Using The Gene Ontology: Gene Product Annotation.
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Essential Bioinformatics and Biocomputing Module (Tutorial) Biological Databases Lecturer: Chen Yuzong Jan 2003 TAs: Cao Zhiwei Lee Teckkwong, Bernett.
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
Biological databases Nicky Mulder:
SPH 247 Statistical Analysis of Laboratory Data 1May 14, 2013SPH 247 Statistical Analysis of Laboratory Data.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
GENE ONTOLOGY FOR THE NEWBIES Suparna Mundodi, PhD The Arabidopsis Information Resources, Stanford, CA.
Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
BIOINFORMATIK I UEBUNG 2 mRNA processing.
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Manual GO annotation Evidence: Source AnnotationsProteins IEA:Total Manual: Total
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
SRI International Bioinformatics 1 Submitting pathway to MetaCyc Ron Caspi.
24th Feb 2006 Jane Lomax GO Further. 24th Feb 2006 Jane Lomax GO annotations Where do the links between genes and GO terms come from?
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
Copyright OpenHelix. No use or reproduction without express written consent1.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Getting Started: a user’s guide to the GO TAMU GO Workshop 17 May 2010.
Rice Proteins Data acquisition Curation Resources Development and integration of controlled vocabulary Gene Ontology Trait Ontology Plant Ontology
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Bioinformatics and Computational Biology
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
CACAO Training Jim Hu and Suzi Aleksander Fall 2015.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
1 Annotation EPP 245/298 Statistical Analysis of Laboratory Data.
MAPPING OF SEQUENCES TO GENE ONTOLOGY. GO consortium.
An example of GO annotation from a primary paper Rebecca E. Foulger (UniProt Curator) GO Annotation Camp, June 2005 PMID:
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
An example of GO annotation from a primary paper GO Annotation Camp, July 2006 PMID:
CACAO Training Jim Hu and Suzi Aleksander Fall 2015.
Gene Annotation & Gene Ontology
Genome Annotation.
CACAO Training ASM-JGI 2012.
Annotating with GO: an overview
Introduction to the Gene Ontology
Modified from slides from Jim Hu and Suzi Aleksander Spring 2016
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Insight into GO and GOA Angelica Tulipano , INFN Bari CNR
Presentation transcript:

Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP

Sequence annotation Annotation is the process of adding information to a DNA sequence. The information usually has DNA coordinate. Features could be repeats, genes, promoters, protein domains…….. Features can be linked to other databases e.g. Pfam/Pubmed AG-ICB-USP

Public databases GenBank, EMBL and DDBJ. All databases update each other automatically AG-ICB-USP

Feature table Format definition Covers DDBJ/EMBL/GenBank Defines all accepted annotation terms and hierarchy AG-ICB-USP

Annotation file Contains: A header with: Information about the sequence Organism Authors References Comments A feature table containing Sequence features and co-ordinates AG-ICB-USP

ID PFMAL1P4 standard; DNA; INV; BP. XX AC AL031747; XX SV AL XX DT 24-SEP-1998 (Rel. 57, Created) DT 27-APR-2000 (Rel. 63, Last updated, Version 13) XX DE Plasmodium falciparum DNA from MAL1P4 XX KW HTG; rifin; telomere; var; var-like hypothetical protein. XX OS Plasmodium falciparum (malaria parasite P. falciparum) OC Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium. XX RN [1] RA Oliver K., Bowman S., Churcher C., Harris B., Harris D., Lawson D., RA Quail M., Rajandream M., Barrell B.; RT ; RL Submitted (24-SEP-1998) to the EMBL/GenBank/DDBJ databases. RL P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome RL Trust Genome Campus, Hinxton, Cambridge CB10 1S. Header (EMBL) AG-ICB-USP

LOCUS PFMAL1P bp DNA linear INV 02-DEC-2004 DEFINITION Plasmodium falciparum DNA from MAL1P4, complete sequence. ACCESSION AL AL VERSION AL GI: KEYWORDS HTG; rifin; telomere; var; var-like hypothetical protein. SOURCE Plasmodium falciparum 3D7 ORGANISM Plasmodium falciparum 3D7 Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium. REFERENCE 1 AUTHORS Hall,N., Pain,A., Berriman,M., Churcher,C., Harris,B., Harris,D., TITLE Sequence of Plasmodium falciparum chromosomes 1, 3-9 and 13 JOURNAL Nature 419 (6906), (2002) PUBMED REFERENCE 2 AUTHORS Oliver,K., Pain,A., Berriman,M., Bowman,S., Churcher,C., Harris,B., Harris,D., Lawson,D., Quail,M., Rajandream,M., Hall,N. and Barrell,B. TITLE Direct Submission JOURNAL Submitted (24-SEP-1998) P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK COMMENT On Oct 2, 2002 this sequence version replaced gi: For more information about this sequence or the Malaria Project, see NCBI Header AG-ICB-USP

Feature Region of DNA that was annotated with a key/qualifier Keys: CDS, intron, miscellaneous, etc. Qualifier: notes or extra-information about a feature i.e. exon (key) /gene=“adh” (qualifier) AG-ICB-USP

Feature keys misc_difference misc_feature misc_recomb misc_RNA misc_signal misc_structure modified_base mRNA N_region old_sequence polyA_signal polyA_site precursor_RNA prim_transcript primer_bind promoter protein_bind RBS repeat_region repeat_unit rep_origin rRNA S_region satellite scRNA sig_peptide snRNA snoRNA source stem_loop STS TATA_signal terminator transit_peptide tRNA unsure V_region V_segment variation 3'clip 3'UTR 5'clip 5'UTR -10_signal -35_signal attenuator C_region CAAT_signal CDS conflict D-loop D_segment enhancer exon GC_signal gene iDNA intron J_segment LTR mat_peptide misc_binding AG-ICB-USP

Feature qualifier Additional information about a feature /allele="text" /citation=[number] /codon=(seq:"text",aa: ) /codon_start= : " /EC_number="text" /evidence= /exception="text" /function="text" /gene="text" /label=feature_label /map="text" /note="text" /number=unquoted /product="text" /protein_id=" " /pseudo /standard_name="text" /translation="text" /transl_except=(pos:,aa: ) /transl_table /usedin=accnum:feature_label AG-ICB-USP

Features (EMBL) AG-ICB-USP

Features (NCBI) AG-ICB-USP FEATURES Location/Qualifiers source /organism="Plasmodium falciparum 3D7" /mol_type="genomic DNA" /isolate="3D7" /db_xref="taxon:36329" /chromosome="1" repeat_region /note="telomeric repeat" repeat_region /note="14bp repeat" gene join( , ) /gene="MAL1P4.01" /note="synonyms: PFA0005w, VAR" CDS join( , ) /gene="MAL1P4.01" /note="Subtelomeric var gene Pfam hit to PF03011 Similar to Plasmodium falciparum VaR, mal1p4.01 vaR SWALL:Q9NFB6 (EMBL:AL031747) (2163 aa) fasta scores: E(): 0, 100% id in 2163 aa" /codon_start=1 /product="erythrocyte membrane protein 1 (PfEMP1)" /protein_id="CAB " /db_xref="GI: " /db_xref="GOA:Q9NFB6" /db_xref="UniProtKB/TrEMBL:Q9NFB6" /translation="MVTQSSGGGAAGSSGEEDAKHVLDEFGQQVYNEKVEKYANSKIY KEALKGDLSQASILSELAGTYKPCALEYEYYKHTNGGGKGKRYPCTELGEKVEPRFSDTLGGQCTNK KIEGNKYIKGKDVGACAPYRRLHLCSHNLESIQ

CDS features CDS stands for coding sequence and is used to denote genes and pseudogenes. These features are automatically translated on submission and the protein added to the protein databases. AG-ICB-USP

/note Note field contains all the evidence for a gene call……..plus anything else. Similarity (fasta or blast) Domain/motif information (Pfam, TMHMM, etc.) Unusual features (repeats, aa richness) AG-ICB-USP

/product The name of the gene product eg. Alcohol dehydrogenase Unless there is proof we must qualify... Putative Possible Always be conservative!… eg. Putative dehydrogenase dehyrogenase like protein Only piece of annotation added to the protein databases. AG-ICB-USP

Naming protocols Hypothetical proteinunknown function and no homology Conserved hypothetical proteinunknown function WITH homology Alcohol dehydrogenase likelooks a bit like it, but may not be. Putative alcohol dehydrogenaseprobably a alcohol dehydrogenase Alcohol dehydrogenasethis has previously been characterised and shown to be alcohol dehydrogenase in this organism. AG-ICB-USP

/gene The gene name eg ADH1 Only transfer a gene name if it is meaningful Never transfer a gene name like PfB0024. Is it a gene family? make sure two genes have the same name. AG-ICB-USP

Transitive Annotation AKA annotation catastrophe Junk in = Junk out Mis-annotations spread through incorrect database submissions. AG-ICB-USP

How can we standardize the annotation terms? AG-ICB-USP

Through a dynamic controlled vocabulary AG-ICB-USP

So what does that mean? From a practical view, ontology is the representation of something we know about. “Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those things.

Ontology Structure cell membrane chloroplast mitochondrial chloroplast membrane Directed Acyclic Graph (DAG) - multiple parentage allowed

GO topology The ontologies are structured as directed acyclic graphs Similar to hierarchies but differ in that a more specialized term (child) can be related to more than one less specialized term (parent). For example, hexose biosynthetic process has two parents, hexose metabolic process and monosaccharide biosynthetic process. AG-ICB-USP

True Path Violations Create Incorrect Definitions..”the pathway from a child term all the way up to its top-level parent(s) must always be true". chromosome Part_of relationship nucleus

True Path Violations..”the pathway from a child term all the way up to its top-level parent(s) must always be true". chromosome Mitochondrial chromosome Is_a relationship

True Path Violations..”the pathway from a child term all the way up to its top-level parent(s) must always be true". chromosome Mitochondrial chromosome Is_a relationship Part_of relationship nucleus A mitochondrial chromosome is not part of a nucleus!

True Path Violations..”the pathway from a child term all the way up to its top-level parent(s) must always be true". nucleuschromosome Nuclear chromosome Mitochondrial chromosome Is_a relationship Part_of relationship mitochondrion Part_of relationship

GO Definitions: Each GO term has 2 Definitions A definition written by a biologist: necessary & sufficient conditions written definition (not computable) Graph structure: necessary conditions formal (computable)

Term-term relationship is_a The is_a relationship is a simple class- subclass relationship, where A is_a B means that A is a subclass of B For example, nuclear chromosome is_a chromosome. AG-ICB-USP GO: : intracellular non-membrane-bound organelle GO: : chromosome GO: : nuclear chromosome

Term-term relationship part_of C part_of D means that whenever C is present, it is always a part of D, but C does not always have to be present For example, periplasmic flagellum part_of periplasmic space AG-ICB-USP GO: : cell part GO: : cell projection GO: : flagellum GO: : flagellin-based flagellum GO: : periplasmic flagellum GO: : periplasmic space GO: : periplasmic flagellum

Current Ontologies Molecular function: tasks performed by gene product Biological process: broad biological goals accomplished by ordered assemblies of molecular functions Cellular component: subcellular structures, locations and macromolecular complexes AG-ICB-USP

Search result for toxin AG-ICB-USP

Relationships in GO “is-a” “part of” AG-ICB-USP

GO paths to terms AG-ICB-USP

GO definitions AG-ICB-USP

Pyruvate dehydrogenase AG-ICB-USP

Why the interest in GO? ● Universal ontology ● Functional classification scheme with many different levels in a DAG ● Widespread interest from scientific community ● Already mappings to SP keywords and gene products-annotation on some organisms AG-ICB-USP

GO Evidence codes AG-ICB-USP Experimental Evidence Codes EXP: Inferred from Experiment IDA: Inferred from Direct Assay IPI: Inferred from Physical Interaction IMP: Inferred from Mutant Phenotype IGI: Inferred from Genetic Interaction IEP: Inferred from Expression Pattern Computational Analysis Evidence Codes ISS: Inferred from Sequence or Structural Similarity ISO: Inferred from Sequence Orthology ISA: Inferred from Sequence Alignment ISM: Inferred from Sequence Model IGC: Inferred from Genomic Context RCA: inferred from Reviewed Computational Analysis Author Statement Evidence Codes TAS: Traceable Author Statement NAS: Non-traceable Author Statement Curator Statement Evidence Codes IC: Inferred by Curator ND: No biological Data available Automatically-assigned Evidence Codes IEA: Inferred from Electronic Annotation Obsolete Evidence Codes NR: Not Recorded

Current Mappings to GO Consortium mappings -MGD, SGD, FlyBase Swiss-Prot keywords EC numbers InterPro entries Medline ID Commercial companies -CompuGen, Proteome AG-ICB-USP

InterPro-to-GO

EC number-to-GO AG-ICB-USP

SP keyword-to-GO AG-ICB-USP

GO doesn’t cover… Gene products: e.g. cytochrome c is not in the ontologies, but attributes of cytochrome c, such as oxidoreductase activity, are. Processes, functions or components that are unique to mutants or diseases: e.g. oncogenesis is not a valid GO term because causing cancer is not the normal function of any gene. Attributes of sequence such as intron/exon parameters: these are not attributes of gene products and will be described in a separate sequence ontology (see Sequence Ontology). Protein domains or structural features. Protein-protein interactions. Environment, evolution and expression. Anatomical or histological features above the level of cellular components, including cell types. AG-ICB-USP

Sequence Ontology The four major aspects of the complete Sequence Ontology are: located sequence features for objects that can be located on sequence in coordinates, sequence attributes for describing the properties of features, consequences of mutation for the annotation of the effects of a mutation chromosome variation to describe large scale variations AG-ICB-USP

Sequence Ontology AG-ICB-USP How to edit an ontology file? OBO-Edit – an ontology editor for biologists OBO-Edit compliant format

Generic feature format 3 AG-ICB-USP Generic format for sequence annotation interchange Tab-delimited text file Represents features in hierarchical view Uses a controlled vocabulary – is compliant to Sequence Ontology

AG-ICB-USP The tab-delimited file presents 9 columns: Column 1: "seqid" Column 2: "source" Column 3: "type" Columns 4 & 5: "start" and "end" Column 6: "score" Column 7: "strand" The strand of the feature. + for positive strand (relative to the landmark), - for minus strand Column 8: "phase" Column 9: "attributes" Generic feature format 3

Column 1: "seqid" Column 2: "source" Column 3: "type" Columns 4 & 5: "start" and "end" Column 6: "score" Column 7: "strand" Column 8: "phase" Column 9: "attributes"

How to annotate these splicing variants using Sequence Ontology terms and the GFF3?

The annotated genome region is named “ctg123” A gene named EDEN extends from coordinates 1 to 9000 The gene encodes three alternatively-spliced variants: EDEN.1, EDEN.2 and EDEN.3 Transcript EDEN.3 presents two alternative translation start points There is a transcriptional factor binding site (a promoter) located 50 bp upstream of the translational start site of EDEN.1

##gff-version 3 ##sequence-region ctg ctg123. gene ID=gene00001;Name=EDEN ctg123. TF_binding_site ID=tfbs00001;Parent=gene00001 ctg123. mRNA ID=mRNA00001;Parent=gene00001;Name=EDEN.1 ctg123. mRNA ID=mRNA00002;Parent=gene00001;Name=EDEN.2 ctg123. mRNA ID=mRNA00003;Parent=gene00001;Name=EDEN.3

ctg123. exon ID=exon00001;Parent=mRNA00003 ctg123. exon ID=exon00002;Parent=mRNA00001,mRNA00002 ctg123. exon ID=exon00003;Parent=mRNA00001,mRNA00003 ctg123. exon ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003 ctg123. exon ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003 ctg123. CDS ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 ctg123. CDS ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 ctg123. CDS ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 ctg123. CDS ID=cds00001;Parent=mRNA00001;Name=edenprotein.1

ctg123. CDS ID=cds00002;Parent=mRNA00002;Name=edenprotein.2 ctg123. CDS ID=cds00002;Parent=mRNA00002;Name=edenprotein.2 ctg123. CDS ID=cds00002;Parent=mRNA00002;Name=edenprotein.2 ctg123. CDS ID=cds00003;Parent=mRNA00003;Name=edenprotein.3 ctg123. CDS ID=cds00003;Parent=mRNA00003;Name=edenprotein.3 ctg123. CDS ID=cds00003;Parent=mRNA00003;Name=edenprotein.3 ctg123. CDS ID=cds00004;Parent=mRNA00003;Name=edenprotein.4 ctg123. CDS ID=cds00004;Parent=mRNA00003;Name=edenprotein.4 ctg123. CDS ID=cds00004;Parent=mRNA00003;Name=edenprotein.4

AG-ICB-USP If you writes a GFF file, you can test it! There is an online validator: Generic feature format 3

Testing the GFF3 Validator

AG-ICB-USP

Let’s change the feature names

Annotation viewing and editing Artemis Artemis is a free genome viewer and annotation tool developed by Kim Rutherford (Sanger Institute, UK). It allows for visualization of sequence features and results of analyses, in the context of the sequence and its six-frame translation. AG-ICB-USP

Annotation viewing and editing Artemis Artemis is written in Java, and is available for UNIX, GNU/Linux, BSD, Macintosh and MS- Windows systems.Java It can read complete EMBL and GENBANK database entries or sequence in FASTA or raw format. Extra sequence features can be in EMBL, GENBANK or GFF format.EMBLGENBANKGFF AG-ICB-USP

AG-FMVZ-USP

AG-FMVZ-USP

AG-FMVZ-USP

AG-FMVZ-USP

AG-FMVZ-USP