Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Bioinformatics Ayesha M. Khan Spring 2013.
Applications of GO. Goals of Gene Ontology Project.
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
CACAO Biocurator Training CACAO Fall CACAO Syllabus What is CACAO & why is it important? Training Examples.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Protein and Function Databases
BICH CACAO Biocurator Training Session #3.
UniProt - The Universal Protein Resource
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
A Rough Guide to Biological Databases Alastair Kerr, Ph.D. Bioinformatician Wellcome Trust Centre for Cell Biology.
SPH 247 Statistical Analysis of Laboratory Data 1 May 12, 2015 SPH 247 Statistical Analysis of Laboratory Data.
Using The Gene Ontology: Gene Product Annotation.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
SPH 247 Statistical Analysis of Laboratory Data 1May 14, 2013SPH 247 Statistical Analysis of Laboratory Data.
Organizing information in the post-genomic era The rise of bioinformatics.
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.
Manual GO annotation Evidence: Source AnnotationsProteins IEA:Total Manual: Total
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
SRI International Bioinformatics 1 Submitting pathway to MetaCyc Ron Caspi.
24th Feb 2006 Jane Lomax GO Further. 24th Feb 2006 Jane Lomax GO annotations Where do the links between genes and GO terms come from?
Gene Product Annotation using the GO ml Harold J Drabkin Senior Scientific Curator The Jackson Laboratory.
Production Priorities. Genome protein sets User Support Production systems change Database changes On-the-fly species gene associations.
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
Getting Started: a user’s guide to the GO TAMU GO Workshop 17 May 2010.
Rice Proteins Data acquisition Curation Resources Development and integration of controlled vocabulary Gene Ontology Trait Ontology Plant Ontology
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Bioinformatics and Computational Biology
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
Update Susan Bridges, Fiona McCarthy, Shane Burgess NRI
CACAO Training Jim Hu and Suzi Aleksander Fall 2015.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
1 Annotation EPP 245/298 Statistical Analysis of Laboratory Data.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Module 1: Gene Lists 1 Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
CACAO Training Jim Hu and Suzi Aleksander Fall 2015.
Introduction to Genes and Genomes with Ensembl
Gene Annotation & Gene Ontology
Protein databases Henrik Nielsen
Annotating with GO: an overview
Data Mining with BioMart
Introduction to the Gene Ontology
UniProt: Universal Protein Resource
Ensembl Genome Repository.
Gene expression analysis
Annotating Gene Products to the GO
Insight into GO and GOA Angelica Tulipano , INFN Bari CNR
Presentation transcript:

Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases

Questions to address What are the main sequence databases? Which one to use for: Looking up a gene name/identifier from a paper Identifiers What should I use and why? Coordinate based systems Annotation Protein domains Gene Ontology

Database Varieties Sequence Warehouses “everything under one roof” Genome Databases Containing single genome dataset(s) Reference Sets Often human curated, the 'standard' for a particular gene or protein from which variants are defined Specialist Short reads from next generation sequencing (Short read archive) [EST] Expressed sequence tags and [GSS] Genome survey sequence

NCBI GenBank EMBL DDBJ Sharing primary data

NCBI Warehouse GenBank NR dataset : NR = non redundant (but is is not..) Reference Dataset RefSeq Genome Datasets NCBI Genomes

EMBL Warehouse EMBL Historically Protein set was call translated EMBL (trEMBL) Gold standard reference set was called SwissProt Reference set = Uniprot UniProtKB/Swiss-Prot Manually annotated and reviewed UniProtKB/TrEMBL automatically annotated and not reviewed Genome database Ensembl

Live Demo Search GenBank for human adh4 How many are there? How many should there be? Why are some different to those found in Uniprot? Are there better databases to use? Which identifier should you use in your lab book?

We should now be able to answer these: What are the main sequence databases? Which one to use for: Looking up a gene identifier from a paper Searching for a gene name Searching for an orthologus genes from another species

Identifiers Or what to write in your lab book

How to identify a feature Gene/protein name Common name Standardised Name Database identifier Unique for each database Some have revision numbers Position in genome Dependant on Genome build Position in a Gene/Protein Protein Domains

Never use common names Example of EPHB2

Consortia identifiers Most key species have a consortia / group / community that provides the key identifiers in the field Humans Was HUGO (HUman Genome Organisation) now the HGNC (Human Genome Nomenclature Committee)

Database Identifiers Every dataset has their own system of identifying gene/protein Example: Human ADH4 Ensembl ENSG ENST ENSP SwissProt ADH4_HUMAN P08319 RefSeq NM_ NP_ GenBank gi| |ref|NP_ |

Keeping Track of Changes Gene models can change Will the id you used yesterday still get the same sequence today? Or: How to you get the latest version of a sequence?

Keeping Track of Changes Genbank: GI or “genbank identifier” Gi number changes each time, often removed when it gets superseded SwissProt: Accession and ID Accession changes each time (P08319) but the ID remains constant (ADH4_HUMAN) RefSeq and Ensembl Revision based ids NM_ ENSG XXX.number XXX always retrieve latest XXX.number retrieves the version

Demo: Retrieving old data

Definining: Chromosome coordinates Demo: Ensembl

Chromosome Positions Features identified by Chromosome & position File formats: BED, WIG, gff.. All major genome databases store features as coordinates Ubiquitous in deep sequencing studies Note: coordinates change depending on the assembly Always note the build number of the genome assembly if you are using coordinates

Coordinates New concept of PATCH This is an assembly update without changing the primary sequence However additional 'improved' contigs map to the reference These will be in the net assembly: you may wish to use them Genome assembly names can differ by institution but are the same underlying sequence: GenBank/UCSC DEMO liftOver

Protein Domains: Protein Positions

Protein Domains Interpro Site that stores information on known protein domains from different projects Covered by Interpro Similarities between proteins Conserved region in an alignment Conserved protein folds Not Covered by Interpro Predicted features on primary protein sequence Trans-membrane regions Low complexity regions Phosphorylation sites

Domain Complexity Many different types of domains Vast amounts of domain based data Many different projects identifying them x =

Old way of interacting with a database Request information Retrieve information From single source

Distributed Annotation

DAS clients Different type of software can have a DAS client build-in Genome Browsers: ensembl, IGB, IGV.. Multiple Alignment editors: Jalview, STRAP 3D Structures: Spice 3D electron microscopy data: PeppeR Demo

Annotation

Problem: Many ways to name a gene Reductase = oxidase = dehydrogenase Gene Ontology Consortium [GO] GO terms standardise naming Note that errors may still occur in the assignment of terms Found in RefSeq, UniProt and most genome databases GO browsers e.g. AmiGO

Gene Ontology all [ gene products] GO: : biological_process [ gene products] GO: : cellular_component [ gene products] GO: : molecular_function [ gene products]

Gene Ontology: acyclical Tree

Evidence Codes Experimental # EXP: Inferred from Experiment # IDA: Inferred from Direct Assay # IPI: Inferred from Physical Interaction # IMP: Inferred from Mutant Phenotype # IGI: Inferred from Genetic Interaction # IEP: Inferred from Expression Pattern Computational # ISS: Inferred from Sequence or Structural Similarity # ISO: Inferred from Sequence Orthology# ISA: Inferred from Sequence Alignment # ISM: Inferred from Sequence Model# IGC: Inferred from Genomic Context # RCA: inferred from Reviewed Computational Analysis Author Statement # TAS: Traceable Author Statement# NAS: Non-traceable Author Statement # Curator Statement Evidence Codes# IC: Inferred by Curator # ND: No biological Data available Automatically-assigned # IEA: Inferred from Electronic Annotation

Best annotation? Use DAS clients to get more information on genomic, gene or protein features Protein Domains are especially useful The Gene Ontology is useful for general classification BUT be aware from where the annotation was derived