A Rough Guide to Biological Databases Alastair Kerr, Ph.D. Bioinformatician Wellcome Trust Centre for Cell Biology.

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Archives and Information Retrieval
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)
Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Protein and Function Databases
UniProt - The Universal Protein Resource
An Introduction to Bioinformatics Molecular Biology Databases.
Login: BITseminar Pass: BITseminar2011 Login: BITseminar Pass: BITseminar2011.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
VectorBase A Resource Centre for Invertebrate Hosts of Human Pathogens Bob MacCallum Imperial College London.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Network Services for Biologists in the Genome Era The Work of the European Bioinformatics Institute.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
Managing Data Modeling GO Workshop 3-6 August 2010.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
NCBI Vector-Parasite Genomic Related Databases Chuong Huynh NIH/NLM/NCBI Sao Paulo, Brasil July 12, 2004
Organizing information in the post-genomic era The rise of bioinformatics.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Strategies for functional modeling TAMU GO Workshop 17 May 2010.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
EMBL-EBI EMBL-EBI EMBL-EBI What is the EBI's particular niche? Provides Core Biomolecular Resources in Europe –Nucleotide; genome, protein sequences,
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2005.
Bioinformatics and Computational Biology
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Welcome to the combined BLAST and Genome Browser Tutorial.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Archives and Information Retrieval
Functional Annotation of the Horse Genome
UniProt: Universal Protein Resource
Mangaldai College, Mangaldai
Access to Sequence Data and Related Information
Introduction to Bioinformatics
Ensembl Genome Repository.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

A Rough Guide to Biological Databases Alastair Kerr, Ph.D. Bioinformatician Wellcome Trust Centre for Cell Biology

SLIDES Follow links to Recent Presentations

Goals Understand differences between different data sources Understand differences between different data sources Know where to go if you have an sequence id (e.g. RefSeq, SwissProt Ensembl) Know where to go if you have an sequence id (e.g. RefSeq, SwissProt Ensembl) Understand the best data source to use for your task (e.g. looking for orthologs or splice variants, proteomics experiment) Understand the best data source to use for your task (e.g. looking for orthologs or splice variants, proteomics experiment) Understand how identifiers work Understand how identifiers work Understand how to get the best annotation Understand how to get the best annotation

Overview of Sequence Databases Brief history Brief history Varieties of data sources (databases and datasets) Varieties of data sources (databases and datasets) –Utility/drawbacks of each Use of identifiers Use of identifiers DNA /Protein Annotation DNA /Protein Annotation –Distributed annotation system [DAS]  DAS clients - outline and demo

Dawn of the Age of Sequencing Mid 50’s : First protein sequence -by Fred Sanger Mid 50’s : First protein sequence -by Fred Sanger Late 70’s : clone based sequencing arrived with Sanger dideoxy method also the first bioinformatics tools (STADEN, alignments) Late 70’s : clone based sequencing arrived with Sanger dideoxy method also the first bioinformatics tools (STADEN, alignments) All sequences were published in papers, a central warehouse was clearly needed to keep them all All sequences were published in papers, a central warehouse was clearly needed to keep them all

Sharing PRIMARY sequence data NCBI GenBank EMBL DDBJ

Sequence Warehouses NCBI GenBank EMBL Protein and DNA database GenBank NR [Non-redundant] Historically DNA: EMBL Protein: translated EMBL (trEMBL) Now called EBI UniProt National Centre for Biotechnology Information European Molecular Biology Laboratory

Sources of DNA Error Vector contamination Vector contamination –Now mainly eliminated in the sequencing pipeline but still possible with rarer vectors Sequencing artefacts Sequencing artefacts –Recombination, mutations, contamination –insertions /deletions (More prone in Next-gen sequencing?) –Most removed from ‘finished’ sequence by comparing multiple reads, but still prevalent in ESTs and GSS

Protein Sequence Error Most proteins are based on a model of the gene Most proteins are based on a model of the gene This gene model is often deduced: combination of: This gene model is often deduced: combination of: –EST data –ORF finding programs –Splice site finding programs Protein interpretations may change Protein interpretations may change –Transcription/translation start sites –Splice Variants –DNA errors

Varieties of data sources Sequence Warehouses Sequence Warehouses –“everything under one roof” Genome Databases Genome Databases –Containing single genome dataset(s) Single pass reads Single pass reads –[EST] Expressed sequence set (cDNA) –[GSS] Genome survey sequence Curated sets Curated sets

Pros Pros –Retrieve a specific sequence  e.g. an identifier from a paper –Comprehensive sequence search Cons Cons –>100 BILLION bases from 165,000 organisms –Redundancy  Variants, both real and artefacts (may complicate mass- spec searches) –Still may contain junk DNA Sequence Warehouses

UniProt

Single Genome databases/datasets Pros Pros –Truly non-redundant –If complete, you know the gene copy number and gene families –Can search genomic DNA for know but un-annotated genes –Can ‘browse’ the genome –Usually very good computer annotation Cons Cons –Not always assembled correctly –May be incomplete (despite saying otherwise) –Often no human intervention with annotation

ensEMBL From EBI / Sanger Many vertebrate and model organism genomes Many vertebrate and model organism genomes –All cross-annotated (easy to find orthologs) Access via web with built in DAS client (see later) Access via web with built in DAS client (see later) –BIOMART access –Also / SQL / Perl-API

Low Quality Sequences Expressed sequence tag (EST) are reads coding DNA of variable quality but usually very numerous Expressed sequence tag (EST) are reads coding DNA of variable quality but usually very numerous –Ideal for determining which part of a genome is transcribed…  but not necessarily coding! [ncRNAs] –Recent improvements in visualisation, ORF identification and clustering Genomic survey sequence [GSS] are single pass reads Genomic survey sequence [GSS] are single pass reads –Genomes in early stage of sequencing –Environmental sampling (meta-genomics)

Using EST databases Gene model verification (e.g. checking a splice variant) Gene model verification (e.g. checking a splice variant) –Search EST databases with genomic sequence or cDNA

Curated (Protein) Data Sets Several efforts to create high quality databases SwissProt (1986) the first gold standard in protein functional annotation SwissProt (1986) the first gold standard in protein functional annotation –Originally every entry entered by Amos Bairoch! –Now integrated into UniProt (EBI / EMBL) RefSeq RefSeq –NCBIs human-involved annotation effort IMPORTANT: These are site specific. And NOT shared between NCBI / EBI IMPORTANT: These are site specific. And NOT shared between NCBI / EBI

IPI International Protein Index From EBI Contains proteomes of higher eukaryotic Contains proteomes of higher eukaryotic Effectively maintains a database of cross references between the primary data sources Effectively maintains a database of cross references between the primary data sources Minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript) Minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript)

Primary sequence crossover between databases Primary sequence is exchanged between databases Primary sequence is exchanged between databases –But what is primary sequence? –Sometimes only ‘finished’ sequence is shared NOT SHARED NOT SHARED –Gene Models (and therefore proteins) –Some EST/GSS –From some genome projects not yet published

Identifiers

Consortia identifiers Most key species have a consortia / group / community that provides the key identifiers in the field Most key species have a consortia / group / community that provides the key identifiers in the field Humans Humans –Was HUGO (HUman Genome Organisation) –now the HGNC (Human Genome Nomenclature Committee) Some of limited use as have incomplete coverage Some of limited use as have incomplete coverage

Database Identifiers Every dataset has their own system of identifying gene/protein Every dataset has their own system of identifying gene/protein Example: Human ADH4 Example: Human ADH4 –Ensembl  ENSG ENST ENSP –SwissProt  ADH4_HUMAN P08319  ADH4_HUMAN P08319 –RefSeq  NM_ NP_  NM_ NP_ –GenBank  gi| |ref|NP_ |

Keeping Track of Changes Gene models can change Gene models can change –Will the id you used yesterday still get the same sequence today? –Or: How to you get the latest version of a sequence?

Keeping Track of Changes Genbank Genbank –Gi number changes each time, often removed when it gets superseded SwissProt SwissProt –Accession changes each time (P08319) but the ID remains constant (ADH4_HUMAN) RefSeq and Ensembl RefSeq and Ensembl –Revision based ids  NM_ ENSG –XXX.number  XXX always retrieve latest  XXX.number retrieves the version

Converting between identifiers Most databases understand an identifier from another database – some are better than others Most databases understand an identifier from another database – some are better than others –Best to use Uniprot NEVER rely on common names NEVER rely on common names

Annotation

Standard Keywords and annotation Problem: Many ways to name a gene Problem: Many ways to name a gene –Reductase = oxidase = dehydrogenase Gene Ontology Consortium [GO] Gene Ontology Consortium [GO] –GO terms standardise naming –Note that errors may still occur in the assignment of terms –Found in RefSeq, SwissProt and most genome databases GO browsers e.g. AmiGO GO browsers e.g. AmiGO

Annotation Issues Most annotation by inference from homology/orthology Most annotation by inference from homology/orthology Domain specific function rather than protein specific function Domain specific function rather than protein specific function Programs may give different answers – not possible for one source to store all possibilities Programs may give different answers – not possible for one source to store all possibilities

Distributed Annotation Information on a protein or gene can be stored on multiple (often specialised) servers Information on a protein or gene can be stored on multiple (often specialised) servers –Distributed Annotation System [DAS] Data can be accessed using client software that checks these sites Data can be accessed using client software that checks these sites –Built into the ensEMBL website

DAS Client DAS Servers DAS Client Reference Data DISTRIBUTED ANNOTATION SYSTEM

Das Servers Free one in ensEMBL Free one in ensEMBL I have my own in-house I have my own in-house –Currently storing CpG islands –Speak to me if you need to use it Found on DAS registry Found on DAS registry –many DAS clients can look up this registry

Some DAS Clients Dedicated Web site Dedicated Web site –Dasty2 [Protein DAS] Genome browsers Genome browsers –ensEMBL web site [Protein and Gene DAS] Protein structure viewers Protein structure viewers –Spice [3D Protein DAS] Alignment programs Alignment programs –Jalview [Protein and Gene DAS]

DAS

CpG island tracks

SPICE: a protein DAS client

Gotchas Coordinate systems MUST match Coordinate systems MUST match –Will change with each genome assembly Use a single source only Use a single source only –e.g. UniProt for proteins

Linked databases

More key databases1 Conserved Domains (See later siminar) Conserved Domains (See later siminar) –Interpro: hosts and names from multiply projects –Each domain often functionally annotated –More in following talk Mendelian Inheritance in Man (at NCBI) Mendelian Inheritance in Man (at NCBI) –Phenotype centric, literature curated Population differences Population differences –dbSNP: polymorphisms (also in Ensembl) –PharmGkb: drug efficacy

More Key databases2 Microarray Microarray –EBI : ArrayExpress –NCBI: GEO Pathway Databases Pathway Databases –Reactome –KEGG –Cytoscape application

References NCBI NCBI EBI EBI UniProt UniProt Sanger Centre Sanger Centre ensEMBL ensEMBL Firefox and BioBar Firefox and BioBar SPICE (Das Client) SPICE (Das Client) Gene Ontology Consortium Gene Ontology Consortium AmiGO AmiGO GMOD GMOD BioMoby BioMoby Dasty2 Dasty2 Interpro: Interpro: Reactome: http// Reactome: http// KEGG: ​ kegg/ ​ kegg2.html KEGG: ​ kegg/ ​ kegg2.html Cytoscape Cytoscape PharmGkb: PharmGkb.org/ PharmGkb: PharmGkb.org/

SLIDES Follow links to Recent Presentations