A Rough Guide to Biological Databases Alastair Kerr, Ph.D. Bioinformatician Wellcome Trust Centre for Cell Biology.

A Rough Guide to Biological Databases Alastair Kerr, Ph.D. Bioinformatician Wellcome Trust Centre for Cell Biology

SLIDES http://bifx1.bio.ed.ac.uk Follow links to Recent Presentations

Goals Understand differences between different data sources Understand differences between different data sources Know where to go if you have an sequence id (e.g. RefSeq, SwissProt Ensembl) Know where to go if you have an sequence id (e.g. RefSeq, SwissProt Ensembl) Understand the best data source to use for your task (e.g. looking for orthologs or splice variants, proteomics experiment) Understand the best data source to use for your task (e.g. looking for orthologs or splice variants, proteomics experiment) Understand how identifiers work Understand how identifiers work Understand how to get the best annotation Understand how to get the best annotation

Overview of Sequence Databases Brief history Brief history Varieties of data sources (databases and datasets) Varieties of data sources (databases and datasets) –Utility/drawbacks of each Use of identifiers Use of identifiers DNA /Protein Annotation DNA /Protein Annotation –Distributed annotation system [DAS]  DAS clients - outline and demo

Dawn of the Age of Sequencing Mid 50’s : First protein sequence -by Fred Sanger Mid 50’s : First protein sequence -by Fred Sanger Late 70’s : clone based sequencing arrived with Sanger dideoxy method also the first bioinformatics tools (STADEN, alignments) Late 70’s : clone based sequencing arrived with Sanger dideoxy method also the first bioinformatics tools (STADEN, alignments) All sequences were published in papers, a central warehouse was clearly needed to keep them all All sequences were published in papers, a central warehouse was clearly needed to keep them all

Sharing PRIMARY sequence data NCBI GenBank EMBL DDBJ

Sequence Warehouses NCBI GenBank EMBL Protein and DNA database GenBank NR [Non-redundant] Historically DNA: EMBL Protein: translated EMBL (trEMBL) Now called EBI UniProt National Centre for Biotechnology Information European Molecular Biology Laboratory

Sources of DNA Error Vector contamination Vector contamination –Now mainly eliminated in the sequencing pipeline but still possible with rarer vectors Sequencing artefacts Sequencing artefacts –Recombination, mutations, contamination –insertions /deletions (More prone in Next-gen sequencing?) –Most removed from ‘finished’ sequence by comparing multiple reads, but still prevalent in ESTs and GSS

Protein Sequence Error Most proteins are based on a model of the gene Most proteins are based on a model of the gene This gene model is often deduced: combination of: This gene model is often deduced: combination of: –EST data –ORF finding programs –Splice site finding programs Protein interpretations may change Protein interpretations may change –Transcription/translation start sites –Splice Variants –DNA errors

Varieties of data sources Sequence Warehouses Sequence Warehouses –“everything under one roof” Genome Databases Genome Databases –Containing single genome dataset(s) Single pass reads Single pass reads –[EST] Expressed sequence set (cDNA) –[GSS] Genome survey sequence Curated sets Curated sets

Pros Pros –Retrieve a specific sequence  e.g. an identifier from a paper –Comprehensive sequence search Cons Cons –>100 BILLION bases from 165,000 organisms –Redundancy  Variants, both real and artefacts (may complicate mass- spec searches) –Still may contain junk DNA Sequence Warehouses

UniProt http://www.uniprot.org/

Single Genome databases/datasets Pros Pros –Truly non-redundant –If complete, you know the gene copy number and gene families –Can search genomic DNA for know but un-annotated genes –Can ‘browse’ the genome –Usually very good computer annotation Cons Cons –Not always assembled correctly –May be incomplete (despite saying otherwise) –Often no human intervention with annotation

ensEMBL From EBI / Sanger Many vertebrate and model organism genomes Many vertebrate and model organism genomes –All cross-annotated (easy to find orthologs) Access via web with built in DAS client (see later) Access via web with built in DAS client (see later) –BIOMART access –Also / SQL / Perl-API

Low Quality Sequences Expressed sequence tag (EST) are reads coding DNA of variable quality but usually very numerous Expressed sequence tag (EST) are reads coding DNA of variable quality but usually very numerous –Ideal for determining which part of a genome is transcribed…  but not necessarily coding! [ncRNAs] –Recent improvements in visualisation, ORF identification and clustering Genomic survey sequence [GSS] are single pass reads Genomic survey sequence [GSS] are single pass reads –Genomes in early stage of sequencing –Environmental sampling (meta-genomics)

Using EST databases Gene model verification (e.g. checking a splice variant) Gene model verification (e.g. checking a splice variant) –Search EST databases with genomic sequence or cDNA

Curated (Protein) Data Sets Several efforts to create high quality databases SwissProt (1986) the first gold standard in protein functional annotation SwissProt (1986) the first gold standard in protein functional annotation –Originally every entry entered by Amos Bairoch! –Now integrated into UniProt (EBI / EMBL) RefSeq RefSeq –NCBIs human-involved annotation effort IMPORTANT: These are site specific. And NOT shared between NCBI / EBI IMPORTANT: These are site specific. And NOT shared between NCBI / EBI

IPI International Protein Index From EBI Contains proteomes of higher eukaryotic Contains proteomes of higher eukaryotic Effectively maintains a database of cross references between the primary data sources Effectively maintains a database of cross references between the primary data sources Minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript) Minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript)

Primary sequence crossover between databases Primary sequence is exchanged between databases Primary sequence is exchanged between databases –But what is primary sequence? –Sometimes only ‘finished’ sequence is shared NOT SHARED NOT SHARED –Gene Models (and therefore proteins) –Some EST/GSS –From some genome projects not yet published

Identifiers

Consortia identifiers Most key species have a consortia / group / community that provides the key identifiers in the field Most key species have a consortia / group / community that provides the key identifiers in the field Humans Humans –Was HUGO (HUman Genome Organisation) –now the HGNC (Human Genome Nomenclature Committee) Some of limited use as have incomplete coverage Some of limited use as have incomplete coverage

Database Identifiers Every dataset has their own system of identifying gene/protein Every dataset has their own system of identifying gene/protein Example: Human ADH4 Example: Human ADH4 –Ensembl  ENSG00000198099 ENST00000423445 ENSP00000397939 –SwissProt  ADH4_HUMAN P08319  ADH4_HUMAN P08319 –RefSeq  NM_000670.3NP_000661.2  NM_000670.3NP_000661.2 –GenBank  gi|71565152|ref|NP_000661.2|

Keeping Track of Changes Gene models can change Gene models can change –Will the id you used yesterday still get the same sequence today? –Or: How to you get the latest version of a sequence?

Keeping Track of Changes Genbank Genbank –Gi number changes each time, often removed when it gets superseded SwissProt SwissProt –Accession changes each time (P08319) but the ID remains constant (ADH4_HUMAN) RefSeq and Ensembl RefSeq and Ensembl –Revision based ids  NM_000670.3 ENSG00000198099.1 –XXX.number  XXX always retrieve latest  XXX.number retrieves the version

Converting between identifiers Most databases understand an identifier from another database – some are better than others Most databases understand an identifier from another database – some are better than others –Best to use Uniprot NEVER rely on common names NEVER rely on common names

Annotation

Standard Keywords and annotation Problem: Many ways to name a gene Problem: Many ways to name a gene –Reductase = oxidase = dehydrogenase Gene Ontology Consortium [GO] Gene Ontology Consortium [GO] –GO terms standardise naming –Note that errors may still occur in the assignment of terms –Found in RefSeq, SwissProt and most genome databases GO browsers e.g. AmiGO GO browsers e.g. AmiGO

Annotation Issues Most annotation by inference from homology/orthology Most annotation by inference from homology/orthology Domain specific function rather than protein specific function Domain specific function rather than protein specific function Programs may give different answers – not possible for one source to store all possibilities Programs may give different answers – not possible for one source to store all possibilities

Distributed Annotation Information on a protein or gene can be stored on multiple (often specialised) servers Information on a protein or gene can be stored on multiple (often specialised) servers –Distributed Annotation System [DAS] Data can be accessed using client software that checks these sites Data can be accessed using client software that checks these sites –Built into the ensEMBL website

DAS Client DAS Servers DAS Client Reference Data DISTRIBUTED ANNOTATION SYSTEM

Das Servers Free one in ensEMBL Free one in ensEMBL I have my own in-house I have my own in-house –Currently storing CpG islands –Speak to me if you need to use it Found on DAS registry Found on DAS registry –many DAS clients can look up this registry

Some DAS Clients Dedicated Web site Dedicated Web site –Dasty2 [Protein DAS] Genome browsers Genome browsers –ensEMBL web site [Protein and Gene DAS] Protein structure viewers Protein structure viewers –Spice [3D Protein DAS] Alignment programs Alignment programs –Jalview [Protein and Gene DAS]

CpG island tracks

SPICE: a protein DAS client

Gotchas Coordinate systems MUST match Coordinate systems MUST match –Will change with each genome assembly Use a single source only Use a single source only –e.g. UniProt for proteins

Linked databases

More key databases1 Conserved Domains (See later siminar) Conserved Domains (See later siminar) –Interpro: hosts and names from multiply projects –Each domain often functionally annotated –More in following talk Mendelian Inheritance in Man (at NCBI) Mendelian Inheritance in Man (at NCBI) –Phenotype centric, literature curated Population differences Population differences –dbSNP: polymorphisms (also in Ensembl) –PharmGkb: drug efficacy

More Key databases2 Microarray Microarray –EBI : ArrayExpress –NCBI: GEO Pathway Databases Pathway Databases –Reactome –KEGG –Cytoscape application

References NCBI www.ncbi.nlm.nih.gov NCBI www.ncbi.nlm.nih.govwww.ncbi.nlm.nih.gov EBI www.ebi.ac.uk EBI www.ebi.ac.ukwww.ebi.ac.uk UniProt http://beta.uniprot.org UniProt http://beta.uniprot.orghttp://beta.uniprot.org Sanger Centre www.sanger.ac.uk Sanger Centre www.sanger.ac.ukwww.sanger.ac.uk ensEMBL www.ensembl.org ensEMBL www.ensembl.orgwww.ensembl.org Firefox and BioBar www.mozilla.org Firefox and BioBar www.mozilla.orgwww.mozilla.org SPICE (Das Client) http://www.efamily.org.uk/software/dasclients/spice/ SPICE (Das Client) http://www.efamily.org.uk/software/dasclients/spice/http://www.efamily.org.uk/software/dasclients/spice/ Gene Ontology Consortium www.geneontology.org Gene Ontology Consortium www.geneontology.orgwww.geneontology.org AmiGO www.godatabase.org AmiGO www.godatabase.orgwww.godatabase.org GMOD www.gmod.org GMOD www.gmod.orgwww.gmod.org BioMoby www.biomoby.org BioMoby www.biomoby.orgwww.biomoby.org Dasty2 http://www.ebi.ac.uk/dasty/ Dasty2 http://www.ebi.ac.uk/dasty/http://www.ebi.ac.uk/dasty/ Interpro: http://www.ebi.ac.uk/interpro/ Interpro: http://www.ebi.ac.uk/interpro/http://www.ebi.ac.uk/interpro/ Reactome: http//www.reactome.org/ Reactome: http//www.reactome.org/ KEGG: http://www.genome.jp/ kegg/ kegg2.html KEGG: http://www.genome.jp/ kegg/ kegg2.html Cytoscape http://www.cytoscape.org Cytoscape http://www.cytoscape.org PharmGkb: http://www. PharmGkb.org/ PharmGkb: http://www. PharmGkb.org/

SLIDES http://bifx1.bio.ed.ac.uk Follow links to Recent Presentations

A Rough Guide to Biological Databases Alastair Kerr, Ph.D. Bioinformatician Wellcome Trust Centre for Cell Biology.

Similar presentations

Presentation on theme: "A Rough Guide to Biological Databases Alastair Kerr, Ph.D. Bioinformatician Wellcome Trust Centre for Cell Biology."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Rough Guide to Biological Databases Alastair Kerr, Ph.D. Bioinformatician Wellcome Trust Centre for Cell Biology.

Similar presentations

Presentation on theme: "A Rough Guide to Biological Databases Alastair Kerr, Ph.D. Bioinformatician Wellcome Trust Centre for Cell Biology."— Presentation transcript:

Similar presentations

About project

Feedback