Download presentation
Presentation is loading. Please wait.
Published byCharles Carson Modified over 9 years ago
1
A Rough Guide to Biological Databases Alastair Kerr, Ph.D. Bioinformatician Wellcome Trust Centre for Cell Biology
2
SLIDES http://bifx1.bio.ed.ac.uk Follow links to Recent Presentations
3
Goals Understand differences between different data sources Understand differences between different data sources Know where to go if you have an sequence id (e.g. RefSeq, SwissProt Ensembl) Know where to go if you have an sequence id (e.g. RefSeq, SwissProt Ensembl) Understand the best data source to use for your task (e.g. looking for orthologs or splice variants, proteomics experiment) Understand the best data source to use for your task (e.g. looking for orthologs or splice variants, proteomics experiment) Understand how identifiers work Understand how identifiers work Understand how to get the best annotation Understand how to get the best annotation
4
Overview of Sequence Databases Brief history Brief history Varieties of data sources (databases and datasets) Varieties of data sources (databases and datasets) –Utility/drawbacks of each Use of identifiers Use of identifiers DNA /Protein Annotation DNA /Protein Annotation –Distributed annotation system [DAS] DAS clients - outline and demo
5
Dawn of the Age of Sequencing Mid 50’s : First protein sequence -by Fred Sanger Mid 50’s : First protein sequence -by Fred Sanger Late 70’s : clone based sequencing arrived with Sanger dideoxy method also the first bioinformatics tools (STADEN, alignments) Late 70’s : clone based sequencing arrived with Sanger dideoxy method also the first bioinformatics tools (STADEN, alignments) All sequences were published in papers, a central warehouse was clearly needed to keep them all All sequences were published in papers, a central warehouse was clearly needed to keep them all
6
Sharing PRIMARY sequence data NCBI GenBank EMBL DDBJ
7
Sequence Warehouses NCBI GenBank EMBL Protein and DNA database GenBank NR [Non-redundant] Historically DNA: EMBL Protein: translated EMBL (trEMBL) Now called EBI UniProt National Centre for Biotechnology Information European Molecular Biology Laboratory
9
Sources of DNA Error Vector contamination Vector contamination –Now mainly eliminated in the sequencing pipeline but still possible with rarer vectors Sequencing artefacts Sequencing artefacts –Recombination, mutations, contamination –insertions /deletions (More prone in Next-gen sequencing?) –Most removed from ‘finished’ sequence by comparing multiple reads, but still prevalent in ESTs and GSS
10
Protein Sequence Error Most proteins are based on a model of the gene Most proteins are based on a model of the gene This gene model is often deduced: combination of: This gene model is often deduced: combination of: –EST data –ORF finding programs –Splice site finding programs Protein interpretations may change Protein interpretations may change –Transcription/translation start sites –Splice Variants –DNA errors
11
Varieties of data sources Sequence Warehouses Sequence Warehouses –“everything under one roof” Genome Databases Genome Databases –Containing single genome dataset(s) Single pass reads Single pass reads –[EST] Expressed sequence set (cDNA) –[GSS] Genome survey sequence Curated sets Curated sets
12
Pros Pros –Retrieve a specific sequence e.g. an identifier from a paper –Comprehensive sequence search Cons Cons –>100 BILLION bases from 165,000 organisms –Redundancy Variants, both real and artefacts (may complicate mass- spec searches) –Still may contain junk DNA Sequence Warehouses
14
UniProt http://www.uniprot.org/
15
Single Genome databases/datasets Pros Pros –Truly non-redundant –If complete, you know the gene copy number and gene families –Can search genomic DNA for know but un-annotated genes –Can ‘browse’ the genome –Usually very good computer annotation Cons Cons –Not always assembled correctly –May be incomplete (despite saying otherwise) –Often no human intervention with annotation
16
ensEMBL From EBI / Sanger Many vertebrate and model organism genomes Many vertebrate and model organism genomes –All cross-annotated (easy to find orthologs) Access via web with built in DAS client (see later) Access via web with built in DAS client (see later) –BIOMART access –Also / SQL / Perl-API
18
Low Quality Sequences Expressed sequence tag (EST) are reads coding DNA of variable quality but usually very numerous Expressed sequence tag (EST) are reads coding DNA of variable quality but usually very numerous –Ideal for determining which part of a genome is transcribed… but not necessarily coding! [ncRNAs] –Recent improvements in visualisation, ORF identification and clustering Genomic survey sequence [GSS] are single pass reads Genomic survey sequence [GSS] are single pass reads –Genomes in early stage of sequencing –Environmental sampling (meta-genomics)
19
Using EST databases Gene model verification (e.g. checking a splice variant) Gene model verification (e.g. checking a splice variant) –Search EST databases with genomic sequence or cDNA
20
Curated (Protein) Data Sets Several efforts to create high quality databases SwissProt (1986) the first gold standard in protein functional annotation SwissProt (1986) the first gold standard in protein functional annotation –Originally every entry entered by Amos Bairoch! –Now integrated into UniProt (EBI / EMBL) RefSeq RefSeq –NCBIs human-involved annotation effort IMPORTANT: These are site specific. And NOT shared between NCBI / EBI IMPORTANT: These are site specific. And NOT shared between NCBI / EBI
21
IPI International Protein Index From EBI Contains proteomes of higher eukaryotic Contains proteomes of higher eukaryotic Effectively maintains a database of cross references between the primary data sources Effectively maintains a database of cross references between the primary data sources Minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript) Minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript)
22
Primary sequence crossover between databases Primary sequence is exchanged between databases Primary sequence is exchanged between databases –But what is primary sequence? –Sometimes only ‘finished’ sequence is shared NOT SHARED NOT SHARED –Gene Models (and therefore proteins) –Some EST/GSS –From some genome projects not yet published
23
Identifiers
24
Consortia identifiers Most key species have a consortia / group / community that provides the key identifiers in the field Most key species have a consortia / group / community that provides the key identifiers in the field Humans Humans –Was HUGO (HUman Genome Organisation) –now the HGNC (Human Genome Nomenclature Committee) Some of limited use as have incomplete coverage Some of limited use as have incomplete coverage
25
Database Identifiers Every dataset has their own system of identifying gene/protein Every dataset has their own system of identifying gene/protein Example: Human ADH4 Example: Human ADH4 –Ensembl ENSG00000198099 ENST00000423445 ENSP00000397939 –SwissProt ADH4_HUMAN P08319 ADH4_HUMAN P08319 –RefSeq NM_000670.3NP_000661.2 NM_000670.3NP_000661.2 –GenBank gi|71565152|ref|NP_000661.2|
26
Keeping Track of Changes Gene models can change Gene models can change –Will the id you used yesterday still get the same sequence today? –Or: How to you get the latest version of a sequence?
27
Keeping Track of Changes Genbank Genbank –Gi number changes each time, often removed when it gets superseded SwissProt SwissProt –Accession changes each time (P08319) but the ID remains constant (ADH4_HUMAN) RefSeq and Ensembl RefSeq and Ensembl –Revision based ids NM_000670.3 ENSG00000198099.1 –XXX.number XXX always retrieve latest XXX.number retrieves the version
28
Converting between identifiers Most databases understand an identifier from another database – some are better than others Most databases understand an identifier from another database – some are better than others –Best to use Uniprot NEVER rely on common names NEVER rely on common names
29
Annotation
30
Standard Keywords and annotation Problem: Many ways to name a gene Problem: Many ways to name a gene –Reductase = oxidase = dehydrogenase Gene Ontology Consortium [GO] Gene Ontology Consortium [GO] –GO terms standardise naming –Note that errors may still occur in the assignment of terms –Found in RefSeq, SwissProt and most genome databases GO browsers e.g. AmiGO GO browsers e.g. AmiGO
32
Annotation Issues Most annotation by inference from homology/orthology Most annotation by inference from homology/orthology Domain specific function rather than protein specific function Domain specific function rather than protein specific function Programs may give different answers – not possible for one source to store all possibilities Programs may give different answers – not possible for one source to store all possibilities
33
Distributed Annotation Information on a protein or gene can be stored on multiple (often specialised) servers Information on a protein or gene can be stored on multiple (often specialised) servers –Distributed Annotation System [DAS] Data can be accessed using client software that checks these sites Data can be accessed using client software that checks these sites –Built into the ensEMBL website
34
DAS Client DAS Servers DAS Client Reference Data DISTRIBUTED ANNOTATION SYSTEM
35
Das Servers Free one in ensEMBL Free one in ensEMBL I have my own in-house I have my own in-house –Currently storing CpG islands –Speak to me if you need to use it Found on DAS registry Found on DAS registry –many DAS clients can look up this registry
36
Some DAS Clients Dedicated Web site Dedicated Web site –Dasty2 [Protein DAS] Genome browsers Genome browsers –ensEMBL web site [Protein and Gene DAS] Protein structure viewers Protein structure viewers –Spice [3D Protein DAS] Alignment programs Alignment programs –Jalview [Protein and Gene DAS]
37
DAS
38
CpG island tracks
39
SPICE: a protein DAS client
40
Gotchas Coordinate systems MUST match Coordinate systems MUST match –Will change with each genome assembly Use a single source only Use a single source only –e.g. UniProt for proteins
41
Linked databases
43
More key databases1 Conserved Domains (See later siminar) Conserved Domains (See later siminar) –Interpro: hosts and names from multiply projects –Each domain often functionally annotated –More in following talk Mendelian Inheritance in Man (at NCBI) Mendelian Inheritance in Man (at NCBI) –Phenotype centric, literature curated Population differences Population differences –dbSNP: polymorphisms (also in Ensembl) –PharmGkb: drug efficacy
44
More Key databases2 Microarray Microarray –EBI : ArrayExpress –NCBI: GEO Pathway Databases Pathway Databases –Reactome –KEGG –Cytoscape application
45
References NCBI www.ncbi.nlm.nih.gov NCBI www.ncbi.nlm.nih.govwww.ncbi.nlm.nih.gov EBI www.ebi.ac.uk EBI www.ebi.ac.ukwww.ebi.ac.uk UniProt http://beta.uniprot.org UniProt http://beta.uniprot.orghttp://beta.uniprot.org Sanger Centre www.sanger.ac.uk Sanger Centre www.sanger.ac.ukwww.sanger.ac.uk ensEMBL www.ensembl.org ensEMBL www.ensembl.orgwww.ensembl.org Firefox and BioBar www.mozilla.org Firefox and BioBar www.mozilla.orgwww.mozilla.org SPICE (Das Client) http://www.efamily.org.uk/software/dasclients/spice/ SPICE (Das Client) http://www.efamily.org.uk/software/dasclients/spice/http://www.efamily.org.uk/software/dasclients/spice/ Gene Ontology Consortium www.geneontology.org Gene Ontology Consortium www.geneontology.orgwww.geneontology.org AmiGO www.godatabase.org AmiGO www.godatabase.orgwww.godatabase.org GMOD www.gmod.org GMOD www.gmod.orgwww.gmod.org BioMoby www.biomoby.org BioMoby www.biomoby.orgwww.biomoby.org Dasty2 http://www.ebi.ac.uk/dasty/ Dasty2 http://www.ebi.ac.uk/dasty/http://www.ebi.ac.uk/dasty/ Interpro: http://www.ebi.ac.uk/interpro/ Interpro: http://www.ebi.ac.uk/interpro/http://www.ebi.ac.uk/interpro/ Reactome: http//www.reactome.org/ Reactome: http//www.reactome.org/ KEGG: http://www.genome.jp/ kegg/ kegg2.html KEGG: http://www.genome.jp/ kegg/ kegg2.html Cytoscape http://www.cytoscape.org Cytoscape http://www.cytoscape.org PharmGkb: http://www. PharmGkb.org/ PharmGkb: http://www. PharmGkb.org/
46
SLIDES http://bifx1.bio.ed.ac.uk Follow links to Recent Presentations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.