Download presentation
Presentation is loading. Please wait.
Published byAnthony Patrick Modified over 9 years ago
1
EBI is an Outstation of the European Molecular Biology Laboratory. Every genome deserves a home Dan Lawson EMBL-EBI
2
Disclosure - my background VectorBase http://www.vectorbase.orghttp://www.vectorbase.org NIAID-funded Bioinformatic Resource Center focused on arthropod vectors of human pathogens Collaborates with sequencers and community on 1 o annotation Community resource, ‘One stop shop’ Ensembl Genomes http://www.ensemblgenomes.orghttp://www.ensemblgenomes.org Extending Ensembl across taxonomic space 5 taxonomic portals to present genome assemblies and annotation Integrated resource for cross-species interrogation
3
Find a home for every genome
4
Every genome deserves a home Sequencing the genome of your favourite species is a beginning You will want to make your genome: Useful to your group/community Useful to other communities You will (hopefully) want to update/improve: Assembly (new sequencing technologies, mapping strategies) Gene predictions (new models, correct existing models, delete unsupported models) Gene annotation (add gene names/symbols, descriptions) Data richness (new high-throughput datasets, xrefs to relevant resources)
5
Finding a home for every genome All genomes deserve a home Houses Apartments/Flats Dormitories/Barracks Genomic information infrastructure after the deluge Julian Parkhill, Ewan Birney and Paul Kersey Genome Biology 2010, 11:402 http://genomebiology.com/2010/11/7/402
6
Anatomy of a home Genome browser Similarity searches BLAST/BLAT Query tools Simple keyword Complex queries Downloads Similarity searches Query tool Downloads Browser Compara
7
Finding a home Factors to take into account when choosing a home for your genome Required functionality Data access (Bulk download, tailored download, computational) Visualization (Genome browser) Search (Sequence based, simple keyword queries, complex queries) Extendability for new data types (e.g. NGS transcriptomics, variation) Resources required for maintenance Compute/servers Staff (with appropriate skills)
8
Tier 2 databases: VectorBase One of 4 NIAID Bioinformatics Resource Centers Integrated genomic resource for arthropod vectors of human pathogens Collaboration of 3 European and 3 US Institutes VectorBase is: Both service provider and content generator A collator of genomic information A genome annotation group (gene structure prediction) A provider of tools for browsing and data mining vector genomes A helpdesk for community queries Responsible for data submissions to the public archival databases Committed to regular release cycles (5-6 releases per year)
9
VectorBase highlights 2012 Website orientated around data rather than species Consolidation of legacy sections Faceted universal search Scalable handling of: organism strain assembly gene set Ensembl genome browser Extensive user data upload facilities More species Community Annotation Portal overhaul
10
Tier 3 databases: Ensembl Genomes
11
Ensembl Genomes release 18 (http://metazoa.ensembl.org)http://metazoa.ensembl.org 43 species Stakeholders: VectorBase FlyBase WormBase BeetleBase Hymenoptera Genome Database Other highlights Lepidoptera (3 spp. one to come) Sole location of a number of arthropod genomes
12
Ensembl Genomes - home analogy Integration into the Ensembl relational database schema Genome browser Data centric views Downloads Similarity searches (Blast/Blat) Comparative analysis with other species Programmatic access (Perl API) BioMart query tool Data consistency across species
13
Benefits of inclusion in Ensembl Genomes Integration with a wide range of other species Ability to include other data types Variation Functional genomics Alignments Community data sets (configuration of site) BAMs (RNA-seq, re-sequencing) VCFs (SNPs, CNVs) Wiggle plots for regulatory elements/ChiP-Seq etc. User addition of data sets (temporary visualization) Downstream usage by 3rd party tools/analyses
14
Choosing a solution Look at existing solutions “Off the shelf” Generic Model Organism Database project (http://www.gmod.org/wiki/Main_Page)http://www.gmod.org/wiki/Main_Page Ensembl (http://www.ensembl.org)http://www.ensembl.org “Roll your own” Content Management Systems (Drupal) Wikis (many flavours)
15
Publicise your resource Meetings Mailing lists Publication NAR Database issue a little bit of SEO Google/Bing etc. Social media
16
Make your data available in common formats Just as we use a lingua franca to communicate between nationalities we use the same in sharing data Sequences Fasta format http://www.ebi.ac.uk/help/formats.html Assembly AGP (Golden Path) GenBank http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtmlhttp://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml Annotation GFF3 (Gene Feature Format v3) Sequence Ontology http://www.sequenceontology.org/gff3.shtmlhttp://www.sequenceontology.org/gff3.shtml
17
Bulk downloads are not an afterthought... The provision of data as bulk downloads should not be an afterthought for your project Make data available in common formats Be responsive to community needs (in terms of alternative formats, other data types) Run quality assurance over the download files Completeness Within files Across files ‘Round trip’ data where possible - “I have a dream”
18
but by far the most important thing is Submission to the public archival databases
19
Why submit to the public archival databases? Visability Integration with the widest possible community xrefs back to your resource Longevity Funding for INSDC is always going to be more secure than your database Accreditation Publication Many funders and journals require submission prior to publication NCBI/EBI/UCSC Browser agreement Only assemblies submitted to INSDC can be visualised through these resources Personally - I don’t consider a genome to be in the public domain until it has been submitted to INSDC
20
Submission makes you do a number of things Requirement to conform to standards Some are mandatory, some advisory Opportunity to capture metadata Minimum information about a genome sequence (MIGS) Encourages good practice Explicit nomenclature and versioning Caveat that you need to make updates!
21
GenBank nomenclature BioProject accessions WGS accessions Assembly accessions
22
i5k BioProject at INSDC We encourage communities to submit data to the appropriate public archival database (GenBank/ENA/DDBJ), Short Read Archive (SRA) etc. We encourage you to join us and add your project when submitting data to INSDC http://www.ncbi.nlm.nih.gov/bioproject/163993
23
Encourage collaboration “Many cooks spoil the broth” v “Many hands make light work” Send your genome to school to learn Encourage collaboration within your community Encourage the next generation of researchers Don’t be afraid to ask “experts” for specific help Fort Lauderdale agreement Outcome from a 2003 meeting Sequencing group reserves right to publish Strike a balance between fair use (i.e. no pre-emptive publication) and early disclosure.
24
http://arthropodgenomes.org
25
arthropodgenomes.org > 600 registered users from 178 institutes worldwide 30 community resources/databases ≅ 800 species nominated by individuals, consortia, museums or societies
26
Built around Person & Organism pages
27
Stakeholders - Databases Outreach opportunity Includes species (living in this home) Contact details for the project Contact details for the developers References
28
Stakeholders - Resources Outreach opportunity Includes species (living in this home) Contact details for the project Contact details for the developers References
29
Encourage collaboration
30
Finding “experts” from outside your community Genome papers, supplemental data
31
Future challenges Scaling bioinformatics infrastructure to deal with 1000s of genomes Centralised or federated models Democratisation of genome analysis “Best practices” for genome assembly & annotation Metrics for assessing genome assemblies and annotations e.g. Assemblathon (http://assemblathon.org)http://assemblathon.org Facilitating and improving community involvement in genome projects e.g. VectorBase Community Annotation Portal (CAP), WebApollo.
32
Contact lawson@ebi.ac.uk or bugadmin@arthropodgenomes.orglawson@ebi.ac.ukodgenomes.org
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.