EBI is an Outstation of the European Molecular Biology Laboratory. Every genome deserves a home Dan Lawson EMBL-EBI.

Slides:



Advertisements
Similar presentations
V Alyssa Rosemartin 1, Lee Marsh 1, Ellen Denny 1, Bruce Wilson USA National Phenology Network, Tucson, AZ; 2 - Oak Ridge National Laboratory, Oak.
Advertisements

MEDIN Standards M. Charlesworth and the MEDIN Standards Working Group.
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
5 EBI is an Outstation of the European Molecular Biology Laboratory. Master title Molecular Interactions – the IntAct Database Sandra Orchard EMBL-EBI.
Administration & Workflow
Systems Biology Data Dissemination Working Group 25FEB2015.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.
Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
NCBI resources III: GEO and expression data analysis Yanbin Yin Fall
November 2007BRC5 Bethesda Variation data in VectorBase Dan Lawson, VectorBase EMBL-EBI.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
VectorBase A Resource Centre for Invertebrate Hosts of Human Pathogens Bob MacCallum Imperial College London.
Taverna and my Grid Basic overview and Introduction Tom Oinn
EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Daniel Rios Stephen Fitzgerald Edinburgh, 24 & 25 February 2009 Ensembl.
1 Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
BIRN Update Carl Kesselman Professor of Industrial and Systems Engineering Information Sciences Institute Fellow Viterbi School of Engineering University.
Sept 19,  Provides a common set of terminology and definitions  A framework for describing resources and processes  Enables computer based interoperability.
The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Objectives.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
VectorBase Gene expression data in VectorBase Fotis Kafatos, George Christophides, Bob MacCallum & Seth Redmond Imperial College London (thanks also to.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Metadata in the iPlant Collaborative Cyberinfrastructure Birds of a Feather meeting at PAG XXII, Jan. 14, 2014.
Web Apollo and the VectorBase user community Gloria I. Giraldo-Calderón March 31, 2015.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Copyright OpenHelix. No use or reproduction without express written consent1.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Got genom e? Community Meetings GMOD.org The GMOD community meets semi- annually to discuss GMOD components, best practices,
Prototype Information Architecture. Key Requirements Access to data, tools, and expertise –Integrated access to spatial data –Submission of info. to OWEB.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Experts Workshop on the IPT, v. 2, Copenhagen, Denmark The Pathway to the Integrated Publishing Toolkit version 2 Tim Robertson Systems Architect Global.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
Vectorbase and Galaxy Jarek Nabrzyski On behalf of VectorBase Center for Research Computing University of Notre Dame VectorBase Bioinformatics Resource.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015.
Introduction to the Gene Ontology GO Workshop 3-6 August 2010.
Copyright OpenHelix. No use or reproduction without express written consent1.
TopCAT Use Cases Priorities User Interface 1 ICAT developer workshop, August 2009 Laurent Lerusse – STFC
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
The role of the National Agricultural Library in arthropod genomics research - implementing and developing tools for genomic data management Monica Poelchau.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Accessing and visualizing genomics data
A Tripal based Arthropod genome portal The i5k A Tripal based Arthropod genome portal Christopher Childers USDA/ARS/NAL i5k.nal.usda.gov.
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
The i5k – enabling genomic data access, visualization and curation for the i5k community Monica Poelchau and the i5k group.
Galaxy for analyzing genome data Hardison October 05, 2010
Tools For Vertebrate Gene Naming
Why Create a PGDB? Perform pathway analyses as part of a genome project Analyze omics data Create a central public information resource for the organism,
Bioinformatics Tools for Comparative Genomics of Vectors
Functional Annotation of the Horse Genome
Strategic uses of Web Content Management Systems
Ensembl Genomes: Overview Poznań, 27th-28th June 2013
got genome? Community Meetings Databases Training GMOD.org
Explore Evolution: Instrument for Analysis
Lesson 3 Bioinformatics Laboratory
Genetic Data in Mary Ann Tuli.
Welcome - webinar instructions
Presentation transcript:

EBI is an Outstation of the European Molecular Biology Laboratory. Every genome deserves a home Dan Lawson EMBL-EBI

Disclosure - my background VectorBase NIAID-funded Bioinformatic Resource Center focused on arthropod vectors of human pathogens Collaborates with sequencers and community on 1 o annotation Community resource, ‘One stop shop’ Ensembl Genomes Extending Ensembl across taxonomic space 5 taxonomic portals to present genome assemblies and annotation Integrated resource for cross-species interrogation

Find a home for every genome

Every genome deserves a home Sequencing the genome of your favourite species is a beginning You will want to make your genome: Useful to your group/community Useful to other communities You will (hopefully) want to update/improve: Assembly (new sequencing technologies, mapping strategies) Gene predictions (new models, correct existing models, delete unsupported models) Gene annotation (add gene names/symbols, descriptions) Data richness (new high-throughput datasets, xrefs to relevant resources)

Finding a home for every genome All genomes deserve a home Houses Apartments/Flats Dormitories/Barracks Genomic information infrastructure after the deluge Julian Parkhill, Ewan Birney and Paul Kersey Genome Biology 2010, 11:402

Anatomy of a home Genome browser Similarity searches BLAST/BLAT Query tools Simple keyword Complex queries Downloads Similarity searches Query tool Downloads Browser Compara

Finding a home Factors to take into account when choosing a home for your genome Required functionality Data access (Bulk download, tailored download, computational) Visualization (Genome browser) Search (Sequence based, simple keyword queries, complex queries) Extendability for new data types (e.g. NGS transcriptomics, variation) Resources required for maintenance Compute/servers Staff (with appropriate skills)

Tier 2 databases: VectorBase One of 4 NIAID Bioinformatics Resource Centers Integrated genomic resource for arthropod vectors of human pathogens Collaboration of 3 European and 3 US Institutes VectorBase is: Both service provider and content generator A collator of genomic information A genome annotation group (gene structure prediction) A provider of tools for browsing and data mining vector genomes A helpdesk for community queries Responsible for data submissions to the public archival databases Committed to regular release cycles (5-6 releases per year)

VectorBase highlights 2012 Website orientated around data rather than species Consolidation of legacy sections Faceted universal search Scalable handling of: organism strain assembly gene set Ensembl genome browser Extensive user data upload facilities More species Community Annotation Portal overhaul

Tier 3 databases: Ensembl Genomes

Ensembl Genomes release 18 ( 43 species Stakeholders: VectorBase FlyBase WormBase BeetleBase Hymenoptera Genome Database Other highlights Lepidoptera (3 spp. one to come) Sole location of a number of arthropod genomes

Ensembl Genomes - home analogy Integration into the Ensembl relational database schema Genome browser Data centric views Downloads Similarity searches (Blast/Blat) Comparative analysis with other species Programmatic access (Perl API) BioMart query tool Data consistency across species

Benefits of inclusion in Ensembl Genomes Integration with a wide range of other species Ability to include other data types Variation Functional genomics Alignments Community data sets (configuration of site) BAMs (RNA-seq, re-sequencing) VCFs (SNPs, CNVs) Wiggle plots for regulatory elements/ChiP-Seq etc. User addition of data sets (temporary visualization) Downstream usage by 3rd party tools/analyses

Choosing a solution Look at existing solutions “Off the shelf” Generic Model Organism Database project ( Ensembl ( “Roll your own” Content Management Systems (Drupal) Wikis (many flavours)

Publicise your resource Meetings Mailing lists Publication NAR Database issue a little bit of SEO Google/Bing etc. Social media

Make your data available in common formats Just as we use a lingua franca to communicate between nationalities we use the same in sharing data Sequences Fasta format Assembly AGP (Golden Path) GenBank Annotation GFF3 (Gene Feature Format v3) Sequence Ontology

Bulk downloads are not an afterthought... The provision of data as bulk downloads should not be an afterthought for your project Make data available in common formats Be responsive to community needs (in terms of alternative formats, other data types) Run quality assurance over the download files Completeness Within files Across files ‘Round trip’ data where possible - “I have a dream”

but by far the most important thing is Submission to the public archival databases

Why submit to the public archival databases? Visability Integration with the widest possible community xrefs back to your resource Longevity Funding for INSDC is always going to be more secure than your database Accreditation Publication Many funders and journals require submission prior to publication NCBI/EBI/UCSC Browser agreement Only assemblies submitted to INSDC can be visualised through these resources Personally - I don’t consider a genome to be in the public domain until it has been submitted to INSDC

Submission makes you do a number of things Requirement to conform to standards Some are mandatory, some advisory Opportunity to capture metadata Minimum information about a genome sequence (MIGS) Encourages good practice Explicit nomenclature and versioning Caveat that you need to make updates!

GenBank nomenclature BioProject accessions WGS accessions Assembly accessions

i5k BioProject at INSDC We encourage communities to submit data to the appropriate public archival database (GenBank/ENA/DDBJ), Short Read Archive (SRA) etc. We encourage you to join us and add your project when submitting data to INSDC

Encourage collaboration “Many cooks spoil the broth” v “Many hands make light work” Send your genome to school to learn Encourage collaboration within your community Encourage the next generation of researchers Don’t be afraid to ask “experts” for specific help Fort Lauderdale agreement Outcome from a 2003 meeting Sequencing group reserves right to publish Strike a balance between fair use (i.e. no pre-emptive publication) and early disclosure.

arthropodgenomes.org > 600 registered users from 178 institutes worldwide 30 community resources/databases ≅ 800 species nominated by individuals, consortia, museums or societies

Built around Person & Organism pages

Stakeholders - Databases Outreach opportunity Includes species (living in this home) Contact details for the project Contact details for the developers References

Stakeholders - Resources Outreach opportunity Includes species (living in this home) Contact details for the project Contact details for the developers References

Encourage collaboration

Finding “experts” from outside your community Genome papers, supplemental data

Future challenges Scaling bioinformatics infrastructure to deal with 1000s of genomes Centralised or federated models Democratisation of genome analysis “Best practices” for genome assembly & annotation Metrics for assessing genome assemblies and annotations e.g. Assemblathon ( Facilitating and improving community involvement in genome projects e.g. VectorBase Community Annotation Portal (CAP), WebApollo.

Contact or