for EGI/EUDAT EMBL/ELIXIR use-cases Tony Wildish
What is EMBL-EBI? Europe’s home for biological data services, research and training A trusted data provider for the life sciences Part of the European Molecular Biology Laboratory, an intergovernmental research organisation International: 570 members of staff from 57 nations Home of the ELIXIR Technical hub.
A distributed data infrastructure for Europe EMBL-EBI is a founding member of ELIXIR: Europe’s distributed research infrastructure for biological information Mission: to support life science research and its translation to medicine, the environment, the bioindustries and society ELIXIR Nodes represent centres of excellence throughout Europe.
Data resources available from EMBL-EBI Genes, genomes & variation RNA Central Array Express Expression Atlas Metabolights PRIDE InterProPfamUniProt ChEMBLChEBI Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank European Nucleotide Archive European Variation Archive European Genome-phenome Archive Gene, protein & metabolite expression Protein sequences, families & motifs Chemical biology Reactions, interactions & pathways IntActReactomeMetaboLights Systems BioModelsEnzyme PortalBioSamples Ensembl Ensembl Genomes GWAS Catalog Metagenomics portal Europe PubMed Central Gene Ontology Experimental Factor Ontology Literature & ontologies
ELIXIR: Driven by 4 scientific use-cases Marine Metagenomics Genomic & Phenotypic data for Crop and Forest plants Rare Diseases Human Genetic Data Will not start with human data due to security constraints All scientific use cases require either private or public data sets to be replicated from the source or between analysis sites
Use-case characteristics Data volumes from 10’s to several 100’s of GB monthly Human data likely to be largest volume/traffic Replication between a handful of sites Periodic updates to reference datasets => metadata handling to describe datasets consistently Download smaller subsets for individual analyses End-users widely distributed
Use-case characteristics Metadata replication not a target for the pilot Complex, domain-specific, well established No clear gain in replicating it at this time Decouple dataset-description metadata from file-location and transfer metadata Allow file-distribution to be explored and understood without digging into details of what the data is about
Use-case characteristics Subscription-based model Datasets subscribed to a destination, new versions distributed automatically as they become available Need to understand metadata requirements to allow this Need an opaque ID for data that can be shared between EBI and EUDAT/EGI to identify dataset versions Rely on EBI source archive for determining what the ID represents File-transfer system needs to handle overlapping datasets (partial updates to existing datasets)
Initial prototype Standalone prototype, first investigate metadata issues Provide a flat list of files to transfer Use globus-connect endpoints & CLI to perform transfer Side-step issues with dependency on AAI Switch to using AAI as soon as possible (ELIXIR, EGI, EUDAT) Currently works on EBI-Embassy, CESNET, and Amazon Integrate with ELIXIR portal Allow data-discovery followed by subscription to ELIXIR/EGI/EUDAT destinations
Summary Initial pilot to investigate issues Data-description metadata out of scope for pilot File-distribution based on AAI from multiple providers Start with globus-connect for simplicity, move to gridFTP once AAI in place File-replica metadata to be handled by prototype TBD: how to do this, tools, technologies… Integrate with ELIXIR cloud portal, (under development) Early days, lots to learn...