MMG: from proof-of-concept to production services at scale Lars Ailo Bongo (ELIXIR-NO, WP6) WP4 F2F, 8-9 February 2017, Stockholm, Sweden
MMG on ELIXIR compute clouds Proof-of-concept: META-pipe on cPouta √ EMG on Embassy cloud √ Webinar: ELIXIR Compute Platform Roadmap TODO: Test META-pipe and EMG at scale Deploy META-pipe and EMG production service on cloud Document best practices Integrate META-pipe and EMG with other ELIXIR platforms Incorporate other MMG pipelines such as BioMaS Issues: Missing policies: who is paying for resources? Which resources can different users use? … Missing technology: how to do accounting? How to ensure a stable service? …
Outline META-pipe: Other MMG/WP6 activities EMG presentation to follow User feedback Elixir compute TUCs and other components used Need your help here Future plans Other MMG/WP6 activities EMG presentation to follow
META-pipe: analysis as a service √ Login Upload data Select analysis tool parameters Execute analysis Download results
META-pipe: architecture √
META-pipe: front-end technical solutions √ Login Authorization server integrated with ELIXIR AAI Upload data Incoming! web app library META-pipe storage server Select analysis parameters META-pipe web app Execute analysis META-pipe job queue META-pipe execution environment Download result
META-pipe: front-end policies Login All ELIXIR users can login gives (user, home institution) Who can pay for the resources? Who is allowed to use tools and resources (academic vs industry)? Upload data Data size gives computation requirements Small for free? Medium on pre-allocated? Large as special case? Select analysis parameters / execute analysis Which resource to use? Who decides? Commercial clouds? Scheduling/ prioritization of jobs? Response time guarantees? Who is responsible to maintain and monitor resources? Download result Private vs (eventually) public?
META-pipe (and EMG): bakcend layers (√) Pipeline tools & DBs META-pipe Pipeline specification Spark program Analysis engine Spark, NFS Cloud setup cPouta ansible playbook
META-pipe: cloud execution Pipeline tools & reference DBs: Mostly 3rd party binaries Hundreds of GB of reference DBs Packaged in META-pipe Jenkins server Not in a container/ VM (no benefits for now) TODO: standardize description/ provenance data reporting (WP4?) TODO: summarize best practices (WP4 / ?) Spark program Regular spark program + abstractions/interfaces for running 3rd party binaries TODO: better error detection, logging, and handling (WP6) TODO: more secure execution (WP6/ WP4) TODO: accounting and payment (WP4) TODO: use our approach for other pipelines? (WP4)
META-pipe: cloud execution Spark, NFS execution environment: Standalone Spark NFS since some tools need a shared file system TODO: optimize execution environments (WP6/WP4) TODO: test scalability (WP6/ WP4) ?: integrate META-pipe storage server with ELIXIR storage & transfer cPouta ansible playbook Setup Spark and NFS execution environment on cPouta OpenStack Ongoing work: setup execution environment on Open Nebula (CZ) TODO: port to other clouds (WP4?) TODO: provide best practice guidelines (WP4) TODO: long term maintaining of setup tools (?)
WP6 deliverables The comprehensive metagenomics standards environment √ Paper to be submitted on Friday Provenance of sampling standard Provenance of sequencing standard Provenance of analysis best practices Archiving of analysis discussion Marine metagenomics portal (MMP) √ https://mmp.sfb.uit.no/ Marine reference databases (MarRef, MarDB, MarCat) META-pipe used to process data for MarCat
WP6 deliverables MMG analysis pipelines: August 2018 Test META-pipe and MMG at scale Deploy META-pipe and MMG on ELIXIR compute clouds Evaluation of tools Synthetic benchmark metagenomes Federated search engine Training and workshops Metagenomics data analysis, 3-6 April 2017, Helsinki, Finland Metagenomics data analysis, ?, ?, Portugal
BioMaS pipeline on INDIGO-Datacloud BioMaS is a taxonomical classification pipeline (ELIXIR-IT) Provided as an on-demand Galaxy instance Based on INDIGO-Datacloud
Pyttipanna Who is the user of cloud services? Pipeline providers? End-users? Data transfer vs storage vs AAI 3 services? 1 distributed file storage? EMG cloud proof-of-concept = Plant use case Setup VMs, transfer data, allow user to run analyses
Summary 2 MMG pipelines can be run on ELIXIR clouds Need resources to test at scale Need policies and TUCs (21 and 22) for production use of clouds
TUCs TUC1/ TUC3 (Federated ID/ ELIXIR Identity): TUC2 (Other ID): Give access to service Get information needed for accounting and payment TUC2 (Other ID): Give access to non-European-academic users TUC4 (Cloud IaaS Services) Cloud providers that can run execution environment TUC5 (HTC/ HPC cluster) Run batch jobs to produce reference databases TUC6 (PRACE cluster) We do not need PRACE scale resources
TUCs TUC7 (Network file storage) TUC8 (File transfer) Not provided (we setup NFS as part of execution environment) TUC8 (File transfer) Not needed (file transfer time is low) TUC9/ TUC11 (Infrastructure service directory/ registry) Not needed TUC10 (Credential translation) TUC11 (Service access management) Needed to maintain user submitted data
TUCs TUC12/13 (Virtual machine library/ container library) We provide analysis as a service VMs/ containers useful for visualization tools TUC14 (Module library) We have META-pipe in a deployment server TUC15 (Data set replication) Not needed (our datasets are small) TUC17 (Endorsed…) User submitted data management TUC18 (Cloud storage) Replace META-pipe storage server
TUCs TUC19 (PID and metadata registry) Provide in reference databases? TUC20/23 (Federated cloud/HPC/HTC) Not exposed to our end-users TUC21 (Operational integration) Service availability monitoring is needed TUC22 (Resource accounting) Is very much needed