Scaling bio-analyses from computational clusters to grids George Byelas University Medical Centre Groningen, the Netherlands IWSG-2013, Zürich, Switzerland,

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Parallel Scripting on Beagle with Swift Ketan Maheshwari Postdoctoral Appointee (Argonne National.
Test Case Management and Results Tracking System October 2008 D E L I V E R I N G Q U A L I T Y (Short Version)
® IBM Software Group © 2006 IBM Corporation Rational Software France Object-Oriented Analysis and Design with UML2 and Rational Software Modeler 04. Other.
ProActive Task Manager Component for SEGL Parameter Sweeping Natalia Currle-Linde and Wasseim Alzouabi High Performance Computing Center Stuttgart (HLRS),
ManageEngine TM Applications Manager 8 Monitoring Custom Applications.
Sergey Belov, LIT JINR 15 September, NEC’2011, Varna, Bulgaria.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
Microsoft SharePoint 2013 SharePoint 2013 as a Developer Platform
Member of the ExperTeam Group Ralf Ratering Pallas GmbH Hermülheimer Straße Brühl, Germany
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss.
Module 3: SQL Server 2005 Administrative Tools. Overview Using SQL Server Management Studio Using SQL Computer Manager Using the sqlcmd Utility Using.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Submitted by: Madeeha Khalid Sana Nisar Ambreen Tabassum.
A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.
Customized cloud platform for computing on your terms !
Bioinformatics Core Facility Ernesto Lowy February 2012.
OnTimeMeasure Integration with Gush Prasad Calyam, Ph.D. (PI) Tony Zhu (Software Programmer) Alex Berryman (REU Student) GEC10 Selected.
Application Web Service Toolkit Geoffrey Fox, Marlon Pierce, Ozgur Balsoy Indiana University July
Software Engineering 2003 Jyrki Nummenmaa 1 CASE Tools CASE = Computer-Aided Software Engineering A set of tools to (optimally) assist in each.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Testovid - an environment for testing almost any aspect of student assignments I. Pribela, S. Tošić, M. Ivanović, Z. Budimac Risan, September 2007.
MSc. Miriel Martín Mesa, DIC, UCLV. The idea Installing a High Performance Cluster in the UCLV, using professional servers with open source operating.
Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
EGrid Software Packages Overview. EGrid Introduction Egrid Introduction : A description of the main software packages EGrid Inside : A detailed description.
RISICO on the GRID architecture First implementation Mirko D'Andrea, Stefano Dal Pra.
GridFE: Web-accessible Grid System Front End Jared Yanovich, PSC Robert Budden, PSC.
1 Overview of the Application Hosting Environment Stefan Zasada University College London.
INFSO-RI Module 01 ETICS Overview Etics Online Tutorial Marian ŻUREK Baltic Grid II Summer School Vilnius, 2-3 July 2009.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
The PROGRESS Grid Service Provider Maciej Bogdański Portals & Portlets 2003 Edinburgh, July 14th-17th.
Workflow Project Status Update Luciano Piccoli - Fermilab, IIT Nov
Supported by EU projects 12/12/2013 Athens, Greece Open Data in Agriculture Hands-on with data infrastructures that can power your agricultural data products.
Application portlets within the PROGRESS HPC Portal Michał Kosiedowski
Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.
Andrey Meeting 7 October 2003 General scheme: jobs are planned to go where data are and to less loaded clusters SUNY.
Production Tools in ATLAS RWL Jones GridPP EB 24 th June 2003.
Metadata Mòrag Burgon-Lyon University of Glasgow.
Consulting Services JobScheduler Architecture Decision Template Information for Consulting Parties Information for Consulting Parties.
Interactive Workflows Branislav Šimo, Ondrej Habala, Ladislav Hluchý Institute of Informatics, Slovak Academy of Sciences.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.
A scalable and flexible platform to run various types of resource intensive applications on clouds ISWG June 2015 Budapest, Hungary Tamas Kiss,
Introduction to The Storage Resource.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
Consulting Services JobScheduler Architecture Decision Template Information for Consulting Parties Information for Consulting Parties.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space The Capabilities of the GridSpace2 Experiment.
Application Web Service Toolkit Allow users to quickly add new applications GGF5 Edinburgh Geoffrey Fox, Marlon Pierce, Ozgur Balsoy Indiana University.
Gridmake for GlueX software Richard Jones University of Connecticut GlueX offline computing working group, June 1, 2011.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Practical using WMProxy advanced job submission.
GPU Programming Contest. Contents Target: Clustering with Kmeans How to use toolkit1.0 Towards the fastest program.
© Geodise Project, University of Southampton, Workflow Support for Advanced Grid-Enabled Computing Fenglian Xu *, M.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
DGAS Distributed Grid Accounting System INFN Workshop /05/1009, Palau Giuseppe Patania Andrea Guarise 6/18/20161.
Wednesday NI Vision Sessions
//liveVirtualacademy2011/ What’s New for ASP.NET 4.5 and Web Development in Visual Studio 11 Developer Preview Γιώργος Καπνιάς MVP, MCT, MCDP, MCDBA, MCTS,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GOCDB4 Gilles Mathieu, RAL-STFC, UK An introduction.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Advanced Computing Facility Introduction
Consulting Services JobScheduler Architecture Decision Template
Data Bridge Solving diverse data access in scientific applications
Consulting Services JobScheduler Architecture Decision Template
Recap: introduction to e-science
MIK 2.1 DBNS - introduction to WS-PGRADE, 2013
gLite Job Management Christos Theodosiou
Presentation transcript:

Scaling bio-analyses from computational clusters to grids George Byelas University Medical Centre Groningen, the Netherlands IWSG-2013, Zürich, Switzerland, June 3 rd,

Content Bio-workflows Workflow Modeling Job Generation Tool deployment Data management Workflow execution Implementation detail Recent developments Conclusion and further steps 2

Bio-workflows 3

Example: NGS alignment workflow alignment workflow Per Project: 1.Aligned reads 2.QC-reports 3.SNP lists Per Project: 1.Aligned reads 2.QC-reports 3.SNP lists Raw data samples Result data 20 – 200 days 80 – 800 GB HiSeq 4

Alignment & SNP calling workflow Input Analysis protocols Sample DNA data Reference DNA data Analysis Scripts are generated and executed Output Aligned DNA and QC reports 31 steps, ≥ 2 days per sample 5

An analysis job (script) generated from a protocol #!/bin/bash #PBS -q test #PBS -l nodes=1:ppn=4 #PBS -l walltime=08:00:00 #PBS -l mem=6gb #PBS -e $GCC/test_compute/projects/batch4/intermediate/test1/err/err_test1_BwaElement1A102a_FC81D90ABXX_L7.err #PBS -o $GCC/test_compute/projects/batch4/intermediate/test1/out/out_test1_BwaElement1A102a_FC81D90ABXX_L7.out mkdir -p $GCC/test_compute/projects/batch4/intermediate/test1/err mkdir -p $GCC/test_compute/projects/batch4/intermediate/test1/out printf "test1_BwaElement1A102a_FC81D90ABXX_L7_started " >>$GCC/test_compute/projects/batch4/intermediate/test1/log_test1.txt date "+DATE: %m/%d/%y%tTIME: %H:%M:%S" >>$GCC/test_compute/projects/batch4/intermediate/test1/log_test1.txt date "+start time: %m/%d/%y%t %H:%M:%S" >>$GCC/test_compute/projects/batch4/intermediate/test1/test1_BwaElement1A102a_FC81D90ABXX_L7.txt echo running on node: `hostname` >>$GCC/test_compute/projects/batch4/intermediate/test1/test1_BwaElement1A102a_FC81D90ABXX_L7.txt /target/gpfs2/gcc/tools//bwa-0.5.8c_patched/bwa aln \ /target/gpfs2/gcc/resources/hg19/indices/human_g1k_v37.fa \ $GCC/test_compute/projects/batch4/rawdata/110121_I288_FC81D90ABXX_L7_HUMrutRGADIAAPE_1.fq.gz \ -t 4 \ -f $GCC/test_compute/projects/batch4/intermediate/A102a_110121_I288_FC81D90ABXX_L7_HUMrutRGADIAAPE_1.fq.gz.sai printf "test1_BwaElement1A102a_FC81D90ABXX_L7_finished " >>$GCC/test_compute/projects/batch4/intermediate/test1/log_test1.txt date "+finish time: %m/%d/%y%t %H:%M:%S" >>$GCC/test_compute/projects/batch4/intermediate/test1/test1_BwaElement1A102a_FC81D90ABXX_L7.txt date "+DATE: %m/%d/%y%tTIME: %H:%M:%S" >>$GCC/test_compute/projects/batch4/intermediate/test1/log_test1.txt analysis specific 6

Imputation workflow Imputation: Number of jobs One run: 7

Bio-workflow complexity Many analysis steps Many analysis jobs Different analysis tools and their dependencies Large various data involved Heterogeneous resources 8

Workflow design and generation 9

MOLGENIS approach Model Generate Use Analyses... Species... Projects... 10

Workflow design 11 Jobs are generated from the model Every job has an analysis target (e.g. Genome region)

Command-line generator ( IWSG-2012) 12 Generates jobs (scripts) from model described in files Suitable for workflows (PBS cluster) and single jobs (gLite grid)

Database solution with MOLGENIS software toolkit (1) 13 Use (web) Animal Observatory NextGenSeq Model organisms Model (xml) Generator (java) Workflow Molgenis/compute

Database solution with MOLGENIS software toolkit (2) 14 Model workflow.xml / 100 loc ui.xml / 25 loc Generate *.sql / 1722 loc *.java / loc

Workflow design view in the generated Molgenis web-UI workflow step workflow step analysis protocol analysis protocol previous steps 15

Workflow run-time view (analysis jobs) 16

Failed jobs overview 17 Running on node: v33-45.gina.sara.nl Error: terminate called after throwing an instance of 'std::bad_alloc' what(): St9bad_alloc Error: terminate called after throwing an instance of 'std::bad_alloc' what(): St9bad_alloc How much memory: virtual memory (kbytes, -v) How much memory: virtual memory (kbytes, -v) chr: 4 from: to:

Workflow deployment 18

Local storage/c loud Cluster storage Cluster storage Distributed grid storages Distributed grid storages Local compute/ cloud Cluster compute (Inter)national grid (Inter)national grid Computational environments ease of use vs. redundancy & scale Tool environment Data environment 19

“Harmonized” tool management Simple download vs. On-site build deployment Tool in input sandbox “getFile(‘tool.zip’)” Tool deployed as “load module” In $VO_BBMRI_NL_SW_DIR Download Build Configure In $WORKDIR Download 20

‘Harmonized’ tool management: modules Build using standard ‘modulecmd’ tool Software should be deployed at all grid sites Module file should be added to all sites 21

Workflow execution 22

Execution topology 23 Started pilot jobs retrieve analysis jobs from Molgenis server

Workflow execution with pilots (1) 24 glite-wms-job-submit \ -d $USER … $HOME/maverick.jdl curl … -F status=started \ -F backend=ui.grid.sara.nl \ > script.sh

Workflow execution with pilots (2) 25 bash -l script.sh 2>&1 \ | tee -a log.log & curl … -F status=done \ -F \

Workflow execution with pilots (3) 26 while [ 1 ] ; do … check_process "script.sh” CHECK_RET=$? if [ $CHECK_RET -eq 0 ]; then … curl … -F status=nopulse \ -F … … elif … curl … -F status=pulse \ -F … …

Back-end independent analysis templates 27

Template structure //header #MOLGENIS walltime=15:00 nodes=1 cores=4 mem=6 //tool management module load bwa/${bwaVersion} //data management getFile ${indexfile} getFile ${leftbarcodefqgz} //template of the actual analysis bwa aln \ ${indexfile} ${leftbarcodefqgz} \ -t ${bwaaligncores} -f ${leftbwaout} //data management putFile ${leftbwaout} 28

Data transfer getFile and putFile are back-end specific now, we check if the files are present (cluster or localhost) do srm/lfn file transfer (grid) Input Generated output getFile $WORKDIR/groups/gonl/projects/imputationBenchmarking/eQtl/ hapmap2r24ceu/chr20.map getFile ${studyInputPedMapChr}.map 29

Generated back-end independent script //header #MOLGENIS walltime=15:00 nodes=1 cores=4 mem=6 //tool management module load bwa/0.5.8c_patched //data management getFile $WORKDIR/resources/hg19/indices/human_g1k_v37.fa getFile $WORKDIR/groups/gcc/projects/cardio/run01/rawdata/121128_SN163_0484_AC1D3HACXX_L8_CAA CCT_1.fq.gz //template of the actual analysis bwa aln \ human_g1k_v37.fa _SN163_0484_AC1D3HACXX_L8_CAACCT_1.fq.gz -t 4 \ -f _SN163_0484_AC1D3HACXX_L8_CAACCT_1.bwa_align.human_g1k_v37.sai //data management putFile $WORKDIR/groups/gcc/projects/cardio/run01/results/121128_SN163_0484_AC1D3HACXX_L8_CAAC CT_1.bwa_align.human_g1k_v37.sai 30

Current developments 31

Pilots Dashboard During execution Workflow is completed 32

Dash-board for jobs monitoring (work-in-progress) 33

Enhancements and further steps What if not all parameters are known at the generation time Run-time parameters passing to DB from previous steps (implemented) Advanced pilot management Pilots re-using Better workflow visualization Showing workflow elements and their properties H. Byelas and M. Swertz, “Visualization of bioinformatics workflows for ease of understanding and design activities,” Proceedings of the BIOSTEC BIOINFORMATICS-2013 conference, pp. 117–123,

Conclusion

One protocol template style that is suitable for different back-ends Workflow tools deployment using module system Hidden in scripts data management Workflow execution using pilot jobs 36

All available as open source Thank you! Questions?