Simulation in a Distributed Computing Environment

Slides:



Advertisements
Similar presentations
Introduction to Grid Application On-Boarding Nick Werstiuk
Advertisements

The GATE-LAB system Sorina Camarasu-Pop, Pierre Gueth, Tristan Glatard, Rafael Silva, David Sarrut VIP Workshop December 2012.
Maria Grazia Pia Simulation in a Distributed Computing Environment Simulation in a Distributed Computing Environment S. Guatelli 1, A. Mantero 1, P. Mendez.
Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.
Monte Carlo simulation for radiotherapy in a distributed computing environment S. Chauvie 2,3, S. Guatelli 2, A. Mantero 2, J. Moscicki 1, M.G. Pia 2 CERN.
Other GEANT4 capabilities Event biasing Parameterisation (fast simulation) Persistency Parallelisation and integration in a distributed computing environment.
DIANE Project Seminar on Innovative Detectors, Siena Oct 2002 Distributed Computing in Physics Parallel Geant4 Simulation in Medical.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Workload Management Massimo Sgaravatto INFN Padova.
Massive Ray Tracing in Fusion Plasmas on EGEE J.L. Vázquez-Poletti, E. Huedo, R.S. Montero and I.M. Llorente Distributed Systems Architecture Group Universidad.
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
DIANE Overview Germán Carrera, Alfredo Solano (CNB/CSIC) EMBRACE COURSE Monday 19th of February to Friday 23th. CNB-CSIC Madrid.
Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Grid Data Management A network of computers forming prototype grids currently operate across Britain and the rest of the world, working on the data challenges.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
DIANE Project CHEP 03 DIANE Distributed Analysis Environment for semi- interactive simulation and analysis in Physics Jakub T. Moscicki,
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.
Enabling Grids for E-sciencE EGEE-III INFSO-RI Using DIANE for astrophysics applications Ladislav Hluchy, Viet Tran Institute of Informatics Slovak.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Technological Transfer from HEP to Medical Physics How precise Brachytherapy MonteCarlo simulations can be applied in Clinics Reality Problem: How to achieve.
S. Guatelli, A. Mantero, J. Moscicki, M. G. Pia Geant4 medical simulations in a distributed computing environment 4th Workshop on Geant4 Bio-medical Developments.
Benchmarks of medical dosimetry simulation on the grid S. Chauvie 1, A. Lechner 4, P. Mendez Lorenzo 5, J. Moscicki 5, M.G. Pia 6 G.A.P. Cirrone 2, G.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
AMH001 (acmse03.ppt - 03/7/03) REMOTE++: A Script for Automatic Remote Distribution of Programs on Windows Computers Ashley Hopkins Department of Computer.
UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.
AliRoot survey: Analysis P.Hristov 11/06/2013. Are you involved in analysis activities?(85.1% Yes, 14.9% No) 2 Involved since 4.5±2.4 years Dedicated.
Susanna Guatelli Geant4 in a Distributed Computing Environment S. Guatelli 1, P. Mendez Lorenzo 2, J. Moscicki 2, M.G. Pia 1 1. INFN Genova, Italy, 2.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
ATLAS Distributed Analysis Dietrich Liko IT/GD. Overview  Some problems trying to analyze Rome data on the grid Basics Metadata Data  Activities AMI.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
Enabling Grids for E-sciencE LRMN ThIS on the Grid Sorina CAMARASU.
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
BaBar & Grid Eleonora Luppi for the BaBarGrid Group TB GRID Bologna 15 febbraio 2005.
ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,
DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of.
Workload Management Workpackage
TensorFlow– A system for large-scale machine learning
Introduction to Load Balancing:
F. Foppiano3, S. Guatelli2, J. Moscicki1, M.G. Pia2 CERN1 INFN Genova2
Design rationale and status of the org.glite.overlay component
GWE Core Grid Wizard Enterprise (
PROOF – Parallel ROOT Facility
INFN-GRID Workshop Bari, October, 26, 2004
Ruslan Fomkin and Tore Risch Uppsala DataBase Laboratory
Performance Evaluation of Adaptive MPI
Support for ”interactive batch”
Other GEANT4 capabilities
Simulation in a Distributed Computing Environment
Short Course Siena, 5-6 October 2006
Wide Area Workload Management Work Package DATAGRID project
Simulation in a Distributed Computing Environment
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
Database System Architectures
Distributed Simulation with Geant4
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Simulation in a Distributed Computing Environment S. Guatelli1, J. Moscicki2, M.G. Pia1 1INFN Genova, Italy 2CERN, Geneva, Switzerland

Speed of Monte Carlo simulation Speed of execution is often a concern in Monte Carlo simulation Often a trade-off between precision of the simulation and speed of execution Typical use cases Semi-interactive response Detector design Optimisation Oncological radiotherapy Very long execution time High statistics simulation High precision simulation Fast simulation Variance reduction techniques (event biasing) Inverse Monte Carlo methods Parallelisation Methods for faster simulation response

Features of this study Geant4 application in a distributed computing environment Architecture Implications on simulation applications Environments PC farm GRID Two use cases: Geant4 Advanced Examples semi-interactive response (brachytherapy) high statistics (medical_linac) By-product: results for Geant4 medical application Quantitative study results to be submitted for publication

Requirements Architectural requirements Transparent execution in sequential/parallel mode Transparent execution on a PC farm and on the Grid Semi-interactive simulation High statistics simulation Geant4 brachytherapy Execution time for 20 M events: 5 hours Goal: execution time ~ few minutes Geant4 medical_linac Execution time for 109 events: ~10 days Goal: execution time ~ few hours Reference: sequential mode on a Pentium IV, 3 GHz

Parallel mode: local cluster / GRID Both applications have the same computing model a job consists of a number of independent tasks which may be executed in parallel result of each task is a small data packet (few kb), which is merged as the job runs In a cluster computing resources are used for parallel execution user connects to a possibly remote cluster input data for the job must be available on the site typically there is a shared file system and a queuing system network is fast GRID computing uses resources from multiple computing centres typically there is no shared file system (parts of) input data must be replicated in remote sites network connection is slower than within a cluster

Overview Architectural issues Performance tests Conclusions DIANE How to dianize a Geant4 application Performance tests On a single CPU On clusters On the GRID Conclusions Lessons learned Outlook Quantitative, documented results Publicly distributed: DIANE Geant4 application code

DIANE http://cern.ch/DIANE Master-Worker R&D project Developed by J. Moscicki, CERN/IT R&D project started in 2001 in CERN/IT with very limited resources collaboration with Geant4 groups at CERN, INFN, ESA succesful prototypes running on LSF and EDG prototype for an intermediate layer between applications and the GRID Hide complex details of underlying technology Master-Worker architectural pattern DIANE is a layer which provides applications with a easy and convinient way of execution in the distributed, cluster environment. THE DESIGN GOALS: - easy customization and adaptability to different needs - hide details of underlying technology (allow for easy migration) - location independant – accessible from anywhere in the network - limited to Master-Worker model which covers most of the typical jobs and needs in HEP Parallel cluster processing make fine tuning and customisation easy transparently using GRID technology application independent

Practical example: Geant4 simulation with analysis Each task produces a file with histograms The job result is the sum of histograms produced by tasks Master-worker model client starts a job workers perform tasks and produce histograms master integrates the results Distributed Processing for Geant4 Applications task = N events job = M tasks tasks may be executed in parallel tasks produce histograms/ntuples task output is automatically combined (add histograms, append ntuples) Master-Worker Model Master steers the execution of job, automatically splits the job and merges the results Worker initializes the Geant4 application and executes macros Client gets the results

UML Deployment Diagram for Geant4 applications simulation with DIANE Completely transparent to the user: same Geant4 application code G4Simulation class is responsible of managing the simulation manage random number seeds Geant4 initialisation macros to be executed in batch mode termination

Development costs Strategy to minimise the cost of migrating a Geant4 simulation to a distributed environment DIANE Active Workflow framework provides automatic communication/synchronization mechanisms application is “glued” to the framework using a small Python module in most cases no code changes to the original application are required load balancing and error recovery policies may be plugged in form of simple python functions Transparent adaptation for Clusters/GRIDs, shared/local file systems, shared/private queues Development/modification of application code original source code unmodified addition of an interface class which binds together application and M-W framework The application developer is shielded from the complexity of underlying technology via DIANE

Test results Test on a single CPU Test on a dedicated farm (60 CPUs) Performance of the execution of the dianized Brachytherapy example Test on a single CPU Test on a dedicated farm (60 CPUs) Test on a farm shared with other users (LSF, CERN) Test on the GRID (LCG) Tools and libraries: Simulation toolkit: Geant4 7.0.p01 Analysis tools: AIDA 3.2.1 and PI 1.3.3 DIANE: DIANE 1.4.2 CLHEP: 1.9.1.2 G4EMLOW 2.3

Overhead at initialisation/termination Test on a single dedicated CPU (Intel ®, Pentium IV, 3.00 GHz) Study execution via DIANE w.r.t. sequential execution run 1 event Standalone application 4.6  0.2 s Application via DIANE, simulation only 8.8  0.8 s Application via DIANE, with analysis integration 9.5  0.5 s Overhead: ~ 5 s, negligible in a high statistics job

The overhead of DIANE is negligible in high statistics jobs Overhead due to DIANE Test on a single dedicated CPU (Intel ®, Pentium IV, 3.00 GHz) Study execution via DIANE w.r.t. sequential execution Execution time vs. number of events in the job The overhead of DIANE is negligible in high statistics jobs Ratio = with respect to the number of events

Farm: execution time and efficiency Dedicated farm : 30 identical bi-processors (Pentium IV, 3 GHz) Thanks to Regional Operation Centre (ROC) Team, Taiwan Thanks to Hurng-Chun Lee (Academia Sinica Grid Computing Center, Taiwan) Load balancing: optimisation of the number of tasks and workers

Optimizing the number of tasks The job ends when all the tasks are executed in the workers If the job is split into a higher number of tasks, the chance that the workers finish the tasks at the same time is a higher Note: the overall time of the job is determined by the last worker to finish the last task Worker number Time (seconds) Worker number Time (seconds) Example of a job that can be improved from a performance point of view Example of a good job balancing

Farm shared with other users Real-life case: farm shared with other users Execution in parallel mode on 5 workers of CERN LSF DIANE used as intermediate layer Preliminary! The load of the cluster changes quickly in time The conditions of the test are not reproducible Highly variable performance

Parallel execution in a PC farm Required production of Brachytherapy: 20 M events 20 M events in sequential mode : 16646 s (~ 4h and 38 minutes) on a a Intel ®, Pentium IV, 3.00 GHz The same simulation runs in 5 minutes in parallel on 56 CPUs appropriate for clinical usage Similar results for Geant4 medical_linac Advanced Example production can become compatible with usage for the verification of IMRT treatment planning sequential execution requires ~ 10 days to obtain significant results

Running on the Grid (LCG) G4Brachy executed on the GRID (LCG) nodes located in Spain, Russia, Italy, Germany, Switzerland Conditions of the test The load of the GRID changes quickly in time The conditions of the test are not reproducible Efficiency The evaluation of the efficiency with the same criterion as in a dedicated farm does not make much sense in this context Study the “efficiency” of DIANE as automated job management w.r.t. manual submission through simple scripts

Test results Execution on the GRID through DIANE, 20 M events,180 tasks, 30 workers Execution on the GRID, without DIANE Worker number Worker number Time (seconds) Time (seconds) Through DIANE: - All the tasks are executed successfully on 22 workers - Not all the workers are initialized and used: on-going investigation Without DIANE: - 2 jobs not successfully executed due to set-up problems of the workers

How the GRID load changes Execution time of Brachytherapy in two different conditions of the GRID DIANE used as intermediate layer Worker number Worker number Time (seconds) Time (seconds) 20 M events, 60 workers initialized, 360 tasks Very different result!

Farm/GRID execution Preliminary indication Brachy, 20 M events, 180 tasks Taipei cluster: 29 machines, 734 s ~ 12 minutes GRID: 27 machines, 1517 s ~ 25 minutes Preliminary indication The conditions are not reproducible

Lessons learned DIANE as intermediate layer Load balancing Transparency Good separation of the subsystems Good management of CPU resources Negligible overhead Load balancing A relatively large number of tasks increases the efficiency of parallel execution in a farm Trade-off between optimisation of task splitting and overhead introduced Controlled and real life situation is quite different in a farm need dedicated farm for critical usage (i.e. hospital) Grid highly variable environment not mature yet for critical usage automated management through a smart system is mandatory work in progress, details still to be understood quantitatively

Conclusions General approach to the execution of Geant4 simulation in a distributed computing environment transparent sequential/parallel application transparent execution on a local farm or on the Grid user code is the same Quantitative, documented results reference for users and for further improvement on-going work to understand details