MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

Slides:



Advertisements
Similar presentations
DATA PROCESSING SYSTEMS
Advertisements

Chapter 19: Network Management Business Data Communications, 5e.
Slow Control LHCf Catania Meeting - July 04-06, 2009 Lorenzo Bonechi.
Key-word Driven Automation Framework Shiva Kumar Soumya Dalvi May 25, 2007.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Software Frameworks for Acquisition and Control European PhD – 2009 Horácio Fernandes.
Components and Architecture CS 543 – Data Warehousing.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
Implementing ISA Server Caching. Caching Overview ISA Server supports caching as a way to improve the speed of retrieving information from the Internet.
BizTalk Deployment using Visual Studio Release Management
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
Testing PanDA at ORNL Danila Oleynik University of Texas at Arlington / JINR PanDA UTA 3-4 of September 2013.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Generic Instrument Processing Facility Interface Specifications A. BuongiornoFrascati 12 /10/2012 ESA EOP-GS 1.
Distribution After Release Tool Natalia Ratnikova.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.
IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.
1 Resource Provisioning Overview Laurence Field 12 April 2015.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Direct gLExec integration with PanDA Fernando H. Barreiro Megino CERN IT-ES-VOS.
Experience and possible evolution Danila Oleynik (UTA), Sergey Panitkin (BNL), Taylor Childers (ANL) ATLAS TIM 2014.
The EDGeS project receives Community research funding 1 Porting Applications to the EDGeS Infrastructure A comparison of the available methods, APIs, and.
MultiJob pilot on Titan. ATLAS workloads on Titan Danila Oleynik (UTA), Sergey Panitkin (BNL) US ATLAS HPC. Technical meeting 18 September 2015.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
CMS Computing Model Simulation Stephen Gowdy/FNAL 30th April 2015CMS Computing Model Simulation1.
XmlBlackBox The presentation Alexander Crea June the 15st 2010 The presentation Alexander Crea June the 15st 2010
PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
User Scenarios in VENUS-C Focus on Structural Analysis Ignacio Blanquer I3M - UPV.
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
OPTIMIZATION OF DIESEL INJECTION USING GRID COMPUTING Miguel Caballer Universidad Politécnica de Valencia.
HPC pilot code. Danila Oleynik 18 December 2013 from.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
Update on Titan activities Danila Oleynik (UTA) Sergey Panitkin (BNL)
Proxy management mechanism and gLExec integration with the PanDA pilot Status and perspectives.
Event Service Wen Guan University of Wisconsin 1.
STAR Scheduler Gabriele Carcassi STAR Collaboration.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,
Big PanDA on HPC/LCF Update Sergey Panitkin, Danila Oleynik BigPanDA F2F Meeting. March
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
Joint Institute for Nuclear Research Synthesis of the simulation and monitoring processes for the data storage and big data processing development in physical.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
A Software Energy Analysis Method using Executable UML for Smartphones Kenji Hisazumi System LSI Research Center Kyushu University.
I/O aspects for parallel event processing frameworks Workshop on Concurrency in the many-Cores Era Peter van Gemmeren (Argonne/ATLAS)
PanDA HPC integration. Current status. Danila Oleynik BigPanda F2F meeting 13 August 2013 from.
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
PANDA PILOT FOR HPC Danila Oleynik (UTA). Outline What is PanDA Pilot PanDA Pilot architecture (at nutshell) HPC specialty PanDA Pilot for HPC 2.
Accessing the VI-SEEM infrastructure
Chapter 2: System Structures
Conditions Data access using FroNTier Squid cache Server
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
Production client status
Presentation transcript:

MultiJob PanDA Pilot Oleynik Danila 28/05/2015

Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot in details MultiJob PanDA pilot2

Initial PanDA pilot concept & HPC Pilot definition: «The Panda pilot is an execution environment used to prepare the computing element, request the actual payload (a production or user analysis job), execute it, and clean up when the payload has finished» One of HPC limitation is restricted number of launched jobs (pilots) under one account. (usually less than 10), but one job may occupy a lot of resources (tens – hundreds of nodes) For the moment ATLAS have no payloads which may be executed on more than one node (MPI) MultiJob PanDA pilot3

Motivation No way to get MPI ATLAS production payloads quickly HPC resources should be used as much efficient as possible. There is no gain to launch just only few panda jobs simultaneously, if much more resources available – Potential outcome from machine like Titan compatible with, at least, Tier2 center Possible solution, which allow significant increase efficiency of usage of HPC is launching of set of PanDA jobs in assemble as one MPI job. MultiJob PanDA pilot4

PanDA Pilot workflow at nutshell There are next basic steps in pilot workflow: – Retrieve job information – Setup environment – StgaeIn input data – Execute payloads – StageOut output data and logs During execution, pilot monitor available disk resources, output files and updates PanDA server with status of PanDA job. MultiJob PanDA pilot5

MultiJob Pilot Current realization of MultiJob pilot implemented with same workflow and framework as regular PanDA pilot Most of core components and basic procedures of regular pilot were modified to serve multiple jobs with different states Procedures for intercommunication between runJob and Monitor process was slightly redesigned (without changing of technology) Current version was designed as “proof of concept” MultiJob PanDA pilot6

MultiJob Pilot. Requesting jobs. For the moment there is no method on PanDA server to retrieve set of jobs Set of jobs collects from server in cycle one by one. One request takes ~1 sec. so this will not scale good for big amount of jobs It’s important to collect jobs only from one task in bunch, to avoid mess with environment setup later Number of requested jobs fitted with available backfill resources MultiJob PanDA pilot7

MultiJob Pilot. Environment setup and verification Environment setup in most of cases is specific for experiment. Organized for each job in set Optimized through reduction in the number of repeating identical checks MultiJob PanDA pilot8

MultiJob Pilot. StageIn Optimized through reduction of number of remote stagein in case data already copied locally (for other job in set) – This simple optimization give significant reduction of whole stagein time. MultiJob PanDA pilot9

MultiJob Pilot. Payload execution Number of jobs adjusted one more time according to backfill Jobs, which not fitted, will failed with sub-status “rejected” PanDA jobs launched as separeted MPI ranks through special wrapper – Transformation name and input parameters translated through file – CPU consumption time and trf exit code published in rank report file MultiJob PanDA pilot10

MultiJob Pilot. StageOut Not require special optimization for the moment, due to not time critical operation for HPC – Optimization will be reviewed as scale will goes to hundreds of simultaneously launched PanDA jobs by one pilot MultiJob PanDA pilot11

First results MultiJob pilot was tested with jobs from ATLAS production validated task jobs was executed ( events generated) Scale was increased from 3 to 20 simultaneously launched jobs Significant increasing of execution time of simultaneously launched jobs was not observed MultiJob PanDA pilot12