PanDA HPC integration. Current status. Danila Oleynik BigPanda F2F meeting 13 August 2013 from.

Slides:

Advertisements

Similar presentations

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Advertisements

Operating Systems Operating system is the “executive manager” of all hardware and software.

ENEA-GRID and gLite Interoperability: robustness of SPAGO approach Catania, Italy, February ENEA-GRID and gLite Interoperability: robustness.

Where Do the 7 layers “fit”? Or, where is the dividing line between hdw & s/w? ? ?

Sergey Belov, LIT JINR 15 September, NEC’2011, Varna, Bulgaria.

Software Developer By: Charlie Edwards Period 6 th Mrs. Truong.

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

Testing PanDA at ORNL Danila Oleynik University of Texas at Arlington / JINR PanDA UTA 3-4 of September 2013.

MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

Plans for Exploitation of the ORNL Titan Machine Richard P. Mount ATLAS Distributed Computing Technical Interchange Meeting May 17, 2013.

Sergey Belov, Tatiana Goloskokova, Vladimir Korenkov, Nikolay Kutovskiy, Danila Oleynik, Artem Petrosyan, Roman Semenov, Alexander Uzhinskiy LIT JINR The.

1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.

Exchange Network Node Help Desk NOLA Conference Feb 9-10, 2004.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.

Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,

Block1 Wrapping Your Nugget Around Distributed Processing.

PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.

Technical and operational requirements Sergey Sukhonosov, et al. Partnership Centre for the IODE Ocean Data Portal MINCyT, Buenos Aires, Argentina, 7 –

Site Report: Tokyo Tomoaki Nakamura ICEPP, The University of Tokyo 2013/12/13Tomoaki Nakamura ICEPP, UTokyo1.

Software Quality Assurance

PetaApps: Update on software engineering and performance J. Dennis M. Vertenstein N. Hearn.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

David Adams ATLAS DIAL status David Adams BNL November 21, 2002 ATLAS software meeting GRID session.

ServiceSs, a new programming model for the Cloud Daniele Lezzi, Rosa M. Badia, Jorge Ejarque, Raul Sirvent, Enric Tejedor Grid Computing and Clusters Group.

 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.

Machine/Job Features Update Stefan Roiser. Machine/Job Features Recap Resource User Resource Provider Batch Deploy pilot Cloud Node Deploy VM Virtual.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Experience and possible evolution Danila Oleynik (UTA), Sergey Panitkin (BNL), Taylor Childers (ANL) ATLAS TIM 2014.

Distributed System Services Fall 2008 Siva Josyula

MultiJob pilot on Titan. ATLAS workloads on Titan Danila Oleynik (UTA), Sergey Panitkin (BNL) US ATLAS HPC. Technical meeting 18 September 2015.

David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.

WSV207. Cluster Public Cloud Servers On-Premises Servers Desktop Workstations Application Logic.

System/SDWG Update Management Council Face-to-Face Flagstaff, AZ August 22-23, 2011 Sean Hardman.

ALICE-PanDA Pilot Factorizations Kaushik De Nov. 7, 2014.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.

Making the System Operational Implementation & Deployment

Università di Perugia Enabling Grids for E-sciencE Status of and requirements for Computational Chemistry NA4 – SA1 Meeting – 6 th April.

HPC pilot code. Danila Oleynik 18 December 2013 from.

CERES-2012 Deliverables Architecture and system overview 21 November 2011 Updated: 12 February

Update on Titan activities Danila Oleynik (UTA) Sergey Panitkin (BNL)

What’s Coming? What are we Planning?. › Better docs › Goldilocks – This slot size is just right › Storage › New.

WP1 Status and plans Francesco Prelz, Massimo Sgaravatto 4 th EDG Project Conference Paris, March 6 th, 2002.

Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.

Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,

© Geodise Project, University of Southampton, Workflow Support for Advanced Grid-Enabled Computing Fenglian Xu *, M.

Big PanDA on HPC/LCF Update Sergey Panitkin, Danila Oleynik BigPanDA F2F Meeting. March

Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {

Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.

Danila Oleynik (On behalf of ATLAS collaboration)

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Architecture of a platform for innovation and research Erik Deumens – University of Florida SC15 – Austin – Nov 17, 2015.

1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

PANDA PILOT FOR HPC Danila Oleynik (UTA). Outline What is PanDA Pilot PanDA Pilot architecture (at nutshell) HPC specialty PanDA Pilot for HPC 2.

Argus EMI Authorization Integration

BigPanDA Workflow Management on Titan

Design rationale and status of the org.glite.overlay component

PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL

HPC DOE sites, Harvester Deployment & Operation

Hodor HPC Cluster LON MNG HPN Head Node Comp Node Comp Node Comp Node

MPI probes OMB Meeting 26th February 2013

NGS computation services: APIs and Parallel Jobs

Making the System Operational Implementation & Deployment

Workflow Management Software For Tomorrow

Presentation transcript:

PanDA HPC integration. Current status. Danila Oleynik BigPanda F2F meeting 13 August 2013 from

Outline HPC access, architecture, specialty Current PanDA implementation PanDA architecture for Kraken, Titan Initial testing Next step: Pilot - SAGA integration.

HPC specialty Kraken Cray XT5 (have access from beginning of August) 9408 nodes node: 12 core, 16 GB RAM Titan Cray XT7 (access request in process) 18,688 nodes node: 16 core, GB RAM (2GB per core) Parallel file system shared between nodes. Access only to interactive nodes (worker nodes have extremely limited connectivity) One-Time Password Authentication Internal job management tool: PBS/TORQUE One job occupy minimum one node (12-16 cores) Limitation of number of jobs in scheduler for one user

Current PanDA implementation One Pilot per WN Pilot executes on same node as job SW distribution through CVMFS One Pilot per WN Pilot executes on same node as job SW distribution through CVMF

PanDA architecture for Kraken, Titan Pilot(s) executes on HPC interactive node Pilot interact with local job scheduler to manage job Number of executing pilots = number of available slots in local scheduler

Initial testing Some initial testing was done for proving that panda components will be abele to run in HPC environment on interactive nodes Sergey was successful with starting APF and pilots on Titan, outbound https connection was confirmed, so pilots can communicate with PanDA I provide successful testing of SAGA API on Kraken. Generally SAGA API allows manage jobs in local HPC job scheduler Due to interactive node and worker nodes use shared file- system, we did not need any special internal data- management process.

Next step: Pilot - SAGA integration Actually it’s a bit big step, which may be technically split: SAGA source integration with pilot code Reviewing, revers engineering runJob class Implementation runJobHPC class based on SAGA API