Download presentation
Presentation is loading. Please wait.
Published byJulianna Stone Modified over 8 years ago
1
HPC pilot code. Danila Oleynik 18 December 2013 from
2
Outline Titan HPC specialty PanDA architecture for Titan Modification: PanDA Pilot side Modification: SAGA API side Next steps 2
3
HPC specialty Titan Cray XT7 (access request in process) 18,688 nodes node: 16 core, 32 + 6 GB RAM (2GB per core) Parallel file system shared between nodes. Access only to interactive nodes (worker nodes have extremely limited connectivity) One-Time Password Authentication Internal job management tool: PBS/TORQUE One job occupy minimum one node (16 cores) Limitation of number of jobs in scheduler for one user Special data transfer nodes (high speed stage in/out) 3
4
PanDA architecture Titan Pilot(s) executes on HPC interactive node Pilot interact with local job scheduler to manage job Number of executing pilots = number of available slots in local scheduler 4
5
Modification: PanDA Pilot side Native PanDA pilot was successfully started on Titan interactive nodes. Correct definition of PanDA queue was needed. Main modification was performed for payload execution part: runJobTitan.py module was developed based on runJob.py module. Method, which call payload execution was changed for setup, run and collect results of job through PBS. Some minor modifications of cleanup procedures was done (subdirectories cleanup). 5
6
Modification: SAGA API side Some modifications of SAGA API part, to fit Titan specific was done with collaboration of SAGA developers. PBS adaptor was fixed, for transmitting job definition through file (mandatory by Titan, supported by other PBS realization) Some interpretation of PBS parameters is different by Titan against others Cray’s. It was fixed as well. 6
7
Summary. Set of light modifications gave possibility to run ”native” PanDA pilot on in Titan HPC environment. Pilot executes on interactive nodes, which have enough connectivity to intercommunication with PanDA server. SAGA API allows to manage execution of payloads on Titan worker nodes through PBS calls. 7
8
Next steps Optimization of stage in/out procedures: Basic Titan interactive nodes don’t support high speed data transfers Titan specific data mover for data transfers across DTN (Data Transfer Nodes) should be implemented Simple service for managing of number of parallel executed pilots on Titan under one account should be developed. Extending monitoring information, with state of payload in PBS queue 8
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.