Managing GPUs in HTCondor 8.1/8.2

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Job and Machine Policy Configuration CERN Oct 2014 Todd Tannenbaum.
More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.
Multi Agent Simulation and its optimization over parallel architecture using CUDA™ Abdur Rahman and Bilal Khan NEDUET(Department Of Computer and Information.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Printing Terminology. Requirements for Network Printing At least one computer to operate as the print server Sufficient RAM to process documents Sufficient.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
A Guided Tour of BOINC David P. Anderson Space Sciences Lab University of California, Berkeley TACC November 8, 2013.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
Memory Management What if pgm mem > main mem ?. Memory Management What if pgm mem > main mem ? Overlays – program controlled.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:
Ian C. Smith Towards a greener Condor pool: adapting Condor for use with energy-efficient PCs.
HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Tags Pages 63 to 114 in your workbook. Tag Browser Review of the communication chain Polling Driver concepts Tag Browser in detail – Filtering – The tag.
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Debugging and Profiling GMAO Models with Allinea’s DDT/MAP Georgios Britzolakis April 30, 2015.
Advanced Topics StratusLab Tutorial (Orsay, France) 28 November 2012.
Grid job submission using HTCondor Andrew Lahiff.
ASP.NET State Management. Slide 2 Lecture Overview Client state management options Cookies Server state management options Application state Session state.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.
EMT 2390L Lecture 5 Dr. Reyes Reference: The Linux Command Line, W.E. Shotts.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Memory Management What if pgm mem > main mem ?. Memory Management What if pgm mem > main mem ? Overlays – program controlled.
APST Internals Sathish Vadhiyar. apstd daemon should be started on the local resource Opens a port to listen for apst client requests Runs on the host.
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
© 2008 Cisco Systems, Inc. All rights reserved.CIPT1 v6.0—1-1 Getting Started with Cisco Unified Communications Manager Installing and Upgrading Cisco.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
BABCA Software Operating Systems (OS) aka Systems Software A set of instructions that coordinate all the activities among computer hardware resources.
What’s Coming? What are we Planning?. › Better docs › Goldilocks – This slot size is just right › Storage › New.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
Condor Project Computer Sciences Department University of Wisconsin-Madison Using New Features in Condor 7.2.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
HTCondor Advanced Job Submission John (TJ) Knoeller Center for High Throughput Computing.
Five todos when moving an application to distributed HTC.
Windows Server 2003 { First Steps and Administration} Benedikt Riedel MCSE + Messaging
CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.
Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.
CHTC Policy and Configuration
Effective use of cgroups with HTCondor
HTCondor Annex (There are many clouds like it, but this one is mine.)
Improvements to Configuration
Quick Review. Job and Machine Policy Configuration HTCondor / ARC CE Workshop Barcelona 2016 Todd Tannenbaum.
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
HTCondor Security Basics
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
Things you may not know about HTCondor
Moving CHTC from RHEL 6 to RHEL 7
Job and Machine Policy Configuration
High Availability in HTCondor
Things you may not know about HTCondor
CREAM-CE/HTCondor site
Building and Testing using Condor
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
HTCondor Command Line Monitoring Tool
Docker with HTCondor at FNAL
Negotiator Policy and Configuration
HTCondor Security Basics HTCondor Week, Madison 2016
Condor and Multi-core Scheduling
The Condor JobRouter.
Late Materialization has (lately) materialized
Condor-G Making Condor Grid Enabled
Presentation transcript:

Managing GPUs in HTCondor 8.1/8.2 John (TJ) Knoeller Condor Week 2014

Better support for GPUs in HTCondor 8.1/8.2 GPUs as a form of custom resource Custom resources enhanced Assign a specific GPU to a job Simpler configuration

Defining a custom resource Define a custom STARTD resource MACHINE_RESOURCE_<tag> MACHINE_RESOURCE_INVENTORY_<tag> <tag> is case preserving, case insensitive For GPU resources use the tag “GPUs” The plural, not the singular. (like “Cpus”) Because matchmaking

Fungible resources Works with HTCondor 8.0 For OS virtualized resources Cpus, Memory, Disk For intangible resources Bandwidth Licenses? Works with Static and Partitionable slots

Fungible custom resource example : bandwidth (1) > condor_config_val –dump Bandwidth MACHINE_RESOURCE_Bandwidth = 1000 > grep –i bandwidth userjob.submit REQUEST_Bandwidth = 200

Fungible custom resource example : bandwidth (2) Assuming 4 static slots > condor_status –long | grep –i bandwidth Bandwidth = 250 DetectedBandwidth = 1000 TotalBandwidth = 1000 TotalSlotBandwidth = 250

Non-fungible resources New for HTCondor 8.1/8.2 For resources not virtualized by OS GPUs, Instruments, Directories Configure by listing resource ids Quantity is inferred Specific id(s) are assigned to slots Works with Static and Partitionable slots

Non-fungible custom resource example : GPUs (1) > condor_config_val –dump gpus MACHINE_RESOURCE_GPUs = CUDA0, CUDA1 ENVIRONMENT_FOR_AssignedGPUs = CUDA_VISIBLE_DEVICES ENVIRONMENT_VALUE_FOR_UnAssignedGPUs = 10000 > grep –i gpus userjob.submit REQUEST_GPUs = 1

Non-fungible custom resource example : GPUs (2) > condor_status –long slot1| grep –i gpus AssignedGpus = "CUDA0" DetectedGPUs = 2 GPUs = 1 TotalSlotGPUs = 1 TotalGPUs = 2

Non-fungible custom resource example : GPUs (3) Environment of a job running on that slot > env | grep –I CUDA _CONDOR_AssignedGPUs = CUDA0 CUDA_VISIBLE_DEVICES = 0

Additional resource attributes Run a resource inventory script MACHINE_RESOURCE_INVENTORY_<tag> Script must return Detected<tag> = <quantity> or Detected<tag> = "<list-of-ids>" All script output is published in all slots Script output must be ClassAd syntax

condor_gpu_discovery > condor_gpu_discovery -properties DetectedGPUs = "CUDA0, CUDA1" CUDACapability = 2.0 CUDADeviceName = "GeForce GTX 480" CUDADriverVersion = 4.2 CUDAECCEnabled = false CUDAGlobalMemoryMb = 1536 CUDARuntimeVersion = 4.10

condor_gpu_discovery extra More attributes with –extra option Clock speed, CUs Dynamic attributes with –dynamic option Fan speed, Power usage, Die temp Non homogeneous attributes have GPU id in their name CUDA0PowerUsage_mw Fake it with –simulate[:n,m] option

Using condor_gpu_discovery In your configuration file, add use feature : gpus The line above expands to MACHINE_RESOURCE_INVENTORY_GPUs = \ $(LIBEXEC)/condor_gpu_discovery –properties \ $(GPU_DISCOVERY_EXTRA) ENVIRONMENT_FOR_AssignedGPUs = \ GPU_DEVICE_ORDINAL=/(CUDA|OCL)// CUDA_VISIBLE_DEVICES ENVIRONMENT_VALUE_FOR_UnAssignedGPUs=10000

Taking a GPU offline Add the following to your configuration OFFLINE_MACHINE_RESOURCE_GPUs=CUDA0 Configuration can be set remotely condor_config_val –startd –set Then restart the STARTD condor_restart [–peaceful] -startd

What’s new in 8.1 (review) Non-fungible custom resources Take a custom resource offline condor_gpu_discovery now defines non-fungible GPUs resource STARTD policy for custom resources Don’t abort when resource quantity is 0 Give out resource until gone, then give out 0

Any Questions?