Cluster Optimisation using Cgroups Gang Qin, Gareth Roy, David Crooks, Sam Skipsey, Gordon Stewart and David Britton ACAT2016 18th Jan 2016 Prof. David Britton GridPP Project leader University of Glasgow David Britton, University of Glasgow IET, Oct 09
The Issue In 2015 ATLAS starting to provide 8-core multicore jobs requiring 16GB of memory. Led to a quantization problem with some generations of hardware at Glasgow and performance problems even with fully usable configurations. Initiated project to explore memory use of workloads at the Glasgow Tier-2 site in order to: Maximize site throughput by tuning the memory allocation. Optimize purchase of future hardware: How much memory? Monitoring memory could identify stuck or suspicious workloads? End of 2015, ATLAS confirmed 2GB/hyper-threaded-core but expressed an interest in some larger (4GB) memory resources. In a procurement currently in progress at Glasgow the difference between 2GB/core and 4GB/core is 12% of the total cost. David Britton, University of Glasgow ACAT2016
Tools Control Groups (cgroups) Condor Cgroups Linux kernel feature to limit/isolate resource usage among user- defined groups of processes. We used the cpu and memory cgroups Resource Controllers (or subsystems). Condor Cgroups Condor puts each job into a dedicated cgroup for the selected subsystem. Can track the subsystem usage of all processes started by a job. Can control cpu usage at job level: Jobs can use more cpus than allocated if there are still free cpus. Can control RSS (physical memory) usage at job level: soft: jobs can access more memory than allocated if there is still free physical memory available in the system. hard: jobs can't access more physical memory than allocated. David Britton, University of Glasgow ACAT2016
ATLAS Test Job Re-run same test job with different (hard) memory limits: So might speculate that not necessary to always provision to max-requested-memory. See maximum memory needed is 2700MB ~20% increase in run time when RSS limited to 1000MB (37% of maximum) ~1.5% increase in run time when RSS limited to 2000MB (75% of maximum) …but very linear memory footprint on this job. David Britton, University of Glasgow ACAT2016
Workloads and Databases Data collected in three local Databases: Condor database – e.g. which machine the job ran on. Panda database – e.g. provides Job Type and ID for ATLAS workloads. Cgroups database – e.g. provides the data on memory and CPU usage. David Britton, University of Glasgow ACAT2016
Data Set Data taken over ~year. Transfer Failure Cgmemd problem Data taken over ~year. Glasgow Condor cluster scaled up to ~4600 cores by July. 4.64m jobs went through Panda: 5.5% requested 8 cores and 94.5% requested 1 core. Memory monitoring shut down in June to reduce load on Condor central manager, and then restarted in July with a lower sampling frequency and only on part of the cluster. 2.5m jobs monitored.
1: ATLAS Multicore Simulation 55,000 Jobs So could there be swapping because 1 core needs more than 2GB? All run on 8 cores on Ivybridge machines -16GB memory. MaxMemory shows structure (different tasks). MaxMemory goes up to about 11GB; averaged 5.7GB Job efficiency was around 95% except in first quarter. Can we monitor memory and reduce to 12GB or even 8GB? David Britton, University of Glasgow ACAT2016
2: ATLAS Multicore Reconstruction 29,000 Jobs So could there be swapping because 1 core needs more than 2GB? All run on 8 cores on Ivybridge machines -16GB memory. MaxMemory averages about 11GB but extends to ~19GB Job efficiency averages around 60%-70% (but varies). The 16GB of memory requested is needed. David Britton, University of Glasgow ACAT2016
Normal vs Quick Reconstruction Digi Reco xAOD DQ David Britton, University of Glasgow ACAT2016
3: ATLAS Single-core Simulation 294,000 Jobs All run on 1 core on Ivybridge machines - 3GB memory. MaxMemory averages about 1GB but extends to ~2GB Job efficiency consistently about 95%. Memory could be reduced from 3GB to < 2GB (1.8GB?) David Britton, University of Glasgow ACAT2016
4: ATLAS Single-core Analysis Log Scale 463,000 Jobs Log Scale Log Scale All run on 1 core on Ivybridge machines - 4GB memory. MaxMemory averages about 0.4GB but extends >4GB MadEvent in Q3 attempted to use 30+ cores (log plot!) but were contained via cgroups. ~98% have MaxMemory < 1.8GB David Britton, University of Glasgow ACAT2016
4: ATLAS Single-core Analysis - continued David Britton, University of Glasgow ACAT2016
5: ATLAS Sequential single-core Analysis Log Scale 210,000 Jobs Log Scale Log Scale All run on 1 core on Ivybridge machines - 4GB memory. MaxMemory averages about 0.6GB but extends >4GB Tail of events shown in Q3 (log plot) requested >1 core. ~96% have MaxMemory < 1.8GB David Britton, University of Glasgow ACAT2016
5: ATLAS Sequential single-core Analysis - continued David Britton, University of Glasgow ACAT2016
6,7,8: Non-ATLAS Single-core jobs ILC PhenoGrid GridPP VO Jobs allocated 2GB Jobs use < ~1.2 GB Jobs allocated 2GB Jobs use < ~1.2 GB Jobs allocated 2GB Jobs use < ~0.8 GB David Britton, University of Glasgow ACAT2016
Results On this basis, we reduced memory allocation of all single core jobs to 1.8GB for the last 6 months: Running was stable. We achieved 10% higher CPU usage. David Britton, University of Glasgow ACAT2016
Example Sandybridge – 32 CPU with 64 GB of memory (32GB physical + 32GB swap) With allocations of 16/4GB for Multi/Single-core jobs, in principle fit: 4 MC + 0 SC (32 CPU and 64GB memory allocated) 3 MC + 4 SC (28 CPUs; 64GB memory allocated) 2 MC + 8 SC (24 CPUs; 64GB memory allocated) 1 MC + 12 SC (20 CPUs; 64GB memory allocated) 0 MC + 16 SC (16 CPUs; 64 GB memory allocated) With allocations of 16/2GB for Multi/Single-core jobs, in principle fit: 3 MC + 8 SC (32 CPUs; 64GB memory allocated) 2 MC + 16 SC (32 CPUs; 64GB memory allocated) 1 MC + 24 SC (32 CPUs; 64GB memory allocated) 0 MC + 32 SC (32 CPUs; 64 GB memory allocated) 95% of work is SC David Britton, University of Glasgow ACAT2016
Example Our cluster has 4600 cores, when the cluster is fully loaded, there were always ~ 600 cores unused because there was not enough memory and/or the quantization didn’t work. Reducing memory request for SC jobs to 2GB could occupy all cores but two generations of hardware (Westmere and Sandybridge) became very slow when fully loaded, so we tuned the memory to keep a few cores free on each of these boxes. Sweet spot was to set SC request to 1.8GB and keep 1-2 cores free on these 2 type of machines. Enabled 500 (out of the 600) previously unused cores could be used when system full loaded. 500/4600 10% gain in CPU usage on cluster. David Britton, University of Glasgow ACAT2016
Future Plans If ATLAS were to produce two separate types of multicore jobs (simulation and reconstruction) we could try reducing memory for the simulation jobs from 16GB to 12GB (?). Develop a memory monitoring system so that we can: Set more fine-grained and more dynamic limits. Spot stuck or suspicious workloads. Provide feedback to ATLAS on memory requirements on a realtime bases so that memory requests can be adjusted (benefits all sites). Stick to 2GB/hyperthreaded-core in the current procurement and spend the last 12% on additional CPU rather than extra memory. David Britton, University of Glasgow ACAT2016
Backup Slides David Britton, University of Glasgow ACAT2016
Multicore Simulation Jobs (5)
Multicore Reconstruction Jobs (6) Digi Reco xAOD DQ Digi Reco xAOD DQ