Cluster Optimisation using Cgroups

Slides:



Advertisements
Similar presentations
MEMORY MANAGEMENT Y. Colette Lemard. MEMORY MANAGEMENT The management of memory is one of the functions of the Operating System MEMORY = MAIN MEMORY =
Advertisements

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.
– Unfortunately, this problems is not yet fully under control – No enough information from monitoring that would allow us to correlate poor performing.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
Introduction Optimizing Application Performance with Pinpoint Accuracy What every IT Executive, Administrator & Developer Needs to Know.
CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.
Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]
Claudio Grandi INFN Bologna CMS Operations Update Ian Fisk, Claudio Grandi 1.
Storage Management in Virtualized Cloud Environments Sankaran Sivathanu, Ling Liu, Mei Yiduo and Xing Pu Student Workshop on Frontiers of Cloud Computing,
Copyrighted material John Tullis 10/6/2015 page 1 Performance: WebSphere Commerce John Tullis DePaul Instructor
Ian Alderman A Little History…
Technologies: Server Virtualization, Infrastructure and Application Monitoring November 2, 2010 David Pritchett and John McQuaid.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
1. Maria Girone, CERN  Q WLCG Resource Utilization  Commissioning the HLT for data reprocessing and MC production  Preparing for Run II  Data.
Time Management Personal and Project. Why is it important Time management is directly relevant to Project Management If we cannot manage our own time.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
Optimisation of Grid Enabled Storage at Small Sites Jamie K. Ferguson University of Glasgow – Jamie K. Ferguson – University.
Pre-GDB 2014 Infrastructure Analysis Christian Nieke – IT-DSS Pre-GDB 2014: Christian Nieke1.
1 Database mini workshop: reconstressing athena RECONSTRESSing: stress testing COOL reading of athena reconstruction clients Database mini workshop, CERN.
Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.
Computer Systems Week 14: Memory Management Amanda Oddie.
GridPP11 Liverpool Sept04 SAMGrid GridPP11 Liverpool Sept 2004 Gavin Davies Imperial College London.
Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.
GRID activities in Wuppertal D0RACE Workshop Fermilab 02/14/2002 Christian Schmitt Wuppertal University Taking advantage of GRID software now.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer
Alessandro De Salvo CCR Workshop, ATLAS Computing Alessandro De Salvo CCR Workshop,
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
A Statistical Analysis of Job Performance on LCG Grid David Colling, Olivier van der Aa, Mona Aggarwal, Gidon Moont (Imperial College, London)
J.J. Keijser Nikhef Amsterdam Grid Group MyFirstMic experience Jan Just Keijser 26 November 2013.
Getting the Most out of Scientific Computing Resources
WLCG Workshop 2017 [Manchester] Operations Session Summary
Title of the Poster Supervised By: Prof.*********
OPERATING SYSTEMS CS 3502 Fall 2017
Getting the Most out of Scientific Computing Resources
Computing Operations Roadmap
Processes and threads.
AWS Integration in Distributed Computing
SuperB and its computing requirements
Diskpool and cloud storage benchmarks used in IT-DSS
Lead SQL BankofAmerica Blog: SQLHarry.com
WLCG Memory Requirements
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
David Cameron ATLAS Site Jamboree, 20 Jan 2017
PanDA in a Federated Environment
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
Readiness of ATLAS Computing - A personal view
The ADC Operations Story
Passive benchmarking of ATLAS Tier-0 CPUs
THE OPERATION SYSTEM The need for an operating system
John Kennedy LMU Muenchen on behalf of many other people
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
Simulation use cases for T2 in ALICE
Report on Vector Prototype
Lecture 28: Virtual Memory-Address Translation
Department of Computer Science University of California, Santa Barbara
Chapter 9: Virtual-Memory Management
Lecture 27: Virtual Memory
Chapter 9: Virtual Memory
Memory Management Overview
Process scheduling Chapter 5.
Department of Computer Science University of California, Santa Barbara
Exploring Multi-Core on
Presentation transcript:

Cluster Optimisation using Cgroups Gang Qin, Gareth Roy, David Crooks, Sam Skipsey, Gordon Stewart and David Britton ACAT2016 18th Jan 2016 Prof. David Britton GridPP Project leader University of Glasgow David Britton, University of Glasgow IET, Oct 09

The Issue In 2015 ATLAS starting to provide 8-core multicore jobs requiring 16GB of memory. Led to a quantization problem with some generations of hardware at Glasgow and performance problems even with fully usable configurations. Initiated project to explore memory use of workloads at the Glasgow Tier-2 site in order to: Maximize site throughput by tuning the memory allocation. Optimize purchase of future hardware: How much memory? Monitoring memory could identify stuck or suspicious workloads? End of 2015, ATLAS confirmed 2GB/hyper-threaded-core but expressed an interest in some larger (4GB) memory resources. In a procurement currently in progress at Glasgow the difference between 2GB/core and 4GB/core is 12% of the total cost. David Britton, University of Glasgow ACAT2016

Tools Control Groups (cgroups) Condor Cgroups Linux kernel feature to limit/isolate resource usage among user- defined groups of processes. We used the cpu and memory cgroups Resource Controllers (or subsystems). Condor Cgroups Condor puts each job into a dedicated cgroup for the selected subsystem. Can track the subsystem usage of all processes started by a job. Can control cpu usage at job level: Jobs can use more cpus than allocated if there are still free cpus. Can control RSS (physical memory) usage at job level: soft: jobs can access more memory than allocated if there is still free physical memory available in the system. hard: jobs can't access more physical memory than allocated. David Britton, University of Glasgow ACAT2016

ATLAS Test Job Re-run same test job with different (hard) memory limits: So might speculate that not necessary to always provision to max-requested-memory. See maximum memory needed is 2700MB ~20% increase in run time when RSS limited to 1000MB (37% of maximum) ~1.5% increase in run time when RSS limited to 2000MB (75% of maximum) …but very linear memory footprint on this job. David Britton, University of Glasgow ACAT2016

Workloads and Databases Data collected in three local Databases: Condor database – e.g. which machine the job ran on. Panda database – e.g. provides Job Type and ID for ATLAS workloads. Cgroups database – e.g. provides the data on memory and CPU usage. David Britton, University of Glasgow ACAT2016

Data Set Data taken over ~year. Transfer Failure Cgmemd problem Data taken over ~year. Glasgow Condor cluster scaled up to ~4600 cores by July. 4.64m jobs went through Panda: 5.5% requested 8 cores and 94.5% requested 1 core. Memory monitoring shut down in June to reduce load on Condor central manager, and then restarted in July with a lower sampling frequency and only on part of the cluster. 2.5m jobs monitored.

1: ATLAS Multicore Simulation 55,000 Jobs So could there be swapping because 1 core needs more than 2GB? All run on 8 cores on Ivybridge machines -16GB memory. MaxMemory shows structure (different tasks). MaxMemory goes up to about 11GB; averaged 5.7GB Job efficiency was around 95% except in first quarter. Can we monitor memory and reduce to 12GB or even 8GB? David Britton, University of Glasgow ACAT2016

2: ATLAS Multicore Reconstruction 29,000 Jobs So could there be swapping because 1 core needs more than 2GB? All run on 8 cores on Ivybridge machines -16GB memory. MaxMemory averages about 11GB but extends to ~19GB Job efficiency averages around 60%-70% (but varies). The 16GB of memory requested is needed. David Britton, University of Glasgow ACAT2016

Normal vs Quick Reconstruction Digi Reco xAOD DQ David Britton, University of Glasgow ACAT2016

3: ATLAS Single-core Simulation 294,000 Jobs All run on 1 core on Ivybridge machines - 3GB memory. MaxMemory averages about 1GB but extends to ~2GB Job efficiency consistently about 95%. Memory could be reduced from 3GB to < 2GB (1.8GB?) David Britton, University of Glasgow ACAT2016

4: ATLAS Single-core Analysis Log Scale 463,000 Jobs Log Scale Log Scale All run on 1 core on Ivybridge machines - 4GB memory. MaxMemory averages about 0.4GB but extends >4GB MadEvent in Q3 attempted to use 30+ cores (log plot!) but were contained via cgroups. ~98% have MaxMemory < 1.8GB David Britton, University of Glasgow ACAT2016

4: ATLAS Single-core Analysis - continued David Britton, University of Glasgow ACAT2016

5: ATLAS Sequential single-core Analysis Log Scale 210,000 Jobs Log Scale Log Scale All run on 1 core on Ivybridge machines - 4GB memory. MaxMemory averages about 0.6GB but extends >4GB Tail of events shown in Q3 (log plot) requested >1 core. ~96% have MaxMemory < 1.8GB David Britton, University of Glasgow ACAT2016

5: ATLAS Sequential single-core Analysis - continued David Britton, University of Glasgow ACAT2016

6,7,8: Non-ATLAS Single-core jobs ILC PhenoGrid GridPP VO Jobs allocated 2GB Jobs use < ~1.2 GB Jobs allocated 2GB Jobs use < ~1.2 GB Jobs allocated 2GB Jobs use < ~0.8 GB David Britton, University of Glasgow ACAT2016

Results On this basis, we reduced memory allocation of all single core jobs to 1.8GB for the last 6 months: Running was stable. We achieved 10% higher CPU usage. David Britton, University of Glasgow ACAT2016

Example Sandybridge – 32 CPU with 64 GB of memory (32GB physical + 32GB swap) With allocations of 16/4GB for Multi/Single-core jobs, in principle fit: 4 MC + 0 SC (32 CPU and 64GB memory allocated) 3 MC + 4 SC (28 CPUs; 64GB memory allocated) 2 MC + 8 SC (24 CPUs; 64GB memory allocated) 1 MC + 12 SC (20 CPUs; 64GB memory allocated) 0 MC + 16 SC (16 CPUs; 64 GB memory allocated) With allocations of 16/2GB for Multi/Single-core jobs, in principle fit: 3 MC + 8 SC (32 CPUs; 64GB memory allocated) 2 MC + 16 SC (32 CPUs; 64GB memory allocated) 1 MC + 24 SC (32 CPUs; 64GB memory allocated) 0 MC + 32 SC (32 CPUs; 64 GB memory allocated) 95% of work is SC David Britton, University of Glasgow ACAT2016

Example Our cluster has 4600 cores, when the cluster is fully loaded, there were always ~ 600 cores unused because there was not enough memory and/or the quantization didn’t work. Reducing memory request for SC jobs to 2GB could occupy all cores but two generations of hardware (Westmere and Sandybridge) became very slow when fully loaded, so we tuned the memory to keep a few cores free on each of these boxes. Sweet spot was to set SC request to 1.8GB and keep 1-2 cores free on these 2 type of machines. Enabled 500 (out of the 600) previously unused cores could be used when system full loaded. 500/4600  10% gain in CPU usage on cluster. David Britton, University of Glasgow ACAT2016

Future Plans If ATLAS were to produce two separate types of multicore jobs (simulation and reconstruction) we could try reducing memory for the simulation jobs from 16GB to 12GB (?). Develop a memory monitoring system so that we can: Set more fine-grained and more dynamic limits. Spot stuck or suspicious workloads. Provide feedback to ATLAS on memory requirements on a realtime bases so that memory requests can be adjusted (benefits all sites). Stick to 2GB/hyperthreaded-core in the current procurement and spend the last 12% on additional CPU rather than extra memory. David Britton, University of Glasgow ACAT2016

Backup Slides David Britton, University of Glasgow ACAT2016

Multicore Simulation Jobs (5)         

Multicore Reconstruction Jobs (6)       Digi Reco xAOD DQ    Digi Reco xAOD DQ