Effective use of cgroups with HTCondor

Slides:



Advertisements
Similar presentations
Operating-System Structures
Advertisements

Overview of local security issues in Campus Grid environments Bruce Beckles University of Cambridge Computing Service.
Operating Systems (CSCI2413) Lecture 2 Overview phones off (please)
Internet Information Server 6.0. IIS 6.0 Enhancements  Fundamental changes, aimed at: Reliability & Availability Reliability & Availability Performance.
Resource Management with YARN: YARN Past, Present and Future
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
Week:#14 Windows Recovery
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Hands-On Microsoft Windows Server 2003 Administration Chapter 6 Managing Printers, Publishing, Auditing, and Desk Resources.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
Windows Server 2008 Chapter 11 Last Update
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Troubleshooting Guide for Network Hard Disk. Model - NH-200.
Chapter 3.1:Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access.
By Matthew Smith, John Allred, Chris Fulton. Requirements Relocation Protection Sharing Logical Organization Physical Organization.
Booting and boot levels
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
Chapter Fourteen Windows XP Professional Fault Tolerance.
VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Windows NT Chapter 13 Key Terms By Bill Ward NT Versions NT Workstation n A desktop PC that both accesses a network and works as a stand alone PC NT.
Guide to Linux Installation and Administration, 2e1 Chapter 2 Planning Your System.
Recent advances in the Linux kernel resource management Kir Kolyshkin, OpenVZ
1 What is a Kernel The kernel of any operating system is the core of all the system’s software. The only thing more fundamental than the kernel is the.
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
Windows 2000 Course Summary Computing Department, Lancaster University, UK.
Condor and DRBL Bruno Gonçalves & Stefan Boettcher Emory University.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
© 2004 IBM Corporation IBM ^ z/VM Design considerations > Security > Performance (SIE)
Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.
Computer Systems Week 14: Memory Management Amanda Oddie.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
SDN Management Layer DESIGN REQUIREMENTS AND FUTURE DIRECTION NO OF SLIDES : 26 1.
CS 390 Unix Programming Environment
Isecur1ty training center Presented by : Eng. Mohammad Khreesha.
I/O Software CS 537 – Introduction to Operating Systems.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Solaris 가상화 기술 이강산. What is a zone? A zone is a virtual operating system abstraction that provides a protected environment in which applications run.
VMware vSphere 4.0 Preventive & Maintenance. Agenda Preventive & Maintenace Storage/Datastore ESX Host Performance Monitoring ESX Maintenance User Access.
Windows Server 2003 { First Steps and Administration} Benedikt Riedel MCSE + Messaging
Configuring the User and Computer Environment Using Group Policy Lesson 8.
Slide 1 User-Centric Workload Analytics: Towards Better Cluster Management Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal,
Slide 1 Cluster Workload Analytics Revisited Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal, Stephen Harrell (Purdue),
Linux & UNIX OS Overview Fort Collins, CO Copyright © XTR Systems, LLC Overview of the Linux & UNIX Operating Systems Instructor: Joseph DiVerdi, Ph.D.,
Exposing Link-Change Events to Applications
Dockerize OpenEdge Srinivasa Rao Nalla.
Cluster Optimisation using Cgroups
Examples Example: UW-Madison CHTC Example: Global CMS Pool
UBUNTU INSTALLATION
Introduction to Operating Systems
HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017.
Behavioral Design Patterns
Modeling Page Replacement Algorithms
Introduction to Operating System (OS)
Oracle Solaris Zones Study Purpose Only
KERNEL ARCHITECTURE.
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Chapter 9: Virtual-Memory Management
Page Replacement.
Privilege Separation in Condor
HC Hyper-V Module GUI Portal VPS Templates Web Console
Modeling Page Replacement Algorithms
In Today’s Class.. General Kernel Responsibilities Kernel Organization
Job Submission Via File Transfer
Putting your users in a Box
Presentation transcript:

Effective use of cgroups with HTCondor Tom Downes Center for Gravitation, Cosmology and Astrophysics University of Wisconsin-Milwaukee LIGO Scientific Collaboration HTCondor Week 2017

What are Control Groups (cgroups) Condor is a fault tolerant system for running jobs Control Groups provide fault tolerance for systems by managing access to resources like CPU / memory / devices / network systemd tightly coupled: RHEL 7+, Debian 8+, Ubuntu 15.04+ Broadly coupled to movement to isolate processes from one another Vastly improve measurements of job resource utilization

I got my philosophy I “speak for the systems” not jobs Services working should be normal Need system autonomy to live life Go on vacation Work on important things 2000345.0   scaudill         4/18 23:58 Error from slot1_3@execute1068.nemo.uwm.edu: Job has gone over memory limit of 5120 megabytes. Peak usage: 5320 megabytes.

The punchline $ cat /etc/condor/config.d/cgroups BASE_CGROUP=/system.slice/condor.service CGROUP_MEMORY_LIMIT_POLICY=soft

How cgroups work memory system.slice condor.service slot1_1 slot1_2 cron.service user.slice user-1296 Controllers manage a single resource (CPU, memory, etc.) hierarchically Each controller is K-ary tree structure exposed in directory structure Processes are assigned to nodes Condor daemons in systemd cgroup Condor adds leaf nodes for jobs! $ cat /sys/fs/cgroup/memory/system.slice/condor.service/tasks  1640 ...

Memory controller Measures RAM usage (Resident Set Size) Separately measures combined RAM and swap usage Actual swap usage determined by swappiness, a cgroup setting! RHEL7 docs misleadingly suggest that you can prevent swap usage the only unused swap is no swap! Hierarchical accounting: descendant cgroups count toward ancestors

Debian / Ubuntu By default, Debian does not enable memory controller Neither Debian/Ubuntu enable swap features within controller $ grep GRUB_CMDLINE_LINUX_DEFAULT /etc/default/grub GRUB_CMDLINE_LINUX_DEFAULT="quiet cgroup_enable=memory swapaccount=1” $ update-grub $ shutdown –r now Preseed at install (avoid first boot problems!) with grub-pc grub2/linux_cmdline_default string quiet cgroup_enable=memory swapaccount=1

Limiting Condor execute nodes In addition to job limits, let’s limit total memory used by Condor daemons and processes Use ExecStartPost to limit RAM+swap (thanks, RHEL7 docs!) $ cat /etc/systemd/system/condor.service.d/memory.conf  [Service] MemoryAccounting=true MemoryLimit=4G # this value must be greater than or equal to MemoryLimit ExecStartPost=/bin/bash -c "echo 4G > /sys/fs/cgroup/memory/system.slice/condor.service/memory.memsw.limit_in_bytes"

Limiting users on a submit node Add same condor.service limits as on execute node Configuration below limits total RAM by all users at command line For users who login via ssh, sets per-process virtual memory limit $ cat /etc/systemd/system/user.slice.d/50-MemoryLimit.conf  [Slice] MemoryAccounting=true MemoryLimit=6G $ cat /etc/systemd/system/openssh-server.service.d/memory.conf  [Service] LimitAS=2147483648

Details of enforcement (hard limit on Condor) When hard limit on whole condor service is reached… Kernel attempts to reclaim memory from cgroup + descendants swap out / delete file cache until below hard limit or soft limit (if enabled) If that fails, OOM killer invoked on cgroup + descendants OOM killer targets jobs with high /proc/[pid]/oom_score “bad” jobs are ones that are closest to their limit jobs have oom_score_adj set to appear, at minimum, at 80%+ of their limit

Details of enforcement (soft limits on jobs) Soft limits on jobs == ”OOM event will occur above the slot cgroup” With hard limit on Condor, see prior slide Otherwise, it will occur when all system RAM and swap are exhausted In both cases, Condor intercepts OOM to perform kill and cleanup

Differences in behavior Hitting Condor service hard limit is different from exhausting system If system resources exhausted, global OOM “outside” of cgroups In this case, the oom_score_adj set by Condor is very important! If triggered by Condor hard limit, every job sees OOM event! In this case, the oom_score_adj set by Condor doesn’t matter because all jobs have same value (-800) Condor < 8.6: responded by killing every job on node! Condor >= 8.6: with default value of IGNORE_LEAF_OOM=True, examines job before killing. Don’t kill if <90% of its request.

cgroups-v2 Controllers unified into much simpler tree structure Processes live only in leaf nodes Memory controller eschews soft/hard limits in favor of low/high/max Built-in understanding that job memory usage is hard to predict: be flexible Eliminates userspace OOM handling Encourages applications to monitor memory and change limit dynamically Memory controller measures swap as its own resource Possible to use memory v2 interface while using v1 for others cgroups author: https://www.youtube.com/watch?v=PzpG40WiEfM

Thoughts Define system stability as the goal rather than constraining jobs Memory management Soft limits don’t really do much except encourage a bit of swapping Soft and hard limits are separate settings. Could set both for jobs! There must be a swappiness knob Re-consider IGNORE_LEAF_OOM behavior for jobs at 90% level Consider alternative approach in spirit of cgroups-v2 cgroups-v2 will be production opt-in on Debian 9 in next few months Swap existence is a matter of religion Might be able to use systemd/cgroups to really make backfill work nicely

Increase use of systemd? Condor could mimic systemd by creating many scope units In prepping this talk, I concluded that one still needed to write directly to cgroups API, but might examine long-term benefits of using systemd as stable interface to cgroups If able to be made compatible with /etc configuration files or templates, allows user to use cgroups for other resources without developing new Condor knobs