Scaling Puppet and Foreman for HPC

Slides:



Advertisements
Similar presentations
Puppet for GENI Experiments
Advertisements

Operating-System Structures
Cloud Computing Systems Lin Gu Hong Kong University of Science and Technology Sept. 21, 2011 Windows Azure—Overview.
Application Packaging Standard Fundamentals
Getting to Push Button Deploys Moovweb January 19, 2012.
Lecture 8 – Platform as a Service. Introduction We have discussed the SPI model of Cloud Computing – IaaS – PaaS – SaaS.
AI project components: Facter and Hiera
86% 50% Infrastructure provisioning Enterprise-class multi- tenant infrastructure for hybrid environments System Center capabilities Application.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Puppet with vSphere Workshop Install, configure and use Puppet on your laptop for vSphere DevOps Billy Lieberman August 1, 2015.
The Art and Zen of Managing Nagios with Puppet Michael Merideth - VictorOps
CERN IT Department CH-1211 Genève 23 Switzerland t Experiences running a production Puppet Ben Jones HEPiX Bologna Spring.
Configuration Management Evolution at CERN Gavin
Configuration Management with Cobbler and Puppet Kashif Mohammad University of Oxford.
Ariel Garcia LCG cluster installation, EGEE training, Ariel Garcia - IWR LCG Cluster Installation Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft.
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Usage of virtualization in gLite certification Andreas Unterkircher.
CERN IT Department CH-1211 Genève 23 Switzerland t IT Configuration Activities Gavin McCance Online Cross-experiment Meeting, 14 June 2012.
1 CERN IT Department CH-1211 Genève 23 Switzerland t Puppet in the CERN CC Tomas Karasek Steve Traylen Oct
CERN AI Config Management 16/07/15 AI for INFN visit2 Overview for INFN visit.
Active Directory Domain Services (AD DS). Identity and Access (IDA) – An IDA infrastructure should: Store information about users, groups, computers and.
Cloud Installation & Configuration Management. Outline  Definitions  Tools, “Comparison”  References.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
Configuration Services at CERN HEPiX fall Ben Jones, HEPiX Fall 2014.
Running clusters on a Shoestring US Lattice QCD Fermilab SC 2007.
The Site Architecture You Can Edit Varnish Mobile? Ryan Lane Wikimedia Foundation Membase? Swift.
OpenQRM is not Dead the lightning version Building a cloud in 5 mnutes by Kris Buytaert.
Integrating oVirt and Foreman to Empower your Data-Center
If it’s not automated, it’s broken!
SDN-O LCM for Mercury Release Key Points and Overview
Accessing the VI-SEEM infrastructure
Renewal of Puppet for Australia-ATLAS
Application or server monitoring
RHEV Platform at LHCb Red Hat at CERN 17-18/1/17
@ Bucharest DevOps Hacker Meetup
AI How to: System Update and Additional Software
Bootstrap / Getting Started Using Puppet Deployment
AII v2 Ronald Starink Luis Fernando Muñoz Mejías
Dockerize OpenEdge Srinivasa Rao Nalla.
Foreman in Your Data Center Lukáš Zapletal
Google App Engine Mandeep Singh (37926)
oVirt Node Project Douglas Schilling Landgraf
Open Source distributed document DB for an enterprise
Web Hosting with OpenShift
StratusLab Final Periodic Review
StratusLab Final Periodic Review
Network Load Balancing
Bridges and Clouds Sergiu Sanielevici, PSC Director of User Support for Scientific Applications October 12, 2017 © 2017 Pittsburgh Supercomputing Center.
on behalf of the NRC-KI Tier-1 team
Quattor Usage at Nikhef
Puppet
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Infrastructure Overview
OPNFV Arno Installation & Validation Walk-Through
Replication Middleware for Cloud Based Storage Service
Setting policies in kubernetes
Managing Clouds with VMM
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
HC Hyper-V Module GUI Portal VPS Templates Web Console
Introduction to Ansible
Presented By - Avinash Pawar
SUSE Linux Enterprise Desktop Administration
Distributed File Systems
Configuration Management at its peak with
The Neuronix HPC Cluster:
Distributing META-pipe on ELIXIR compute resources
Automation of Control System Configuration TAC 18
System Center Configuration Manager Cloud Services – Cloud Distribution Point Presented By: Ginu Tausif.
HC VMware Module
Pete Gronbech, Kashif Mohammad and Vipul Davda
Bending Ironic for Big Iron
Presentation transcript:

Scaling Puppet and Foreman for HPC Trey Dockendorf HPC Systems Engineer Ohio Supercomputer Center

Introduction Puppet configuration management Hiera YAML data for Puppet Foreman provisioning HPC Environment using NFS root Deployed to 1000 HPC and Infrastructure systems First used on Owens 824 node cluster

Motivation Requirement of any large HPC center is scaling the provisioning and management of HPC clusters Common provisioning and configuration management between compute and infrastructure Unified management of PXE, DHCP and DNS Audit networking interfaces Support testing configuration changes Unit and system

Foreman Host life cycle management DNS, DHCP, TFTP – both Infrastructure and HPC Tuning: PassengerMaxPoolSize NFS root support required custom provisioning templates Local Boot PXE override to always network boot Workflow change for HPC – no ”Build”, use key-value TUNING – prevent overload of Foreman host

Foreman Key-Value Storage Key-value stored in Foreman as Parameters Change behavior during boot nfsroot_build Change TFTP templates nfsroot_path nfsroot_host nfsroot_kernel_version Hierarchical storage provides inheritance base/owens -> base/owens/compute -> FQDN Managed using web UI and scripts via API host-parameter.py & hostgroup-parameter.py

Foreman NFS Root Provisioning Provisioning handled by read/write host Support by Foreman written from scratch Read-only hosts have specific locations writable defined by /etc/statetab & /etc/rwtab statetab – persists through reboots rwtab – does not persist through reboots Read-only rebuild: nfsroot_build parameter osc-partition service Partition scripts generated by Puppet Defined using a partition schema in Hiera partition-wait script if nfsroot_build=false Statetab/rwtab Statetab used only when absolutely necessary to allow reboots ability to reset systems All puppet managed resources go in rwtab Read-only rebuild set by host-parameter.py Foreman Role allows students ability to rebuild nodes Osc-partition checks key-value of Foreman Partition schema defines contents of partition scripts Partition wait designed to wait for LVM to become available

NFS Root Boot Workflow

Parameter Manipulation Timing Operation Avg Time (sec) StdDev (sec) host get 0.522 0.021 host list 0.467 0.025 host set 0.581 0.024 host set /w TFTP sync 1.321 0.057 host delete 0.433 0.026 host delete w/ TFTP sync 1.148 0.038 hostgroup get 0.510 hostgroup list 0.514 0.034 hostgroup set 0.526 0.030 hostgroup delete 0.489

Scaling Puppet – Standalone Hosts Typically master compiles catalog for agents Scaling achieved by load balancing between masters Subject Alternative Name certificates – any master can be CA Masters synced with mcollective and r10k Environments isolated by r10k control repo and git branching Foreman acts as ENC (External Node Classifier) to Puppet Scaling Number of CPUs available to masters is number of concurrent agents Environments isolated… Modules not in control repo are defined by Puppetfile, mostly community modules ENC supplies important data to Puppet like IP address and hostgroup

Puppet Performance – Standalone Hosts Min Max Mean Std Resource Count 644 9093 1313 806 Compile Times (sec) 14 82 21 9

Scaling Puppet – HPC Systems Scaling achieved by removing master and using masterless Primary bottleneck is performance reading manifests and modules then compiling locally /opt/puppet – manifest and modules synced by mcollective and r10k Read-write hosts still use puppet masters Masterless puppet run via papply Environment isolation defaults to read-write host’s Stateful - use PuppetDB and Foreman ENC Stateless - uses minimal catalog run in two stages at boot Stateless - manages locations in rwtab

NFS Root Boot Workflow

Puppet Performance - Masterless Min (sec) Max (sec) Mean (sec) Std (sec) Early 27 257 115 57 Late 87 414 185 73 Compile 3 19 9 Late contains 71 managed resources Late contains 60 second wait for filesystems Times collected from data of system bring-up after maintenance 60 second wait in order to get GPFS mounted before pbs_mom and NHC run

Cluster Definitions - YAML YAML files define cluster Script used to sync YAML with Foreman Loaded into Puppet as Hiera data Make Puppet aware of cluster nodes and their properties Populates clustershell, pdsh, Torque, conman, powerman, SSH host based auth, etc YAML deployed to root filesystem as Python pickle and Ruby marshall YAML on root filesystem loaded to populate facts Informational such as node location Determine behavior when Puppet runs Ruby and Python based facts

Custom Fact Example Usage

Repositories Foreman Templates Foreman Plugin NFS Root Module https://github.com/treydock/osc-foreman-templates Foreman Plugin https://github.com/treydock/foreman_osc NFS Root Module https://github.com/treydock/puppet-nfsroot Puppet Masterless Module https://github.com/treydock/puppet-puppet_masterless Cluster Facts module https://github.com/treydock/puppet-osc_facts