FROM QUATTOR TO PUPPET A T2 point of view. BACKGROUND GRIF distributed T2 site  6 sub-sites  Used quattor for  GRIF-LAL is the home of 2 well known.

Slides:



Advertisements
Similar presentations
ITR3 lecture 7: more introduction to UNIX Thomas Krichel
Advertisements

ATLAS Tier-3 in Geneva Szymon Gadomski, Uni GE at CSCS, November 2009 S. Gadomski, ”ATLAS T3 in Geneva", CSCS meeting, Nov 091 the Geneva ATLAS Tier-3.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Computer Science 162 Section 1 CS162 Teaching Staff.
Version Control Systems Phil Pratt-Szeliga Fall 2010.
DONVITO GIACINTO (INFN) ZANGRANDO, LUIGI (INFN) SGARAVATTO, MASSIMO (INFN) REBATTO, DAVID (INFN) MEZZADRI, MASSIMO (INFN) FRIZZIERO, ERIC (INFN) DORIGO,
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks gLite Release Process Maria Alandes Pradillo.
Erlware For Managing Distribution and Build Erlang User Conference 2007.
26/4/2001VMware - HEPix - LAL 2001 Windows/Linux Coexistence : VMware Approach HEPix – LAL Apr Michel Jouvin
Building service testbeds on FIRE D5.2.5 Virtual Cluster on Federated Cloud Demonstration Kit August 2012 Version 1.0 Copyright © 2012 CESGA. All rights.
EGEE is a project funded by the European Union under contract IST Quattor Installation of Grid Software C. Loomis (LAL-Orsay) GDB (CERN) Sept.
Introduction to Version Control
October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.
The Art and Zen of Managing Nagios with Puppet Michael Merideth - VictorOps
CERN IT Department CH-1211 Genève 23 Switzerland t Experiences running a production Puppet Ben Jones HEPiX Bologna Spring.
Security at NCAR David Mitchell February 20th, 2007.
Configuration Management with Cobbler and Puppet Kashif Mohammad University of Oxford.
Copyright © 2015 – Curt Hill Version Control Systems Why use? What systems? What functions?
Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.
Monitoring at GRIF Frédéric SCHAER cea.fr Hepix Workshop, April
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
Wahid, Sam, Alastair. Now installed on production storage Edinburgh: srm.glite.ecdf.ed.ac.uk  Local and global redir work (port open) e.g. root://srm.glite.ecdf.ed.ac.uk//atlas/dq2/mc12_8TeV/NTUP_SMWZ/e1242_a159_a165_r3549_p1067/mc1.
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Usage of virtualization in gLite certification Andreas Unterkircher.
 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
Optimal Pipeline Using Perforce, Jenkins & Puppet Nitin Pathak Works on
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
QWG Errata Management Framework Ian Collier 10 th Quattor Workshop Rutherford Appleton Laboratory October 2010.
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Tools and techniques for managing virtual machine images Andreas.
2012 Objectives for CernVM. PH/SFT Technical Group Meeting CernVM/Subprojects The R&D phase of the project has finished and we continue to work as part.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
(1) Introduction to Continuous Integration Philip Johnson Collaborative Software Development Laboratory Information and Computer Sciences University of.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
Version Control and SVN ECE 297. Why Do We Need Version Control?
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Update Authorization Service Christoph Witzig,
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
BaBar Cluster Had been unstable mainly because of failing disks Very few (
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
Automatic testing and certification procedure for IGI products in the EMI era and beyond Sara Bertocco INFN Padova on behalf of IGI Release Team EGI Community.
INRNE's participation in LCG Elena Puncheva Preslav Konstantinov IT Department.
CERN AI Config Management 16/07/15 AI for INFN visit2 Overview for INFN visit.
SCD Monthly Projects Meeting 2014 Scientific Linux Update Rennie Scott January 14, 2014.
10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.
The Great Migration: From Pacman to RPMs Alain Roy OSG Software Coordinator.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
Configuration Services at CERN HEPiX fall Ben Jones, HEPiX Fall 2014.
SCDB Update Michel Jouvin LAL, Orsay March 17, 2010 Quattor Workshop, Thessaloniki.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarksEGEE-III INFSO-RI MPI on the grid:
CERN IT Department CH-1211 Genève 23 Switzerland M.Schröder, Hepix Vancouver 2011 OCS Inventory at CERN Matthias Schröder (IT-OIS)
Quattor installation and use feedback from CNAF/T1 LCG Operation Workshop 25 may 2005 Andrea Chierici – INFN CNAF
1 Policy Based Systems Management with Puppet Sean Dague
INFSO-RI Enabling Grids for E-sciencE Workshop WLCG Security for Grid Sites Louis Poncet System Engineer SA3 - OSCT.
Nov 05, 2008, PragueSA3 Workshop1 A short presentation from Owen Synge SA3 and dCache.
Puppet and Cobbler for the configuration of multiple grid sites
Renewal of Puppet for Australia-ATLAS
Constructing Deploying and Maintaining Enterprise Systems
How to open source your Puppet configuration
The New APEL Client Will Rogers, STFC.
High Availability Linux (HA Linux)
A Simple Introduction to Git: a distributed version-control system
Heterogeneous Computation Team HybriLIT
Trends like agile development and continuous integration speak to the modern enterprise’s need to build software hyper-efficiently Jenkins:  a highly.
Quality Control in the dCache team.
Frederic Schaer, Sophie Ferry
Simplified Development Toolkit
Grid Management Challenge - M. Jouvin
Kashif Mohammad VIPUL DAVDA
Presentation transcript:

FROM QUATTOR TO PUPPET A T2 point of view

BACKGROUND GRIF distributed T2 site  6 sub-sites  Used quattor for  GRIF-LAL is the home of 2 well known quattor gurus GRIF-IRFU (CEA) subsite  Runs a 4200 cores cluster with a 2,3PiB DPM storage o About 50% GRIF cpu resources.  Is the only non-IN2P3 subsite  Had in the past 3 sysadmins.  Has local policies and requirements others don’t (seem to) have  Started looking and migrating to puppet after HEPIX 2012 (05/2012)

SOME REASONS TO CHANGE IRFU was the quattor black sheep in GRIF  Always had to hack and maintain those hacks to abide by local policies and requirements We were uncomfortable  with compile times o under windows/eclipse : 1-10+minutes on a laptop i7 o Under linux (for deploying), on a 4-cores Xeon : additional minutes !  Debugging and understanding was SO time consuming We did not have control on security updates

SOME REASONS TO CHANGE (2) Quattor at GRIF suffers from several SPOFs  Power cut at LAL : no management tool.  Network failure at LAL : no management  SVN failure : nothing  Power/network maintenance : no work….  Want to add some package ? Connect as root on Quattor is time consuming  poor quattor/QWG documentation  grep –er with files ? Slow as hell, even on SSD. Even in memory.  SPMA (no yum) was really getting on our nerves o Checkdeps ? Not working. cluster wide failures were common  Special award for the cluster-breaking ncm-accounts

2012 : THE DECISIVE YEAR May 2012 : I tried to setup an EMI1 WMS+LB on a single host  starting to get pressure to migrate gLite 3.2 services  wanted to avoïd the SPOF hence chose a « WMS+LB »  Spent about one month (or two ?) on this  There were issues everywhere. First one was the design.  And diving (drowning) into perl object was a nightmare September 2012 : end of gLite  Sites required to migrate to EMI  GRIF, IRFU failed to meet the deadline  Most, FR T2 also failed  Mainly because quattor templates were not ready

2012 : THE DECISIVE YEAR (2) Manpower  IRFU grid team lost one main sysadmin in 2011 o We fought hard to keep the position and a new sysadmin we trained  IRFU lost 3 computing people recently  2 more people to retire in 2014  No replacement Conclusion  Loosing time was not possible anymore  Quattor was not meeting our expectations  We had to try something else

Something with potential to increase (a lot) our efficiency. That would allow a « test before break » approach. we could control, reproduce, manage, update. Ourselves. that would allow upgrades when they are out, not 1 year later. We wanted to spend our time on working, not on waiting/fixing/hacking/maintaining management software WHAT DID WE WANT, AS A T2 ?

SO WE CHOSE TO TRY PUPPET Because Cern chose it. Because we wanted our temporary sysadmin to master something award-winning  should we fail to hire her permanently Because the community is huge. Because documention is good. Because developpers are reactive. Because it’s easy to understand  (most of the time) Because we know how to fix a module.  But not yet ruby ones ;)

THE ROAD TO PUPPET Was NOT easy  It took us 2 years to migrate everything.  We spent many hours late at night on this  Puppet and foreman are not perfect. We were in a hurry  Always had to upgrade something with quattor  wanted to meet deadlines  wanted to avoïd quattor upgrades (spma yum, json…) We started with easy things : virtual machines.  we spent month writing « base modules » that configure the base machines as we want : OS packets, fixed IPv4, repositories, NTP, firewalls, DNS, network… Then came the foreman/puppetmaster  Managed by puppet itself  Complex, even with puppet modules Then we started implementing easy things :  perfsonar (PS, MDM)  National accounting machine (MySQL server)  NFS servers…

THE ROAD TO PUPPET (2) Next step was grid machines  We wrote grid and yaim modules o First one calling the second one…  And hardcoded a few static things o VO details, account UIDs… We implemented from lowest to highest difficulty/risks  Wms  Computing (CREAM CE + torque + maui)  Storage (DPM) We faced requirements and issues along the way  Even CERN modules sometimes are not so good.  ARGUS, NGI argus  EMI3 accounting/migration  Glexec  DPM modules patching over and over  Xrootd federation setup

OUR ERRORS We learned puppet as we were using it  We wrote modules with too many inter-dependencies o This prevents pushing them on github or puppetforge without refactoring  We do thing some or many of our modules need a huge refactoring to be considered usable by others. We will do this when we have time We avoïded using hiera at first  But hiera is deeply hard-linked to CERN modules, so we enabled it in the end  Hiera is simple, and allowed us anyway to distinguish managing puppet code (the modules) from the configuration data (site name, IPs, filesystem UUIDs…) We patched stuff that then evolved :’( We put passwords and md5 in git Maybe git is an error too…

ACHIEVEMENTS It took 2 years to fully migrate to puppet.  But we did it with very limited manpower. We not only migrated to puppet :  We reinstalled everything in SL6, EMI3  Deployed preprod and dev environments With only  3 days downtime for storage  ~1 week for computing We are managing one debian server  with the exact same manifests. No or little extra work. We were (one of) the first FR sites fully EMI3 compliant  One month ahead of deadline  While half of french sites again failed to meet the EMI3 deadline  Even some GRIF subsites failed. We helped and are ready to help other French sites to get puppet-kickstarted we now are « devops » ready (?)

WHAT NEXT ? If/when yaim dies, we will replace it We are testing slurm  We will then replace torque2/maui (which might die anyway at next CVE)  And enable multithreaded jobs on our GRID site We want to test/deploy CEPH to replace *NFS in our cluster

STORY END

EXTRA 1 : ARCHITECTURE We are currently running puppet with foreman 1.4  With one single puppetmaster for 359 hosts  ~40% at peak times We have 3 puppet environments mapping to 3 git branches  Dev  Preprod  Prod Each git push instantly updates the 3 branches on the puppet master. We develop in the dev branch, then merge into preprod.  If preprod does not fail. We then merge into prod.  We sometimes create local branches, to track changes of huge modules updates We recently deployed the puppetdb, in order to automate monitoring setup.  Our check_mk is now automated : new machines are automatically monitored

EXTRA 2 : PERFORMANCE ISSUES master load : client splay option helped graph analysis (using gephi) also helped limit dependencies and erradicate useless N-to-M dependencies – this is a « simple » WN graph…