FROM QUATTOR TO PUPPET A T2 point of view. BACKGROUND GRIF distributed T2 site  6 sub-sites  Used quattor for  GRIF-LAL is the home of 2 well known.

FROM QUATTOR TO PUPPET A T2 point of view

BACKGROUND GRIF distributed T2 site  6 sub-sites  Used quattor for  GRIF-LAL is the home of 2 well known quattor gurus GRIF-IRFU (CEA) subsite  Runs a 4200 cores cluster with a 2,3PiB DPM storage o About 50% GRIF cpu resources.  Is the only non-IN2P3 subsite  Had in the past 3 sysadmins.  Has local policies and requirements others don’t (seem to) have  Started looking and migrating to puppet after HEPIX 2012 (05/2012)

SOME REASONS TO CHANGE IRFU was the quattor black sheep in GRIF  Always had to hack and maintain those hacks to abide by local policies and requirements We were uncomfortable  with compile times o under windows/eclipse : 1-10+minutes on a laptop i7 o Under linux (for deploying), on a 4-cores Xeon : 1-10+ additional minutes !  Debugging and understanding was SO time consuming We did not have control on security updates

SOME REASONS TO CHANGE (2) Quattor at GRIF suffers from several SPOFs  Power cut at LAL : no management tool.  Network failure at LAL : no management  SVN failure : nothing  Power/network maintenance : no work….  Want to add some package ? Connect as root on quattorsrv@LAL. Quattor is time consuming  poor quattor/QWG documentation  grep –er with 23000 files ? Slow as hell, even on SSD. Even in memory.  SPMA (no yum) was really getting on our nerves o Checkdeps ? Not working. cluster wide failures were common  Special award for the cluster-breaking ncm-accounts

2012 : THE DECISIVE YEAR May 2012 : I tried to setup an EMI1 WMS+LB on a single host  starting to get pressure to migrate gLite 3.2 services  wanted to avoïd the SPOF LB@LAL, hence chose a « WMS+LB »  Spent about one month (or two ?) on this  There were issues everywhere. First one was the design.  And diving (drowning) into perl object was a nightmare September 2012 : end of gLite  Sites required to migrate to EMI  GRIF, IRFU failed to meet the deadline  Most, FR T2 also failed  Mainly because quattor templates were not ready

2012 : THE DECISIVE YEAR (2) Manpower  IRFU grid team lost one main sysadmin in 2011 o We fought hard to keep the position and a new sysadmin we trained  IRFU lost 3 computing people recently  2 more people to retire in 2014  No replacement Conclusion  Loosing time was not possible anymore  Quattor was not meeting our expectations  We had to try something else

Something with potential to increase (a lot) our efficiency. That would allow a « test before break » approach. we could control, reproduce, manage, update. Ourselves. that would allow upgrades when they are out, not 1 year later. We wanted to spend our time on working, not on waiting/fixing/hacking/maintaining management software WHAT DID WE WANT, AS A T2 ?

SO WE CHOSE TO TRY PUPPET Because Cern chose it. Because we wanted our temporary sysadmin to master something award-winning  should we fail to hire her permanently Because the community is huge. Because documention is good. Because developpers are reactive. Because it’s easy to understand  (most of the time) Because we know how to fix a module.  But not yet ruby ones ;) https://www.flickr.com/photos/19779889@N00/7369247848

THE ROAD TO PUPPET Was NOT easy  It took us 2 years to migrate everything.  We spent many hours late at night on this  Puppet and foreman are not perfect. We were in a hurry  Always had to upgrade something with quattor  wanted to meet deadlines  wanted to avoïd quattor upgrades (spma yum, json…) We started with easy things : virtual machines.  we spent month writing « base modules » that configure the base machines as we want : OS packets, fixed IPv4, repositories, NTP, firewalls, DNS, network… Then came the foreman/puppetmaster  Managed by puppet itself  Complex, even with puppet modules Then we started implementing easy things :  perfsonar (PS, MDM)  National accounting machine (MySQL server)  NFS servers…

THE ROAD TO PUPPET (2) Next step was grid machines  We wrote grid and yaim modules o First one calling the second one…  And hardcoded a few static things o VO details, account UIDs… We implemented from lowest to highest difficulty/risks  Wms  Computing (CREAM CE + torque + maui)  Storage (DPM) We faced requirements and issues along the way  Even CERN modules sometimes are not so good.  ARGUS, NGI argus  EMI3 accounting/migration  Glexec  DPM modules patching over and over  Xrootd federation setup

OUR ERRORS We learned puppet as we were using it  We wrote modules with too many inter-dependencies o This prevents pushing them on github or puppetforge without refactoring  We do thing some or many of our modules need a huge refactoring to be considered usable by others. We will do this when we have time We avoïded using hiera at first  But hiera is deeply hard-linked to CERN modules, so we enabled it in the end  Hiera is simple, and allowed us anyway to distinguish managing puppet code (the modules) from the configuration data (site name, IPs, filesystem UUIDs…) We patched stuff that then evolved :’( We put passwords and md5 in git Maybe git is an error too…

ACHIEVEMENTS It took 2 years to fully migrate to puppet.  But we did it with very limited manpower. We not only migrated to puppet :  We reinstalled everything in SL6, EMI3  Deployed preprod and dev environments With only  3 days downtime for storage  ~1 week for computing We are managing one debian server  with the exact same manifests. No or little extra work. We were (one of) the first FR sites fully EMI3 compliant  One month ahead of deadline  While half of french sites again failed to meet the EMI3 deadline  Even some GRIF subsites failed. We helped and are ready to help other French sites to get puppet-kickstarted we now are « devops » ready (?) https://www.flickr.com/photos/7870793@N03/8266423479/

WHAT NEXT ? If/when yaim dies, we will replace it We are testing slurm  We will then replace torque2/maui (which might die anyway at next CVE)  And enable multithreaded jobs on our GRID site We want to test/deploy CEPH to replace *NFS in our cluster

STORY END https://www.flickr.com/photos/stf-o/9617058578/sizes/h/in/photostream/

EXTRA 1 : ARCHITECTURE We are currently running puppet 3.5.1 with foreman 1.4  With one single puppetmaster for 359 hosts  Loaded @ ~40% at peak times We have 3 puppet environments mapping to 3 git branches  Dev  Preprod  Prod Each git push instantly updates the 3 branches on the puppet master. We develop in the dev branch, then merge into preprod.  If preprod does not fail. We then merge into prod.  We sometimes create local branches, to track changes of huge modules updates We recently deployed the puppetdb, in order to automate monitoring setup.  Our check_mk is now automated : new machines are automatically monitored

EXTRA 2 : PERFORMANCE ISSUES master load : client splay option helped graph analysis (using gephi) also helped limit dependencies and erradicate useless N-to-M dependencies – this is a « simple » WN graph…

FROM QUATTOR TO PUPPET A T2 point of view. BACKGROUND GRIF distributed T2 site  6 sub-sites  Used quattor for  GRIF-LAL is the home of 2 well known.

Similar presentations

Presentation on theme: "FROM QUATTOR TO PUPPET A T2 point of view. BACKGROUND GRIF distributed T2 site  6 sub-sites  Used quattor for  GRIF-LAL is the home of 2 well known."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

FROM QUATTOR TO PUPPET A T2 point of view. BACKGROUND GRIF distributed T2 site  6 sub-sites  Used quattor for  GRIF-LAL is the home of 2 well known.

Similar presentations

Presentation on theme: "FROM QUATTOR TO PUPPET A T2 point of view. BACKGROUND GRIF distributed T2 site  6 sub-sites  Used quattor for  GRIF-LAL is the home of 2 well known."— Presentation transcript:

Similar presentations

About project

Feedback