Download presentation
Presentation is loading. Please wait.
Published byCathleen Benson Modified over 9 years ago
1
FROM QUATTOR TO PUPPET A T2 point of view
2
BACKGROUND GRIF distributed T2 site 6 sub-sites Used quattor for GRIF-LAL is the home of 2 well known quattor gurus GRIF-IRFU (CEA) subsite Runs a 4200 cores cluster with a 2,3PiB DPM storage o About 50% GRIF cpu resources. Is the only non-IN2P3 subsite Had in the past 3 sysadmins. Has local policies and requirements others don’t (seem to) have Started looking and migrating to puppet after HEPIX 2012 (05/2012)
3
SOME REASONS TO CHANGE IRFU was the quattor black sheep in GRIF Always had to hack and maintain those hacks to abide by local policies and requirements We were uncomfortable with compile times o under windows/eclipse : 1-10+minutes on a laptop i7 o Under linux (for deploying), on a 4-cores Xeon : 1-10+ additional minutes ! Debugging and understanding was SO time consuming We did not have control on security updates
4
SOME REASONS TO CHANGE (2) Quattor at GRIF suffers from several SPOFs Power cut at LAL : no management tool. Network failure at LAL : no management SVN failure : nothing Power/network maintenance : no work…. Want to add some package ? Connect as root on quattorsrv@LAL. Quattor is time consuming poor quattor/QWG documentation grep –er with 23000 files ? Slow as hell, even on SSD. Even in memory. SPMA (no yum) was really getting on our nerves o Checkdeps ? Not working. cluster wide failures were common Special award for the cluster-breaking ncm-accounts
5
2012 : THE DECISIVE YEAR May 2012 : I tried to setup an EMI1 WMS+LB on a single host starting to get pressure to migrate gLite 3.2 services wanted to avoïd the SPOF LB@LAL, hence chose a « WMS+LB » Spent about one month (or two ?) on this There were issues everywhere. First one was the design. And diving (drowning) into perl object was a nightmare September 2012 : end of gLite Sites required to migrate to EMI GRIF, IRFU failed to meet the deadline Most, FR T2 also failed Mainly because quattor templates were not ready
6
2012 : THE DECISIVE YEAR (2) Manpower IRFU grid team lost one main sysadmin in 2011 o We fought hard to keep the position and a new sysadmin we trained IRFU lost 3 computing people recently 2 more people to retire in 2014 No replacement Conclusion Loosing time was not possible anymore Quattor was not meeting our expectations We had to try something else
7
Something with potential to increase (a lot) our efficiency. That would allow a « test before break » approach. we could control, reproduce, manage, update. Ourselves. that would allow upgrades when they are out, not 1 year later. We wanted to spend our time on working, not on waiting/fixing/hacking/maintaining management software WHAT DID WE WANT, AS A T2 ?
8
SO WE CHOSE TO TRY PUPPET Because Cern chose it. Because we wanted our temporary sysadmin to master something award-winning should we fail to hire her permanently Because the community is huge. Because documention is good. Because developpers are reactive. Because it’s easy to understand (most of the time) Because we know how to fix a module. But not yet ruby ones ;) https://www.flickr.com/photos/19779889@N00/7369247848
9
THE ROAD TO PUPPET Was NOT easy It took us 2 years to migrate everything. We spent many hours late at night on this Puppet and foreman are not perfect. We were in a hurry Always had to upgrade something with quattor wanted to meet deadlines wanted to avoïd quattor upgrades (spma yum, json…) We started with easy things : virtual machines. we spent month writing « base modules » that configure the base machines as we want : OS packets, fixed IPv4, repositories, NTP, firewalls, DNS, network… Then came the foreman/puppetmaster Managed by puppet itself Complex, even with puppet modules Then we started implementing easy things : perfsonar (PS, MDM) National accounting machine (MySQL server) NFS servers…
10
THE ROAD TO PUPPET (2) Next step was grid machines We wrote grid and yaim modules o First one calling the second one… And hardcoded a few static things o VO details, account UIDs… We implemented from lowest to highest difficulty/risks Wms Computing (CREAM CE + torque + maui) Storage (DPM) We faced requirements and issues along the way Even CERN modules sometimes are not so good. ARGUS, NGI argus EMI3 accounting/migration Glexec DPM modules patching over and over Xrootd federation setup
11
OUR ERRORS We learned puppet as we were using it We wrote modules with too many inter-dependencies o This prevents pushing them on github or puppetforge without refactoring We do thing some or many of our modules need a huge refactoring to be considered usable by others. We will do this when we have time We avoïded using hiera at first But hiera is deeply hard-linked to CERN modules, so we enabled it in the end Hiera is simple, and allowed us anyway to distinguish managing puppet code (the modules) from the configuration data (site name, IPs, filesystem UUIDs…) We patched stuff that then evolved :’( We put passwords and md5 in git Maybe git is an error too…
12
ACHIEVEMENTS It took 2 years to fully migrate to puppet. But we did it with very limited manpower. We not only migrated to puppet : We reinstalled everything in SL6, EMI3 Deployed preprod and dev environments With only 3 days downtime for storage ~1 week for computing We are managing one debian server with the exact same manifests. No or little extra work. We were (one of) the first FR sites fully EMI3 compliant One month ahead of deadline While half of french sites again failed to meet the EMI3 deadline Even some GRIF subsites failed. We helped and are ready to help other French sites to get puppet-kickstarted we now are « devops » ready (?) https://www.flickr.com/photos/7870793@N03/8266423479/
13
WHAT NEXT ? If/when yaim dies, we will replace it We are testing slurm We will then replace torque2/maui (which might die anyway at next CVE) And enable multithreaded jobs on our GRID site We want to test/deploy CEPH to replace *NFS in our cluster
14
STORY END https://www.flickr.com/photos/stf-o/9617058578/sizes/h/in/photostream/
15
EXTRA 1 : ARCHITECTURE We are currently running puppet 3.5.1 with foreman 1.4 With one single puppetmaster for 359 hosts Loaded @ ~40% at peak times We have 3 puppet environments mapping to 3 git branches Dev Preprod Prod Each git push instantly updates the 3 branches on the puppet master. We develop in the dev branch, then merge into preprod. If preprod does not fail. We then merge into prod. We sometimes create local branches, to track changes of huge modules updates We recently deployed the puppetdb, in order to automate monitoring setup. Our check_mk is now automated : new machines are automatically monitored
16
EXTRA 2 : PERFORMANCE ISSUES master load : client splay option helped graph analysis (using gephi) also helped limit dependencies and erradicate useless N-to-M dependencies – this is a « simple » WN graph…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.