Download presentation
Presentation is loading. Please wait.
Published byElinor Johnston Modified over 9 years ago
1
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Summary of the HEPiX Spring 2014 Meeting Arne Wiebalck Ben Jones Vincent Brillault CERN ITTF June 06, 2014
2
Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 2 HEPiX – www.hepix.org Global organization of service managers and support staff providing computing facilities for HEP community Participating sites include BNL, CERN, DESY, FNAL, IN2P3, INFN, NIKHEF, RAL, TRIUMF … Meetings are held twice per year –Spring: Europe, Autumn: U.S./Asia Reports on status and recent work, work in progress & future plans –Usually no showing-off, honest exchange of experiences
3
Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 3 Outline HEPiX News & 2014 Spring Meeting Site reports Storage & File Systems End User Services Basic IT Services Computing & Batch Systems IT Facilities Clouds & Virtualisation Networking & Security Arne Ben Vincent
4
Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 4 HEPiX News New (American) Co-Chair –Sandy Philpott reaches end of her mandate –Call for nominations to be sent out in June –Election will take place at autumn HEPiX New HEPiX website –First version planned to go live end of June –Migration of content in the coming months
5
Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 5 Next HEPiX Meetings Autumn 2014 –University of Nebraska, Lincoln (NE), U.S. –Oct 13 – Oct 17, 2014 Spring 2015 –Oxford, U.K. –Mar 23 – Mar 27, 2015 Autumn 2015 –DESY Zeuthen (?)
6
Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 6 HEPiX Spring 2014 May 19 – May 23 at LAPP, Annecy-Le-Vieux –Very well organized, pretty rich program –Network access: eduroam (as in Bologna, Ann Arbor) 105 registered participants –Europe: 77 (France 26), U.S./Canada: 10, Asia: 6 (CERN: 15) –Many first timers –12 participants from 6 companies 71 presentations –26 hours of presentations –Many offline discussions Sponsors: WD, DDN, and Univa
7
Updates from the WGs (1) IPv6 –No formal report, but some IPv6 related talks in networking track –Sites are encouraged to attend the pre-WLCG in June Configuration management –Report by Ben –Explained objectives: share information & experiences, not replace Puppet community –Yaim is a concern (Puppet wrappers) Benchmarking –New SPEC CPU suite expected by the end of the year –Identifying experiment code to compare benchmarks candidates with will start now Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 7
8
Updates from the WGs (2) Energy efficiency –No enthusiastic answers from community –Stopped for now Batch systems –No formal report –Data of the survey and findings from pre-GDB will be on the website Bit preservation –Report from German –Focus long-term costing evaluations Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 8
9
Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 9 Site reports (1) Configuration Management –Many sites have moved to Puppet or are moving –Few sites left that use Quattor (RAL, INFN) or other systems Batch system reviews ongoing –Discussions about “suitability” –HTCondor takes the lead (non-proprietary solutions) Cloud storage –Ceph still a hot topic, evaluated at several sites: RAL (see later), BNL, ASGC, CNAF, AGLT2, … –“Dropbox”-like services enter production KIT (bwSync based on PowerFolder for 450’000 users), DESY (see later)
10
Site reports (2) Hardware –4GB RAM per core –10GbE –Share procurement experiences? Windows –XP: still run at many sites, but usually confined in VLANs or blocked by firewall, i.e. without internet access or mail –Active migration to Win 7 and 8 (Win 8 already default at various sites) Monitoring –Logstash, ElasticSearch, and Kibana become more wide-spread Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 10
11
Site reports (3) “Cooling problem” at INFN –Problem with cooling system –Took out their CC for 1 week (Tier1 back after 36 hours) –Happened Sunday at 1am … Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 11
12
Site reports (4) FNAL –PDU plug incident –Alarm was triggered, but firemen did not find the problem due to the rapid air exchange –Thermographic imaging for an early in-situ Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 12
13
Storage Ceph @ RAL –Cloud backend (OpenNebula): ~1PB, currently being installed –CASTOR replacement: ~1.8PB from batch via CephFS –Special ATLAS Panda queue for ARC CE with CephFS –Works well so far Cloud Storage @ DESY –Motivation: data locality –Requested features: big, fast, reliable, ACLs, support on all clients, web access, … –DESY’s approach: ownCloud on dCache –No quotas, “unlimited” space (~5’000 users) –Billing? Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 13
14
End User Services (1) Session on “Future OS for HEP” –Alan Silverman on the creation of Scientific Linux –10 years ago: Several Unix flavors and “Linux?” –In 2003 RedHat went commercial, the binary distribution needed a license –Various labs negotiated good deals with RedHat –FNAL and CERN tried to do so too, but it didn’t work out –In Spring 2004 FNAL and CERN started to rebuild from source: Scientific Linux –Since then, SL has been a major community driving force in HEP –Karanbir Singh on CentOS –Started for similar reasons and around the same time as SL –Stable platform to solve every day's problems (unlike Fedora) –Compatible with RHEL –In January 2014 CentOS and RedHat “joined forces”: RH became main sponsor, pays main developers (open source unit), owns CentOS trademarks –Special Interest Groups Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 14
15
End User Services (2) –FNAL and CERN presented their points of view on what should be “SL?7” –Should we continue to build a separate distribution? –Should we adopt CentOS and become part of a bigger community? –Is CentOS already ready for this? –Does it need to be ready? –Do we lose the “community factor” of SL? –What if RedHat changes plans with CentOS? –Aren’t we already in a situation where we don’t have a single OS? –Summary and conclusion –CentOS still too much in flux to take firm decision now –HEP community should try to influence CentOS discussion –Request for a “Scientific SIG” –FNAL and CERN will observe the situation and stay in close contact –Both teams will take their own decision on how to build “SL7” (not final) –The situation will be re-assessed at next HEPiX –Both teams acknowledge the preference for a common solution Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 15
16
Basic IT Services Ten talks –Quattor & Puppet –Data transformations with Lavoisier –Cluster consolidation at NERSC –Agile Infrastructure updates Quattor: positive and not so positive updates RAL described positive developments in Quattor community –Adopting Aquilion to replace SCDB –Development moved to GitHub improving collaboration –Release manager & regular updates –YUM! –ncm-puppet Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 16
17
Basic IT Services (2) IRFU – migration from Quattor to Puppet –Highlighted lack of documentation / debugging tools available in Quattor –Chose Puppet due to large community, developers etc, but also due to CERN –Migration took 2 years, but now fully puppet, EMI-3, SLC6 Secrets –DESY presented results of initial POC, which is slightly complicated due to key management Lavoisier data transformation –tool to describe data transformations between different formats Cluster Consolidation at NERSC –Improving supercomputer cluster by managing heterogeneous nodes using OS, xCat & cfengine Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 17
18
Basic IT services (3) AI updates –All three tracks reported progress –Addition of customer-orientated talk by service manger (Jérôme for Batch) Takeaways: –More sites migrating to puppet –Those sites still using Quattor enjoying some improvements with more nimble development & adoption of Aquilon –YUM for all –NERSC cluster consolidation shows that choice of configuration management less important than just doing it properly Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 18
19
Computing & batch systems Ten talks covering: –batch systems, with specific talks on Condor and Univa GE –multicore scheduling, linux control groups –CPU / system evaluations and comparisons –benchmarking Condor –Three talks from very different stages of Condor adoption CERN reported on Condor scalability testing with a view to adoption RAL described experiences after running Condor for 1 year –Were using Torque/MAUI but installation had become brittle –Rejected LSF/Univa GE in favour of OSS solution –Scalability issues with SLURM > 6K jobslots, worse with plugins Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 19
20
Computing & batch (2) RAL experience with Condor very smooth –no issues scaling to 14K cores –very configurable –using with ARC CE, will deco CREAM –lots of other UK sites moving to Condor based on experience Nebraska presented much longer experience with Condor –“Swiss army knife” of high throughput computing, batch just one facet –lots of useful best practices, for monitoring, security, accounting, extensibility Univa Grid Engine used by IN2P3 as migration from Oracle –originally tested Sun GE before Oracle acquisition –Oracle support was very disappointing –migration to Univa felt like a version update rather than new product Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 20
21
Computing & batch (3) Univa themselves presented on linux cgroups –interesting alternative to virtualization for compartmentalization –safer job suspension & job reaping, CPU isolation, memory limits Multi-core a hot topic –batch system review from GDB highlighted increasing importance to WLCG –desirable to use same resources for multi-core & single-core –CC-IN2P3 reported that mixing not currently possible with Univa GE –Experiment perspective of multicore provided by LHCb, some difficulties in understanding where the VO or the site is responsible for resolving issues Benchmarking – HS06 coming to end of useful life –WG looking at change, HS16 working title of new spec –Some discussion as to whether should be free Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 21
22
Computing & batch (4) Ivy Bridge v Opteron –Ivy Bridge same per-core in HS06 as Opteron –per system, Intel boxes have clear advantage –Opteron scale better for multi-job throughput, possibly due to SMT –Opteron purchase price lower per HS06 Avaton (Atom system-on-chip) –micro server & cloud storage market –very power efficient –CNAF tested half pizza box machine & HP “Moonshot” (4.3U 42 server carts) –Factor three less per core on HS06 compared with modern Xeon –Factor three advantage in HS06 per watt –Would mean many more nodes for equivalent workloads Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 22
23
Networking & security (1) IPv6: –FZU (Prague): Dual stack DPM (head node, disk nodes, WN): WAN IPv6 not really used Some migration difficulties: helpful insights for other deployments Partial monitoring using Nagios. Example: default route (dynamic) –QMUL (UK): Dual stack ok, but: Small MTU issues: due to ICMPv6 blocked IPv6 using software routing: slow! One-way IPv6 wrong route issue fixed –UK: Deployment not uniform Lot of sites involved in various tests –Pre-GDB on IPv6 next week Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 23
24
Networking & security (2) PerfSONAR: –Deployment task force ended in April Only 8 sites missing, 64 using outdated version Still too much data missing (configuration/firewall issues) –New task force Fix deployment issues Improve metrics and their usages Measuring WLCG data streams at batch job level –Monitoring network traffic in userland by process/job –Overhead: 5% CPU on one core –First results: Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 24
25
Networking & security (3) New Windows security at CEA and IRFU: –Computer not compliant: no internet, dedicated VLAN –Standard PC: Chrome & IE, banned software removed –Management: no internet, no mail except webmail, no remote access –Management+: no mail at all, special admin accounts –Management++ (AD root): special room Emergency suspension list: –Efficient suspension/unsuspension mechanism for the Grid –Hierarchical infrastructure: CERN – NGIs – Sites –Monitoring: only NGIs by EGI Security Update: –Crypto-currency mining –SSL/x509 nightmares –Windigo Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 25
26
IT Facilities (1) Open compute @ CERN (ITTF 2014/03/14) Interesting result from tests but procurement limitations New Data-center in Orsay: 1.5MW in 220m² –Long delays: budget issues (project reported) –High energy efficiency required: long term investment –French electrical regulation (16A, 30mA differential): custom PDU! –Passive rack issues (hot): mixing active/passive for better air flow Wigner: –¾ of the site for Tier-0 capacity –First room operated while site under construction! –Minor issues so far: specifications not detailed enough Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 26
27
IT Facilities (2) INFN cooling issue –Incident: One chiller burning Shared power supply for control logic: security kills 5/6 chiller Overheating CC -> Emergency shutdown: power cut ! –Damages: Bios configuration (dead batteries), IPMI configuration Broken: 30% IPMI, 1% PCI cards (mostly network cards) –Designing new emergency shutdown procedure: testing ? Business continuity at DESY: –Ongoing effort for ISO 27001: should be conform, but needs writing –CC climat & power managed by 2 other departments –Focus on optimizing the incident reaction over finding single point of failures –Incident: one power line cut, other line overheated, saved by batteries Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 27
28
Virtualization (1) Big Data transfers –LHC only one of the big data source, not the biggest –TCP/IP tests planed with various configurations & infrastructures FermiCloud On-demand Services –Address concurrent peak utilization –Code deployment: CVMFS –Using GlideinWMS, pushing to different clous –Various issues, including limit to 100 VM due to Squid@FNAL –EWS price: $125 for 1088 jobs (1k CPU-hours), network included Experiences with Vacuum model –Idea: Host creates VM by itself, Pilot job will handle the rest –Flavor of new VM: targeted shared VS exchanges between hosts –VM VS Batch Efficiency: 99.37% (Sigma: 0.57%) Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 28
29
Virtualization (2) Virtualization @ RAL Tier-1 –Most service virtualized using Hyper-V –Live migration issues not clearly understood –Cloud prototype (for federated clouds) –Using HTCondor power management for dynamic WN Helix Nebula update: –Successful test deployments over multiple suppliers Using ATLAS PanDA framework + SlipStream High CPU jobs with low I/O –Issues: Supplier heterogeneity (e.g. VM deployment: 5-25 mins) Cost high or Undefined Wiebalck, Jones, Brillault: Summary of the HEPiX Spring 2014 Meeting - 29
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.