Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

Similar presentations


Presentation on theme: "Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary"— Presentation transcript:

1

2 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
HEPiX Fall Meeting 2015 Brookhaven National Laboratory Upton NY, US Arne Wiebalck Eric Bonfillou Alberto Rodriguez Peon Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

3 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Outline 2015 Spring Meeting & General HEPiX News Site Reports (21) Grids, Clouds, and Virtualization (14) End User Services & Operating Systems (5) Storage and File systems (10) Basic IT Services (10) Computing and Batch (8) IT Facilities (6) Security and Networking (8) Closing remarks Arne Alberto Eric Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

4 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Global organization of service managers and support staff providing computing facilities for HEP community Participating sites include BNL, CERN, DESY, FNAL, IN2P3, INFN, KEK, LAL, LBNL, NERSC, NIKHEF, RAL, TRIUMF … Working groups to tackle specific/current issues and topics - (Storage), (Virtualization), Configuration Management, IPv6, Benchmarking Meetings are held twice per year - Spring: Europe, Autumn: U.S./Asia Reports on status and recent work, work in progress & future plans - Usually no showing-off, honest exchange of experiences Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

5 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Oct 12 – 16, 2015 at the Brookhaven National Laboratory, Upton, NY, US Combined with GDB 110 registered participants (high!) - Many first timers again - 45% from Europe, 35% from North America, 10% Asia/Pacific, 10% from companies - 32 different affiliations 82 contributions, 1630 minutes Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

6 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
HEPiX Working Groups Benchmarking: Experiments started to create their own benchmarks suites in addition to HS06 (“LHCb fast”, root marks, ATLAS validation kit, …) Four areas for further work (from MB) CPU power as seen by the job slot Whole-server benchmarking Accounting Storage/transport of benchmarking information (Machine-job features) Group being formed now, HEPiX experts will collaborate on first two items Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

7 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Site Reports (1) HTCondor dominating batch system - 80% of the site reports mentioned HTCondor, with a mixture of CEs (HTCondor, ARC, CREAM) - LSF and UGE much less visible than at previous meetings Requests for “large memory” jobs (4-6GB) - small fraction of total so far jobs so far, but tendency towards an increased RAM-to-core ratio Integration of diverse compute resources - Cover peak loads, use cheap opportunistic resources, incorporate HPC centers, federated clouds - FNAL: HEPCLOUD, BNL/AUS: AWS, CNAF: bank (!) Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

8 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Site Reports (2) Sync services become more popular - bwSync (KIT), “DESYbox”, IHEPbox, CERNbox ZFS in production at various sites - Features wanted so much that some sites even consider FreeBSD! Puppet dominating config’ mgmt system - Quattor flag still held up by some (very few) sites - Ansible gaining popularity (in prod at CSC, NERSC) SL vs. CentOS: not a hot topic CERNBox Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

9 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Site Reports (3) Continuing trend: Enable HEP labs to support other sciences - SLAC reported this time - provide small to mid-range HPC solutions, incl. storage - provide assistance in developing computing models - pay-as-you-use (storage), LSF analytics for accounting/dashboards - considering a lease model for the hardware New cooling system evaluated at PIC: Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

10 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Site Reports (4) Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

11 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Virtualization (1) OpenStack seems to become the de-facto standard for managing private clouds - OpenNebula mentioned as well Interest in commercial cloud offerings, several open questions - how to spot and deal with performance variability & inhomogeneity? - how to handle assess presumed and perceived performance? - how to procure commercial cloud resources? Presentation by D. Giordano et al. - Ran several procurements, developed benchmark suite Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

12 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Virtualization (2) VM sizes (cores) Before After 4x 8 7.8% 3.3% (batch WN) 2x 16 16% 4.6% (batch WN) 1x 24 20% 5.0% (batch WN) 1x 32 20.4% 3-6% (SLC6 … WN) Performance - Optimizations dependencies - NUMA, pinning, huge pages, EPT Pre-deployment testing - Small issues can have major impact Performance monitoring - Need continuous benchmarks to detect performance changes Containers being evaluated for various use cases - s/w development for various Linux flavors (BNL) - compute: short-lived single applications, very low performance overhead (BNL) - HTCondor support to come - service migration from VMs to containers & Mesos (RAL) - even on a CRAY! (NERSC) Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

13 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
End-user services & OS SL update - SL5.11 is last version (EOL: March 2017) - SL6.7 released in Aug 2015 (driver updates) - SL7.1 released in Apr 2015 (OverlayFS in technical preview) - RHEL7.2 in beta - WIP: SL Docker images Remaining talks from CERN - and CC7 update - Self-service kiosk for Macs - Collaboration services stack Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

14 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Outline 2015 Spring Meeting & General HEPiX News Site Reports (21) Grids, Clouds, and Virtualization (14) End User Services & Operating Systems (5) Storage and File systems (10) Basic IT Services (10) Computing and Batch (8) IT Facilities (6) Security and Networking (8) Closing remarks Arne Alberto Eric Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

15 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Storage & File Systems 10 talks, 2 from CERN Alberto Pace: Future home directory at CERN Alberto Peón: CvmFS deployment status and trends CEPH usage increasing across sites RACF Two different CEPH clusters that sum up to 1 PB of data Main user of the deployment is the ATLAS Event Service RAL 5,220 TB of storage cluster Erasure Coding for a more efficient use of disk space but at the cost of CPU time and extra concurrency Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

16 Storage & File Systems (2)
Large dCache deployments in a few US Sites Fermilab Different technologies for use case (Bluearc/Lustre/EOS) but looking to consolidate into dCache on top of tape BNL Also using tape as backend Part of FAX (Federated ATLAS storage systems using XrootD) Using 10.7 PB of data out of 14.2 PB available Two presentations from sponsor companies DDN Storage Reducing file system latency through parallelism IME parallel file system Western Digital Machine Learning techniques to improve disk performance Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

17 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Basic IT Services 10 talks, 2 from CERN Miguel: CERN Monitoring Status Update Alberto: Update on configuration management at CERN ELK (ElasticSearch, Logstash and Kibana) being consolidated as the monitoring infrastructure in most sites Also with some variations (Logstash/Flume, Kibana/Grafana) RAL considering InfluxDB + Grafana as an alternative to Ganglia NERSC using RabbitMQ for data transport and collectd for statistics collection Real time stream processing getting traction at CERN and LBNL Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

18 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Basic IT Services (2) Puppet deployment at KIT Puppet environments based on git branches generated by GitLab CI Foreman for provisioning and host discovery Using standard eyaml for storing secrets A few sites still on the Quattor side Active community (RAL, LAL, Brussels) Adopting Aquilon as replacement of (S)CDB as data store Foreman and GitLab presentations by BNL General description of what the software does and how they used them Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

19 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Computing & Batch 8 talks, 2 from CERN Helge: Update on benchmarking Jerome: Future of batch processing at CERN HTCondor is the most popular Batch system Most of the sites use (or plan to use) HTCondor The HTCondor team presented some of the new features in the 8.4 version, including: Support for dual-stack IPv4/IPv6 Improved scalability and stability Docker support BNL presented their strategy to accommodate multi-core jobs using partitionable slots Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

20 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Computing & Batch (2) LSF and Univa Grid Engine also present in a few sites IN2P3 very happy with the support provided by Univa although considering what to do after the expiration of the contract Discussion on CPU benchmarking HS06 is well established but considered insufficient in some areas Doesn’t measure performance in multi-core processors well Proposal to have a fast benchmark than can measure the performance of a job slot in minutes The HEPiX benchmarking WG is being formed again to address these issues Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

21 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Computing & Batch (3) Evaluation of NVMe drive disks by BNL NVMe eliminates latency and bandwidth limitations imposed by SAS/SATA controllers Benchmarks show ~100% more performance compared with SSDs Still an expensive technology but prices expected to go lower with time Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

22 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Outline 2015 Spring Meeting & General HEPiX News Site Reports (21) Grids, Clouds, and Virtualization (14) End User Services & Operating Systems (5) Storage and File systems (10) Basic IT Services (10) Computing and Batch (8) IT Facilities (6) Security and Networking (8) Closing remarks Arne Alberto Eric Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

23 IT Facilities & Business Continuity (1)
NERSC Computational Research and Theory Facility (CRT) – Elizabeth Bautista Four story building, 140k square feet, cost: 143MUSD Entirely based on free cooling (about 23C all year) Server room heat used to warm up the offices Redundant power of 27MW dedicated for the building 42MW non-redundant available in addition PUE <= 1.1 Seismic floor in the computer room Successfully tested design Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

24 IT Facilities & Business Continuity (2)
The HP IT Data centre Transformation Journey – Dave Rotheroe Data centers exist because of a concrete business case HP reduced the amount of their data centers from 85 (worldwide, 2005) to 4 (US, 2015) Vendors offer always more powerful servers and larger storage solutions… …but most companies still run a mix of ancient and newest technologies Renewing legacy infrastructure does not always make sense depending on the type of business HP predicts that a lot of what runs today in enterprise DC will run in a cloud tomorrow Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

25 IT Facilities & Business Continuity (3)
Proposal for a new Data Centre at BNL– Imran Latif Current data center located in 1st floor of a 1960’s building 22k square feet gross area, 24MkW/hrs/year usage, 1MW UPS +2,2MW diesel backup power 1,2kTon cooling via chilled water, 80% cooling capacity unavailable while on backup power Limiting layout and cooling deficiencies Power system deficiencies Inadequate and limited physical space Proposal to renew the light-source building with the following for completion in 2020: New cooling infrastructure but with maximized usage of free cooling New electrical infrastructure Architectural modification and life safety Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

26 IT Facilities & Business Continuity (4)
Asset management in CERN data centres – Eric Bonfillou CERN uses Infor EAM since 1997 for that purpose, site-wide Only tangible assets, characterized by procurement, financial and technical attributes are tracked Data structure is mostly a tree where top assets are the racks in the data centers Sites (Meyrin, Wigner), Buildings and rooms are only “containers” 20k assets totaling a purchasing value of 43MCHF are now recorded in Infor EAM… …but performing the inventory was a really tedious task! Data in Infor EAM is being used for multiple purposes (PPE project, AIRU, power monitoring,…) Extended usage of the product in 2016 by covering spare parts stock management, integration with the CERN Geographic Information System Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

27 IT Facilities & Business Continuity (5)
Energy efficiency upgrades at PIC – Jose FLIX MOLINA PIC is the largest grid center in Spain located in the Autonomous University of Barcelona All cooling and electrical equipment needed a revision Introduction of the free cooling technique Replacement of CRAHs, integration of chilled water backup system, separation of hot/cold aisles, etc… Work completed in September 2014, savings in electricity is ~100kEuros/year PUE is around 1.45, was >1.6 in 2010 before the rework Replacement of UPS system and power distribution panels Introduction of oil immersion techniques GCR Carnojet system, 4x 46U tanks capable to dissipate up to 45kW Oil temperature is 50C max Expect to achieve PUE of 1.05 to 1.1 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

28 IT Facilities & Business Continuity (6)
Energy services performance contracting abstract – Michael Ross Speaker covered a talk about contracts concluded by U.S federal agencies Aim of these contracts is to ensure efficient use of U.S federal data centers and reduce their energy footprint Basic idea is that infrastructure measures to reduce the energy footprint are paid for by the energy savings achieved over up to 25 years Once paid, the energy savings benefit entirely the organisation running the data centre Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

29 Security & Networking (1)
News from the HEPiX IPv6 working group – David Kelsey North America ran out of IPv4 addresses on 24/09/2015, IPv6 reaches now 9% of global traffic Working group has been running testbeds with gridftp and dCache FTS3 and XRootD Various problems are being encountered every now and then While testing is going on, for the experiments IPv6 is not the higher priority However, the aim is to gradually move to dual-stack services Most importantly, IPv4 services must continue to work! Especially since a number of issues were identified in dual-stack systems PerfSONAR is used to monitor IPv6 traffic Aim is to obtain the same performance using IPv6 than with IPv4 Today, CERN and six Tier-1 sites are IPv6 capable Security concerns and issues have been reviewed Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

30 Security & Networking (2)
Status on the IPv6 OSG software stack tests – Edgar Fajardo Hernandez Definition of compliance is that both client and server are dual-stack Software continues to work correctly if one is IPv4 only Tests have been run between FNAL, Wisconsin and UCSD 23/44 packages were tagged as fully compliant 12/44 packages were tagged as not compliant Storage services are overall compliant except for Hadoop File access issues with Bestman and dCache clients Authorisation and authentication issues with VOMS, GUMS, gLexec and EDG mkgridmap CEs and job submissions work HTCondor and Gram successfully tested Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

31 Security & Networking (3)
Network infrastructure for the CERN data centre – Eric Sallaz Goals are to support 10GbE, 40GbE and 100GbE network infrastructure Mostly optical network compatible with LC connectors as well as copper Low bandwidth devices still use 1GbE copper networks Even more than before, the “inspect and test before you connect” practice is enforced Issues spotted with dirty fibers, damaged connectors, etc.… Evolving the aging (1995) copper infrastructure is based on standard methods For high speed networks (>10GbE) new types of cables have to be considered OM3 for length up to 100m OM4 for length up to 150m With different cables comes different type of connectors MPO-12, MPO-24, etc.… Coping with heterogeneity is mandatory! Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

32 Security & Networking (4)
WCLG Network and transfer metrics WG after one year – Shawn McKee Working group created a year ago, progress achieved via monthly meetings Uses cases have been translated into measurable metrics and reported at CHEP PerfSONAR as the main monitoring tool Allows to differentiate between network and application issues Data is made widely available via various mechanisms in collaboration with OSG and Esnet A concrete example, targeting FTS transfers, showed asymmetries between ATLAS and CMS An observation is that the infrastructure is fine tuned for long flows and big files Is this the most appropriate choice over time? A proposal suggests the implementation of an (automated) unit to intervene on network issues For the time being, the collaboration must continue to collect more data from the metrics Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

33 Security & Networking (5)
Update on WLCG/OSG PerfSONAR infrastructure – Shawn McKee Goals are to identify network issues and characterize the network use Current PerfSONAR deployment is composed of 278 instances, 245 being active Metrics provided by PerfSONAR for faster and more precise analysis Some standard metrics include packet loss, delays, bandwidth measurement, packet loss, etc… Data collected from the PerfSONAR infrastructure is earmarked to be stored on a long run Path analysis is supported on PerfSONAR 3.5 used now by most sites Deployment of the software still requires physical hardware… …but powerful VMs are being evaluated which will reduce the cost of the infrastructure Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

34 Security & Networking (6)
Using VPLS for VM mobility – Carles Kishimoto Bisbe CERN data centers has racks spread over Meyrin and Wigner Both sites are interconnected via 100GbE links, network is routed and no VLANs are configured The need to move VMs is real, mostly triggered by decommissioning of aging hypervisors The difficulty resides in migrating VMs transparently using the existing network infrastructure A solution based on VPLS (Virtual Private LAN services) at the router level has been designed It requires router configuration but also some additional cabling to ensure proper routing It is achieved by putting in place loops on the routers, allowing traffic to pass on the proper path Testing is successful and workflows are being put in place to go for production Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

35 Security & Networking (7)
Computer security update – Liviu Valsan No major evolution since HEPiX spring 2015 except that exploit kits are always more advanced Mobile devices are increasingly targeted Crypto-lockers are being used at higher frequency generating large “profits” for attackers Most attacks are targeting largely deployed software like IE and flash plugins Unfortunately, large phishing campaigns are still on-going and sometimes partly successful Software protections are not always enough to prevent issues, mostly due to delays in integrating patches against malicious code A thorough example was presented covering poorly designed commercial software and weak recommendations on how to use it (from the company selling it) Usual recommendations to prevent security breaches were reminded Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

36 Security & Networking (8)
Building a large scale security operations center – Liviu Valsan Designing a centralized and unified platform for ingestion, storage and analytics for multiple data access Scale-out integration within the CERN IT ecosystem using bare metal hardware but also OpenStack VMs Several components being successfully used were presented: Bro: network analysis framework MISP: Malware Information Sharing Platform CIF: Collective Intelligence Framework Collected data is being stored and process in Hadoop, about 500GB /day is generated Services are not necessarily the main targets but admins and users are Talk ended by the analysis of a phishing sent to HEPiX participants  Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

37 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
Outline 2015 Spring Meeting & General HEPiX News Site Reports (21) Grids, Clouds, and Virtualization (14) End User Services & Operating Systems (5) Storage and File systems (10) Basic IT Services (10) Computing and Batch (8) IT Facilities (6) Security and Networking (8) Closing remarks Arne Alberto Eric Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

38 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary
HEPiX Board News New website finally on-line (hosted at DESY) Next meetings - Spring 2016: DESY Zeuthen (DE), April 18-22, 2016 - Fall 2016: LBNL Berkeley (US), Oct 17-21, (back to back with CHEP which is the week before) - Spring 2017: firm proposal for a European meeting - Discussions about swapping the European/US location cycle, consider Asia Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

39 Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

40


Download ppt "Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary"

Similar presentations


Ads by Google