How to enable computing

Slides:

Advertisements

Similar presentations

WLCG Cloud Traceability Working Group progress Ian Collier Pre-GDB Amsterdam 10th March 2015.

Advertisements

CMS Diverse use of clouds David Colling

Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.

Assessment of Core Services provided to USLHC by OSG.

Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.

Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.

Cloud Status Laurence Field IT/SDC 09/09/2014. Cloud Date Title 2 SaaS PaaS IaaS VMs on demand.

ALICE Offline Week | CERN | November 7, 2013 | Predrag Buncic AliEn, Clouds and Supercomputers Predrag Buncic With minor adjustments by Maarten Litmaath.

1 The Adoption of Cloud Technology within the LHC Experiments Laurence Field IT/SDC 17/10/2014.

WLCG operations A. Sciabà, M. Alandes, J. Flix, A. Forti WLCG collaboration workshop July , Barcelona.

GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.

Security Policy Update David Kelsey UK HEP Sysman, RAL 1 Jul 2011.

Evolving Security in WLCG Ian Collier, STFC Rutherford Appleton Laboratory Group info (if required) 1 st February 2016, WLCG Workshop Lisbon.

Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013

1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.

LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.

CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Accounting John Gordon WLC Workshop 2016, Lisbon.

1 Globe adapted from wikipedia/commons/f/fa/ Globe.svg IDGF-SP International Desktop Grid Federation - Support Project SZTAKI.

Outcome should be a documented strategy Not everything needs to go back to square one! – Some things work! – Some work has already been (is being) done.

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP33 Ambleside 22 Aug 2014.

3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.

CernVM and Volunteer Computing Ivan D Reid Brunel University London Laurence Field CERN.

Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April

UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.

WLCG Accounting Task Force Introduction Julia Andreeva CERN 9 th of June,

HTCondor Accounting Update

Virtualisation & Containers (A RAL Perspective)

Parrot and ATLAS Connect

Dynamic Extension of the INFN Tier-1 on external resources

WLCG IPv6 deployment strategy

Review of the WLCG experiments compute plans

Laurence Field IT/SDC Cloud Activity Coordination

WLCG Workshop 2017 [Manchester] Operations Session Summary

Davide Salomoni INFN-CNAF Bologna, Jan 12, 2006

Track 3 summary Distributed Computing

C Loomis (CNRS/LAL) and V. Floros (GRNET)

Volunteer Computing for Science Gateways

Sviluppi in ambito WLCG Highlights

Virtualization and Clouds ATLAS position

Multi User Pilot Jobs update

Dag Toppe Larsen UiB/CERN CERN,

Dag Toppe Larsen UiB/CERN CERN,

The “Understanding Performance!” team in CERN IT

ATLAS Cloud Operations

WLCG Manchester Report

HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL

CREAM Status and Plans Massimo Sgaravatto – INFN Padova

WLCG experiments FedCloud through VAC/VCycle in the EGI

Update on Plan for KISTI-GSDC

David Cameron ATLAS Site Jamboree, 20 Jan 2017

Cloud Management Mechanisms

Update from the HEPiX IPv6 WG

Summary from last MB “The MB agreed that a detailed deployment plan and a realistic time scale are required for deploying glexec with setuid mode at WLCG.

Grid status ALICE Offline week March 30, Maarten Litmaath CERN-IT v1.1

Simulation use cases for T2 in ALICE

FCT Follow-up Meeting 31 March, 2017 Fernando Meireles

WLCG Collaboration Workshop;

GGF15 – Grids and Network Virtualization

Management of Virtual Execution Environments 3 June 2008

Cloud Computing R&D Proposal

Cloud Management Mechanisms

Operations Management Board April 30

Ivan Reid (Brunel University London/CMS)

Backfilling the Grid with Containerized BOINC in the ATLAS computing

Exploit the massive Volunteer Computing resource for HEP computation

Presentation transcript:

Lightweight computing resources WLCG workshop, 19th June 2017 Maarten Litmaath, CERN v1.0

How to enable computing Services that currently are or may be needed to enable computing at a grid site: Computing Element Batch system Cloud setups Authorization system Info system Accounting CVMFS Squid Monitoring

T2 vs. T3 sites T3 sites have not signed the WLCG MoU Typically dedicated to a single experiment  can take advantage of shortcuts T2 sites have rules that apply Availability / Reliability targets Accounting into EGI / OSG / WLCG repository EGI: presence in the info system for the Ops VO Security regulations Mandatory OS and MW updates and upgrades Isolation Traceability Security tests and challenges Evolution is possible Some rules could be adjusted The infrastructure machinery can evolve

Lightweight sites – classic view How might sites provide resources with less effort? Storage  see Data Management session on Tue Computing  followup in Ops Coordination Here we are mostly concerned with EGI sites US-ATLAS and US-CMS projects: see CHEP talk Site responses to a questionnaire show the potential benefits of shared repositories OpenStack images Pre-built services, pre-configured where possible Docker containers Ditto Puppet modules For site-specific configuration

Lightweight sites – alternative view CE + batch system not strictly needed Cloud VMs or containers could be sufficient They can be managed e.g. with Vac or Vcycle Several GridPP sites are doing that already All 4 experiments are covered The resources are properly accounted They can directly receive work from an experiment’s central task queue Or they can rather join a regional or global HTCondor pool to which an experiment submits work Proof of concept used by GridPP sites for ALICE Cf. the CMS global GlideinWMS pool  scalable to O(100k)

Distributed site operations model A site needs to provide resources at an agreed QoS level HW needs to be administered by the site Other admin operations could be done by a remote, possibly distributed team of experts Site resources within a region could be amalgamated (and hidden) by a regional virtual HTCondor batch system VMs/containers of willing sites may join the pool directly CEs and batch systems of other sites can be addressed through Condor-G The virtual site exposes an HTCondor CE interface through which customers submit jobs to the region HTCondor then routes the jobs according to fair-share etc.

Volunteer computing … The LHC@home project coordinates volunteer computing activities across the experiments ATLAS have benefited from 1-2% extra resources for simulation workloads See this recent talk by David Cameron It could become a way for a computing-only lightweight site to provide its resources The central infrastructure can scale at least for simulation jobs The resources can be properly accounted in APEL

… and lightweight sites Real sites can be trusted No need for volunteer CA or data bridge A separate, easier infrastructure would be set up BOINC might even coexist with a batch system on the same WN Some ideas to explore… Also here HTCondor is used under the hood Standard for experiments and service managers

Recent ATLAS T3 stats

Recent CMS T3 stats

Volunteer potential 8th BOINC Pentathlon 2017 SixTrack 400k 300k 200k

Computing resource SLAs The resources themselves can also be “lightweight” Please see this recent talk by Gavin McCance Extra computing resources could be made available at a lower QoS than usual Disk server CPU cycles, spot market, HPC backfill, intervention draining, … Jobs might e.g. get lower IOPS and would typically be pre-emptible Machine-Job Features (MJF) functionality can help smooth the use They would have an SLA between those of standard and volunteer resources  a mid-SLA