Exploit the massive Volunteer Computing resource for HEP computation

Slides:

Advertisements

Similar presentations

Volunteer Computing Laurence Field IT/SDC 21 November 2014.

Advertisements

The Prototype Laurence Field IT/SDC 11 November 2014.

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.

Copyright © 2010 Platform Computing Corporation. All Rights Reserved.1 The CERN Cloud Computing Project William Lu, Ph.D. Platform Computing.

Public-resource computing for CEPC Simulation Wenxiao Kan Computing Center/Institute of High Physics Energy Chinese Academic of Science CEPC2014 Scientific.

1 port BOSS on Wenjing Wu (IHEP-CC)

+ CS 325: CS Hardware and Software Organization and Architecture Cloud Architectures.

YAN, Tian On behalf of distributed computing group Institute of High Energy Physics (IHEP), CAS, China CHEP-2015, Apr th, OIST, Okinawa.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

The Data Bridge Laurence Field IT/SDC 6 March 2015.

Volunteer Computing 2 Overview Volunteer Computing BOINC Volunteer Computing For HEP Virtualization Volunteer Towards a Common Platform.

Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.

David Cameron Claire Adam Bourdarios Andrej Filipcic Eric Lancon Wenjing Wu ATLAS Computing Jamboree, 3 December 2014 Volunteer Computing.

David Cameron Riccardo Bianchi Claire Adam Bourdarios Andrej Filipcic Eric Lançon Efrat Tal Hod Wenjing Wu on behalf of the ATLAS Collaboration CHEP 15,

BESIII Production with Distributed Computing Xiaomei Zhang, Tian Yan, Xianghu Zhao Institute of High Energy Physics, Chinese Academy of Sciences, Beijing.

Cloud Status Laurence Field IT/SDC 09/09/2014. Cloud Date Title 2 SaaS PaaS IaaS VMs on demand.

1 The Adoption of Cloud Technology within the LHC Experiments Laurence Field IT/SDC 17/10/2014.

1 Volunteer Computing at CERN past, present and future Ben Segal / CERN (describing the work of many people at CERN and elsewhere ) White Area lecture.

Status of WLCG FCPPL project Status of Beijing site Activities over last year Ongoing work and prospects for next year LANÇON Eric & CHEN Gang.

Status of BESIII Distributed Computing BESIII Workshop, Sep 2014 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group.

– Past, Present, Future Volunteer Computing at CERN Helge Meinhard, Nils Høimyr / CERN for the CERN BOINC service team H. Meinhard et al. - Volunteer.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

Cloud Computing Application in High Energy Physics Yaodong Cheng IHEP, CAS

INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.

ARC-CE: updates and plans Oxana Smirnova, NeIC/Lund University 1 July 2014 Grid 2014, Dubna using input from: D. Cameron, A. Filipčič, J. Kerr Nilsen,

Volunteer Clouds and Citizen Cyberscience for LHC Physics Artem Harutyunyan / CERN Carlos Aguado Sanchez / CERN, Jakob Blomer / CERN, Predrag Buncic /

Volunteer Clouds for the LHC experiments H. Riahi – 12/11/15 EGI User Forum Laurence Field Hassen Riahi CERN IT-SDC.

DIRAC Distributed Computing Services A. Tsaregorodtsev, CPPM-IN2P3-CNRS FCPPL Meeting, 29 March 2013, Nanjing.

CernVM and Volunteer Computing Ivan D Reid Brunel University London Laurence Field CERN.

The Limits of Volunteer Computing Dr. David P. Anderson University of California, Berkeley March 20, 2011.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

Dynamic Extension of the INFN Tier-1 on external resources

Grid and Cloud Computing

WLCG IPv6 deployment strategy

Review of the WLCG experiments compute plans

Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.

Status of WLCG FCPPL project

Running LHC jobs using Kubernetes

Volunteer Computing for Science Gateways

AWS Integration in Distributed Computing

Elastic Computing Resource Management Based on HTCondor

Virtualization and Clouds ATLAS position

Dag Toppe Larsen UiB/CERN CERN,

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

Designing a Runtime System for Volunteer Computing David P

Integration of Openstack Cloud Resources in BES III Computing Cluster

Dag Toppe Larsen UiB/CERN CERN,

ATLAS Cloud Operations

WLCG Manchester Report

Outline Benchmarking in ATLAS Performance scaling

External Focus Dr Ivan D Reid Brunel University London 02/09/2016 Ivan D Reid.

Fine grained processing with an Event Service

How to enable computing

David Cameron ATLAS Site Jamboree, 20 Jan 2017

Readiness of ATLAS Computing - A personal view

Univ. of Texas at Arlington BigPanDA Workshop, ORNL

CPU efficiency Since May CMS O+C has launched a dedicated task force to investigate CMS CPU efficiency We feel the focus is on CPU efficiency is because.

WLCG Collaboration Workshop;

OffLine Physics Computing

Xiaomei Zhang On behalf of CEPC software & computing group Nov 6, 2017

This work is partially supported by projects InterExcellence (LTT17018), Research infrastructure CERN (CERN-CZ, LM ) and OP RDE CERN Computing (CZ /0.0/0.0/1.

Partner Logo Azure Provides a Secure, Scalable Platform for ScheduleMe, an App That Enables Easy Meeting Scheduling with People Outside of Your Company.

Ivan Reid (Brunel University London/CMS)

rvGAHP – Push-Based Job Submission Using Reverse SSH Connections

Backfilling the Grid with Containerized BOINC in the ATLAS computing

Exploring Multi-Core on

Presentation transcript:

Exploit the massive Volunteer Computing resource for HEP computation Wenjing Wu Computer Center, IHEP, China 2018-03-21 ISGC 2018

Outline A brief timeline of the @home projects in HEP Different implementations Recent development Common challenges Summary

Volunteer Computing CAS@home SETI@home LHC@home ATLAS@home BOINC is a commonly used middleware harness the idle CPU resources on personal computers(desktops, laptops, smart phones) computing tasks run at a lower priority (only when CPUs are idle) suitable for CPU intensive, low latency computing

A Brief Time line of VC in HEP LHC@home 2004 CERN Sixtrack: Simul. accelerator Obstacles: Software size Heterogeneous OS Workflow integration Security Dev. New tech. Virtualbox/BOINC VM CVMFS Other BOINC features LHC@home 2017 CERN Consolidated all LHC projects LHCb@home CMS@home 2016 CERN Event Simul. ATLAS@home 2014 IHEP/ CERN Event Simul.

Other projects under construction BelleII, BESIII developments going on since 2014 Status: BelleII@home, prototype BESIII@home, in Beta Test The motive: Most of the LHC projects need much more power than the grid sites can deliver.. Volunteer Computing is a resource of great potential to be exploited

Common solutions Virtualization on non-Linux OS VM images are built and dispatched to volunteer computers CVMFS for software distribution inside VM Develop a gateway service to bridge the workflows

Different implementations One of the challenges is to integrate BOINC into the workflow which was designed based on Grid Computing WMS(Workload Management System) Pilot implementation GSI authentication required by Grid services vs. the untrusted feature of volunteer computers Different projects have to adapt different solutions to address the above issues

ATLAS@home ACE CE acts as the gateway: Fetch jobs from PanDA, forward jobs to BOINC Cache input/output data, download/upload to Grid SE Authentication/communication with Grid service are done on the gateway, spare the work nodes from storing the authenticators

CMS@home Challenge: different workflows:CRAB3 uses “push”, but BOINC uses “pull” for jobs Developed DataBridge as the gateway, it acts as Plugin to CRAB3, receives job description from CRAB3, store them in message queue Ceph buckets are used for input/output data, so BOINC can access them Stage in/out data between Grid SE and Ceph buckets Grid credential is stored in DataBridge to interact with Grid services.

LHCb@home

LHCb@home Challenges: the merge of two different authentication, the GSI authentication is closely coupled with payload running in the DIRAC pilot. Developed WMSSecureGW as the gateway which acts as Intermediate DRIAC services to volunteer computers, so by using a fake proxy, the volunteer computer can request jobs and stage data from the gateway DIRAC client and pilot to the production DIRAC services to fetch jobs and stage data

Daily CPU time of successful jobs in LHC@home The current scale Daily CPU time of successful jobs in LHC@home ATLAS SixTrack Includes Sixtrack, ATLAS, CMS, LHCb and some other HEP applications Daily CPU usage reaches 25K CPU days, average core power is 10HS06, equivalent to a cluster with 400KHS06 (assume the avg. CPU Eff is 60%) Sixtrack gains most of the CPU because it does not require virtualization, easy for volunteers to start with

ATLAS@home CPU time of good jobs of All ATLAS sites in a week The whole CPU time per day is between 300K~400K CPU days, ATLAS@home remains at the TOP 10 sites

ATLAS@home CPU time of good jobs of All ATLAS sites in a week Avg. BOINC core power: 11HS06 BOINC: 2.26% BOINC Avg. 7000 CPU days per day by good jobs

Recent development Lightweight/native Use container instead of virtual machines Use BOINC to backfill busy cluster nodes. The average CPU Utilization rate for clusters is between 50%~70% ATLAS grid site uses ATLAS@home to exploit extra CPU from the fully loaded cluster

Lightweight model for ATLAS@home ARC CE ACT PanDA Volunteer Host (Windows/Mac) VirtualBox VM BOINC Server ATLAS app Cloud Nodes /PC/Linux servers Volunteer Host ( Linux) Singularity Grid sites Run directly Volunteer Host ( SLC/Centos 6)

100% wall != 100% CPU utilization due to job CPU Eff. Job 2: 12 Wall hour Job 1: 12 Wall hour 8 CPU hour Job 2: 12 Wall hour Job 1: 12 Wall hour 8 CPU hour Job 2: Wall hour Job 3: 12Wall hour 4 CPU hour One work node With job 1-2, 100% wall utilization, assume job CPU Eff. 75%,then 25% CPU is wasted With job 1-4, 200% wall utilization, 100% CPU utilization, job eff 75% and 25%

Put more jobs on work nodes Run 2 jobs on each core 1 grid job with normal priority (pri=20), 1 BOINC job with the lowest priority (pri=39) Linux uses “non preemptive” scheduling for CPU cycles, which means high priority jobs occupies CPU until it releases the CPU voluntarily. BOINC only gets CPU cycles when grid jobs do not use it. ATLAS Grid job ATLAS@home job

Experience from ATLAS@home: at BEIJING Tier 2 site Grid jobs Walltime Util. is 87.8%, Grid CPU Util. is 65.6%, BOINC exploit an extra 23% of CPU time, node CPU Util. reaches 89% More details from here: BEIJING_BOINC

Experience from ATLAS@home: at BEIJING Tier 2 site Looking at one node The overall CPU Util. is 98.44% on this node in 24 hours

Common Challenges in the future Discontinued development for the BOINC software More flexible scheduling for better utilization of diverse resources(availability, power) Scalability issues: IO bottleneck.. ATLAS@home has already hit the bottleneck Outreach: how to attract more volunteer computers

Summary The appliance of volunteer computing in HEP started over a decade ago, but only a few years for the big experiments thanks to the development of key technologies Each experiment needs its own implementation in order to integrate the VC resource into the existing workflow Volunteer computing has been providing a very considerable amount of CPU to HEP computing These @home projects can be used in a broad range: Manage Tier 3 sites/institutes’ internal computing devices Backfilling clusters

Thanks!