Exploit the massive Volunteer Computing resource for HEP computation Wenjing Wu Computer Center, IHEP, China 2018-03-21 ISGC 2018
Outline A brief timeline of the @home projects in HEP Different implementations Recent development Common challenges Summary
Volunteer Computing CAS@home SETI@home LHC@home ATLAS@home BOINC is a commonly used middleware harness the idle CPU resources on personal computers(desktops, laptops, smart phones) computing tasks run at a lower priority (only when CPUs are idle) suitable for CPU intensive, low latency computing
A Brief Time line of VC in HEP LHC@home 2004 CERN Sixtrack: Simul. accelerator Obstacles: Software size Heterogeneous OS Workflow integration Security Dev. New tech. Virtualbox/BOINC VM CVMFS Other BOINC features LHC@home 2017 CERN Consolidated all LHC projects LHCb@home CMS@home 2016 CERN Event Simul. ATLAS@home 2014 IHEP/ CERN Event Simul.
Other projects under construction BelleII, BESIII developments going on since 2014 Status: BelleII@home, prototype BESIII@home, in Beta Test The motive: Most of the LHC projects need much more power than the grid sites can deliver.. Volunteer Computing is a resource of great potential to be exploited
Common solutions Virtualization on non-Linux OS VM images are built and dispatched to volunteer computers CVMFS for software distribution inside VM Develop a gateway service to bridge the workflows
Different implementations One of the challenges is to integrate BOINC into the workflow which was designed based on Grid Computing WMS(Workload Management System) Pilot implementation GSI authentication required by Grid services vs. the untrusted feature of volunteer computers Different projects have to adapt different solutions to address the above issues
ATLAS@home ACE CE acts as the gateway: Fetch jobs from PanDA, forward jobs to BOINC Cache input/output data, download/upload to Grid SE Authentication/communication with Grid service are done on the gateway, spare the work nodes from storing the authenticators
CMS@home Challenge: different workflows:CRAB3 uses “push”, but BOINC uses “pull” for jobs Developed DataBridge as the gateway, it acts as Plugin to CRAB3, receives job description from CRAB3, store them in message queue Ceph buckets are used for input/output data, so BOINC can access them Stage in/out data between Grid SE and Ceph buckets Grid credential is stored in DataBridge to interact with Grid services.
LHCb@home
LHCb@home Challenges: the merge of two different authentication, the GSI authentication is closely coupled with payload running in the DIRAC pilot. Developed WMSSecureGW as the gateway which acts as Intermediate DRIAC services to volunteer computers, so by using a fake proxy, the volunteer computer can request jobs and stage data from the gateway DIRAC client and pilot to the production DIRAC services to fetch jobs and stage data
Daily CPU time of successful jobs in LHC@home The current scale Daily CPU time of successful jobs in LHC@home ATLAS SixTrack Includes Sixtrack, ATLAS, CMS, LHCb and some other HEP applications Daily CPU usage reaches 25K CPU days, average core power is 10HS06, equivalent to a cluster with 400KHS06 (assume the avg. CPU Eff is 60%) Sixtrack gains most of the CPU because it does not require virtualization, easy for volunteers to start with
ATLAS@home CPU time of good jobs of All ATLAS sites in a week The whole CPU time per day is between 300K~400K CPU days, ATLAS@home remains at the TOP 10 sites
ATLAS@home CPU time of good jobs of All ATLAS sites in a week Avg. BOINC core power: 11HS06 BOINC: 2.26% BOINC Avg. 7000 CPU days per day by good jobs
Recent development Lightweight/native Use container instead of virtual machines Use BOINC to backfill busy cluster nodes. The average CPU Utilization rate for clusters is between 50%~70% ATLAS grid site uses ATLAS@home to exploit extra CPU from the fully loaded cluster
Lightweight model for ATLAS@home ARC CE ACT PanDA Volunteer Host (Windows/Mac) VirtualBox VM BOINC Server ATLAS app Cloud Nodes /PC/Linux servers Volunteer Host ( Linux) Singularity Grid sites Run directly Volunteer Host ( SLC/Centos 6)
100% wall != 100% CPU utilization due to job CPU Eff. Job 2: 12 Wall hour Job 1: 12 Wall hour 8 CPU hour Job 2: 12 Wall hour Job 1: 12 Wall hour 8 CPU hour Job 2: Wall hour Job 3: 12Wall hour 4 CPU hour One work node With job 1-2, 100% wall utilization, assume job CPU Eff. 75%,then 25% CPU is wasted With job 1-4, 200% wall utilization, 100% CPU utilization, job eff 75% and 25%
Put more jobs on work nodes Run 2 jobs on each core 1 grid job with normal priority (pri=20), 1 BOINC job with the lowest priority (pri=39) Linux uses “non preemptive” scheduling for CPU cycles, which means high priority jobs occupies CPU until it releases the CPU voluntarily. BOINC only gets CPU cycles when grid jobs do not use it. ATLAS Grid job ATLAS@home job
Experience from ATLAS@home: at BEIJING Tier 2 site Grid jobs Walltime Util. is 87.8%, Grid CPU Util. is 65.6%, BOINC exploit an extra 23% of CPU time, node CPU Util. reaches 89% More details from here: BEIJING_BOINC
Experience from ATLAS@home: at BEIJING Tier 2 site Looking at one node The overall CPU Util. is 98.44% on this node in 24 hours
Common Challenges in the future Discontinued development for the BOINC software More flexible scheduling for better utilization of diverse resources(availability, power) Scalability issues: IO bottleneck.. ATLAS@home has already hit the bottleneck Outreach: how to attract more volunteer computers
Summary The appliance of volunteer computing in HEP started over a decade ago, but only a few years for the big experiments thanks to the development of key technologies Each experiment needs its own implementation in order to integrate the VC resource into the existing workflow Volunteer computing has been providing a very considerable amount of CPU to HEP computing These @home projects can be used in a broad range: Manage Tier 3 sites/institutes’ internal computing devices Backfilling clusters
Thanks!