CPU efficiency Since May CMS O+C has launched a dedicated task force to investigate CMS CPU efficiency We feel the focus is on CPU efficiency is because.

Slides:

Advertisements

Similar presentations

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

Advertisements

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

Charles Reiss *, Alexey Tumanov †, Gregory R. Ganger †, Randy H. Katz *, Michael A. Kozuch ‡ * UC Berkeley† CMU‡ Intel Labs.

CMS Diverse use of clouds David Colling

1 Lecture 10: Uniprocessor Scheduling. 2 CPU Scheduling n The problem: scheduling the usage of a single processor among all the existing processes in.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.

AN INTRODUCTION TO THE OPERATIONAL ANALYSIS OF QUEUING NETWORK MODELS Peter J. Denning, Jeffrey P. Buzen, The Operational Analysis of Queueing Network.

CPU Scheduling Chapter 6 Chapter 6.

Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]

Claudio Grandi INFN Bologna CMS Operations Update Ian Fisk, Claudio Grandi 1.

Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.

Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.

1. Maria Girone, CERN  Q WLCG Resource Utilization  Commissioning the HLT for data reprocessing and MC production  Preparing for Run II  Data.

Tier-2  Data Analysis  MC simulation  Import data from Tier-1 and export MC data CMS GRID COMPUTING AT THE SPANISH TIER-1 AND TIER-2 SITES P. Garcia-Abia.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems with Multi-programming Chapter 4.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

Claudio Grandi INFN Bologna CMS Computing Model Evolution Claudio Grandi INFN Bologna On behalf of the CMS Collaboration.

MultiJob pilot on Titan. ATLAS workloads on Titan Danila Oleynik (UTA), Sergey Panitkin (BNL) US ATLAS HPC. Technical meeting 18 September 2015.

CMS Computing Model Simulation Stephen Gowdy/FNAL 30th April 2015CMS Computing Model Simulation1.

LHCbComputing LHCC status report. Operations June 2014 to September m Running jobs by activity o Montecarlo simulation continues as main activity.

OPERATIONS REPORT JUNE – SEPTEMBER 2015 Stefan Roiser CERN.

Maria Girone, CERN CMS Experiment Status, Run II Plans, & Federated Requirements Maria Girone, CERN XrootD Workshop, January 27, 2015.

Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.

Chapter 4 CPU Scheduling. 2 Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

A Case for Application-Aware Grid Services Gabriele Garzoglio, Andrew Baranovski, Parag Mhashilkar, Anoop Rajendra*, Ljubomir Perković** Computing Division,

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.

Basic Concepts Maximum CPU utilization obtained with multiprogramming

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.

MaGate Experiments on Scenarios GridGroup EIF, Feb 5th, 2009 Ye HUANG Pervasive Artificial Intelligence Group, Dept of Informatics, University of Fribourg,

LHCb Computing 2015 Q3 Report Stefan Roiser LHCC Referees Meeting 1 December 2015.

Kevin Thaddeus Flood University of Wisconsin

Ian Bird WLCG Workshop San Francisco, 8th October 2016

Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Cluster Optimisation using Cgroups

for the Offline and Computing groups

LHCb Software & Computing Status

LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.

Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group

Readiness of ATLAS Computing - A personal view

ALICE Computing Model in Run3

ALICE Computing Upgrade Predrag Buncic

Chapter 2.2 : Process Scheduling

WLCG Collaboration Workshop;

Chapter 6: CPU Scheduling

CPU Scheduling G.Anuradha

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

COT 4600 Operating Systems Spring 2011

Operating System Concepts

3: CPU Scheduling Basic Concepts Scheduling Criteria

Chapter5: CPU Scheduling

Chapter 5: CPU Scheduling

Chapter 6: CPU Scheduling

CPU SCHEDULING.

COT 4600 Operating Systems Fall 2009

Chapter 5: CPU Scheduling

Uniprocessor scheduling

Operating System , Fall 2000 EA101 W 9:00-10:00 F 9:00-11:00

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

Chapter 6: CPU Scheduling

Exploit the massive Volunteer Computing resource for HEP computation

Exploring Multi-Core on

Module 5: CPU Scheduling

Presentation transcript:

CPU efficiency Since May CMS O+C has launched a dedicated task force to investigate CMS CPU efficiency We feel the focus is on CPU efficiency is because that is what WLCG measures; punishes those who reduce CPU needs or where one reduces CPU efficiency to get better physics throughput. Still, task force has found opportunities to improve CPU efficiency and resource utilization and physics throughput. TF chaired by D.Colling, UK/IC For a significant amount of 2017, we were not CPU resource constrained. Situation is changing in the last month; insufficient statistics for some plots. Still, a lot of gained understanding which led to implemented solutions Global efficiency can be factorized into infrastructure_efficiency * payload_efficiency Calibration / code loading IOWait, final stageout, job length … Pilot infrastructure, interaction with batch systems, job matching, pilot draining …

Infrastructure “inefficiency” – implemented solutions Causes and mitigations for idle CPU, with estimations of their contributions to the average ~10% and ~15% maximum sustained idle CPU due originally to multi-core glidein scheduling issues. EARLY IMPROVEMENTS: Implementation of depth-wise filling of glideins (pilots) in June (improvement estimated at ~1-2%) Shortening the amount of time a glidein can sit completely idle from 20m to 10m (July), driven by Negotiator cycle length (improvement estimated at ~1%) Removing glideins from site and factory batch queues after a certain amount of time (1-3h) in July: When job submission is bursty, we found that the glideinWMS factory had requested new glideins which finally ran hours or even days after job demand evaporated resulting in massive, unpredictable wastage of CPU, up to 60% over many hours. Had the effect of more tightly coupling demand and supply of resources, but perhaps not the perfect solution (causes some batch queue churn at sites by removing jobs).

Infrastructure “inefficiency” – analysis and plans Reference study period after improvements, with sustained job pressure: Aug 23rd-29th, 2017. Total remaining idle CPU in multi-core pilots: 11.4%. Legitimate use case for leaving CPU idle: high-memory dynamic slots for RAM-hungry workflows (3.4%) on the Grid and the HLT. Analysis jobs with unrealistic wall clock time requests cause idle CPU (1.6%) since the finite-lifetime glideins (typically 48h) will not live long enough to match them. Automatic job splitting in CRAB3 under test Draining glideins after 30h (4.2%): We are currently optimizing against job failure rates and may consider not draining at all to eliminate this component. Total understood: 9.2%, Remaining: 2.2%, or ~1h avg. for a 48h pilot. About ~6% of next gains may come from improving job wall clock time estimates and eliminating draining of pilots. Reference period was during steady production job pressure, not bursty submission. Comparisons with bursty mode in previous weeks shows that this increases idle CPU (due to additional cyclical draining of pilots) an additional ~6% (even after limiting idle glidein queue time to 1h). Scale tests in the Global Pool ITB with über-glideins may allow us to estimate the individual effects of improvements by backing them out one-by-one

Submission Infrastructure Monitoring Multi-core pool to >97% Breakdown of idle core reasons: note the order of magnitude difference in scale. HLT & Grid memory starved slots (red and purple) a valid use case from high-memory workflows Retiring policy and requested job length accuracy can fix the yellow and green, in principle Remaining 2% (blue) can be investigated but is possibly irreducible. # of cores

CMS (Payload) Workload Over the last 1B of core hours, here’s the approximate usage: For each row: Different source of inefficiency. Some inefficiencies may be irreducible. Different approach (and cost!) for improvements. Goal: given finite effort, maximize payoff. Job Type Percent CPU resources Payload CPU efficiency Percent of idle payload CPU Analysis 37% 64% 41% Simulation 27% 74% 22% MC reconstruction 68% 29% Data reconstruction 7% 73% 5% Other 2% 56% 8% Overall: WLCG CPU efficiency = (pilot scheduling eff) * (payload CPU eff) WLCG measurements roughly match collected CMS monitoring!

Difficulties: MC reconstruction The most CPU efficient workflows are those with infinite loops. Example: CMS MC reconstruction can be done in either premixed or classic pile-up. Premixed mode reduces CPU-per-evt by 2x and data-per-evt by 10x. It also requires 10x larger background (~1PB of disk). We host the samples at large disk sites (FNAL, CERN) and stream them to reprocessing sites; these provide 30% of the CPU resources for this campaign ~60% CPU efficiency overall; 65% CPU efficiency when run at CERN / FNAL. Today: We take the CPU efficiency hit and deliver >2x more events per month than prior approach. Alternate options with higher CPU efficiency: Only run premixing at CERN and FNAL – far less CPUs. Run classic mode only – involve more sites, but far slower application. Distribute premixed background more widely Also under investigation: Have ROOT better interleave IO and CPU usage. Take home message: We sometimes trade off CPU efficiency for better physics or to avoid IO (which is not metered by WLCG).

Difficulties: Input Data Much ado about a measurable quantity. Over the last year, 17% of our wall-clock hours went to jobs that read data from offsite. On average, this reduces CPU efficiency by 10%. The CPU-efficiency cost of remote reads is about 2% globally. This assumes we have other jobs that could have utilized the same CPU: bad assumption, given we spent most of 2017 under-utilized. Sites with “popular” input data and performant storage may have less remote-IO jobs. Today: We take the CPU efficiency hit. Cost: 2% decrease in overall CPU efficiency. Alternate options: Run lower-priority jobs. Leave sites or cores unused. Simulation jobs (no input dataset) are 27% of CMS’s CPU workload. Remaining requires varying amounts of IO. Take home message: Remote IO provides us with the flexibility in workflow planning; complex tradeoff of utilization, prioritization, and CPU efficiency.