HTCondor at the RACF SEAMLESSLY INTEGRATING MULTICORE JOBS AND OTHER DEVELOPMENTS William Strecker-Kellogg RHIC/ATLAS Computing Facility Brookhaven National.

Slides:

Advertisements

Similar presentations

Virtual Memory (II) CSCI 444/544 Operating Systems Fall 2008.

Advertisements

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

VSphere vs. Hyper-V Metron Performance Showdown. Objectives Architecture Available metrics Challenges in virtual environments Test environment and methods.

Enabling High-level SLOs on Shared Storage Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Randy Katz, Ion Stoica Cake 1.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

Performance Evaluation

Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 34 – Media Server (Part 3) Klara Nahrstedt Spring 2012.

Presented by Jacob Wilson SharePoint Practice Lead Bross Group 1.

Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.

Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21.

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

Analysis of Simulation Results Andy Wang CIS Computer Systems Performance Analysis.

PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Data oriented job submission scheme for the PHENIX user analysis in CCJ Tomoaki Nakamura, Hideto En’yo, Takashi Ichihara, Yasushi Watanabe and Satoshi.

Gilbert Thomas Grid Computing & Sun Grid Engine “Basic Concepts”

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

OSG Public Storage and iRODS

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

No Idle Cores DYNAMIC LOAD BALANCING IN ATLAS & OTHER NEWS WILLIAM STRECKER-KELLOGG.

Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

DONE-08 Sizing and Performance Tuning N-Tier Applications Mike Furgal Performance Manager Progress Software

CPU Scheduling CSCI 444/544 Operating Systems Fall 2008.

Write-through Cache System Policies discussion and A introduction to the system.

Integrating JASMine and Auger Sandy Philpott Thomas Jefferson National Accelerator Facility Jefferson Ave. Newport News, Virginia USA 23606

Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.

Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.

The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.

Migration to 7.4, Group Quotas, and More William Strecker-Kellogg Brookhaven National Lab.

Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison

Aneka Cloud ApplicationPlatform. Introduction Aneka consists of a scalable cloud middleware that can be deployed on top of heterogeneous computing resources.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Neutrino Program (MINOS, MINERvA, General) Margaret Votava April 21, 2009 Tactical plan.

Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.

PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.

Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.

Data Placement Intro Dirk Duellmann WLCG TEG Workshop Amsterdam 24. Jan 2012.

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

DAQ Status & Plans GlueX Collaboration Meeting – Feb 21-23, 2013 Jefferson Lab Bryan Moffit/David Abbott.

Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer

Next Generation of Apache Hadoop MapReduce Owen

Virtual Cluster Computing in IHEPCloud Haibo Li, Yaodong Cheng, Jingyan Shi, Tao Cui Computer Center, IHEP HEPIX Spring 2016.

New Directions and BNL USATLAS TIER-3 & NEW COMPUTING DIRECTIVES William Strecker-Kellogg RHIC/ATLAS Computing Facility (RACF) Brookhaven National.

CASTOR new stager proposal CASTOR users’ meeting 24/06/2003 The CASTOR team.

Five todos when moving an application to distributed HTC.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)

Examples Example: UW-Madison CHTC Example: Global CMS Pool

CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016

CREAM-CE/HTCondor site

PES Lessons learned from large scale LSF scalability tests

CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017

Conditions Data access using FroNTier Squid cache Server

Building and Testing using Condor

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Support for ”interactive batch”

Virtual Memory: Working Sets

Presentation transcript:

HTCondor at the RACF SEAMLESSLY INTEGRATING MULTICORE JOBS AND OTHER DEVELOPMENTS William Strecker-Kellogg RHIC/ATLAS Computing Facility Brookhaven National Laboratory April 2014

RACF Overview Main HTCondor pools ◦PHENIX—12.2kCPU ◦STAR—12.0kCPU ◦ATLAS—13.1kCPU STAR/PHENIX are RHIC detectors ◦Loose federation of individual users ATLAS—tightly controlled, subordinate to PANDA workflow management, strict structure HTCONDORWEEK Smaller Experiments ◦LBNE ◦Dayabay ◦LSST

Supporting Multicore Priorities and Requirements ◦Keep management workload at a minimum ◦No manual re-partitioning ◦Needs to be dynamic and/or partitionable ◦Support automatic demand-spillover ◦Need to retain group-quota system with accept-surplus feature ◦Maximize throughput—minimize latency ◦Same goals as above ◦Attempt to support future developments in same framework ◦More than just multicore ◦High-Memory already used in this context (with caveat) ◦Principle of proportional pain ◦Okay to make multicore wait longer—but no starvation is allowed HTCONDORWEEK

Supporting Multicore STEP 1: PARTITIONABLE SLOTS EVERYWHERE Required change to monitoring ◦Job-count no longer correct metric for measuring occupancy Minor script change with SlotID ◦Slot  Slot _ Works with no side effects STEP 2: POLICY CHANGES Preemption is no longer possible ◦OK for now since not needed Slot-Weight can only be CPUs ◦Needs to change in the future Defragmentation is necessary ◦Detail next slide Dedicated queues for now HTCONDORWEEK

Defragmentation Policy DEFRAGMENTATION DAEMON Start Defragmentation ◦(PartitionableSlot && !Offline && TotalCpus > 12) End Defragmentation ◦(Cpus >= 10) Rate: max 4/hr KEY CHANGE: NEGOTIATOR POLICY Default policy is breadth-first filling of equivalent machines ◦(Kflops – SlotId) Depth-first filling preserves continuous blocks longer ◦(-Cpus) HTCONDORWEEK

Multicore in Dedicated Queues Works well now, issues were resolved ◦Allocation is handled by group quota and surplus-share Dedicated queue per species of job ◦Currently two—high-memory (6Gb) and 8-core ◦Changing requirements require manual action HTCONDORWEEK

Multicore in common Queue Works well, no starvation ◦Lack of control over strict allocation of m. vs. n core jobs within queue Not currently done in ATLAS ◦Issue of allocation control is just part of reason why ◦Structural and not likely to change soon HTCONDORWEEK

Fixing bugs in HTCondor Last summer a period of intensive development/testing in collaboration with HTCondor team ◦Built a VM testbed, rapid build & test of patches from HTCondor team ◦Built new monitoring interface ◦After many iterations had working config with Partitionable slots and Hierarchical Group Quotas with accept_surplus Should Stay Here Now it does! HTCONDORWEEK

Bugfix Testbed Details Rapid build, test, deploy cycle from git patches ◦ patch ◦Rebuild condor ◦Run test-feeder Job Feeder ◦Defines groups in config-file with different random length-ranges and requirements ◦Variable workload—keep N jobs idle in each queue HTCONDORWEEK

Current Multicore Status Fully utilize PSlots ◦All traditional nodes (x86_64) can have same config Fix works perfectly when accept_surplus is on for any combination of groups Major Limitation: SlotWeight=Cpus ◦High-memory jobs can be accommodated by asking for more CPUs. ◦Need ability to partition better and interact with global resource limits ◦SlotWeight should be a configurable function of all consumable resources Other Limitation: No Preemption ◦Required to support opportunistic usage SLOT_TYPE_1 = 100% NUM_SLOTS = 1 NUM_SLOTS_TYPE_1 = 1 SLOT_TYPE_1_PARTITIONABLE = True SlotWeight=Cpus HTCONDORWEEK

Issue With High Memory Jobs 1…ok, 2…ok, 3…not ok! General problem: ◦Inefficiencies in heterogeneous jobs scheduling to granular resources ◦Worse as you add dimensions: imagine GPSs, Disks, CoProcessors, etc… HTCONDORWEEK

Goals Need all the same behavior as we have now regarding groups and accept_surplus Want to be able to slice by any resource Sane and configurable defaults/quantization of requests Defragmentation inefficiencies should be kept to a minimum—we are mostly there already! Overall we are something like ¾ of the way to our ideal configuration. HTCONDORWEEK

Problem of Weights and Costs What does SlotWeight mean with heterogeneous resources? ◦Job of administrator to determine how much to “charge” for each requested resources ◦E.g. (cpus + 1.5(ram exceeding cpus * ram/core)) ◦Are these weights normalized to what CPU counting would give? ◦If not then what does the sum of SlotWeights represent? Quotas related to sum of SlotWeights, needs to be constant pool-wide and independent of current job allocation—if specifying static number! ◦Cost functions need to be linear? ◦Only dynamic quota instead (e.g. 80%X + 20%Y)… HTCONDORWEEK

Implications A picture is worth 1000 words… The more barriers between nodes that can be broken down the better ◦MPI-like batch software with NUMA-aware scheduling making other machines like further away NUMA nodes? HTCONDORWEEK

ATLAS Load Balancing PANDA contains knowledge of upcoming work Wouldn’t it be nice to adapt the group allocation accordingly ◦A few knobs can be tuned—surplus and quota ◦Dynamic adjustment based on current pending-work ◦Gather heuristics on past behavior to help Timeframe: Fall 2014—project will be picked up by a summer student this year HTCONDORWEEK

PHENIX Job Placement Data stored on PHENIX nodes (dCache) ◦New this year is RAW data is placed hot off the DAQ ◦Reprocessing no longer requires second read from tape ◦Less tape wear, faster—no stage latency ◦Analysis continues to read input from dCache No intelligent placement of jobs ◦HDFS-like job placement would be great—but without sacrificing throughput ◦Approach: ◦Need to know where files are first! ◦Need to suggest placement without waiting for the perfect slot ◦Started as proof-of-concept for efficacy of non-flat network ◦Testing Infiniband fabrics with tree-based topology HTCONDORWEEK

PHENIX Job Placement File placement harvested from nodes and placed in database ◦Database contains map of file->machine ◦Also contains machines->rack and rack->rack-group mapping Machines run a STARTD_CRON to query & advertise their location Job-RANK statement used to steer jobs towards machines where their files are ◦E.g: (3*(Machine==“a” || Machine==“c”)) + 2*(Rack==“21-6”) + (RackGroup == “10”) ◦Slight increase in negotiation time, upgraded hardware to compensate ◦Several thousand matches/hr with possibly unique RANK statements Working on modified dCache client to directly read file if on local node HTCONDORWEEK

PHENIX Job Placement Results We achieved expected results with machine-local jobs ◦>80% on an empty farm ◦~10% on a full farm All localization in rack-group ◦>90% empty farm ◦~15% full farm Argues that great benefit could be gained from utilizing multi-tier networking ◦Without it, only machine-local jobs benefit HTCONDORWEEK

PHENIX Job Placement Results 1.Machine-Local 2.Rack-Local (exclusive of Machine-Local) 3.All Localized (Sum ) 4.All Jobs HTCONDORWEEK

PHENIX Job Placement Results 1.Machine-Local 2.Rack-Local (exclusive of Machine-Local) 3.All Localized (Sum ) HTCONDORWEEK

PHENIX Job Placement Results SIMULATED NO LOCALIZATIONRESULTS FROM LOCALIZATION 1.Machine-Local 2.Rack-Local (exclusive of Machine-Local) 3.Non-Local *Plots generated by Alexandr Zaytsev Histogram of portion of jobs in each state taken in 1 hour intervals HTCONDORWEEK

THANK YOU Questions? Comments? Stock Photo? HTCONDORWEEK

Configuration Changes TAKING ADVANTAGE OF CONFIG- DIR Since 7.8 Condor supports a config.d/ directory to read configuration from More easily allows programmatic/automated management of configuration Refactored configuration files at RACF to take advantage Main Config: LOCAL_CONFIG_DIR = /etc/condor/config.d LOCAL_CONFIG_FILES = /dir/a, /dir/b Order: 1. /etc/condor/condor_config (or $CONDOR_CONFIG) 2. /etc/condor/config.d/* (in alphanumeric order) 3. /dir/a 4. /dir/b Old Way New Way Main Config: LOCAL_CONFIG_FILES = /dir/a, /dir/b Order: 1. /etc/condor/condor_config (or $CONDOR_CONFIG) 2. /dir/a 3. /dir/b HTCONDORWEEK