Examples Example: UW-Madison CHTC Example: Global CMS Pool

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.

More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley

Block1 Wrapping Your Nugget Around Distributed Processing.

Grid job submission using HTCondor Andrew Lahiff.

Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Review of Condor,SGE,LSF,PBS

Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

1.1 Sandeep TayalCSE Department MAIT 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming Batched Systems Time-Sharing Systems.

JSS Job Submission Service Massimo Sgaravatto INFN Padova.

What’s Coming? What are we Planning?. › Better docs › Goldilocks – This slot size is just right › Storage › New.

How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.

Next Generation of Apache Hadoop MapReduce Owen

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.

Taming Local Users and Remote Clouds with HTCondor at CERN

Monitoring Primer HTCondor Workshop Barcelona Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.

Monitoring Primer HTCondor Week 2017 Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.

Getting the Most out of Scientific Computing Resources

Dynamic Extension of the INFN Tier-1 on external resources

Applied Operating System Concepts

Condor DAGMan: Managing Job Dependencies with Condor

Getting the Most out of Scientific Computing Resources

Large Output and Shared File Systems

Experience on HTCondor batch system for HEP and other research fields at KISTI-GSDC Sang Un Ahn, Sangwook Bae, Amol Jaikar, Jin Kim, Byungyun Kong, Ilyeon.

HTCondor Networking Concepts

HTCondor Networking Concepts

Introduction to Distributed Platforms

HTCondor Security Basics

Quick Architecture Overview INFN HTCondor Workshop Oct 2016

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Working With Azure Batch AI

Example: Rapid Atmospheric Modeling System, ColoState U

Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")

Workload Management System

High Availability in HTCondor

Monitoring HTCondor with Ganglia

Building Grids with Condor

HTCondor Command Line Monitoring Tool

Grid Canada Testbed using HEP applications

Accounting, Group Quotas, and User Priorities

What’s New in DAGMan HTCondor Week 2013

湖南大学-信息科学与工程学院-计算机与科学系

HTCondor Security Basics HTCondor Week, Madison 2016

Condor Glidein: Condor Daemons On-The-Fly

What’s Different About Overlay Systems?

Operating System Concepts

Introduction to High Throughput Computing and HTCondor

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

Condor: Firewall Mirroring

Grid Laboratory Of Wisconsin (GLOW)

Condor-G Making Condor Grid Enabled

GLOW A Campus Grid within OSG

Operating System Concepts

Job Submission Via File Transfer

PU. Setting up parallel universe in your pool and when (not

Presentation transcript:

Talking Points: Deployment on Big Infrastructures INFN HTCondor Workshop Oct 2016

Examples Example: UW-Madison CHTC Example: Global CMS Pool Pool Size : ~15k slots Central Manager : 8 cores (loadavg of 2), 8GB RAM (5GB in use), no special config Submit Machines: ~80 submit machines, 3 "big" general purpose ones, each big one typically has Typical ~10k running / 100k queued, 32 cores, 96GB RAM, SSD Example: Global CMS Pool Pool Size: ~150k - 200k slots Central Manager: collector tree, no preemption Submit machines: 15 with ~15k running

Central Manager Planning Memory: ~1GB of RAM per 4,000 slots plus RAM for other services (e.g. monitoring) …or even better, run them somewhere else CPU: 4 cores can work if < 20k slots; 8 cores if bigger or many users Speed per core (clock) helps 1 gig network connection is OK Create CCB brokers separate from Central Manager at > ~20k slots

Central Manager Planning, cont Negotiator Top Level Collector Use a "collector tree" if using strong authentication / encryption, esp over the WAN Per 1500 execute nodes Hides latency, more parallelism See HOWTO at Child Child Child Execute Nodes https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigCollectors

Submit Machine Planning Memory: ~50KB per queued job, ~1MB per running job (actual is ~400KB, safety factor) CPU: 2 or 3 cores fine, BUT Base CPU decision on the needs of logged-in users (i.e. compiling, test jobs, etc) More than 5-10k jobs? Buy an SSD! Our setup typically has Dedicated, small, low-latency SSD for queue, AND Large high-throughput (striped) storage for user home/working directories Network: 1gig, or 10gig if primarily using HTCondor File Transfer

How to move files from submit machine to execute machine? Shared File system (NFS, AFS, Gluster, …) Pro: Less work for users - no need to specify input files Con: No management often leads to meltdown HTCondor File Transfer Con: Users need to specify input and/or output files Pro: File transfers are managed  Pro: Simpler to now run the job offsite

Note Brian B's warning…

Horizontal Scaling Submit node scaling problems? Add more Pool can have an arbitrary number of schedds How many needed? Depends on many things Hertz rate of job ( schedd safe at ~10-20+ starts/sec) Submission one at a time vs big batches Amount of job I/O How to detect scaling problem? RecentDaemonCoreDutyCycle > 98% SCHEDD_HOST in .condor/user_config can point to a remote schedd Central manager scaling problems? Add more Add then federate via "Flocking" How to detect scaling problem? Metrics on dropped packets, negotiation cycle time (UW-Madison typically a couple minutes)

Some User/Admin Training Train Users to submit jobs in large batches Instead of running condor_submit 5,000 times do: executable = /bin/foo.exe initialdir = run_$(Process) queue 5000 Train users re reasonable number of queued jobs? reasonable job runtime? Avoid constant polling with condor_[q|status] Consider job event log, DAGMan post Consider monitoring with condor_gangliad, Fifemon Use selection and projection Bad: condor_status -l | grep Busy Good: condor_status -cons 'Activity=="Busy"' -af Name Custom Print Formats (https://is.gd/jB7m4q )

Tuning and Customization for large scale Kernel tuning Automatically done w/ HTCondor v8.4.x+ Enable Shared Port daemon Automatically done w/ HTCondor v8.5.x+ CCB required to let one schedd to have more than ~25k running jobs "Circuit Breaker" config knobs; we have lots of knobs Schedd: MAX_JOBS_PER_OWNER, MAX_JOBS_PER_SUBMISSION, MAX_JOBS_RUNNING, FILE_TRANSFER_DISK_LOAD_THROTTLE, MAX_CONCURRENT_UPLOADS/DOWNLOADS, … Central Manager: NEGOTIATOR_MAX_TIME_PER_SCHEDD, NEGOTIATOR_MAX_TIME_PER_SUBMITTER Schedd SUBMIT_REQUIREMENTS

Tuning, cont. Improve scalability by disabling unneeded features, e.g. Preemption negotiator_consider_preemption = false Job Ranking of machines negotiator_ignore_job_ranks = true Durable commits in event of power failure condor_fsync = false Improve scalability by enabling experimental features

Questions?