Examples Example: UW-Madison CHTC Example: Global CMS Pool

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
Block1 Wrapping Your Nugget Around Distributed Processing.
Grid job submission using HTCondor Andrew Lahiff.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Review of Condor,SGE,LSF,PBS
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
1.1 Sandeep TayalCSE Department MAIT 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming Batched Systems Time-Sharing Systems.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
What’s Coming? What are we Planning?. › Better docs › Goldilocks – This slot size is just right › Storage › New.
How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.
Next Generation of Apache Hadoop MapReduce Owen
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
Taming Local Users and Remote Clouds with HTCondor at CERN
Monitoring Primer HTCondor Workshop Barcelona Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.
Monitoring Primer HTCondor Week 2017 Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.
Getting the Most out of Scientific Computing Resources
Dynamic Extension of the INFN Tier-1 on external resources
Applied Operating System Concepts
Condor DAGMan: Managing Job Dependencies with Condor
Getting the Most out of Scientific Computing Resources
Large Output and Shared File Systems
Experience on HTCondor batch system for HEP and other research fields at KISTI-GSDC Sang Un Ahn, Sangwook Bae, Amol Jaikar, Jin Kim, Byungyun Kong, Ilyeon.
HTCondor Networking Concepts
HTCondor Networking Concepts
Introduction to Distributed Platforms
HTCondor Security Basics
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Working With Azure Batch AI
Example: Rapid Atmospheric Modeling System, ColoState U
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Workload Management System
High Availability in HTCondor
Monitoring HTCondor with Ganglia
Building Grids with Condor
HTCondor Command Line Monitoring Tool
Grid Canada Testbed using HEP applications
Accounting, Group Quotas, and User Priorities
What’s New in DAGMan HTCondor Week 2013
湖南大学-信息科学与工程学院-计算机与科学系
HTCondor Security Basics HTCondor Week, Madison 2016
Condor Glidein: Condor Daemons On-The-Fly
What’s Different About Overlay Systems?
Operating System Concepts
Introduction to High Throughput Computing and HTCondor
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Condor: Firewall Mirroring
Grid Laboratory Of Wisconsin (GLOW)
Condor-G Making Condor Grid Enabled
GLOW A Campus Grid within OSG
Operating System Concepts
Job Submission Via File Transfer
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

Talking Points: Deployment on Big Infrastructures INFN HTCondor Workshop Oct 2016

Examples Example: UW-Madison CHTC Example: Global CMS Pool Pool Size : ~15k slots Central Manager : 8 cores (loadavg of 2), 8GB RAM (5GB in use), no special config Submit Machines: ~80 submit machines, 3 "big" general purpose ones, each big one typically has Typical ~10k running / 100k queued, 32 cores, 96GB RAM, SSD Example: Global CMS Pool Pool Size: ~150k - 200k slots Central Manager: collector tree, no preemption Submit machines: 15 with ~15k running

Central Manager Planning Memory: ~1GB of RAM per 4,000 slots plus RAM for other services (e.g. monitoring) …or even better, run them somewhere else CPU: 4 cores can work if < 20k slots; 8 cores if bigger or many users Speed per core (clock) helps 1 gig network connection is OK Create CCB brokers separate from Central Manager at > ~20k slots

Central Manager Planning, cont Negotiator Top Level Collector Use a "collector tree" if using strong authentication / encryption, esp over the WAN Per 1500 execute nodes Hides latency, more parallelism See HOWTO at Child Child Child Execute Nodes https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigCollectors

Submit Machine Planning Memory: ~50KB per queued job, ~1MB per running job (actual is ~400KB, safety factor) CPU: 2 or 3 cores fine, BUT Base CPU decision on the needs of logged-in users (i.e. compiling, test jobs, etc) More than 5-10k jobs? Buy an SSD! Our setup typically has Dedicated, small, low-latency SSD for queue, AND Large high-throughput (striped) storage for user home/working directories Network: 1gig, or 10gig if primarily using HTCondor File Transfer

How to move files from submit machine to execute machine? Shared File system (NFS, AFS, Gluster, …) Pro: Less work for users - no need to specify input files Con: No management often leads to meltdown HTCondor File Transfer Con: Users need to specify input and/or output files Pro: File transfers are managed  Pro: Simpler to now run the job offsite

Note Brian B's warning…

Horizontal Scaling Submit node scaling problems? Add more Pool can have an arbitrary number of schedds How many needed? Depends on many things Hertz rate of job ( schedd safe at ~10-20+ starts/sec) Submission one at a time vs big batches Amount of job I/O How to detect scaling problem? RecentDaemonCoreDutyCycle > 98% SCHEDD_HOST in .condor/user_config can point to a remote schedd Central manager scaling problems? Add more Add then federate via "Flocking" How to detect scaling problem? Metrics on dropped packets, negotiation cycle time (UW-Madison typically a couple minutes)

Some User/Admin Training Train Users to submit jobs in large batches Instead of running condor_submit 5,000 times do: executable = /bin/foo.exe initialdir = run_$(Process) queue 5000 Train users re reasonable number of queued jobs? reasonable job runtime? Avoid constant polling with condor_[q|status] Consider job event log, DAGMan post Consider monitoring with condor_gangliad, Fifemon Use selection and projection Bad: condor_status -l | grep Busy Good: condor_status -cons 'Activity=="Busy"' -af Name Custom Print Formats (https://is.gd/jB7m4q )

Tuning and Customization for large scale Kernel tuning Automatically done w/ HTCondor v8.4.x+ Enable Shared Port daemon Automatically done w/ HTCondor v8.5.x+ CCB required to let one schedd to have more than ~25k running jobs "Circuit Breaker" config knobs; we have lots of knobs Schedd: MAX_JOBS_PER_OWNER, MAX_JOBS_PER_SUBMISSION, MAX_JOBS_RUNNING, FILE_TRANSFER_DISK_LOAD_THROTTLE, MAX_CONCURRENT_UPLOADS/DOWNLOADS, … Central Manager: NEGOTIATOR_MAX_TIME_PER_SCHEDD, NEGOTIATOR_MAX_TIME_PER_SUBMITTER Schedd SUBMIT_REQUIREMENTS

Tuning, cont. Improve scalability by disabling unneeded features, e.g. Preemption negotiator_consider_preemption = false Job Ranking of machines negotiator_ignore_job_ranks = true Durable commits in event of power failure condor_fsync = false Improve scalability by enabling experimental features

Questions?