Yeti Operations INTRODUCTION AND DAY 1 SETTINGS
Rob Lane HPC Support Research Computing Services CUIT
Topics 1.Yeti Operations Committee 2.Introduction to Yeti 3.Rules of Operation
1.Yeti Operations Committee Determines cluster policy In the process of being set up In the meantime we need a policy for day 1 of operations
2. Introduction to Yeti
Final Node Count Node TypeNumber of Nodes Standard (64 GB)38 Intermediate (128 GB)8 High Memory (256 GB)35 Infiniband16 GPU4 Total101
Meet Your New Neighbors Group afsisocp astropsych cclssscc eeengstats journxenon
Group Shares GroupShare %GroupShare % afsis2.12ocp10.60 astro6.36psych2.12 ccls19.43sscc19.08 eeeng2.12stats33.92 journ2.12xenon2.12
Other Groups Renters Free Tier CUIT
Rules of Operation 1.Job Priority 2.Job Characteristics 3.Queues 4.Guaranteed Access
Job Priority Every job waiting to run is assigned a priority by the scheduling software The priority determines the order of jobs waiting in the queue
Job Priority Components Group’s share vs. recent usage User’s recent usage Other factors
Recent Usage What does “recent” mean? It’s configurable Yeti’s setting: 7 Days
Job Characteristics Nodes and cores Time Memory
Job Queues (subject to change) QueueTime LimitMemory LimitMax. User Run Batch 112 hours4 GB512 Batch 212 hours16 GB128 Batch 35 days16 GB64 Batch 43 daysNone8 Interactive4 hoursNone4
Guaranteed Access New mechanism Subject to review by Yeti Operations Committee We’re going to try it out in the meantime
Guaranteed Access Groups have each been assigned systems Group jobs get priority access to their own systems “Guaranteed Access” means there will be a known maximum wait time before your job starts running
Guaranteed Access Example The group astro owns the node Brussels Only two types of jobs will be allowed on Brussels 1.Astro jobs 2.Short jobs
Job Queues (subject to change) QueueTime LimitMemory LimitMax. User Run Batch 112 hours4 GB512 Batch 212 hours16 GB128 Batch 35 days16 GB64 Batch 43 daysNone8 Interactive4 hoursNone4
Guaranteed Access Debate Good because researchers have guaranteed access rights to nodes Bad because long jobs lose access to many nodes
Thanks! Comments and Questions?