HTCondor at Collin Mehring Introduction My background.

Slides:



Advertisements
Similar presentations
Copyright © 2010, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
Advertisements

Test Case Management and Results Tracking System October 2008 D E L I V E R I N G Q U A L I T Y (Short Version)
Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager.
OpenVMS System Management A different perspective by Andy Park TrueBit b.v.
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
Communicating with Users about HTCondor and High Throughput Computing Lauren Michael, Research Computing Facilitator HTCondor Week 2015.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Creating SmartArt 1.Create a slide and select Insert > SmartArt. 2.Choose a SmartArt design and type your text. (Choose any format to start. You can change.
Greg Thain Computer Sciences Department University of Wisconsin-Madison Configuring Quill Condor Week.
The Interactive Media Industry Organisational Structures and Job Roles Research: Skillset.org.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Monitoring with InfluxDB & Grafana
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Manage your projects efficiently and on a high level PROJECT MANAGEMENT SYSTEM Enovatio Projects Efficient project management Creating project plans Increasing.
Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie
The Ultimate SharePoint Admin Tool
SQL Database Management
HPC In The Cloud Case Study: Proteomics Workflow
Workload Management Workpackage
Nithyamoorthy S Core Mind Technologies
Test Lab Management and
HPC In The Cloud Case Study: Proteomics Workflow
Simulation Production System
Shared Services with Spotfire
HORIZONT TWS/WebAdmin DS TWS/WebAdmin DS Tips & Tricks
Monitoring with Clustered Graphite & Grafana
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Types of Operating System
HP SmartStream Production Center
ALICE Monitoring
HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017.
Chapter 2: System Structures
High Availability in HTCondor
Moving from CREAM CE to ARC CE
How To Automate At Least 80% Of Your Online Business With
Cloud Management Mechanisms
LCGAA nightlies infrastructure
CREAM-CE/HTCondor site
Introduction to Operating System (OS)
Monitoring HTCondor with Ganglia
THE STEPS TO MANAGE THE GRID
Accounting in HTCondor
US CMS Testbed.
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Introduction to Events
Chapter 1: Introduction
Upgrading to Microsoft SQL Server 2014
Automation in IMS Can it help the shrinking talent pool
KronoDesk® - Listen, Help, Solve
Basic Grid Projects – Condor (Part I)
Purge-it! USP's, pre-sales process & helping the customer to decide
Brian Lin OSG Software Team University of Wisconsin - Madison
Microsoft Project Past, Present and Future
Office 365 Reporting Dashboard - Overview
CAD DESK PRIMAVERA PRESENTATION.
Cases Admin Training.
MapReduce: Simplified Data Processing on Large Clusters
WHAT ARE THE ADVANTAGES AND CHALLENGES OF MICROSOFT SQL SERVER.
Enhanced agent workspace for messaging
Introduction to research computing using Condor
Presentation transcript:

HTCondor at Collin Mehring Introduction My background

Using HTCondor Since 2011 Madagascar 3 was the first film released since we started using HTCondor. (Released June 8, 2012) DVD and TV Holiday specials, as well as theme park rides

Animation Studio Background Productions are our customers Artists are the end users Production stages and their teams Layout -> Animation -> Lighting / FX -> Finaling The production hierarchy - Production -> Sequence -> Shot -> Frames Frames are composed of many steps composited together Each frame has a left- and right-eye version for 3D effect ~260k frames in a movie Support many different applications Hard deadlines Leads to large amounts of work during crunch time

Who interacts with HTCondor and how? Artists Submit to the farm and expect frames back Focus on the art, no technical knowledge of HTCondor required Technical Directors Configure artists’ software to use submission tools Debug issues on the shot setup side TRAs (Technical Resource Admins / Render Wranglers) Mange the HTCondor farm jobs Answer artists’ questions about the farm, and provide help JoSE (Job Submission and Execution, R&D team) Configure HTCondor Develop and maintain tools to help the TRAs manage the farm Developing submission tools We expect our end users, the artists, to need little technical knowledge of the farm, and we design our systems with this in mind. A lot of inner workings are abstracted away from them.

Why do we configure HTCondor the way we do? End users shouldn’t require any technical knowledge of the scheduling system Available settings should be things they care about, everything else is automatic The scheduling system should not noticeably impact the end users Admins should be able to easily manage large amounts of jobs Admins should have easy access to all relevant information and statistics Easier troubleshooting, helps establish causation, and present information to productions Prioritize throughput, but consider turnaround time as well Minimize wasted compute hours New renderer scales very well with cores, prioritize scheduling large jobs Accounting groups should always get their minimum allocation Help productions meet deadlines anyway possible

How do we have HTCondor configured? All DAG jobs Many steps involved in rendering a frame GroupId.NodeId.JobId instead of ClusterId Easier communication between departments No preemption (yet) Deadlines are important - No lost work Checkpointing coming soon in new renderer Heavy use of group accounting Render Units (RU), the scaled core-hour Productions pay for their share of the farm Execution host configuration profiles e.g. Desktops only run jobs at night Easy deployment and profile switching Load data from JobLog/Spool files into Postgres, Influx, and analytics databases Quick Facts Central Manager and backup (HA) On separate physical servers One Schedd per show, scaling up to ten Split across two physical servers About 1400 execution hosts ~45k server cores, ~15k desktop cores Almost all partitionable slots Complete an average of 160k jobs daily An average frame takes 1200 core hours over its lifecycle Trolls took ~60 million core-hours DAG jobs Steps in different programs Frames of a shot Pieces of a frame that get comped together Autoclustering so jobs in a Node get clustered GroupId HTCondor way: DAG ClusterId -> ClusterId.JobId Many schedds, 40 at a time. GroupId is always unique Mainly to hide the underlying condor implementation Preemption Starting to consider this now, having checkpointing will help a lot Group Accounting Daemon that generates shares from DB table GUI to modify the DB table Desktops Beefy machines for artists to use lighting tools Don’t want them to notice usage File system gets hit hard when they open, now staggered

What additional configuration have we added? Lots of additional ClassAd attributes (~50) Concurrency limits Each group has their own limit Software limits can be per host, and can be released early Error & Production Error status Differentiating between held and errored jobs Subway - Python submission API In terms of studio specific constructs Deferred submissions, v4 provides a REST API Job Policy Predefined templates of several job attributes Heavy use of pre- and post-priorities Clims Cron that generates the software limit files periodically Status Mainly for ease of use by Artists and TRAs Subway Policy Made possible by the submission service Useful because we run a wide range of job types Restricted Hosts Not ideal, have to justify their use

How do we manage our HTCondor pool? The Farm Manager (WebApp) GUI for managing the HTCondor pool Used by TRAs, TDs, Artists, etc. See specific details Group progress Job stats and information Logs, charts, etc. Finished and Canceled jobs Perform actions on jobs Supports batched actions on nodes & groups Can modify jobs that haven’t been submitted yet by the DAG Filter your view Only see the groups relevant to you Hides most low-level HTCondor data ClassAds, DAGs, SDFs, etc. Allocate resources between shares Separate allocations for day and night Monitor execution hosts Data and charts, just like jobs Links to other monitoring tools

Farm Manager - All Groups

Farm Manager - Alerts filter

Farm Manager - Jobs view

Right-clicking on a job to edit it via drop-down menu

View selection / Allocation Manager

Farm Manager - Hosts

How do we monitor pool stats in real-time? Grafana Primarily used by the TRAs / Render Wranglers Quickly detect issues and receive alerts At-a-glance overview of the render farm Diagnose problems Correlate events between metrics More dashboards for specific use cases Software license usage, HTCondor negotiator stats, etc.

TRA Dashboard 1

TRA Dashboard 2

Viewing Historical Data Tableau Big Picture Trends over time Comparison between productions Used primarily for scheduling Can we fit all of the rendering we’re planning on doing into the render farm concurrently? How do we move things around to make it all fit? Are there areas we can optimize to better use the existing farm resources? Are we still on schedule? Historical data stored in a separate database

RU Per Frame Shows historically how much compute is being used for each sequence Tracks overall trends and identifies complex sequences Userful for scheduling production work, allocating resources between teams Tableau - RU Per Frame GIF

Sequence-Shot Details Shows RU usage for every farm job, broken down by sequence and shot Useful for identifying outliers and specific issues Tableau - Seq-Shot details GIF

Overnight Rendering Summary Tracks nightly render farm performance Number of jobs submitted by each production Grouped by priority, with percent completed Amount of RU used by each production compared to their allocations, broken down by team Total RU used compared to capacity, broken down by production Proportion of capacity allocated to each production compared to what they actually used Memory usage compared to capacity Tableau - Rendering Summary Sent to the Sup-TDs of each production.

Question Time