Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
Scheduling Criteria CPU utilization – keep the CPU as busy as possible (from 0% to 100%) Throughput – # of processes that complete their execution per.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Introduction CSCI 444/544 Operating Systems Fall 2008.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.
Chapter 1 and 2 Computer System and Operating System Overview
High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
H-1 Network Management Network management is the process of controlling a complex data network to maximize its efficiency and productivity The overall.
Pre Production I.Concept II.Story Development III.Visual Development IV.Technical Direction V.Production Management.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor Derek Wright Computer Sciences Department.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Chapter 101 Multiprocessor and Real- Time Scheduling Chapter 10.
Condor In Flight at The Hartford 2006 Transformations Condor Week 2007 Bob Nordlund.
1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.
Grid Computing at The Hartford Condor Week 2008 Robert Nordlund
Condor Week 2005Optimizing Workflows on the Grid1 Optimizing workflow execution on the Grid Gaurang Mehta - Based on “Optimizing.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
CSCI1600: Embedded and Real Time Software Lecture 24: Real Time Scheduling II Steven Reiss, Fall 2015.
Condor on WAN D. Bortolotti - INFN Bologna T. Ferrari - INFN Cnaf A.Ghiselli - INFN Cnaf P.Mazzanti - INFN Bologna F. Prelz - INFN Milano F.Semeria - INFN.
Landing in the Right Nest: New Negotiation Features for Enterprise Environments Jason Stowe.
How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Uniprocessor Process Management & Process Scheduling Department of Computer Science Southern Illinois University Edwardsville Spring, 2016 Dr. Hiroshi.
Data Analysis w ith PROOF, PQ2, Condor Data Analysis w ith PROOF, PQ2, Condor Neng Xu, Wen Guan, Sau Lan Wu University of Wisconsin-Madison 30-October-09.
Lecture 4 Page 1 CS 111 Summer 2013 Scheduling CS 111 Operating Systems Peter Reiher.
1 OPERATING SYSTEMS. 2 CONTENTS 1.What is an Operating System? 2.OS Functions 3.OS Services 4.Structure of OS 5.Evolution of OS.
Bitdefender Antivirus Support Australia
Virtualization Review and Discussion
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Matchmaker Policies: Users and Groups HTCondor Week, Madison 2017
Building Grids with Condor
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Chapter 1: Introduction
Operating Systems CPU Scheduling.
Negotiator Policy and Configuration
Accounting, Group Quotas, and User Priorities
HTCondor at Collin Mehring Introduction My background.
Semiconductor Manufacturing (and other stuff) with Condor
Lecture 21: Introduction to Process Scheduling
Basic Grid Projects – Condor (Part I)
CPU SCHEDULING.
CPU scheduling decisions may take place when a process:
Introduction to Scheduling Chapter 1
Lecture 21: Introduction to Process Scheduling
Condor: Firewall Mirroring
Uniprocessor Process Management & Process Scheduling
LO2 – Understand Computer Software
The University of Adelaide, School of Computer Science
GLOW A Campus Grid within OSG
Uniprocessor Process Management & Process Scheduling
PU. Setting up parallel universe in your pool and when (not
Introduction to research computing using Condor
Presentation transcript:

Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation

Submitter Session Manager FAMDB Condor View CORE User Facing Back End CORE's Farm & Middleware GHz. Processors Linux 4GB RAM Terabytes Several Filers 50 Million Renders so far (Vanilla Universe) Condor_startd starter Condor_render Condor_schedd 64 Mac Procs 4 Managing Machines

Goals and Software Goals ●High Throughput & Efficiency ●Easy Condor Submission and Integration Priority Management – Key to Throughput

Initial Configuration Software/Policies ●User Priority ●Behavior Flags - STARTD Issues ●NFS issues ●Out of Order Execution ●Priority Management 320 Procs 1 Main Filer RenderMan Schedd Server Workstation Schedds (Sched Everything Else) Middleware CentralMgr

How CG Productions Work Traditionally, Movie scripts = Group of Sequences Movie's Sequences ~ Play's Scenes Sequence = Group of Shots Assets = Sets/Characters/Props/... Prioritize work-units instead of users? Design Model Texture Surfacing Assets Design Layout Animation Lighting Composite Shots Two Pipelines

Accounting Groups: Take 1 Software/Policies ●Contracted Wisconsin: Accounting Groups(AG) ●Job =unique AG ●Added Filers, Fix drivers Issues ●Accountant Overload ●Slow Finishing Procs Many Filers General Schedd Server Workstation Schedds (Sched Certain Jobs) Middleware Central Mgr 16 Mac Procs

Accounting Groups: Take 1 Every job got some resources, but not enough to finish fast for Production. Moved quickly to Take 2...

Accounting Groups: Take 2 Software/Policies ●Shots Get Unique AG ●Unify Schedds to fix out of order cases Issues ●Wanted: Farm % Priority ●Classic Schedd Overload: “Claimed Idle”s 360 Procs Many Filers General Schedd Server Fewer Workstation Schedds (Sched Certain Jobs) Middleware Central Mgr 32 Mac Procs

Accounting Groups: Final? Software/Policies ●“Priority User” - p1 p2 p3 ●Multiple Server & Schedds ●ASAP & Department Flags Issues ●Department “Pools” ●Preemption = Bad 500 Procs Many Filers 3 Schedd Servers Middleware Central Mgr 32 Mac Procs

Accounting Groups: Final? Sharing Power is a difficult task for anyone, especially users with deadlines. Need a Quality of Service guarantee: resources will always be available without preemptive department pools...

Group Quotas save the day 1000 Procs Many Filers 3 Schedd Servers Middleware Central Mgr 64 Mac Procs Software/Policies ●Department Groups g_lfx, g_mdl, g_chr, etc. ●Quality Of Service ●Nighttime Priority Issues ●Long negotiation Cycles Total Cycle: 6 minutes Server loads >6

Middle ware Performance Optimization 2 Schedd Servers Central Mgr 64 Mac Procs Goal: Speed Negotiator ●Remove Many Groups ●Significant Attributes (SIGNIFICANT_ATTRIBUTES) ●Schedd Submit Algorithm ●Separate Middleware & Central Manager Servers ●Negotiator Cycle 20 sec delay => 3 sec (NEGOTIATOR_CYCLE_DELAY) 1000 Procs Many Filers

Optimization Results Performance Before => After: ● Removed Groups: 6 => 5.5 min ● Significant Attributes: 5.5 => 3 min ● Schedd Algorithm: 3 => 1.5min ● Separate Servers:1.5 => 0.6min ● Cycle delay:0.6 => 0.33 min ● Server Loads:<1 Middleware <2 Central Manager

Lessons Learned ● Remove pre-emption where possible ● Simplify Startd/Negotiator (Control) policies: ● Make Consistent/remove special cases ● Understandable farm behavior ● Keep Server Functions Simple ● Use Accounting Groups to guarantee relative percentage allocation of resources ● Use Group Quotas instead of machine-specific RANK policies for better throughput

Thank you Condor Team University of Wisconsin CORE Any Questions?