HTCondor Project Plans Zach Miller OSG AHM 2013.

Slides:



Advertisements
Similar presentations
Operating-System Structures
Advertisements

Condor use in Department of Computing, Imperial College Stephen M c Gough, David McBride London e-Science Centre.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
USING THE GLOBUS TOOLKIT This summary by: Asad Samar / CALTECH/CMS Ben Segal / CERN-IT FULL INFO AT:
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
Workload Management Massimo Sgaravatto INFN Padova.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Network File System (NFS) in AIX System COSC513 Operation Systems Instructor: Prof. Anvari Yuan Ma SID:
Condor Project Computer Sciences Department University of Wisconsin-Madison Security in Condor.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Securing Your Condor Pool With SSL.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Grid Computing I CONDOR.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations.
Grid job submission using HTCondor Andrew Lahiff.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
UNIX Unit 1- Architecture of Unix - By Pratima.
Configuring and Troubleshooting Identity and Access Solutions with Windows Server® 2008 Active Directory®
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
What’s Coming? What are we Planning?. › Better docs › Goldilocks – This slot size is just right › Storage › New.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
Security-Enhanced Linux Stephanie Stelling Center for Information Security Department of Computer Science University of Tulsa, Tulsa, OK
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
Five todos when moving an application to distributed HTC.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
HTCondor-CE. 2 The Open Science Grid OSG is a consortium of software, service and resource providers and researchers, from universities, national laboratories.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Workload Management Workpackage
Securing Network Servers
Debugging Common Problems in HTCondor
HTCondor Security Basics
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Things you may not know about HTCondor
Chapter 2: System Structures
High Availability in HTCondor
Condor: Job Management
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
HTCondor Security Basics HTCondor Week, Madison 2016
Chapter 2: Operating-System Structures
Chapter 2: Operating-System Structures
Credential Management in HTCondor
Presentation transcript:

HTCondor Project Plans Zach Miller OSG AHM 2013

Even More Lies Zach Miller OSG AHM 2013

chtc.cs.wisc.edu Outline › Fewer Lies Predictions › More Accomplishments › Some hints at Future Work

chtc.cs.wisc.edu HTCondor › As you’ve probably noticed, the project’s name has changed. › HTCondor specifically is the software that is developed by the Center for High- Throughput Computing at UW-Madison › However, the code, names of binaries, and any configuration file names and entries, have NOT changed.

chtc.cs.wisc.edu User Tools › condor_ssh_to_job  If enabled by the admin, allows users to get a shell in the remote execution sandbox as the user that is running the job  Great for debugging! % condor_ssh_to_job Welcome to Your condor job is running with pid(s) > ls condor_exec.exe _condor_stderr _condor_stdout > whoami zmiller

chtc.cs.wisc.edu User Tools › condor_q –analyze  Now doesn’t need to (but can) fetch user priorities, removing the need to contact the negotiator daemon—this results in a much improved response time for busy pools  Can also analyze the Requirements of the condor_startd, in addition to the Requirements of the job

chtc.cs.wisc.edu User Tools › condor_tail  Allows users to view the output of their jobs while the job is still running  Like the UNIX “tail -f” it will allow following the contents of a file (real time streaming)  Not yet part of the HTCondor release (but should be there Real Soon Now™)

chtc.cs.wisc.edu User Tools › condor_qsub  Allows a user to submit a PBS or SGE job to HTCondor directly  Translates the command-line arguments, as well as inline (#PBS or #$) commands to their equivalent condor_submit commands  This is in no way complete. We are not hoping to emulate every feature of qsub, but rather capture the main functionality that supports the majority of simple use cases.

chtc.cs.wisc.edu User Tools › condor_ping  Tests the authentication, mapping, and authorization of a user submitting a job to the running HTCondor daemons  Tries to provide helpful debugging info in the case of failure  globusrun –a

chtc.cs.wisc.edu Admin Tools › condor_ping  Tests the authentication, mapping, and authorization of daemon-to-daemon communications of running HTCondor daemons  Helps assure administrators they have configured things correctly  Tries to provide helpful debugging info in the case of failure

chtc.cs.wisc.edu Admin Tools › condor_who  Shows what jobs by which user are running on the local machine  Does not depend on contacting HTCondor daemons – it gets all info from logs and ‘ps’ % condor_who OWNER CLIENT SLOT JOB RUNTIME PID PROGRAM ingwe.cs.wisc.edu :00: /scratch/

chtc.cs.wisc.edu Networking › IPv6 support in HTCondor has been around for some time, but we continue to test it and harden that code › The condor_shared_port daemon allows an HTCondor instance to listen on a single port, easing configuration in firewalled environments › We would like this to be the default, with port 9618 (registered by name with IANA)

chtc.cs.wisc.edu Security › Creation of official policy › Documented on HTCondor web site  Reporting Process  Release Process  Known vulnerabilities › Running coverity nightly

chtc.cs.wisc.edu Security › Added ability to increase number of bits in delegated proxies › condor_ping already mentioned › Audit Log  Will record the authenticated identity of each user and as much information as is practical about each job that runs  In progress…

chtc.cs.wisc.edu Sandboxing › Per-job PID namespaces  Makes it impossible for jobs to see outside of their process tree, and therefore unable to interfere with the system or with other jobs even if they are owned by the same user  Allows filesystem namespaces, such that each job can have its own /tmp directory which is actually mounted mounted in HTCondor’s temporary execute directory

chtc.cs.wisc.edu Sandboxing › cgroups  Allows for more accurate accounting of a job’s memory and CPU usage  Guarantees proper cleanup of jobs

chtc.cs.wisc.edu Partionable Slots › p-slots  Can contain “generic” resources, beyond CPU, Memory, Disk  Now work in HTCondor’s “parallel” universe  Support for “quick claiming” where a whole machine can be subdivided and claimed in a single negotiation cycle › Can lead to “fragmenting” of a pool

chtc.cs.wisc.edu Defrag › condor_defrag  A daemon which periodically drains and recombines a certain portion of the pool  Leads to the recreation of larger slots which can then be used for larger jobs  Necessarily causes some “badput” › condor_drain is a command-line tool that does this on an individual machine

chtc.cs.wisc.edu Statistics › Condor daemons can now report a wide variety of statistics to the condor_collector  Statistics about the daemons, like response times to incoming requests  About jobs and quantities of data transferred › What is yet to be done is to include tools that help make sense of those statistics, as either a new and improved CondorView or as a Gratia probe

chtc.cs.wisc.edu Scalability › Working on reducing memory footprint of daemons, particularily the condor_shadow › Queued file transfers are now processed round-robin instead of FIFO, so individual users are not starved › ClassAd caching in the schedd and collector have resulted in 30-40% savings in memory

chtc.cs.wisc.edu HTCondor Version › Will contain all of the above goodness › Should be released approximately “during HTCondor week”, April 29 – May 3 › What lies beyond?

chtc.cs.wisc.edu Future Work › Scalability  We always need to be improving this in an attempt to stay ahead of Igor the curve  New tools will be needed – nobody wants to run condor_status and see individual state for 100,000 cores  Reducing the amount of per-job memory used on the submit machine  Collector hierarchies to deal with high-latency, wide area cloud pools

chtc.cs.wisc.edu Future Work › Weather report: 100% chance of clouds  Support for more types of resource acquisition models: EC2 Spot Instance, Azure, OpenStack, The Next Big Thing™  Simple creation of single-purpose clusters Homogenous Ephemeral Single user, single job  Seamless integration of cloud resources with locally deployed infrastructure

chtc.cs.wisc.edu Future Work › Dealing with more hardware complexity  More and more cores  GPUs › Simplifying deployment on widely disparate use cases › Improve support for black-box / commercial applications › Meeting increasing data challenges

chtc.cs.wisc.edu Conclusion › Many things accomplished… › Many more to do… › Questions? Ask me! Or me at › Thanks!