Greg Thain Computer Sciences Department University of Wisconsin-Madison cs.wisc.edu Interactive MPI on Demand.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Greg Thain Computer Sciences Department University of Wisconsin-Madison Condor Parallel Universe.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor's Use of the Cisco Unified Computing System.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Greg Quinn Computer Sciences Department University of Wisconsin-Madison Condor on Windows.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Condor Project Computer Sciences Department University of Wisconsin-Madison Running Map-Reduce Under Condor.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Lockdown of a Basic Pool.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machine Universe in.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.
Grid Computing I CONDOR.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, UWisc Condor Week April 13, 2010.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Parallel Computing using Condor on Windows PCs Peng Wang and Corey Shields Research and Academic Computing Division University Information Technology Services.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Greg Thain Computer Sciences Department University of Wisconsin-Madison Configuring Quill Condor Week.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
Condor Project Computer Sciences Department University of Wisconsin-Madison Running Interpreted Jobs.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor and Virtual Machines.
Building the International Data Placement Lab Greg Thain Center for High Throughput Computing.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G: Condor and Grid Computing.
Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison
Greg Quinn Computer Sciences Department University of Wisconsin-Madison Privilege Separation in Condor.
CHTC Policy and Configuration
HTCondor Security Basics
Dynamic Deployment of VO Specific Condor Scheduler using GT4
CREAM-CE/HTCondor site
Accounting in HTCondor
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Privilege Separation in Condor
HTCondor Security Basics HTCondor Week, Madison 2016
Basic Grid Projects – Condor (Part I)
Upgrading Condor Best Practices
Condor-G Making Condor Grid Enabled
Job Submission Via File Transfer
Putting your users in a Box
Condor-G: An Update.
Presentation transcript:

Greg Thain Computer Sciences Department University of Wisconsin-Madison cs.wisc.edu Interactive MPI on Demand

Unix Tool Philosophy › 1) Individual tools do one thing well › 2) Communicate via ascii streams › 3) Are composable

The Paradox › Universal assent that it’s good › No one uses it  (Except for shell one-liners) grep ^abc| sort | uniq –c | sort –n

More than just shell scripts Division in Unix processes provides: Restartabilty Better security Scalable across multi-core

For example… › Qmail:  Secure, stable  Implemented across ~dozen processes

Getting back to Condor… › Condor uses this in some places  x-Gahp’s  condor_master  Replaceable shadow/starter pairs  Multi_shadow vs. many shadow › But not everywhere  schedd

Condor Daemons as Components › Very Successful strategy:  Glide-in  Personal-condor  “Hoffman” and schedd’s as jobs  Condor-c

Case Study: MPI on Demand › The problem:  Have a pool with lots of machines  Very-long running (weeks) vanilla jobs  Need to run big, but short MPI  Can’t reboot startds › Need Dedicated scheduler  Requires dedicated machines

Possible Solutions › Add “suspension slot”  Requires Reboot › Submit MPI job normally  Preempts vanilla job

COD refresher › COD: Computing On Demand  No Scheduling  No File Transfer  When COD runs, vanilla job suspends “Checkpoint to swap”  Needs security on to work  Explicitly allowed

Startd as COD job › Overview: › Launch personal condor › Run startds as COD jobs on base pool  Report to personal Condor  Base jobs suspend › Submit parallel job to personal Condor › Remove COD startds

Startd under COD: Details › Two condor_config files: careful! › COD provides no file transfer  Can re-use existing startd binary  Need to pre-stage or NFS config_file › Don’t lose claimid!

Example code › HOSTS=“a b c” › For h in hosts do;  Condor_cod request –name $h > claimid.$h › For n in claimid.* do;  Condor_cod activate –id `cat $n` -jobad ja

Cod JOB_AD › CMD = “/nfs/path/run-startd.sh” › IWD = “/tmp” › Out = “startd.out” › Err = “startd.err” › Universe = 5

Run-startd.sh › Mkdir –p p-condor/{spool,log,execute) › CONDOR_CONFIG=/nfs/new_config › Exec /usr/sbin/condor_master –f -t

Summary › Use condor daemons as components › Mix-and-match as needed

Questions? › Thank You!