Condor Week Summary March 14-16, 2005 Madison, Wisconsin.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
Database Architectures and the Web
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Grid Toolkits Globus, Condor, BOINC, Xgrid Young Suk Moon.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Large, Fast, and Out of Control: Tuning Condor for Film Production Jason A. Stowe Software Engineer Lead - Condor CORE Feature Animation.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
Grids and Portals for VLAB Marlon Pierce Community Grids Lab Indiana University.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
Condor Birdbath Web Service interface to Condor
GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
INFSO-RI Enabling Grids for E-sciencE DAGs with data placement nodes: the “shish-kebab” jobs Francesco Prelz Enzo Martelli INFN.
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Nick LeRoy & Jeff Weber Computer Sciences Department University of Wisconsin-Madison Managing.
Peter F. Couvares (based on material from Tevfik Kosar, Nick LeRoy, and Jeff Weber) Associate Researcher, Condor Team Computer Sciences Department University.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
1 Condor BirdBath SOAP Interface to Condor Charaka Goonatilake Department of Computer Science University College London
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap Paradyn/Condor.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.
1 Stork: State of the Art Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
A Managed Object Placement Service (MOPS) using NEST and GridFTP Dr. Dan Fraser John Bresnahan, Nick LeRoy, Mike Link, Miron Livny, Raj Kettimuthu SCIDAC.
A Fully Automated Fault- tolerant System for Distributed Video Processing and Off­site Replication George Kola, Tevfik Kosar and Miron Livny University.
Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.
Condor Services for the Global Grid: Interoperability between OGSA and Condor Clovis Chapman 1, Paul Wilson 2, Todd Tannenbaum 3, Matthew Farrellee 3,
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Leveraging Database Technologies in Condor Jeff Naughton March 14, 2005.
Leveraging Database Technologies in Condor Jeff Naughton April 25, 2006.
HTCondor Security Basics
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Building Grids with Condor
A Web-Based Data Grid Chip Watson, Ian Bird, Jie Chen,
Basic Grid Projects – Condor (Part I)
Condor-G Making Condor Grid Enabled
Presentation transcript:

Condor Week Summary March 14-16, 2005 Madison, Wisconsin

Overview Annual meeting at UW-Madison. About 80 participants at this year’s meeting. Participants come from universities, research labs and industry. Single plenary sessions with talks from users and developers.

Overview Topics ranged from basic to advanced. Selected highlights in today’s talk. Slides from this year’s talks can be found at

CondorWeek Topics distributed computing and Condor data handling and Condor 3rd party contributions to Condor reports from the field Condor roadmap

Condor Grids (by Alan De Smet) Various alternatives for accessing remote computing resources (distributed computing, flocking, Globus/Condor-G, Condor-C, etc). Discussed pros and cons of each approach (ACF uses Globus/Condor-G).

Condor-G Status and News Globus Toolkit 2 is stable Globus Toolkit 3 is supported –But we think most people are moving to… Globus Toolkit 4 in progress –GT4 beta works now in Condor –Condor will officially support soon after official GT4 release.

Glidein (by Dan Bradley) You have access to a cluster running some other batch system. You want Condor features, such as –queue management –matchmaking –checkpoint migration

What Does Glidein Do? Installation and setup of Condor. –May be done remotely. Launching Condor. –Through Condor-G submission to Globus. –Or you run the startup script however you like.

Condor and DBMS (by Jeff Naughton) Premise: A running Condor system is awash in data: –Operational data –Historical data –User data DBMS technology can help capture, organize, manage, archive, and query this data.

Three potential levels of involvement 1.Passively collect and organize data, expose it through DB query interfaces. 2.Move/extend some data-related portions of Condor to DBMS (Condor writes to and reads from DBMS) 3.Provide services to help users manage their data.

Why do this? For Condor administrators –Easier to analyze and trouble shoot; –Easier to audit; –Easier to explore current and past system status and behavior.

Our projects and plans Quill: Transparently provide a DBMS query interface to job_queue and history data. [ready to deploy!] CondorDB: Transparently captures and provides interface to critical data from all Condor daemons. [status: partial prototype working in our own “sandbox”]

Quill Job ClassAds information mirrored into an RDBMS Both active jobs and historical jobs Benefits BOTH scalability and accessibility QuillSchedd Job Queue log RDBMS Startd … Master Queue + History Tables

Longer-term plans Tight integration of DBMS technology and Condor [status: thinking hard!]. DBMS-inspired data management services to help Condor users manage their own data. [status: thinking really hard!]

Stork (by Tevfik Kosar) Condor tool for data movement. First available in v Will be included in next stable release (6.8.0). Prototypes deployed at various sites.

Bioinformatics: BLAST High Energy Physics: LHC Astronomy: LSST 2MASS SDSS DPOSS GSC-II WFCAM VISTA NVSS FIRST GALEX ROSAT OGLE... LSST 2MASS SDSS DPOSS GSC-II WFCAM VISTA NVSS FIRST GALEX ROSAT OGLE... Educational Technology: WCER EVP 500 TB/year 2-3 PB/year 11 PB/year 20 TB - 1 PB/year

Stork: Data Placement Scheduler First scheduler specialized for data movement/placement. De-couples data placement from computation. Understands the characteristics and semantics of data placement jobs. Can make smart scheduling decisions for reliable and efficient data placement.

Stork can also: Allocate/de-allocate (optical) network links Allocate/de-allocate storage space Register/un-register files to Meta Data Catalog Locate physical location of a logical file name Control concurrency levels on storage servers

Storage Management (by Jeff Weber) NEST (Network Storage Technology) is another project at UW-Madison. To be coupled to Condor and Stork. No stable release available yet.

Overview of NeST NeST: Network Storage Technology Lightweight: Configuration and installation can be performed in minutes. Multi-protocol: Supports Chirp, GridFTP, NFS, HTTP –Chirp is NeST’s internal protocol Secure: GSI authentication Allocation: NeST negotiates “mini storage contracts” between users and server.

Why storage allocations ? Users need both temporary storage, and long-term guaranteed storage. Administrators need a storage solution with configurable limits and policy. Administrators will benefit from NeST’s autonomous reclamations of expired storage allocations.

Storage allocations in NeST Lot – abstraction for storage allocation with an associated handle –Handle is used for all subsequent operations on this lot Client requests lot of a specified size and duration. Server accepts or rejects client request.

Condor and SRM (by Derek Wright) Coordinate computation and data movement with Condor. Condor ClassAd hook (STARTD_CRON_JOBS) queries DRM for files in cache and publishes it in ClassAd for each node. FSM keeps track of all files required by jobs in the system and contacts HRM if required files are missing. Regular Condor matchmaking schedules jobs where files exist.

3 rd party contributions to Condor High availability features (Technion Institute). Privilege separation in Condor (Univ. of Cambridge). Optimizing Condor throughput (CORE Feature Animation). Web interface to Condor (Univ. College of London).

Collector Negotiator Current Condor Pool Startd and Schedd Central Manager

Highly Available Condor Pool Startd and Schedd Idle Central Manager Idle Central Manager Active Central Manager Highly Available Central Manager

Highly Available Central Manager Our solution - Highly Available Central Manager –Automatic failure detection –Transparent failover to backup matchmaker (no global configuration change for the pool entities) –“Split brain” reconciliation after network partitions –State replication between active and backups –No changes to Negotiator/Collector code

What is privilege separation? Isolation of those parts of the code that run at different privilege levels root Condor daemons Condor job No privilege separation: root Condor daemons Condor job Privilege separation:

Throughput Optimization (CORE Feature Animation) Performance Before => After: ● Removed Groups: 6 => 5.5 min ● Significant Attributes: 5.5 => 3 min ● Schedd Algorithm: 3 => 1.5min ● Separate Servers:1.5 => 0.6min ● Cycle delay:0.6 => 0.33 min ● Server Loads:<1 Middleware <2 Central Manager

Web Service Interface to Condor Facilitate the development of third-party applications capable of interacting with Condor (remotely). –E.g. build higher-level application specific scheduler that submits jobs to multiple Condor pools based on application semantics –These can be built using a wide range of languages/SOAP packages –BirdBath has been tested on: Java (Apache Axis, XSUL) Python (ZSI) C# (.Net) C/C++ (gSOAP) Condor accessible from platforms where its command-line tools are not supported/installed

Condor Plans (by Todd Tannenbaum) Condor (stable series) available in May 05. Fail-over, persistence and other features. Improved scalability and accessibility (API’s, Grid middleware, Web-based interfaces, etc). Grid universe and security improvements.

Condor can now transfer job data files larger than 2 GB in size. –On all platforms that support 64bit file offsets Real-time spooling of stdout/err/in in any universe incl VANILLA –Real-time monitoring of job progress Condor Installer on Win32 uses MSI (thanks Micron!) condor_transfer_data (DZero) STARTD_VM_EXPRS (INFN) condor_vacate_job tool condor_status -negotiator BAM! More tasty Condor goodness!

And More… New startd policy expression MaxJobRetirementTime. –specifies the maximum amount of time (in seconds) that the startd is willing to wait for a job to finish on its own when the startd needs to preempt the job -peaceful option to condor_off, condor_restart noop_job = True Preliminary support for the Tool Daemon Protocol (TDP) –TDP goal is to provide a generic way for scheduling systems (daemons) to interact with monitoring tools. –specify a ``tool'' that should be spawned along-side their regular Condor job. –On Linux, ability to allow a monitoring tool to attach with ptrace() before the job's main() function is called.

Hey Jobs! We’re watching you! condor_starter enforce limits –Starter is already monitoring many job characteristics (image size, cpu usage, etc) –Threshold expressions Use more resources than you said you would, and BAM! Local Universe –Just like Scheduler Universe, but there is a condor_starter –All advantages of the starter schedd starter job Submit startd starter job Execute Hey, job, behave or else!

ClassAd Improvements in Condor! Conditionals –IfThenElse(condition,then,else) String functions –Strcat(), strcmp(), toUpper(), etc. StringList functions –Example of a “string list” (CSV style) Mylist = “Joe, Jon, Jeff, Jim, Jake” –StrListContains(), StrListAppend(), StrListRemove(), etc. Others –Type test, some math functions

Accounting Groups and Group Quota Support Account Group (w/ CORE Feature Animation) Account Group Quota (inspiration Fermi) –Sample Problem: Cluster w/ 500 nodes, Chemistry Dept purchased 100 of them, Chemistry users must always be able to use them –Could use Machine Rank… but this ties to specific machines –Or could use new group support Each group can be given a quota in config file Job ads can specify group membership Group quotas are satisfied first Accounting by user and by group

Improved Scalability Much faster negotiation –SIGNIFICANT_ATTRIBUTES determined automatically –Schedd uses non-blocking TCP connects to the startd –Negotiator caching –Collector Forks for queries –More…

What’s brewing for after v6.8.0? More data, data, data –Stork distributed w/ v6.8.0, incl DAGMan support –NeST manage Condor spool files, ckpt servers –Stork used for Condor job data transfers Virtual Machines (and the future of Standard Universe) Condor and Shibboleth (with Georgetown Univ) Least Privilege Security Access (with U of Cambridge) Dynamic Temporary Accounts (with EGEE, Argonne) Leverage Database Technology (with UW DB group) ‘Automatic’ Glideins (NMI Nanohub – Purdue, U of Florida) Easier Updates New ClassAds (integration with Optena) Hierarchical Matchmaking Can I commit this to CVS??