Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Overview of local security issues in Campus Grid environments Bruce Beckles University of Cambridge Computing Service.
NorduGrid Grid Manager developed at NorduGrid project.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Part 7: CondorG A: Condor-G B: Laboratory: CondorG.
CMS Applications Towards Requirements for Data Processing and Analysis on the Open Science Grid Greg Graham FNAL CD/CMS for OSG Deployment 16-Dec-2004.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Expanding scalability of LCG CE A.Kiryanov, PNPI.
OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.
GRAM: Software Provider Forum Stuart Martin Computational Institute, University of Chicago & Argonne National Lab TeraGrid 2007 Madison, WI.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
Open Science Grid OSG CE Quick Install Guide Siddhartha E.S University of Florida.
Condor-G A Quick Introduction Alan De Smet Condor Project University of Wisconsin - Madison.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
VO Box Issues Summary of concerns expressed following publication of Jeff’s slides Ian Bird GDB, Bologna, 12 Oct 2005 (not necessarily the opinion of)
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
II EGEE conference Den Haag November, ROC-CIC status in Italy
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.
INFSO-RI Enabling Grids for E-sciencE Padova site report Massimo Sgaravatto On behalf of the JRA1 IT-CZ Padova group.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G: Condor and Grid Computing.
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
Network - definition A network is defined as a collection of computers and peripheral devices (such as printers) connected together. A local area network.
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.
OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
Security in OSG Tuesday afternoon, 4:15pm Igor Sfiligoi Member of the OSG Security team University of California San Diego.
Rome, Sep 2011Adapting with few simple rules in glideinWMS1 Adaptive 2011 Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience by.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Dealing with real resources Wed July 21st, 3:15pm Igor Sfiligoi, OSG Scalability Area coordinator and OSG glideinWMS factory manager.
Scott Koranda, UWM & NCSA 20 November 2016www.griphyn.org Lightweight Replication of Heavyweight Data Scott Koranda University of Wisconsin-Milwaukee &
HTCondor Security Basics
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Workload Management System
GWE Core Grid Wizard Enterprise (
Glidein Factory Operations
EGEE VO Management.
Security in OSG Rob Quick
Building Grids with Condor
TCG Discussion on CE Strategy & SL4 Move
Francesco Giacomini – INFN JRA1 All-Hands Nikhef, February 2008
Globus Job Management. Globus Job Management Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.
HTCondor Security Basics HTCondor Week, Madison 2016
GRID Workload Management System for CMS fall production
Condor-G Making Condor Grid Enabled
JRA 1 Progress Report ETICS 2 All-Hands Meeting
Condor-G: An Update.
Presentation transcript:

Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD

Madison, Apr 2010Igor Sfiligoi2 How do I use Condor-G ● GlideinWMS ● Part of the development team ● Operate the OSG glidein factory ● Condor-G is used for glidein submission ● OSG Scalability, Reliability and Usability area ● Testing CE scalability using Condor-G – Both production and in-development CE software

Madison, Apr 2010Igor Sfiligoi3 Condor-G is old, nothing to learn ● This is what one would expect ● This is why we use Condor-G in glideinWMS ● Solid reliable technology ● Use-and-forget ● etc. ● Turns out, it is not really like that

Madison, Apr 2010Igor Sfiligoi4 Condor-G is not old ● Just been around for a long time ● since v6.2 – year 2001 ● It is changing all the time ● New Grid universes added with time – v6.8 adds Condor, GT3 and GT4 – v7.0 adds nordugrid, unicore, pbs and lsf – v7.2 adds Amazon EC2 – v7.4 adds CREAM and GT5 ● Existing protocols being tuned

Madison, Apr 2010Igor Sfiligoi5 Some benchmarks first ● As part of the OSG Scalability effort ● Tested ● GT2 (still by far the most used Grid universe in OSG) ● GRAM5 ● CREAM ● Using a mix of v7.4.X and v7.5.X

Madison, Apr 2010Igor Sfiligoi6 What about the other universes? ● GT4, the WS-based Globus Gatekeeper has never got wide acceptance in OSG ● Plus Globus has deprecated it (not in Globus 5) ● I plan to test Nordugrid soon ● Just did not have time yet ● Amazon is interesting but it is quite different ● Unicore, pbs, lsf and Condor-C not interesting for OSG right now Do I need to mention GT3?

Madison, Apr 2010Igor Sfiligoi7 GT2 performance ● How fast can I submit GT2 jobs? ● It really depends what you want to measure (with 10 jobmanagers, single user) ● Job submission alone – 45 jobs/min ● Job submission while jobs terminating – 33 jobs/min ● Processing job termination – 15 jobs/min ● Processing job termination while jobs being submitted – 7.5 jobs/min

Madison, Apr 2010Igor Sfiligoi8 GT2 performance ● How fast can I submit GT2 jobs? ● It also depends how far away are you from the Grid gatekeeper ● Submitting across the world will cut rates in half

Madison, Apr 2010Igor Sfiligoi9 GT2 performance ● How fast can I submit GT2 jobs? ● It also depends how many jobs you have in the queue ● Try submitting 20k jobs at the same time and you may have to wait hours to get the first job started! ● Use dagman -maxidle 1000 if you have a lot of jobs

Madison, Apr 2010Igor Sfiligoi10 GRAM5 ● What is GRAM5? ● The new non-WS Gatekeeper coming with Globus 5 ● Major scalability improvement over GT2 ● Almost backwards compatible with GT2 ● Removed just streaming support ● Too bad Condor-G relies on this feature in GT2! ● Condor 7.4 adds explicit GT5 support

Madison, Apr 2010Igor Sfiligoi11 GRAM5 performance ● Is it really faster than GT2? ● Yes! ● Both submit and termination rates exceeded 100 jobs/min ● But a lot of variation in time ● Average still over 2x of GT2 ● Globus still has bugs ● Not ready for production yet, but looks promising!

Madison, Apr 2010Igor Sfiligoi12 CREAM ● What is CREAM? ● A brand new Grid Gatekeeper coming from Italy ● Web-based, runs inside Tomcat ● State stored in a relational database ● Support of multi-home deployments planned ● Uses its own protocol ● Condor 7.4 adds the CREAM Grid universe

Madison, Apr 2010Igor Sfiligoi13 CREAM performance ● How it compares to GT2 and GT5? ● Much faster! ● Consistent rates in the 50 jobs/minute range ● About 5x faster than GT2, and 2x faster than GT5 ● But the reliability is dismal ● Out of 10k jobs, over 2k failed to run! ● We managed to overload the Condor-G client!

Madison, Apr 2010Igor Sfiligoi14 How did CREAM overload the client? ● CREAM moves files directly to and from the worker nodes ● No intermediate staging to the CE like in Globus ● Each worker node will independently contact the GridFTP server running on the Condor-G node ● No coordination between WNs Condor-G CREAM WN jo b in files out files Condor-G Globus WN jo b in files out files in files

Madison, Apr 2010Igor Sfiligoi15 GridFTP on the client? ● Wait! I have a GridFTP running on my client? ● That can access any file I own? ● Can read /etc/passwd? My tax return? ● Can overwrite ~/.bashrc? ~/.ssh/authorized_keys? ● My proxy is of course needed, but I delegated the proxy to the Grid node! ● I am not sure I want to use CREAM anymore ● But Globus uses an equivalent GASS Server

Madison, Apr 2010Igor Sfiligoi16 Condor-G basically insecure! ● It takes a lot of trust to use Condor-G ● At least for GT2, GT5 and CREAM ● If any Grid admin wants to compromise your client node, it can ● And there is no way to prevent that! ● The good news is that vanilla Condor is much more secure ● Condor team tells me remote admins can only access/modify files in the submit directory You have to log in, right?

Madison, Apr 2010Igor Sfiligoi17 What about portals? ● Portals can submit Grid jobs for multiple users ● Using user-provided proxies ● I know of portals that use a single system account to host all the user files ● No user run code on the portal node, so why not? ● Makes administration easier ● Use of Condor-G gives every user they serve access to all portal files

Madison, Apr 2010Igor Sfiligoi18 The glideinWMS factory ● The glideinWMS factory used that philosophy ● Had to find a solution! ● We now have one user account per user served ● For storing user files and submitting user jobs ● But no logins! ● Factory processes run as a dedicated user ● Need a way to switch to the user account when needed ● But do not want to run as root!

Madison, Apr 2010Igor Sfiligoi19 Condor privsep ● Since v7.0, Condor ships with a tool that allows account switching without becoming root ● Sort of a lightweight sudo ● But more limited in scope, thus – easier to configure – more secure, if you can live within the limits ● GlideinWMS v2.4 now uses Condor privsep to operate safely with Condor-G

Madison, Apr 2010Igor Sfiligoi20 Condor privsep ● Documentation was a problem ● The switchboard configuration is actually well documented ● But how to use it.... had to look through the code! ● It is relatively easy to use ● But not by hand ● Wrote a python module that made the use simple in glideinWMS

Madison, Apr 2010Igor Sfiligoi21 Summary ● Condor-G a wonderful tool to access the Grid ● Easy to use, flexible, etc. ● But its security model can be a real problem ● Not really Condor's fault ● But users only know about Condor, so security implications should be clearly stated ● Condor privsep can help ● Although this is currently not a supported use-case

Madison, Apr 2010Igor Sfiligoi22 Condor Week 2010 Additional resources

Madison, Apr 2010Igor Sfiligoi23 Links ● OSG Scalability area results ● ● OSG Scalability test tools ● ● GlideinWMS home page ●

Madison, Apr 2010Igor Sfiligoi24 Acknowledgements ● Many thanks to the whole Condor team for the continuous support ● And in particular to Dan Bradley and Jaime Frey ● This work was partially supported by ● DoE grant DE-FC02-06ER41436 ● NSF grants PHY , PHY and OCI