European Condor Week 2006 - CDF experience with Condor glide-ins and GCB - Igor Sfiligoi1 European Condor Week 2006 Using Condor Glide-Ins and GCB to run.

Slides:



Advertisements
Similar presentations
Todd Tannenbaum Condor Team GCB Tutorial OGF 2007.
Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
LcgCAF:CDF submission portal to LCG Federica Fanzago for CDF-Italian Computing Group Gabriele Compostella, Francesco Delli Paoli, Donatella Lucchesi, Daniel.
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
CDF Grid Status Stefan Stonjek 05-Jul th GridPP meeting / Durham.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
Interactive Job Monitor: CafMon kill CafMon tail CafMon dir CafMon log CafMon top CafMon ps LcgCAF: CDF submission portal to LCG resources Francesco Delli.
CDF Grid at KISTI 정민호, 조기현 *, 김현우, 김동희 1, 양유철 1, 서준석 1, 공대정 1, 김지은 1, 장성현 1, 칸 아딜 1, 김수봉 2, 이재승 2, 이영장 2, 문창성 2, 정지은 2, 유인태 3, 임 규빈 3, 주경광 4, 김현수 5, 오영도.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
OSG Site Admin Workshop - Mar 2008Using gLExec to improve security1 OSG Site Administrators Workshop Using gLExec to improve security of Grid jobs by Alain.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Integration of Physics Computing on GRID S. Hou, T.L. Hsieh, P.K. Teng Academia Sinica 04 March,
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Gabi Kliot Computer Sciences Department Technion – Israel Institute of Technology Adding High Availability to Condor Central Manager Adding High Availability.
Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.
Why you should care about glexec OSG Site Administrator’s Meeting Written by Igor Sfiligoi Presented by Alain Roy Hint: It’s about security.
BaBar & Grid Eleonora Luppi for the BaBarGrid Group TB GRID Bologna 15 febbraio 2005.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
CDF Monte Carlo Production on LCG GRID via LcgCAF Authors: Gabriele Compostella Donatella Lucchesi Simone Pagan Griso Igor SFiligoi 3 rd IEEE International.
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Condor Week 2007Glidein Factories - by I. Sfiligoi1 Condor Week 2007 Glidein Factories (and in particular, the glideinWMS) by Igor Sfiligoi.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
Condor Week 2006, University of Wisconsin 1 Matthew Norman Using Condor Glide-ins and GCB to run in a grid environment Elliot Lipeles, Matthew Norman,
Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Arlington, Dec 7th 2006 Glidein Based WMS 1 A pilot-based (PULL) approach to the Grid An overview by Igor Sfiligoi.
IFAE Apr CDF Computing Experience - Gabriele Compostella1 IFAE Apr CDF Computing Experience Gabriele Compostella, University.
HTCondor Networking Concepts
LcgCAF:CDF submission portal to LCG
HTCondor Networking Concepts
HTCondor Security Basics
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Workload Management System
High Availability in HTCondor
The CMS use of glideinWMS by Igor Sfiligoi (UCSD)
Adding High Availability to Condor Central Manager Tutorial
Building Grids with Condor
HTCondor Security Basics HTCondor Week, Madison 2016
Condor Glidein: Condor Daemons On-The-Fly
The Condor JobRouter.
Condor: Firewall Mirroring
GRID Workload Management System for CMS fall production
Data Processing for CDF Computing
Condor-G Making Condor Grid Enabled
Condor-G: An Update.
Presentation transcript:

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi1 European Condor Week 2006 Using Condor Glide-Ins and GCB to run on the Grid The CDF experience Elliot Lipeles, Matthew Norman, Frank Würthwein, University of California, San Diego Subir Sarkar, Igor Sfiligoi, INFN

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi2 Traditional use of Condor in CDF Single Point of Submission system ● CAF Daemons accept and authenticate user jobs, handle output. ● Condor Schedd does the real job CAF (Portal) Worker Node (WN) Startd Collector Schedd Submitter Daemons Negotiator Legacy protocol Monitoring Daemons CDF Analysis Farm (CAF)...

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi3 Had to move to the Grid “The Grid” ● HEP moving to the Grid ● Nobody wants to finance dedicated nodes, anymore ● Need to move ● Want to preserve user interface

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi4 Why not plain Condor-G? Works, but... ● No central matchmaking ● No control over priorities ● Site selection problem ● Black holes can eat most of the user jobs Grid Pool Globus Submitter Daemons Grid Pool Globus Batch queue CAF (Portal) User jobs Schedd Batch queue Batch Slot User Job

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi5 All jobs done CDF decided to go with Condor glide-ins Grid Pool Globus Submitter Daemons Grid Pool Globus Batch queue CAF (Portal) User jobs Schedd Batch queue Collector/Neg Schedd Glidekeeper Monitor Glideins Batch Slot Glidein User Job Glide-ins solve all the problems

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi6 What is the Glidekeeper ● Just a simple script – Checks number jobs and number glide-ins – Makes sure there is always at least a glide-in per CE, when idle jobs in the queue ● Does not need to be complex – Just blindly submit to all the CEs – But could be made as complex as needed

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi7 What is a glide-in condor_config DAEMON_LIST = MASTER,STARTD NEGOTIATOR_HOST = $(HEAD_NODE) COLLECTOR_HOST = $(HEAD_NODE) TMOUT= MaxJobRetirementTime=$(TMOUT) SHUTDOWN_GRACEFUL_TIMEOUT=$(TMOUT) # How long will it wait in an unclaimed state STARTD_NOCLAIM_SHUTDOWN = 1200 HEAD_NODE = cdfhead.fnal.gov glidein_startup.sh validate_node() local_config()./condor_master -r $mins -dyn -f ● Glide-ins are properly configured startds – Same old binaries used ● CDF uses a startup script to validate the node before starting startd – Prevent job failure

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi8 CDF using glide-ins for a year, now CNAF one year history plot: Note over a thousand running jobs for long period of time. SDSC six months history plot CNAF (Bologna, Italy), the first GlideCAF SDSC (San Diego, CA, USA) Fermilab (Batavia, IL, USA)IN2P3 (Lyon, France)MIT (Boston, MA, USA) 1k

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi9 Problems found along the road ● File delivery ● Firewalls ● Security concerns

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi10 Problems found along the road ● File delivery ● Firewalls ● Security concerns

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi11 Condor binaries delivery glidein.submit Universe = Globus GlobusScheduler = mysite/jobmanager-mybatch Executable = glidein_startup.sh transfer_Input_files = condor.tgz Queue Only the executable guaranteed to be transfered on EGEE. Other files need gLite tools. Works fine on OSG Cannot use Condor-G transfer mechanism!

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi12 glidein_startup.sh validate_node() local_config() tar -czf condor.tgz./condor_master -r $retmins -dyn -f HTTPd for file transfer glidein_startup.sh validate_node() wget sha1sum knownSHA1 condor.tgz if [ $? -eq 0 ]; then tar -xzf condor.tgz local_config()./condor_master -r $retmins -dyn -f fi GlideCAF (Portal) Collector/Neg Main Schedd Glidekeeper Glide-in Schedd HTTPd (1) (2) (3) (4) Submitter Daemons (2) (3) HTTPd serves files

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi13 Add HTTP Proxy to reduce load GlideCAF (Portal) Collector/Neg Main Schedd Glidekeeper Glide-in Schedd HTTPd (1) (2) (3) (4) Submitter Daemons glidein_startup.sh validate_node() env http_proxy=proxy1.fnal.gov wget sha1sum knownSHA1 condor.tgz if [ $? -eq 0 ]; then tar -xzf condor.tgz local_config()./condor_master -r $retmins -dyn -f fi HTTP Proxy (once)

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi14 Problems found along the road ● File delivery ● Firewalls ● Security concerns

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi15 Working with remote Grid sites - Dumb Collector Main Schedd Available Batch Slot Startd Globus WAN ? ? UDP packets often lost over WAN Available Batch Slot Startd Firewall/ NAT Firewalls and NATs make it impossible Globus Glide-in Schedd Available Batch Slot Startd Globus UDP User Job

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi16 Working with remote Grid sites - Smart Collector Main Schedd Available Batch Slot Startd Globus WAN TCP Available Batch Slot Startd Globus Glide-in Schedd Available Batch Slot Startd Globus UDP User Job UDP is fast TCP traffic is reliable Firewall/NAT Use GCB to bridge firewalls GCB Outgoing TCP User Job

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi17 Generic Connection Brokering (GCB) ● Condor proxy service – Fully integrated in Condor – Just a configuration parameter ● For scalability, use multiple instances – Point different glide-ins to different GCBs condor_config DAEMON_LIST = MASTER,STARTD NEGOTIATOR_HOST = $(HEAD_NODE) COLLECTOR_HOST = $(HEAD_NODE) TMOUT= MaxJobRetirementTime=$(TMOUT) SHUTDOWN_GRACEFUL_TIMEOUT=$(TMOUT) # How long will it wait in an unclaimed state STARTD_NOCLAIM_SHUTDOWN = 1200 HEAD_NODE = cdfhead.fnal.gov # GCB configuration NET_REMAP_INAGENT = cdfgcb1.fnal.gov NET_REMAP_ENABLE = true NET_REMAP_SERVICE = GCB

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi18 How GCB works Firewall/NAT Collector Schedd Grid Pool GCB Broker Startd Globus 4) Schedd address forwarded 3) Send own address 6) A job can be sent to the startd GCB must be on a public network TCP 2) Advertise itself and the GCB connection 5) Establish a TCP connection to the schedd Just a proxy User Job 1) Establish a persistent TCP connection to a GCB TCP

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi19 The use of GCB at CDF ● Using a pre-production version of GCB – Waiting for Condor v ● N.American CAF just opened to users – Condor collector and schedds at Fermilab – Using 2 GCBs, both at Fermilab – Gliding into Fermilab, MIT and San Diego – Rest of US and Canada Grid sites to follow

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi20 Problems found along the road ● File delivery ● Firewalls ● Security concerns

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi21 Security in the pull model ● Real user known only after glide-in starts – Cannot use user proxy when submitting to CE – Real user proxy delivered by Condor ● User job can use it for further authentication ● All glide-ins use a single service proxy – Site admin does not know about the real user – All jobs potentially run under the same UID ● No system protection between users ● Glide-in and job run under the same UID – User can steal service proxy

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi22 Security problem at a glance Submitter Daemons Grid Pool Globus CAF (Portal) User jobs Schedd Batch queue Collector/Neg Schedd Glidekeeper Monitor Glideins Batch Slot Glidein User Job Batch Slot Glidein User Job Node uidX ProxyA uidX ProxyB ProxyX

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi23 Use gLExec to authenticate on the WN ● Authenticate on WN before starting the user job – Can change UID on the fly ● Grid admin knows who is running Globus Batch queue Batch Slot Glidein uidX ProxyX User Job ProxyA uidA gLExec ProxyA Schedd Collector/Neg Schedd Grid Pool ProxyX

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi24 Status of gLExec ● Current gLExec works only on CE – Cannot use it as-is ● Work in progress to make it usable for the glideins – See later talk ● Condor team promised to make use of gLExec once available

European Condor Week CDF experience with Condor glide-ins and GCB - Igor Sfiligoi25 Summary ● Glide-ins and GCB took CDF to the Grid – Without leaving the Condor environment ● Using the pull model made our life easy – No need to select a site in advance – Matchmaking done at a global level – Policies kept in CDF hands ● No need for any add-on at Grid sites – Will use what is found ● gLExec and a local Squid desirable