May 12, 2005Batch Workshop HEPiX Karlsruhe 1 Preparing for the Grid— Changes in Batch Systems at Fermilab HEPiX Batch System Workshop.

Slides:

Advertisements

Similar presentations

PRAGMA Application (GridFMO) on OSG/FermiGrid Neha Sharma (on behalf of FermiGrid group) Fermilab Work supported by the U.S. Department of Energy under.

Advertisements

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.

Role Based VO Authorization Services Ian Fisk Gabriele Carcassi July 20, 2005.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

GUMS status Gabriele Carcassi PPDG Common Project 12/9/2004.

Implementing Finer Grained Authorization in the Open Science Grid Gabriele Carcassi, Ian Fisk, Gabriele, Garzoglio, Markus Lorch, Timur Perelmutov, Abhishek.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

Workload Management Massimo Sgaravatto INFN Padova.

Minerva Infrastructure Meeting – October 04, 2011.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

F Run II Experiments and the Grid Amber Boehnlein Fermilab September 16, 2005.

SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.

Open Science Grid Software Stack, Virtual Data Toolkit and Interoperability Activities D. Olson, LBNL for the OSG International.

HEP Experiment Integration within GriPhyN/PPDG/iVDGL Rick Cavanaugh University of Florida DataTAG/WP4 Meeting 23 May, 2002.

The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.

Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.

OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.

YuChul Yang Oct KPS 2006 가을 EXCO, 대구 The Current Status of KorCAF and CDF Grid 양유철, 장성현, 미안 사비르 아메드, 칸 아딜, 모하메드 아즈말, 공대정, 김지은,

Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.

VOX Project Status T. Levshina. Talk Overview VOX Status –Registration –Globus callouts/Plug-ins –LRAS –SAZ Collaboration with VOMS EDG team Preparation.

3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.

May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,

Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.

1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.

SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.

Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April

Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.

10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab

6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.

Grid User Management System Gabriele Carcassi HEPIX October 2004.

22 nd September 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.

São Paulo Regional Analysis Center SPRACE Status Report 22/Aug/2006 SPRACE Status Report 22/Aug/2006.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,

Role Based VO Authorization Services Ian Fisk Gabriele Carcassi July 20, 2005.

US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.

Conference name Company name INFSOM-RI Speaker name The ETICS Job management architecture EGEE ‘08 Istanbul, September 25 th 2008 Valerio Venturi.

Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center February.

4/25/2006Condor Week 1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center.

GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.

Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

VO Privilege Activity. The VO Privilege Project develops and implements fine-grained authorization to grid- enabled resources and services Started Spring.

OSG Abhishek Rana Frank Würthwein UCSD.

Farms User Meeting April Steven Timm 1 Farms Users meeting 4/27/2005

UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.

ALCF Argonne Leadership Computing Facility GridFTP Roadmap Bill Allcock (on behalf of the GridFTP team) Argonne National Laboratory.

Hans Wenzel CMS 101 September " Introduction to the FNAL UAF and other US facilities " “work in progress Hans Wenzel Fermilab ● Introduction ●

LCG and Tier-1 Facilities Status ● LCG interoperability. ● Tier-1 facilities.. ● Observations. (Not guaranteed to be wry, witty or nonobvious.) Joseph.

Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.

OSG Deployment Preparations Status Dane Skow OSG Council Meeting May 3, 2005 Madison, WI.

An Introduction to Campus Grids 19-Apr-2010 Keith Chadwick & Steve Timm.

Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.

Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.

VOX Project Status T. Levshina. 5/7/2003LCG SEC meetings2 Goals, team and collaborators Purpose: To facilitate the remote participation of US based physicists.

Site Authorization Service Local Resource Authorization Service (VOX Project) Vijay Sekhri Tanya Levshina Fermilab.

Victoria A. White Head, Computing Division, Fermilab Fermilab Grid Computing – CDF, D0 and more..

April 18, 2006FermiGrid Project1 FermiGrid Project Status April 18, 2006 Keith Chadwick.

Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.

Defining the Technical Roadmap for the NWICG – OSG Ruth Pordes Fermilab.

FermiGrid The Fermilab Campus Grid 28-Oct-2010 Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.

Workload Management System

Artem Trunov and EKP team EPK – Uni Karlsruhe

Open Science Grid at Condor Week

Presentation transcript:

May 12, 2005Batch Workshop HEPiX Karlsruhe 1 Preparing for the Grid— Changes in Batch Systems at Fermilab HEPiX Batch System Workshop Karlsruhe, Germany Ken Schumacher, Steven Timm

May 12, 2005Batch Workshop HEPiX Karlsruhe 2 Introduction All big experiments at Fermilab (CDF, D0, CMS) are moving to grid-based processing. This talk will cover the following: –Batch scheduling at Fermilab before the grid. –Changes of big Fermilab clusters to Condor and why it happened –Future requirements for batch scheduling at Fermilab

May 12, 2005Batch Workshop HEPiX Karlsruhe 3 Before Grid--FBSNG Fermilab had four main clusters, CDF Reconstruction Farm, D0 Reconstruction Farm, General Purpose Farm, CMS. All used FBSNG (Farms Batch System Next Generation). Most early activities on these farms were reconstruction of experimental data and generation of Monte Carlo. All referred to generically as “Reconstruction Farms”

May 12, 2005Batch Workshop HEPiX Karlsruhe 4 FBSNG scheduling in Reconstruction Farms Dedicated reconstruction farm (CDF, D0) –Large cluster dedicated to one experiment –Small team of experts submits all jobs –Scheduling is trivial Shared reconstruction farm (General Purpose) –Small cluster shared by 10 experiments, each with one or more queues –Each experiment has maximum quota of CPU’s they can use at once –Each experiment has maximum share of farm it can use when farm is oversubscribed –Most queues do not have time limits. Priority is calculated taking into account the average time jobs that have been running in the queue –Special queues for I/O jobs that run on the head node and go to and from mass storage –Guaranteed scheduling means that everything will eventually run Other queues may be manually held to let a job run May have to temporarily idle some nodes in order to let a large parallel job start up.

May 12, 2005Batch Workshop HEPiX Karlsruhe 5 FBSNG Advantages and Disadvantages Advantages –Light resource consumption by batch system daemons –Simple design—based on resource counting rather than load measuring and balancing –Cost--No per-node license fee –Customized for Fermilab Strong Authentication requirements (Kerberos). –Quite reliable, rarely if ever does FBSNG software fail. Disadvantages –Designed strictly for Fermilab Run II production –Doesn’t have grid-friendly features (x509 authentication), although it could be added.

May 12, 2005Batch Workshop HEPiX Karlsruhe 6 Grid can use any batch system, Why Condor? Free software (but you can buy support). Supported by large team at U. of Wisconsin (and not by Fermilab programmers) Widely deployed in multi-hundred node clusters. New versions of Condor allow Kerberos 5 and x509 authentication Comes with Condor-G which simplifies submission of grid jobs Condor-C components allow for interoperation of independent Condor pools. Some of our grid-enabled users take advantage of the extended Condor features, so it is the fastest way to get our users on the grid. USCMS production cluster at Fermilab has switched to Condor, CDF reconstruction farms cluster is switching. General Purpose Farms, which are smaller, also plan to switch to Condor to be compatible with the two biggest compute resources on site.

May 12, 2005Batch Workshop HEPiX Karlsruhe 7 Rise of Analysis Clusters Experiments now use multi-hundred node Linux clusters for analysis as well, replacing expensive central machines –CDF Central Analysis Facility (CAF) originally used FBSNG— Now has switched to Condor. –D0 Central Analysis Backend (CAB) uses PBS/Torque –USCMS User Analysis Facility (UAF) used FBSNG as primitive load balancer for interactive shells—will switch to Cisco load balancer shortly. Heterogeneous job mix Many different users and groups have to be prioritized within the experiment

May 12, 2005Batch Workshop HEPiX Karlsruhe 8 CAF software In CDF terms, CAF refers to the cluster and the software that makes it go. CDF collaborators (UCSD+INFN) wrote a series of wrappers around FBSNG referred to as “CAF”. –Wrappers allow connection to debug running job, or tail files on job that is running, many other things –Also added monitoring functions –Users are tracked by Kerberos principal, and prioritized with different batch queues, but all jobs run with just a few userID’s, making management easy. dCAF is distributed CAF, the same setup replicated at dedicated CDF resources around the world. Info at

May 12, 2005Batch Workshop HEPiX Karlsruhe 9 CondorCAF in production CDF changed batch system to Condor in analysis facility Also rewrote monitoring software to work with Condor Condor “computing on demand” capacity allows users to list files, tail files, debug on batch nodes. Lots of work from the Condor team to get them going with Kerberos authentication and the large number of nodes (~700). Now half of CDF reconstruction farm also running Condor Rest of CDF reconstruction farm will convert once validation is complete SAM is data delivery and bookkeeping mechanism –used to fetch data files, keep track of intermediate files, store the results. –Replaces user-written bookkeeping system that was high-maintenance Next steps, GlideCAF to make CAF work with Condor Glide-ins across the grid on non-dedicated resources.

May 12, 2005Batch Workshop HEPiX Karlsruhe 10 Screen from CondorCAF monitoring

May 12, 2005Batch Workshop HEPiX Karlsruhe 11 SAMGrid D0 is using SAMGrid for all remote generation of Monte Carlo and reprocessing at several sites world wide. D0 Farms at FNAL are biggest site. Special job managers written to do intelligent handling of production and Monte Carlo requests All job requests and data requests go through head nodes to the outside net. Significant scalability issues, but it is in production. D0 reconstruction farms at Fermilab will continue to use FBSNG.

May 12, 2005Batch Workshop HEPiX Karlsruhe 12 Open Science Grid Continuation of efforts that were begun in Grid3. Integration testing has been ongoing since February Provisioning and deployment is occurring as we speak. At Fermilab, USCMS production cluster and General Purpose Farms will be initial presence on OSG. 10 Virtual Organizations so far, mostly US-based: –USATLAS (ATLAS collaboration) –USCMS (CMS collaboration) –SDSS (Sloan Digital Sky Survey) –fMRI (functional Magnetic Resonance Imaging, based at Dartmouth) –GADU (Applied Genomics, based at Argonne) –GRASE (Engineering applications, based at SUNY Buffalo) –LIGO (Laser Interferometer Gravitational Observatory) –CDF (Collider Detector at Fermilab) –STAR (Solenoidal Tracker at RHIC—BNL) –iVDGL (International Virtual Data Grid Laboratory)

May 12, 2005Batch Workshop HEPiX Karlsruhe 13 Structure of General Purpose Farms OSG Compute Element One node runs Globus Gatekeeper and does all communication with the grid Software comes from VDT (Virtual data toolkit, In this configuration this gatekeeper is also the Condor master. Condor software is part of VDT. Will make a separate Condor head node later once software configuration is stable. All grid software is exported by NFS to the compute nodes. No change to compute node install is necessary.

May 12, 2005Batch Workshop HEPiX Karlsruhe 14 Fermigrid Fermigrid is an internal project at Fermilab to get different Fermilab resources to be able to interoperate, and be available to the Open Science Grid Fermilab will start with General Purpose Farms and CMS being available to OSG and to each other. All non-Fermi organizations will send jobs through common site gatekeeper. Site gatekeeper will route jobs to the appropriate cluster, probably using Condor-C, details to be determined. Fermigrid provides VOMS server to manage all the Fermilab-based Virtual Organizations Fermigrid provides GUMS server to map the grid Distinguished Names to unix userid’s.

May 12, 2005Batch Workshop HEPiX Karlsruhe 15 FNSFO FBSNG HEAD NODE ENSTORE GP Farms FBSNG Worker Nodes 102 currently ENCP FBS Submit NFS RAID Current Farms Configuration

May 12, 2005Batch Workshop HEPiX Karlsruhe 16 FNGP- OSG Gate- keeper FNPCSRV1 FBSNG HEAD NODE GP Farms FBSNG Worker Nodes 102 currently ENSTORE Condor WN 14 currently New Condor WN 40 (coming this summer) Configuration with Grid NFS RAID FBS Submit Fermigrid1 Site gatekeeper Condor submit Job from OSG Job from Fermilab

May 12, 2005Batch Workshop HEPiX Karlsruhe 17 Requirements Scheduling –Current FBSNG installation in general purpose farms has complicated shares and quotas –Have to find best way to replicate this in Condor. –Hardest case to handle—low priority long jobs come into the farm while it is idle and fill it up. Do we pre-empt? Suspend? Grid credentials and mass storage –Need to verify that we can use Storage Resource Manager and gridftp from compute nodes, not just head node. Grid credentials—authentication + authorization –Condor has Kerberos 5 and x509 authentication –Need way to pass these credentials through the Globus GRAM bridge to the batch system –Otherwise local as well as grid jobs end up running non- authenticated and trusting the gatekeeper.

May 12, 2005Batch Workshop HEPiX Karlsruhe 18 Requirements 2 Accounting and auditing –Need features to track which groups and which users are using the resources –VO’s need to know who within their VO is using resources –Site admins need to know who is crashing their batch system Extended VO Privilege –Should be able to set priorities in the batch system and mass storage system by virtual organization and role. –In other words, Production Manager should be able to jump ahead of Joe Graduate Student in the queue. Practical Sysadmin concerns –Some grid user mapping scenarios visualize hundreds of pool userid’s per VO. –Have to give all of these accounts, quotas, home directories, etc. –Would be very nice to do as CondorCAF does and run with a few user id’s traceable back to kerberos principal or grid credential.