Download presentation
Presentation is loading. Please wait.
Published byPhilomena Wilcox Modified over 8 years ago
1
Eileen Berman
2
Condor in the Fermilab Grid FacilitiesApril 30, 2008 Fermi National Accelerator Laboratory is a high energy physics laboratory outside of Chicago Many different experiments, comprised of 1000’s of users working for many years, rely on Fermilab to provide the core services and software necessary to enable the research that leads to scientific discoveries The Fermilab Grid Facilities are strategic in the execution of this mission Fermi National Accelerator Laboratory is a high energy physics laboratory outside of Chicago Many different experiments, comprised of 1000’s of users working for many years, rely on Fermilab to provide the core services and software necessary to enable the research that leads to scientific discoveries The Fermilab Grid Facilities are strategic in the execution of this mission 2
3
Condor in the Fermilab Grid FacilitiesApril 30, 2008 The Fermilab Grid Facility supports many different activities in support of these scientists. The scale of these activities places demands that often require close collaborations. Petabytes of data to store and manage Billions of events to analyze Tens of millions of jobs to run per year Collaboration with the Condor team over the years, has enabled Fermilab to deliver the quality software and services that this large scale science demands. 3
4
Condor in the Fermilab Grid FacilitiesApril 30, 2008 Fermilab Campus Grid (FermiGrid) DZero Data Handling and Job Submission System (SAMGrid) Resource Selection Service (ReSS) Glidein Workload Management System (WMS) Grid/Site Security Boundary (OSE) End To End Security 4
5
Condor in the Fermilab Grid FacilitiesApril 30, 2008 5 The Grid Storage Resources Storage Resources Storage Resources Storage Resources Computing Resources Computing Resources Computing Resources Computing Resources Computing Resources Computing Resources Computing Resources Computing Resources Storage Resources Storage Resources Site D Site C Site B Site A
6
Condor in the Fermilab Grid FacilitiesApril 30, 2008 SAZ Server GUMS Server VOMS Server FERMIGRID SE (dcache SRM) Gratia BlueArc VOMS Server SAZ Server GUMS Server Step 6- Grid job is forwarded to target cluster VOMRS Server 6 Step 2 - user issues voms-proxy-init user receives voms signed credentials Step 3 – user submits their grid job via globus-job-run, globus-job-submit, or condor-g Step 5 – Gateway requests GUMS Mapping based on VO & Role Step 4 – Gateway checks against Site Authorization Service Periodic Synchronization Exterior Interior Periodic Synchronization Step 1 - user registers with VO ReSS CMS WC2 CMS WC2 CMS WC1 CMS WC1 CDF OSG2 CDF OSG2 D0 CAB1 D0 CAB1 GP Grid GP Grid D0 CAB2 D0 CAB2 CMS WC3 CMS WC3 CDF OSG1 CDF OSG1 Publish campus classads CDF OSG3/4 CDF OSG3/4 Condor Site Wide Gateway clusters send ClassAds to the site wide gateway clusters send ClassAds to the site wide gateway
7
Condor in the Fermilab Grid FacilitiesApril 30, 2008 As data volumes increase, demands for increased data processing power has grown beyond the ability to support in a diverse disconnected environment. FermiGrid is a strategic cross campus grid supporting the experiments at Fermilab 11,786 cores amongst the clusters 7586 managed by Condor Access to several Petabytes (10 15 ) of storage O(1000) users Provides a common mechanism for supporting the users while minimizing the expense of service delivery. 7
8
Condor in the Fermilab Grid FacilitiesApril 30, 2008 Condor is the underlying batch system on most of the clusters The Site-Wide Gateway is a natural pairing of Condor-G and a hierarchical grid deployment Gateway uses jobmanager-cemon Based on jobmanager-condor Matches jobs against cluster information via a local condor matchmaking service (ReSS) Forwards to matched sub-cluster Enhances scalability Fault tolerance under development Additional Gateways will provide job access to Teragrid resources Significant scalability work done by Condor team to support usage patterns 8
9
Condor in the Fermilab Grid FacilitiesApril 30, 2008 9
10
Condor in the Fermilab Grid FacilitiesApril 30, 2008 SamGrid is the data handling and job submission system supporting the Dzero collaboration (500 physicists) data processing and simulated event generation needs 100’s of Terabytes of data needing processing Billions of events with data and metadata to manage and deliver Needed a widely distributed system to support the data delivery and data processing needs of the experiment, located across the globe Early adopter of Condor matchmaking service with Condor-G to select clusters instead of machines. Had to learn how to describe clusters Uses Condor matchmaking capabilities to throttle the flow of jobs from the schedulers to the sites (stable, configurable) Uses cluster classad requirements Policies can be implemented based on job status values (e.g. running, held…) 10
11
Condor in the Fermilab Grid FacilitiesApril 30, 2008 Condor team implemented multi-tiered job submission mechanism necessary in order to take advantage of the diverse clusters on the Grid. 11 Condor Schedd Grid
12
Condor in the Fermilab Grid FacilitiesApril 30, 2008 12 SAZ Server GUMS Server VOMS Server FERMIGRID SE (dcache SRM) Gratia BlueArc CMS WC2 CMS WC2 VOMS Server SAZ Server GUMS Server Step 2 - user issues voms-proxy-init user receives voms signed credentials Step 3 – user submits their grid job via globus-job-run, globus-job-submit, or condor-g Step 5 – Gateway requests GUMS Mapping based on VO & Role Step 4 – Gateway checks against Site Authorization Service clusters send ClassAds via CEMon to the site wide gateway Step 6- Grid job is forwarded to target cluster Periodic Synchronization Site Wide Gateway Exterior Interior CMS WC1 CMS WC1 VOMRS Server Periodic Synchronization Step 1 - user registers with VO CDF OSG2 CDF OSG2 D0 CAB1 D0 CAB1 GPGrid D0 CAB2 D0 CAB2 CMS WC3 CMS WC3 CDF OSG1 CDF OSG1 ReSS Publish campus classads CDF OSG3/4 CDF OSG3/4
13
Condor in the Fermilab Grid FacilitiesApril 30, 2008 13 Condor Match Maker Info Gatherer classads Condor Scheduler job What Gate? Gate 3 job CEMon CE Gate1 job-managers jobs info CLUSTER GIP CEMon CE Gate2 job-managers jobs info CLUSTER GIP CEMon CE Gate3 job-managers jobs info CLUSTER GIP ReSS Users express cluster requirements in the JDL Jobs sit in the site batch system queue until execution Grid / Site Interface VO / Grid Interface Grid Site Users Query classads Query
14
Condor in the Fermilab Grid FacilitiesApril 30, 2008 Needed the ability to match diverse job needs with available resources across an entire grid (OSG) without requiring users to know the details Resource Selection Service using same concept as SamGrid, generalized for the Open Science Grid (OSG) Condor-G Matchmaking ReSS is the production OSG resource selection service – Grid Wide FermiGrid uses ReSS internally for matching jobs to the campus clusters. For more information see – Mats Rynge’s talk “OSGMM and ReSS – Matchmaking on OSG” 14
15
Condor in the Fermilab Grid FacilitiesApril 30, 2008 4/30/200815 WMS Legend: Collector Glidein factory VO frontend Legend: Central manager Execute daemon Submit node and user(s) Gatekeeper Worker node Courtesy: I. Sfiligoi (condor week 07)
16
Condor in the Fermilab Grid FacilitiesApril 30, 2008 Provides glideins in a more transparent manner insulating the user from the diverse nature of the grid Supports vanilla and standard universe jobs Works across firewalls Required GCB be made production quality to support this Support for calling glExec built in to Condor directly Allows separation of execution of the Glidein pilot and the user job Scaling upgrades were done to support the O(10,000) jobs running simultaneously Creates a virtual condor pool so distributed grid resources look like dedicated local resources 16
17
Condor in the Fermilab Grid FacilitiesApril 30, 2008 Work is being done at Fermilab at the boundary of the Condor-G world and Condor, determining the security considerations on this boundary. Defines how site installations can work securely when interfacing with a grid environment. 17
18
Condor in the Fermilab Grid FacilitiesApril 30, 2008 18 Acronyms: UCrd- User credentials JCrd- Job specific user credentials JPld- Job payload WN WMS 1 WMS n WMS 2... JCrd JPld Data store JPld JCrd JPld JCrd JPld JCrd JPld sign. Ucrd JCrd The Grid Storage Resources Storage Resources Storage Resources Storage Resources Computing Resources Computing Resources Computing Resources Computing Resources Computing Resources Computing Resources Computing Resources Computing Resources Storage Resources Storage Resources Site D Site C Site B Site A
19
Condor in the Fermilab Grid FacilitiesApril 30, 2008 With the implementation of Glidein WMS’s, a job may make many stops on its way to a worker node How do you insure the job integrity? Can determine (e.g. on a worker node), if the job has been tampered with Implements signed classads to help improve the integrity of the user job as it travels in the Grid Collaboratory investigation between Fermilab and the Condor team. See Ian’s talk (Thursday) ‘End-to-end security in Condor’ 19
20
Condor in the Fermilab Grid FacilitiesApril 30, 2008 Fermilab provides a secure, production grid environment offering grid software and services that enables scientists to produce scientific results on the scale required by today’s petascale experiments. The productive relationship existing between Fermilab and the Condor team has contributed to this success. Condor is part of the Grid Facility at Fermilab, serving the production needs of many experiments and that of the wider OSG. 20
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.