Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.

Similar presentations


Presentation on theme: "Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team."— Presentation transcript:

1 Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team

2 Overview Introduction Grid-level Management SAM-Grid = SAM + JIM Job Management Information Management Fabric-level Management Running jobs on grid resources Local sandbox management The DZero Application Framework Running MC at UWisc

3 Context D0 Grid project started in 2001-2002 to handle D0’s expanded needs for globally distributed computing JIM complements the data handling system (SAM) with jobs and info management JIM is funded by PPDG (our team here), GridPP (Rod Walker in the UK) Collaborative effort with the experiments. CDF joined later in 2002

4 History Delivered JIM prototype for D0, Oct 10, 2002: Remote job submission Brokering based on data cached Web-based monitoring SC-2002 demo – 11 sites (D0, CDF), big success May 2003 – started deployment of V1 Now – working on running MC in production on the Grid

5 Overview Introduction  Grid-level Management SAM-Grid = SAM + JIM Job Management Information Management Fabric-level Management Running jobs on grid resources Local sandbox management The DZero Application Framework Running MC at UWisc

6

7 SAM-Grid Logistics Site Resource Selector Info Collector Info Gatherer Match Making User Interface Submission Global Job Queue Grid Client Submission User Interface Global DH Services SAM Naming Server SAM Log Server Resource Optimizer SAM DB Server RCMetaData Catalog Bookkeeping Service SAM Stager(s) SAM Station (+other servs) Data Handling Worker Nodes Grid Gateway Local Job Handler (CAF, D0MC, BS,...) JIM Advertise Local Job Handling Cluster AAA Dist.FS Info Manager XML DB server Site Conf. Glob/Loc JID map... Info Providers MDS MSS Cache Site Web Serv Grid Monitoring User Tools Flow of: jobdata meta-data

8 Job Management Highlights We distinguish grid-level (global) job scheduling (selection of a cluster to run) from local scheduling (distribution of the job within the cluster) We consider 3 types of jobs analysis: data intensive monte carlo: CPU intensive reconstruction: data and CPU intensive

9 Job Management – Distinct JIM Features Decision making is based on both: Information existing irrespective of jobs (resource description) Functions of (jobs,resource) Decision making is interfaced with data handling middleware Decision making is entirely in the Condor framework (no own RB) – strong promotion of standards, interoperability Brokering algorithms can be extended via plug-ins

10 JOB Computing Element Submission Client User Interface Queuing System Job Management User Interface Broker Match Making Service Information Collector Execution Site #1 Submission Client Match Making Service Computing Element Grid Sensors Execution Site #n Queuing System Grid Sensors Storage Element Computing Element Storage Element Data Handling System Storage Element Informatio n Collector Grid Sensor s Computin g Element Data Handling System

11 Information Management In JIM’s view, this includes: configuration framework resource description for job brokering infrastructure for monitoring Main features Sites (resources) and jobs monitoring Distributed knowledge about jobs etc Incremental knowledge building GMA for current state inquiries, Logging for recent history studies All Web based

12 Information Management via Site Configuration Main Site/cluster Config XMLDB Resource Advertisement classad Monitoring Configuration LDIF Service Instantiation XML … Template XML XSLT

13 Overview Introduction Grid-level Management SAM-Grid = SAM + JIM Job Management Information Management  Fabric-level Management Running jobs on grid resources Local sandbox management The DZero Application Framework Running MC at UWisc

14 Running jobs on Grid resources The trend: Grid resources are not dedicated to a single experiment Translation: no daemons running on the worker nodes of a Batch System no experiment specific software installed

15 Running jobs on Grid resources The situation today is transitioning: Worker nodes typically access the software via shared FS: not scalable! Generally, experiments can install specific services on a node close to the cluster. Local resource configuration still too diverse to easily plug into the Grid

16 The JIM local sandbox management It keeps the job executable (from the Grid) at the head node and knows where its product dependencies are It transports and installs the software to the worker node It can instantiate services at the worker node It sets up the environment for the job to run It packages the output and hands it over to the Grid, so that it becomes available for the download at the submission site

17 Running a DZero application We have JIM sandbox: where is the problem now? JIM sandbox could immediately use the DZero Run Time Environment, but Not all the DZero packages are RTE Compliant User don’t have experience/incentives in using it today

18 Overview Introduction Grid-level Management SAM-Grid = SAM + JIM Job Management Information Management Fabric-level Management Running jobs on grid resources Local sandbox management The DZero Application Framework  Running MC at UWisc

19 Running Monte Carlo at UWisc University of Wisconsin offered DZero the opportunity of using a 1000 node non-dedicated condor cluster We are concentrating on putting it to use to run MC with mc_runjob (in production by year end)

20 The challenges I MC code is not RTE compliant today Chain of 3-5 stages. Each binary 50-200 MB, dynamically linked Are compiled from 40 packages (total for D0 621). Need these packages at run time for RPC files Root, Motif, X11, Ace libraries are found as dependencies (for MC generators…) MC tarballs exist but are hand-crafted (and bug-prone) every time. Size unpacked 2GB (versus 12-15 GB full D0 app tree).

21 The challenges II About every advanced C++ feature, every libc library call, every system call, are used One can get different results on two RedHat 7.2 systems. Total release tree takes N hours (up to 20+) to build – not something easy to do dynamically at remote site

22 Summary The SAM-Grid offers an extensible working framework for Grid-level Job/Data/Info Management JIM provides Fabric-level management tools for sandboxing The applications need to be improved to run on Grid resources


Download ppt "Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team."

Similar presentations


Ads by Google