Report on Hepix Spring 2005 Forschungszentrum Karlsruhe 9-13 May Storage and data management Batch scheduling workshop May
Storage and data management (1) 5 Miscellaneous talks then a session on Grid service challenges and new software from GD group ENSTORE at FNAL: –Overview of data management at FNAL. Nothing new except need to checksum at all stages of file movement. Several (corrected) failures per week. Performance results on Panasas, Lustre and AFS filesystems from CASPUR (Rome): –Borrowed NAS switches, data direct, infortrend and panasas (intelligent) disk trays. –Compared I/O performance for different hardware and file systems. Ranged from 350 MB/s for AFS to 800 MB/s for Panasas. –Different HEP workload types were also tested. –Prices now 2 to 4 Euro per GB for good performance. –Comprehensive results – to be looked at.
Storage and data management (2) Ibrix Fusion Filesystem for US CMS at FNAL: –Evaluation started autumn 2002 for shared highly reliable ‘user disk’ space for US CMS for code, work areas and mini-DSTs. –Chose IBRIX, a commercial software solution. Can use IBRIX clients or NFS mounts. –Had major stability problems for 12 months but now in use. Will grow to 30 TB. Xrootd Infrastructure at RAL: –Extended Root daemon developed at SLAC/INFN. –Single name space. –Client connections load-balanced redirection to servers hosting data. –Very thorough failover architecture including for open files. FNAL SATA disk experiences: –Does not give the same performance as more expensive architectures. –Evaluate the total cost of ownership and carefully select vendors. –You get what you pay for !
Grid Service Challenges (1) US Atlas Tier 1 Facility at BNL planning for SC3: –Lessons from SC2 include need for multiple tcp streams and parallel file transfers to fill up network pipe –Found sluggish parallel i/o with ext2 and ext3 filesystems. Xfs was better. –Goal is 150 MB/sec from CERN to disk, 60 MB/s to tape –Use dcache in front of HPSS. Will need more than 2 tape drives. –Will select small number (2) of tier2 sites for 1-way send transfers LCG Tier 2 at LAL/Dapnia: –Building a tier2 facility for simulation and analysis with LAL (Orsay), DAPNIA (Saclay) and LPNHE(Paris). Investing 1.7ME up to 2007 in 1500 kSi2K cpu and 350TB disk. –Efficient use and management of storage seen as main challenge –Will participate in SC3 (no details)
Grid Service Challenges (2) Ramping up the LCG Service: –Jamie Shiers talk well delivered by Sophie Lemaitre. –SC2 met its throughput goals with more than 600MB/s sustained for 10 days and with more sites than planned. Still cannot be called a service. –For SC3 need gLite file transfer software and SRM service to be widely deployed.
Service Challenge 3 - Phases High level view: Setup phase (includes Throughput Test) –2 weeks sustained in July 2005 “Obvious target” – GDB of July 20 th –Primary goals: 150MB/s disk – disk to Tier1s; 60MB/s disk (T0) – tape (T1s) –Secondary goals: Include a few named T2 sites (T2 -> T1 transfers) Encourage remaining T1s to start disk – disk transfers Service phase –September – end 2005 Start with ALICE & CMS, add ATLAS and LHCb October/November All offline use cases except for analysis More components: WMS, VOMS, catalogs, experiment-specific solutions –Implies production setup (CE, SE, …)
New Software from GD group Updates on talks already given to recent GD workshop LHC File Catalog (J-P.Baud): – replaces EDG Replica Location Service –fixes scalability and performance problems found in RLS gLite File Transfer Service (S.Lemaitre): –Only functionality is file transfer –Will be distributed with LCG but can run stand-alone Lightweight Diskpool Manager (J-P.B): –Similar to dcache but much easier to install and configure –Thoroughly tested –Intended for tier2 sites to satisfy the gLite requirement for an SRM (Storage Resource Manager) interface.
Batch scheduling workshop Aim: to enhance communications between users and developers of local resource scheduling (LRS) systems and Grid-level resource scheduling. HEP sites use various systems giving fine-grained control over heterogeneous local systems and applying local policies. Grid scheduling often assumes sites are homogeneous and equally available to all virtual organisations (for HEP this means LHC experiments plus grid developers). Can Grid level scheduling reflect local scheduling and if not what should sites do ?
Sessions Thursday morning: How are local batch schedulers used at HEP sites. –there were 9 site reports and which answered this question. Thursday afternoon: Local and grid schedulers: Status and plans. –There were two reports on site views of their problems with the Grid-LRS interfaces. –Four commercial scheduler vendor presentations of their plans but not relating to any HEP grid activities. –An overview of Condor (the best supported model for EGEE LRS) –A report on an EGEE gLite interface to LRS. Friday morning: Developing a common batch interface. –A glue schema status and plans talk followed by a discussion. –A proposal to standardize sets of environment variables available to LRS related to local and Grid attributes.
What/How are local batch schedulers used at HEP sites. 4 Platform LSF: CERN, SLAC, JLAB,BNL 2 Sun Grid Engine: London e-Science, DESY 1 OpenPBS: JLAB Lattice QCD cluster (will drop LSF) 1 Torque (an OpenPBS variant) with maui scheduler: RAL FNAL home grown changing to Condor BNL changing from LSF to Condor IN2P3 home grown. All sites (very) heterogeneous hardware. All sites use (or are going to) same farms for grid and non- grid work. Most have local groups and grid VO allocations and use a fairshares mechanism. All sites have cpu-time based queues but allowing to specify other resources. Most common are work space and memory.
Local and grid schedulers: Status and plans (1) Personal ‘musings’ from Jeff Templon (Nikhef): –We need to be able to give local allocations among VO’s but allow them to use what the other does not. Also needs small number of high priority operations jobs to run and maybe cycle scavengers. –Users range from polite to sneaky. Some experiments over-submit pilot jobs (real job is in a DB) to many sites which then take many minutes scheduling large numbers of do-nothing jobs. –If we are only running one VO’s jobs then another VO would get the next free slot but how to publish this fact to the Grid ? –Efficient usage of local resources will need reasonable job run time estimates in normalised units to be attached to jobs. –We need self-disabling sites (or worker nodes at a site) to stop the problem of black-holes ‘eating’ job queues with serial failures.
Local and grid schedulers: Status and plans (2) BQS Problems and solutions for the Grid (IN2P3) –A well structured list of the problems currently seen with the Grid-BQS interfaces but common to most sites. Local solutions are mostly temporary pending hoped for Grid improvements.. –Request Broker does not pass important scheduling info to the Compute Element (BQS interface) such as requested cpu or memory. IN2P3 assume maximum resources are required but this leads to reduced overall efficiency. –Grid certificates map to local team accounts. Fast traceback to real user is needed for resolving problems. –Grid job stdout and stderr files are hidden under unique subdirectories making local problem debugging difficult. They request the RB to somehow indicate these names.
Local and grid schedulers: Status and plans (3) Platform LSF: –Current release tested to support 5000 execution hosts, jobs in queue and 100 users simultaneously using server. IBM LoadLeveller: –Main new feature is resource advance reservation. Mainly used for weather forecasting. Sun Grid Engine: –commercial product is now called N1 Grid Engine 6. Not clear if this is same as lines open source version. Also added advance reservation. PBSPro: –Also released advance reservation. Commercial vendors respond to the requirements of their (high paying) customers !
Local and grid schedulers: Status and plans (4) Condor and Grid challenges: –This was an overview of Condor and some of its plans. –Very rich R&D program with many components (9 million lines of code). –Well supported with 35 staff, many permanent. –Have released Condor-G which sits between user application and grid-middleware. Condor worker node pools are between middleware and local fabric. –Solutions to mismatch between local versus grid scheduler capabilities involved ‘gliding-in’ a Condor startd under the middleware or a whole new job-manager called stork however I could not relate resulting architecture diagram to our grid architecture. –Please read talk at –Followed by long rambling discussion
Local and grid schedulers: Status and plans (last) Last talk was BLAAHP: –EGEE Batch Local ASCII Helper from the EGEE J1 Group –A gLite component used by Condor-C and CREAM (?) to manage jobs on the batch systems. Could be easily interfaced to other systems. –Specifies 3 external scripts for local batch systems Submit job with parameters such as queue name – gets return status Query job status – must return that for running and finished jobs (for some time) and distinguish between successful and failed jobs Cancel job returning success or failure –To address overloading with status queries they have developed a cached server monitoring batch system job logs. –To address job proxy expiry they have developed a proxy receiver to be started by the job wrapper on each worker node which listens to receive an updated proxy from the CE. –This was an important talk but the only one from gLite. How does it relate to the glue schema (see next talk) and what does it imply for EGEE job submission User Interfaces and Resource Brokers ?
Developing a common batch interface. GLUE Schema:status and plans (Laurence Field) –Grid Laboratory Uniform Environment –Started 2002 with no real users. Design had no real VO concept. –Describes attributes and groupings that are published to the web. –Chooses CE (usually site) to run a job from best ETT (job estimated traversal time). ETT calculation needs improving. –Uses dynamic plugins to interface to each LRS. Asked for volunteer sites to support the dynamic plugins. CERN will do the LSF ones. –Next version will allow multiple VO’s per queue with separate ETT calculations. Still assumes homogenous hardware. –Work on a new version will start in Nov 2005 and it does not have to be backwards compatible – give your input ! –Observation: relationship with gLite is unclear to me. Learned later EGEE probably will support GLUE. There will be parallel edg and glite job management components that can coexist at a site (but maybe not on the same grid server hosts).
Common batch interface: some discussion points There are mismatches between local and grid level scheduling, problems with user identity mapping and exposed resources not being acted on. Industrial partners did not see any fast solutions from themselves. Abusive users should be punished (if the ETT time predictions were correct should be no need for abuse!). Should jobs be pulled or pushed to worker nodes. More enthusiasm for pull but not yet. The Condor ClassAds mechanism should be supported (a language to describe resources and policy). Sites would like, for example, to publish local resources per VO. Hepix encourages resource broker developers and LRS managers to find a forum to meet.
Proposal for standardizing the working environment for a LCG/EGEE job From the CCIN2P3 Grid computing team. Proposed several sets of environment variables and a naming convention to be implemented on all LCG/EGEE compute elements. –Posix base: HOME, PATH, PWD, SHELL, TMPDIR,USER –Globus: GLOBUS_LOCATION, GLOBUS_PATH, ….. –EDG: EDG_LOCATION, EDG_WL_JOBID, …. –LCG: LCG_LOCATION, LCG_CATALOG_TYPE, …. –Middleware independent: GRID_WORKDIR, GRID_SITENAME, GRID_HOSTNAME, …… This was welcomed by all as part of solving our problems (e.g. one variable contains the attributes of the certificate of the submitting user such as their address). They will distribute a document to site administrators and applications managers for comments aiming to start deployment at the end of June.
My Conclusions of batch workshop We have a good understanding of what Local Resource Schedulers can do and how they are being used. EGEE JRA1 should be encouraged to support the glue schema. Both sets of developers should communicate. Hep sites should volunteer to support glue schema interfaces to their LRS and provide input for the new version. The workshop in itself did not enhance communications with the grid developers as much as hoped but we must make sure this happens (e.g. as at recent EGEE/LCG Grid Operations workshop in Bologna).