Presentation is loading. Please wait.

Presentation is loading. Please wait.

ALICE and LCG Stefano Bagnasco I.N.F.N. Torino

Similar presentations


Presentation on theme: "ALICE and LCG Stefano Bagnasco I.N.F.N. Torino"— Presentation transcript:

1 ALICE and LCG Stefano Bagnasco I.N.F.N. Torino
Tier-2 Referee Meeting Torino Nov 28, 2005 ALICE and LCG Stefano Bagnasco I.N.F.N. Torino EGEE is a project funded by the European Union under contract IST

2 What do we need to do? Configure, submit and track jobs
User interface with massive production support Job DB (Production and user) Job monitoring Install software on sites Package Managers Distribute and execute (possibly interactive) jobs Workload Management System (Broker, L&B,…) Computing Element software Information Services Interactive analysis jobs Store and catalogue data Data catalogues (file, replica, metadata, local,…) Storage Element software Move data around File Trasfer services and schedulers Access data files I/O services Prestaging services Monitor all that stuff Monitoring infrastructure Sensors Presentation ..on top of that: Enforce security! Tier 2 Referee Meeting  Torino November 28,

3 The purveyors Tier 2 Referee Meeting  Torino November 28,

4 AliEn “ALIce ENvironment” started as early as 2000, using as much as possible existing, open-source, standard technology (<5% “native” code) Core services Task Queue Data & Metadata Catalogue Job & Transfer optimizers Monitoring (MonALISA from Caltech) Stand-alone site services CE, SE, FTD, network proxy (ClusterMonitor), monitoring producers PackMan,… Tier 2 Referee Meeting  Torino November 28,

5 Towards integration Several Grid infrastructures are available: LCG, INFNGRID, possibly others, maybe in the U.S. Lots of resources but, in principle, different middlewares Pull-model is well-suited for implementing higher-level submission systems, since it does not require knowledge about the periphery, that may be very complex: “A Grid is a system that […] coordinates resources that are not subject to centralized control […] using standard, open, general-purpose protocols and interfaces […] to deliver nontrivial qualities of service.” I. Foster “What is the Grid? A three Point Checklist” Grid Today (2001) Tier 2 Referee Meeting  Torino November 28,

6 Boundary conditions Many tools not yet industrial-strength
Or even not existing… ALICE will use a single catalogue Not one per grid! Lots of data already registered there LCG requires access through “official” common software stack WMS Monitoring AliEn/gLite integration a sociologist’s dream But a physicist’s nightmare… Some services interoperable, some a bad case of round hole and square peg Tier 2 Referee Meeting  Torino November 28,

7 The first version – “Interface sites”
AliEn CE LCG UI AliEn SE LCG RB Job submission Server LCG Site EDG CE LCG SE Status report grid012.to.infn.it WN AliEn The philosophy: A whole Grid (namely, a Resource Broker) is seen by the server as a single AliEn CE, and the whole Grid storage is seen as a single, large AliEn SE. Data Registration Data Catalogue Replica Catalogue Data Registration Tier 2 Referee Meeting  Torino November 28,

8 PDC 2004 Phase 1: Production of RAW and shipment to CERN
Large output files (up to 1GB/event in ~25 files) 1a: Central events (long jobs, large files) 1b: Peripheral events (short jobs, smaller files) Alice::Torino::LCG Interface to INFNGRID Alice::CERN::LCG Interface to LCG-2 Tier 2 Referee Meeting  Torino November 28,

9 Lessons learned LCG Workload Management proved itself slow but reasonably robust We never lost a job due because of RB failures The remote site configuration was the major source of problems, LCG-side. And got worse with the use of LCG storage… Software management tools were (and still are) rudimentary Tier-1s have often tighter security restrictions & other idiosincracies Investigating and fixing problems is hard and time-consuming The most difficult part of the management is monitoring LCG through a “keyhole”. Only integrated information available natively MonALISA for AliEn, GridICE for LCG Some safety mechanisms are too coarse for this approach (queue blocking) Anti-blackhole mechanisms needed Based on run time statistics, CPUTime/WallClockTime ratio… Or remove the problem by using Job Agents For shorter jobs, submission time (and thus the interface system performance) can limit the number of jobs But the system is trivially scalable Tier 2 Referee Meeting  Torino November 28,

10 Recent developments JobAgents Distributed data catalogue
Following LHCb example Create a safe playpen before starting real job Distributed data catalogue Core catalogue is a StorageIndex (to be compatible with RB) Local catalogues may have different flavours (including LFC) Tighter integration with root Xrootd replaces aiod and gLite I/O (will run on top of SRM/DPM/…) Proof for interactive analysis Better security Apache service containers All communications secure (still need some logging) LCG-compatible user authentication Tier 2 Referee Meeting  Torino November 28,

11 The VO-Box New concept: VO-Specific services
Also called “Edge services” The ALICE VO-Box software was developed by SB and Pablo Saiz in Torino and at CERN “Thin” and “virtual” services Minimise impact on site operations Minimise duplication of functionality and code Fast-changing service, in production environment “Extreme programming” paradigm: release early, release often Tune the services in production environment Don’t waste CPUs (and manpower): merge Service Challenge and Data Challenge efforts Tier 2 Referee Meeting  Torino November 28,

12 The VO-Box cont’d Interface services… …and some added functionality
“CE” interface between TQ and RB “SE” interface to SRM (via xrootd) and LFC “FTD” interface to FTS …and some added functionality ClusterMonitor to proxy communications between WNs and the core services (reduced need for outbound connectivity) PackMan for software management (no common replacement currently exists) Agents monitoring Tier 2 Referee Meeting  Torino November 28,

13 Current version: site VO-Box
LCG RB Job submission TQ VO-Box LCG Site File Catalogue CE EDG CE LCG SE SE LFN Registration Request configuration PackMan WN JobAgent The philosophy: A set of interface services is run on a dedicated machine at each site, (mostly) providing common interfaces to underlying site services PFN Registration LFC Tier 2 Referee Meeting  Torino November 28,

14 Data Access model VO-Box FTS SERVER LCG Site CE xrootd SE SE SRM
File Catalogue VO-Box LCG Site CE xrootd SE (DPM) SE SRM PackMan xrootd GridFTP WN JobAgent LFC Tier 2 Referee Meeting  Torino November 28,

15 What do we do? Configure, submit and track jobs
User interface with massive production support Job DB (Production and user) Job monitoring Install software on sites Package Managers Distribute and execute jobs Workload Management System (Broker, L&B,…) Computing Element software Information Services Interactive analysis jobs Store and catalogue data Data catalogues (file, replica, metadata, local,…) Storage Element software Move data around File Trasfer services and schedulers Access data files I/O services Prestaging services Monitor all that stuff Monitoring infrastructure Sensors Presentation ..on top of that: Enforce security! MIXED COMMON ? COMMON MIXED MonALISA PROOF Tier 2 Referee Meeting  Torino November 28,

16 LCG as seen from ALICE Tier 2 Referee Meeting  Torino November 28,

17 PDC2005 (and LCG SC3) Better monitoring
Can gather info from single sites without any special trick Better use of local site services At the expense of some abstraction Locally optimised JobAgent submission prevents “JobAgent” floods But things can still be improved… Tier 2 Referee Meeting  Torino November 28,

18 Future developments Submission service Interactive analysis
Was a geographic service, it’s now local but can be “virtualized” Or better to have a broker “plugin” pulling directly from TQ? Interactive analysis Need very efficient storage access, much work still to be done Root →proof→xrootd→SRM→DPM But minimizing number of VO-specific services! Very fast development In a nearly-production environment Torino is currently the “development” site for VO-Box and site services Tier 2 Referee Meeting  Torino November 28,


Download ppt "ALICE and LCG Stefano Bagnasco I.N.F.N. Torino"

Similar presentations


Ads by Google