Update on experiment requirements Federico Carminati CERN, Ginevra.

Slides:

Advertisements

Similar presentations

Applications Area Issues RWL Jones GridPP13 – 5 th June 2005.

Advertisements

EGEE-II INFSO-RI Enabling Grids for E-sciencE The gLite middleware distribution OSG Consortium Meeting Seattle,

FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim

Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

INFSO-RI Enabling Grids for E-sciencE gLite Data Management Services - Overview Mike Mineter National e-Science Centre, Edinburgh.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Enabling Grids for E-sciencE Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008.

1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.

Maarten Litmaath (CERN), GDB meeting, CERN, 2006/02/08 VOMS deployment Extent of VOMS usage in LCG-2 –Node types gLite 3.0 Issues Conclusions.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

GLite – An Outsider’s View Stephen Burke RAL. January 31 st 2005gLite overview Introduction A personal view of the current situation –Asked to be provocative!

T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.

INFSO-RI Enabling Grids for E-sciencE gLite Data Management and Interoperability Peter Kunszt (JRA1 DM Cluster) 2 nd EGEE Conference,

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

1 LHCb File Transfer framework N. Brook, Ph. Charpentier, A.Tsaregorodtsev LCG Storage Management Workshop, 6 April 2005, CERN.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.

INFSO-RI Enabling Grids for E-sciencE Introduction Data Management Ron Trompert SARA Grid Tutorial, September 2007.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

David Adams ATLAS ATLAS distributed data management David Adams BNL February 22, 2005 Database working group ATLAS software workshop.

The CMS Top 5 Issues/Concerns wrt. WLCG services WLCG-MB April 3, 2007 Matthias Kasemann CERN/DESY.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Baseline Services Group Report LHCC Referees Meeting 27 th June 2005 Ian Bird IT/GD, CERN.

Author - Title- Date - n° 1 Partner Logo WP5 Status John Gordon Budapest September 2002.

Testing the HEPCAL use cases J.J. Blaising, F. Harris, Andrea Sciabà GAG Meeting April,

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The LCG interface Stefano BAGNASCO INFN Torino.

Handling of T1D0 in CCRC’08 Tier-0 data handling Tier-1 data handling Experiment data handling Reprocessing Recalling files from tape Tier-0 data handling,

EGEE-II INFSO-RI Enabling Grids for E-sciencE Data Management cluster summary David Smith JRA1 All Hands meeting, Catania, 7 March.

LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.

The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.

GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo,

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.

User Domain Storage Elements SURL  TURL LFC Domain (LCG File Catalogue) SA1 – Data Grid Interoperation Enabling Grids for E-sciencE EGEE-III INFSO-RI

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI solution for high throughput data analysis Peter Solagna EGI.eu Operations.

EGEE Data Management Services

Federating Data in the ALICE Experiment

Jean-Philippe Baud, IT-GD, CERN November 2007

gLite Basic APIs Christos Filippidis

Grid Computing: Running your Jobs around the World

ATLAS Use and Experience of FTS

StoRM: a SRM solution for disk based storage systems

Vincenzo Spinoso EGI.eu/INFN

Status of the SRM 2.2 MoU extension

ALICE Monitoring

GDB 8th March 2006 Flavia Donno IT/GD, CERN

Joint JRA1/JRA3/NA4 session

Comparison of LCG-2 and gLite v1.0

SRM2 Migration Strategy

Short update on the latest gLite status

Artem Trunov and EKP team EPK – Uni Karlsruhe

Ákos Frohner EGEE'08 September 2008

Data Management cluster summary

Monitoring of the infrastructure from the VO perspective

R. Graciani for LHCb Mumbay, Feb 2006

INFNGRID Workshop – Bari, Italy, October 2004

Site availability Dec. 19 th 2006

The LHCb Computing Data Challenge DC06

Presentation transcript:

Update on experiment requirements Federico Carminati CERN, Ginevra

2Upd exps' WLCG At the beginning there was BS

3Upd exps' WLCG

4 BS WG conclusions ServiceALICEATLASCMSLHCb Storage ElementAAAA Basic Transfer ToolsAAAA Reliable File Transfer ServiceAAA/BA Catalogue ServiceBBBB Catalogue and DM toolsCCCC Compute ElementAAAA Workload ManagementA/BAAC VO agentsAAAA VOMSAAAA Database ServicesAAAA Posix I/OCCCC Application Software InstallationCCCC Job Monitoring ToolsCCCC Reliable Messaging ServiceCCCC Information SystemAAAA A: common solution mandatory; B: common solution required, alternative exists C: common solution desirable

5Upd exps' WLCG Then came SC3

6Upd exps' WLCG SC3 Most of the experiments planned to verify (one part of) their computing model –One central point was the export of data from T0 to T1’s A large number of problems emerged –FTS – migration from “test-bed” to production mode exposed the functional limitations and stability of the service –CASTOR 2 – integration in the experiments frameworks difficult – performance, stability and functionality limited –SRM implementations –LFC performance –Stability and functionality of sites Access to data still seems problematic Target performance figure were not reached Difficult to reconcile / overload experiment and service goals

7Upd exps' WLCG SC3 Negative side –We could reach only partially our e-2-e goals –Tests limited to few services, not the same for all the experiments –Computing models still untested in full –The fact that we could not fully use the resources we asked for is turning into a political embarrassment Positive site –Tremendous learning exercise –Many problems that were there have been solved –Everybody has been very collaborative and helpful –The taskforces have demonstrated to be effective tools for cooperation We have now a clearer idea of our requirements

8Upd exps' WLCG Summary of requirements

9Upd exps' WLCG Security, authorisation, authentication VOMS available and stable Groups and roles used by all middleware –How to propagate to WN, DN? Automatic proxy & kerberos credential renewal Recommendations on how to develop experiment specific secure services –Secure service using delegated and automatically renewed user credentials –API or “development guide” for security delegation

10Upd exps' WLCG Information System Stable access to static Information System –BDII or equivalent (!) –Split static and dynamic information? Access to the IS from the WN Publishing experiment specific info More complete use of Glue schema –Should be the same in gLite and LCG

11Upd exps' WLCG Storage Management SRM 2.1 supported by all SE Services –Space reservation, file pinning, bulk operations –Disk quota management at group and user level Homogeneous, consistent and efficient implementation –Smooth transition SRM 1->2 –SE interoperability issues must be solved –Operations should be “safe” SRM client libraries should be available to the applications –Full SRM 2.1 for CASTOR, d-Cache and DPM not before 3Q06? –Need also xrootd  SRM and SRM  xrootd (?) CASTOR 2 –Pre-staging for file transfer –Coherent usage of multiple pools Do we have a consistent model for data access?

12Upd exps' WLCG File transfer Applications would prefer to talk with FPS (sic!) –Routing, plug-ins, multiple replica / broadcast, proxy renewal, error and timeout handling –But retain control of FC update via plugins Improved FTS Clients on all sites, WNs and VOBOXes Ideally GUID (SE1  SE2), however (SURL,SE1)  (SURL,SE2) acceptable Central entry point and transparent mechanism for all transfers Tx Ty –Going beyond static FTS channels (being exposed to users!) Full integration with SRM Possibility to specify type of space, lifetime of a pinned file, etc. –Need for tactical “hot” storage Dynamic priorities

13Upd exps' WLCG File catalogue services LFC as global and/or local file catalogue –Support for replica attributes: tape, tape with cache, pinned cache, disk, archived tape, custodial flag POOL interface to LFC –Access to metadata? Performance –Read access, different kinds of queries –Unauthenticated read-only access, deep levels of queries should be possible for catalogue administrators even with reduced performance –In general, the access frequency is estimated to be ~1Hz Bulk operations for file and replica registration should be supported

14Upd exps' WLCG Grid data management tools lcg-utils available in production C/C++ POSIX file access based on the LFN –gfal library –Efficient choice of “best replica” in a running job Possibility of multiple instances of LFC for high availability and efficiency Reliable registration service ACL support and propagation Bulk operations (registration, replication, staging, deletion)

15Upd exps' WLCG Workload management Stable and redundant service –Load balance and failover Match making based on CPU slots / file location Efficient input sandbox management Latency for job execution and status reporting proportional to expected job duration Support for different priorities based on VOMS groups/roles –Rearrange jobs in the central and local TQ’s Fair share across users in the same group Interactive access to running job sandbox (debugging) Computing Element accessible by services/clients other than RB –Monitoring and submission Changing of identity of a job running on the worker nodes Standard CPU time limits

16Upd exps' WLCG Monitoring / accounting Need to monitor –Transfer traffic, statistics for file opening and I/O by file/dataset from SE's –VO specific information for global operations –Job status/failure/progress information –Publish/Subscribe to logging, bookkeeping and local batch system events for all jobs in the VO –Add heartbeat to SFT? Accounting –Site, user and (VOMS) group granularity –Aggregate by VO (user) specified tag –Application type (MC, Reconstruction….), executable, dataset –SE accounting aggregated by datasets (e.g. PFN directory)

17Upd exps' WLCG Various xrootd & VOBOX at all sites –Should be considered a basic Grid service –Support model? Tools to for site dependent VO environment Secure hosting of long-lived processes –Standard set of secure containers proof? –Precise requirements being defined

18Upd exps' WLCG And now the bright future

19Upd exps' WLCG SC4 Our last chance to test the computing model before the “reality check” We have to test there all the components that we intend to use for data taking Question is –Which of the requirements will be satisfied? –By whom? –Which timescale?

20Upd exps' WLCG EGEE / LCG EGEE is going to “tag” gLite 1.5 by year’s end This will be merged with EGEE 2.7 into gLite 3.0 around March Experiments needs badly stability, and some need the new features of gLite Timing with SC4 is critical We have to make sure that LCG/GD, the taskforces and EGEE JRA1 have a clear idea of our requirements and priorities and work together –But how?

21Upd exps' WLCG Metrics of success We should make a special effort to define metrics I am not sure we know –“how far” from the mark we fail –“how serious” is a failure –At least this is true for ALICE! This is important to provide clear targets to developers

22Upd exps' WLCG Plans for SC4 Experiments plans for SC4 are –Do what they could not do in SC3 –Add more components of the computing model An attentive MW choice will be instrumental in ensuring success –Here I am very much afraid of riches’ embarrassment

23Upd exps' WLCG ALICE plans Quantitative plans still under study –Data storage / reconstruction / analysis on a level of 20% from a standard data taking year “Classical” approach –Large distributed MC T1/2 –Copy of the files back at CERN –“push out” of RAW with FTS (hopefully FPS) –Quasi online T0 –Distributed T1’s –Distributed analysis –Special emphasis on user data analysis - fast and efficient access to –ESDs, resource sharing production/analysis Usage of Raw data, alignment and calibration must be in Planning document in preparation

24Upd exps' WLCG

25Upd exps' WLCG LHCb

26Upd exps' WLCG LHCb

27Upd exps' WLCG

28Upd exps' WLCG

29Upd exps' WLCG

30Upd exps' WLCG

31Upd exps' WLCG

32Upd exps' WLCG

33Upd exps' WLCG Conclusions We are observing steady progress in understanding and fixing problems –However the derivative is still not right –Every SC has increasingly ambitious objectives and an increasing backlog of problems to fix! We have to make sure that all available resources work in the same direction and with the same objectives / priorities (Some) experiment specific solutions are unavoidable and should be supported in a form acceptable to all parties –We have just no time for conflicts, we need to make work what’s there, like or loath it Proper prioritisation of experiment requirements is now mandatory in order to satisfy them