Storage Federations and FAX (the ATLAS Federation) Wahid Bhimji University of Edinburgh.

Slides:



Advertisements
Similar presentations
Storage Workshop Summary Wahid Bhimji University Of Edinburgh On behalf all of the participants…
Advertisements

B A B AR and the GRID Roger Barlow for Fergus Wilson GridPP 13 5 th July 2005, Durham.
Wahid Bhimji SRM; FTS3; xrootd; DPM collaborations; cluster filesystems.
Distributed Xrootd Derek Weitzel & Brian Bockelman.
Tom Byrne, 12 th November 2014 Ceph – status update and xrootd testing Alastair Dewhurst, Tom Byrne 1.
DPM Italian sites and EPEL testbed in Italy Alessandro De Salvo (INFN, Roma1), Alessandra Doria (INFN, Napoli), Elisabetta Vilucchi (INFN, Laboratori Nazionali.
Wahid Bhimji University of Edinburgh P. Clark, M. Doidge, M. P. Hellmich, S. Skipsey and I. Vukotic 1.
Filesytems and file access Wahid Bhimji University of Edinburgh, Sam Skipsey, Chris Walker …. Apr-101Wahid Bhimji – Files access.
StoRM Some basics and a comparison with DPM Wahid Bhimji University of Edinburgh GridPP Storage Workshop 31-Mar-101Wahid Bhimji – StoRM.
PhysX CoE: LHC Data-intensive workflows and data- management Wahid Bhimji, Pete Clarke, Andrew Washbrook – Edinburgh And other CoE WP4 people…
Storage Wahid Bhimji DPM Collaboration : Tasks. Xrootd: Status; Using for Tier2 reading from “Tier3”; Server data mining.
FAX UPDATE 1 ST JULY Discussion points: FAX failover summary and issues Mailing issues Panda re-brokering to sites using FAX cost and access Issue.
FAX UPDATE 26 TH AUGUST Running issues FAX failover Moving to new AMQ server Informing on endpoint status Monitoring developments Monitoring validation.
Efi.uchicago.edu ci.uchicago.edu Towards FAX usability Rob Gardner, Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago US ATLAS.
Efi.uchicago.edu ci.uchicago.edu FAX meeting intro and news Rob Gardner Computation and Enrico Fermi Institutes University of Chicago ATLAS Federated Xrootd.
Wahid, Sam, Alastair. Now installed on production storage Edinburgh: srm.glite.ecdf.ed.ac.uk  Local and global redir work (port open) e.g. root://srm.glite.ecdf.ed.ac.uk//atlas/dq2/mc12_8TeV/NTUP_SMWZ/e1242_a159_a165_r3549_p1067/mc1.
MW Readiness Verification Status Andrea Manzi IT/SDC 21/01/ /01/15 2.
Status & Plan of the Xrootd Federation Wei Yang 13/19/12 US ATLAS Computing Facility Meeting at 2012 OSG AHM, University of Nebraska, Lincoln.
CERN IT Department CH-1211 Geneva 23 Switzerland GT WG on Storage Federations First introduction Fabrizio Furano
Efi.uchicago.edu ci.uchicago.edu FAX status developments performance future Rob Gardner Yang Wei Andrew Hanushevsky Ilija Vukotic.
MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons 10/12/2014.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
Performance Tests of DPM Sites for CMS AAA Federica Fanzago on behalf of the AAA team.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Your university or experiment logo here The Protocol Zoo A Site Presepective Shaun de Witt, STFC (RAL)
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT DPM / LFC and FTS news Ricardo Rocha ( on behalf of the IT/GT/DMS.
ATLAS XRootd Demonstrator Doug Benjamin Duke University On behalf of ATLAS.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
The GridPP DIRAC project DIRAC for non-LHC communities.
ALICE DATA ACCESS MODEL Outline 05/13/2014 ALICE Data Access Model 2  ALICE data access model  Infrastructure and SE monitoring.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
FAX UPDATE 12 TH AUGUST Discussion points: Developments FAX failover monitoring and issues SSB Mailing issues Panda re-brokering to FAX Monitoring.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
Storage Interfaces and Access pre-GDB Wahid Bhimji University of Edinburgh On behalf of all those who participated.
Efi.uchicago.edu ci.uchicago.edu Data Federation Strategies for ATLAS using XRootD Ilija Vukotic On behalf of the ATLAS Collaboration Computation and Enrico.
Efi.uchicago.edu ci.uchicago.edu Ramping up FAX and WAN direct access Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
Testing CernVM-FS scalability at RAL Tier1 Ian Collier RAL Tier1 Fabric Team WLCG GDB - September
Efi.uchicago.edu ci.uchicago.edu Storage federations, caches & WMS Rob Gardner Computation and Enrico Fermi Institutes University of Chicago BigPanDA Workshop.
CernVM-FS Infrastructure for EGI VOs Catalin Condurache - STFC RAL Tier1 EGI Webinar, 5 September 2013.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).
WLCG Operations Coordination Andrea Sciabà IT/SDC 10 th July 2013.
Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.
Testing Infrastructure Wahid Bhimji Sam Skipsey Intro: what to test Existing testing frameworks A proposal.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Storage Interfaces and Access: Interim report Wahid Bhimji University of Edinburgh On behalf of WG: Brian Bockelman, Philippe Charpentier, Simone Campana,
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
Data Analysis w ith PROOF, PQ2, Condor Data Analysis w ith PROOF, PQ2, Condor Neng Xu, Wen Guan, Sau Lan Wu University of Wisconsin-Madison 30-October-09.
Wahid Bhimji (Some slides are stolen from Markus Schulz’s presentation to WLCG MB on 19 June Apologies to those who have seen some of this before)
The GridPP DIRAC project DIRAC for non-LHC communities.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
EU privacy issue Ilija Vukotic 6 th October 2014.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Andrea Manzi CERN EGI Conference on Challenges and Solutions for Big Data Processing on cloud 24/09/2014 Storage Management Overview 1 24/09/2014.
An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)
SLACFederated Storage Workshop Summary Andrew Hanushevsky SLAC National Accelerator Laboratory April 10-11, 2014 SLAC.
DPM in FAX (ATLAS Federation) Wahid Bhimji University of Edinburgh As well as others in the UK, IT and Elsewhere.
Efi.uchicago.edu ci.uchicago.edu Federating ATLAS storage using XrootD (FAX) Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation.
HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP33 Ambleside 22 Aug 2014.
WLCG IPv6 deployment strategy
Global Data Access – View from the Tier 2
Taming the protocol zoo
PanDA in a Federated Environment
Data Federation with Xrootd Wei Yang US ATLAS Computing Facility meeting Southern Methodist University, Oct 11-12, 2011.
Australia Site Report Sean Crosby DPM Workshop – 13 December 2013.
Presentation transcript:

Storage Federations and FAX (the ATLAS Federation) Wahid Bhimji University of Edinburgh

Outline Introductory: What is Storage Federation and FAX and its goals (as stated by the ATLAS FAX project) Some personal perspectives UK deployment status Testing / Monitoring / Use-Cases Concerns and Benefits

What is Federation and FAX? Description (from the FAX Twiki): The Federated ATLAS Xrootd (FAX) system is a storage federation aims at bringing Tier1, Tier2 and Tier3 storage together as if it is a giant single storage system, so that users do not have to think of there is the data and how to access the data. A client software like ROOT or xrdcp will interact with FAX behind the sight and will reach the data whereever it is in the federation. Goals (from Rob Gardner’s talk at Lyon Federation mtg. 2012). Similar to CMS:Rob Gardner’s talk Common ATLAS namespace across all storage sites, accessible from anywhere; Easy to use, homogeneous access to data Use as failover for existing systems Gain access to more CPUs using WAN direct read access Use as caching mechanism at sites to reduce local data management tasks

Other details / oddities of FAX (some of this is my perspective) Started in US with pure-xrootd and xrootd-dcache Now worldwide inc. UK; IT ; DE and CERN (EOS) Uses topology of “regional” redirectors (see next slide) ATLAS federation uses a “Name2Name” LFC lookup unlike CMS Now moving from R&D to production But not (quite) there yet IMHO There is interest in http(s) federation instead / as well Particularly from Europeans But this is nowhere near as far along.

Regional redirectors

UK Fax Deployment Status dCache and Lustre (though not StoRM) setups are widely used in the US DPM has a nice xrootd server now : detailsdetails UK have been a testbed for this – but now entirely YAIM setup (since 1.8.4) : Thanks to David Smith; many issues (in xrootd not DPM) all solved (inc SL6) CASTOR required a kind of custom setup by Shaun / Alastair but building on config of others. Regional redirector setup for UK : physically at CERN - currently managed by them though we could do it UK sites working now DPM: ECDF; Glasgow; Liverpool; Oxford – Working Lancaster – Almost there; Manchester – Intending to deploy. EMI push means all sites could install now Lustre: QMUL – Working CASTOR: RAL - Working

Testing and Monitoring FAX specific : (basic status) (previous page) Now in normal ATLAS monitoring too : ssb.cern.ch/dashboard/request.py/siteview?view=FAXMON#c urrentView=FAXMON&fullscreen=true&highlight=false ssb.cern.ch/dashboard/request.py/siteview?view=FAXMON#c urrentView=FAXMON&fullscreen=true&highlight=false

Traffic monitoring xrootd.monitor all rbuff 32k auth flush 30s window 5s dest files info user io redir atl- prod05.slac.stanford.edu:9930atl- prod05.slac.stanford.edu:9930 xrd.report atl-prod05.slac.stanford.edu:9931 every 60s all -buff -poll sync Needs to be on all disk nodes too (setup by YAIM) prod07.slac.stanford.edu:8080/display

Aside – stress testing DPM xrootd ANALY_GLASGOW_XROOTD queue Stress-tested “local” xrootd access For direct access we saw some server load (same as we do for rfio). David did offer to help – we didn’t follow up much I am still optimistic that xrootd will offer better performance than rfio Trying panda failover tricks Not done yet – but hammercloud tests are planned for the dress-rehersal (see next page) ASGC have done extensive hammerclouds on (non-FAX) dpm- xrootd : Promising results. Using in production now (?)

This week: FAX dress rehersal Initially just a continuation of existing tests ramping up to O(10) jobs Using standard datasets placed at sites Also some hammercloud tests – again low level Direct access (most interesting) not working yet. _All_ increases will be discussed with cloud / sites Not everything is ready to run (still placing datasets) so this is likely to take some weeks.

Use Cases – revisiting goals Common ATLAS namespace across all storage sites, accessible from anywhere; Easy to use, homogeneous access to data Done – implicit in the setup Keen users being encouraged to try: tutorials etc. Use as failover for existing systems Production jobs can now retry from the federation if all local tries fail… works –not tried in anger. Gain access to more CPUs using WAN direct read access WAN access works – no reason no to use in principle. Timing info from WAN tests ready for brokering – not yet used (AFAIK) Use as caching mechanism at sites to reduce local data management tasks Nothing yet has been done on this (AFAIK).

Some of my concerns If users started using this, this could result in a lot of unexpected traffic of files served from sites (and read in by WNs): The service is not yet in production – no SAM test; no clear expectations of service etc. Communication with sites currently direct to site admin (not via cloud or ggus). We know some network paths are slow. And this is a new path involving WN and WAN Multiple VO support: currently separate server instances BUT SAM test etc. is coming as are configurable bandwidth limiting But many site failures (and user frustrations) are storage related so if it solves those then its worth it

What about Storage federation based on http (WebDav) has been demonstrated – see for example Fabrizio’s GDB talkFabrizio’s GDB talk DPM has a WebDav interface as does dCache and Chris has something with Storm at QMUL.WebDav interface Sits better with wider goals of interoperability with other communities. However doesn’t have the same level of HEP/LHC uptake so far ATLAS however want to use it within Rucio for renaming files. ECDF involved in those tests - not going that well but will iron those out.renaming files

Conclusion/ Discussion Significant progress in UK FAX integration. We are well placed to completely deploy this in the coming months. Things appear to work but have not been stressed. ATLAS are stating deadlines of April for both xrootd and webdav (though they don’t really have the use-cases yet). CMS had a deadline of end of last year I believe. Now is the time to voice concerns! And also to decide as a cloud of the benefits and push in that direction. But should also progress deployment. And see what bottlenecks we have Chris made a wiki page for site status: