Tom Byrne, Bruno Canning

Slides:



Advertisements
Similar presentations
30-31 Jan 2003J G Jensen, RAL/WP5 Storage Elephant Grid Access to Mass Storage.
Advertisements

Distributed Xrootd Derek Weitzel & Brian Bockelman.
Tom Byrne, 12 th November 2014 Ceph – status update and xrootd testing Alastair Dewhurst, Tom Byrne 1.
Andrew McNab - EDG Access Control - 14 Jan 2003 EU DataGrid security with GSI and Globus Andrew McNab University of Manchester
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
Wahid, Sam, Alastair. Now installed on production storage Edinburgh: srm.glite.ecdf.ed.ac.uk  Local and global redir work (port open) e.g. root://srm.glite.ecdf.ed.ac.uk//atlas/dq2/mc12_8TeV/NTUP_SMWZ/e1242_a159_a165_r3549_p1067/mc1.
Ceph: A Scalable, High-Performance Distributed File System
WebFTS File Transfer Web Interface for FTS3 Andrea Manzi On behalf of the FTS team Workshop on Cloud Services for File Synchronisation and Sharing.
Jens G Jensen RAL, EDG WP5 Storage Element Overview DataGrid Project Conference Heidelberg, 26 Sep-01 Oct 2003.
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.
Storage Element Security Jens G Jensen, WP5 Barcelona, May 2003.
An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)
HEPiX IPv6 Working Group David Kelsey (STFC-RAL) GridPP33 Ambleside 22 Aug 2014.
Andrew Lahiff HEP SYSMAN June 2016 Hiding infrastructure problems from users: load balancers at the RAL Tier-1 1.
Grid Operations in Germany T1-T2 workshop 2015 Torino, Italy Kilian Schwarz WooJin Park Christopher Jung.
KIT - University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association Xrootd SE deployment at GridKa WLCG.
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
CASTOR at RAL in 2016 Rob Appleyard. Contents Current Status Staffing Upgrade plans Questions Conclusion.
EGEE Data Management Services
Federating Data in the ALICE Experiment
a brief summary for users
Daniele Bonacorsi Andrea Sciabà
Jean-Philippe Baud, IT-GD, CERN November 2007
Dynamic Extension of the INFN Tier-1 on external resources
WLCG IPv6 deployment strategy
Introduction of load balancers at the RAL Tier-1
Status: ATLAS Grid Computing
ALICE & Clouds GDB Meeting 15/01/2013
By Michael Poat & Dr. Jérôme Lauret
“A Data Movement Service for the LHC”
Oxana Smirnova, Jakob Nielsen (Lund University/CERN)
GridPP Storage perspectives
ATLAS Use and Experience of FTS
StoRM: a SRM solution for disk based storage systems
VOs and ARC Florido Paganelli, Lund University
Blueprint of Persistent Infrastructure as a Service
Section 8 Monitoring SES 3
Section 6 Object Storage Gateway (RADOS-GW)
BNL Tier1 Report Worker nodes Tier 1: added 88 Dell R430 nodes
Future of WAN Access in ATLAS
Data Bridge Solving diverse data access in scientific applications
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Service Challenge 3 CERN
dCache – protocol developments and plans
StoRM Architecture and Daemons
Gfal/lcg-util -> Gfal2/gfal2-util
Status and Prospects of The LHC Experiments Computing
Taming the protocol zoo
Section 7 Erasure Coding Overview
Data Federation with Xrootd Wei Yang US ATLAS Computing Facility meeting Southern Methodist University, Oct 11-12, 2011.
Castor services at the Tier-0
A Messaging Infrastructure for WLCG
THE STEPS TO MANAGE THE GRID
Brookhaven National Laboratory Storage service Group Hironori Ito
Hironori Ito Brookhaven National Laboratory
WLCG Demonstrator R.Seuster (UVic) 09 November, 2016
Ákos Frohner EGEE'08 September 2008
RAL Tom Byrne George Vasilakakos, Bruno Canning, Alastair Dewhurst, Ian Johnson, Alison Packer.
The INFN Tier-1 Storage Implementation
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
CTA: CERN Tape Archive Overview and architecture
HEPiX IPv6 Working Group F2F Meeting
Data Management cluster summary
INFNGRID Workshop – Bari, Italy, October 2004
Ceph at the Tier-1 Tom Byrne.
Presentation transcript:

Ceph Status Report @ RAL Tom Byrne, Bruno Canning George Vasilakakos, Alastair Dewhurst, Ian Johnson, Alison Packer

Ceph at RAL is used for two larger projects: Echo Disk only storage service to replace Castor for LHC VOs Emphasis on TB/£ rather than IOPS/£ (thick storage nodes with EC) 10 PB of space usable with EC procured in FY15-16, further 5 PB usable procured in FY16-17 Sirius Provide low-latency block storage to STFC cloud Thin storage nodes with replication c. 600TB raw space

Production Status Echo is now accepting production data from LHC VOs Currently GridFTP and XRootD supported as production I/O protocols. VO data pools can be accessed via either protocol. 7.1 PB of WLCG pledge provided by Echo this year Shared between ATLAS, CMS and LHCb Already storing >1.7PB of data for ATLAS Lots of hard work: testing, development, on call procedures The gateway development has been a large part of the work Castor Currently / TB Capacity to decommission / TB New capacity added to Castor / TB Echo allocation / TB Capacity provided by Echo ALICE 480 0% ATLAS 5257 1693 980 3100 41% CMS 2287 1337 653 2500 61% LHCb 4906 1010 545 1500 25%

Storage Today

Gateway Specification Aim for a simple, lightweight solution that supports established HEP protocols: XRootD and GridFTP No CephFS presenting posix interface to protocols As close to a simple object store as possible Data needs to be accessible via both protocols Ready to support existing customers (HEP VOs) Want to support new customers with industry standard protocols: S3 GridPP makes available 10% of its resources to non-LHC activities

globus-gridftp-server Service Architecture Ceph Cluster globus-gridftp-server ceph-radosgw xrootd rados plugin wLCG VO Data Systems FTS3 New VO Services S3 gsiftp VO data pool RADOS xroot rgw pool libradosstriper

GridFTP Plugin GridFTP plugin was completed at start of October 2016 Development has been done by Ian Johnson (STFC) Work started by Sébastien Ponce (CERN) in 2015 ATLAS are using GridFTP for most transfers to RAL XRootD used for most transfers to batch farm CMS Debug traffic and load tests also use GridFTP Recently improvements have been made to: Deletion timeouts Check-summing (to align with XRootD) Multi-streamed transfers into EC pools https://github.com/stfc/gridFTPCephPlugin

XRootD Plugin XRootD plugin was developed by CERN Plugin is part of XRootD server software But you have to build it yourself XRootD developments: Needed to enable features unsupported for objects store backends Done Check-summing Redirection Caching proxy To Do Over-write a file Name-to-name mapping (N2N): work done, needs testing Memory consumption

XRootD/GridFTP AuthZ/AuthN Client Gateway Storage Authentication Authorization Client Software Gridmap File AuthDB Is user allowed to perform op? Is user valid? Yes Yes xrdcp / globus-url-copy No No Error Data libradosstriper It’s nice to have both plugins using the same AuthN/AuthZ method - keeps things sane Gridmap file used for authentication Authorisation is done via XRootD’s authDB Ian Johnson added support for this in the GridFTP plugin

Removing Gateway Bottleneck for WNs Having all worker nodes talk to Ceph through the gateways presents a bottleneck Can add gateway functionality to worker nodes: Require ceph and XRootD configuration files, gridmap-file and keyring Worker nodes can talk directly to Ceph object store Testing on a small number of WNs running an XRootD gateway in a containers. Container work led by Andrew Lahiff

External VO Redirectors XRootD Architecture External VO Redirectors XrootD Cmsd All XRootD gateways will have a caching proxy: On External gateways it is large as AAA from other sites doesn’t use Lazy Download. On WN gateways it will be small and used to protect against pathological jobs. Gateway XrootD Cmsd Gateway XrootD Cmsd WN will be installed with an XRootD Gateway This will allow direct connection to Echo 80% of batch farm now using SL7 + containers. Ceph Backend WN Xrootd

Operational issues: Stuck PG A placement group is a section of a logical object pool that is mapped onto a set of object storage daemons While rebooting a storage node for patching in February, a placement group in the atlas pool became stuck in a peering state I/O hung for any object in this PG Typical remedies for fixing peering problems were not effective (restarting/removing primary OSD, restarting set etc.) Seemed to be a communication issue between two OSDs in the set. pg 1.323 is remapped+peering, acting [2147483647,1391,240,127,937,362,267,320,7,634,716] Peering is the process of all OSDs in the set establishing communication Unlikely to be networking. SNs shared 200 PGs

Stuck PG To restore service availability, it was decided we would manually recreate the PG accepting loss of all 2300 files/160GB data in that PG Again, typical methods (force_create) failed due to PG failing to peer Manually purging the PG from the set was revealing On the OSD that caused the issue, an error was seen A Ceph developer suggested this was a LevelDB corruption on that OSD Reformatting that OSD and manually marking the PG complete caused the PG to become active+clean, and cluster was back to a healthy state

Stuck PG: Conclusions A major concern with using Ceph for storage has always been recovering from these types of events This showed we had the knowledge and support network to handle events like this The data loss occurred due to late discovery of correct remedy We would have been able to recover without data loss if we had identified the problem (and problem OSD) before we manually removed the PG from the set http://tracker.ceph.com/issues/18960 Having the expertise to recover from these states is vital in running a production storage system… The tracker link goes into a lot more detail

S3 / Swift We believe S3 / Swift are the industry standard protocols we should be supporting. If users want to build their own software directly on top of S3 / Swift, that’s fine: Need to sign agreement to ensure credentials are looked after properly. We expect most new users will want help: Currently developing basic storage service product that can quickly be used to work with data. S3 / Swift API access to Echo will be the only service offered to new users wanting disk only storage at RAL.

DynaFed User DynaFed S3 / Swift We believe DynaFed is best tool to allow small VOs secure access. S3/Swift credentials stored on DynaFed Box. Users use certificate/proxy. Provides file system like structure. Good support for transfers to existing Grid storage. 1. Proxy + request User DynaFed 2. Pre-signed URL 3. Data S3 / Swift

RAL Dynafed Setup DNS Alias Service is behind HA proxy. dynafed.stfc.ac.uk Service is behind HA proxy. Currently just one VM but easily scalable. Will be in production in the next 6 months. Ian J will be spending 50% of his time working on Dynafed. Anyone with an atlas or dteam certificate can try it out: https://dynafed.stfc.ac.uk/gridpp HA Proxies VM VM/ Container CLI # voms-proxy-init # davix-ls -P grid davs://dynafed.stfc.ac.uk/gridpp/dteam/disk/ # davix-put -P grid testfile davs://dynafed.stfc.ac.uk/gridpp/dteam/disk/testfile Or # gfal-ls davs://dynafed.stfc.ac.uk/gridpp/echo/ # gfal-copy file:///home/tier1/dewhurst/testfile davs://dynafed.stfc.ac.uk/gridpp/dteam/disk/testfile2

Conclusion Echo is in production! There has been a massive amount of work in getting to where we are Support for GridFTP and XRootD on a Ceph object store are mature Thanks to Andy Hanushevsky, Sébastien Ponce, Dan Van Der Ster and Brian Bockelman for all their help, advice and hard work. Looking forward: industry standard protocols are all we want to support Tools are there to provide a stepping stones for VOs