First Experiences with Ceph on the WLCG Grid Rob Appleyard Shaun de Witt, James Adams, Brian Davies.

Slides:

Advertisements

Similar presentations

Tier-1 Evolution and Futures GridPP 29, Oxford Ian Collier September 27 th 2012.

Advertisements

Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.

Distributed Xrootd Derek Weitzel & Brian Bockelman.

NWfs A ubiquitous, scalable content management system with grid enabled cross site data replication and active storage. R. Scott Studham.

Oxford Jan 2005 RAL Computing 1 RAL Computing Implementing the computing model: SAM and the Grid Nick West.

Quantum Confidential | LATTUS OBJECT STORAGE JANET LAFLEUR SR PRODUCT MARKETING MANAGER QUANTUM.

Data Storage Willis Kim 14 May Types of storages Direct Attached Storage – storage hardware that connects to a single server Direct Attached Storage.

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

Experiences Deploying Xrootd at RAL Chris Brew (RAL)

Day 10 Hardware Fault Tolerance RAID. High availability All servers should be on UPSs –2 Types Smart UPS –Serial cable connects from UPS to computer.

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

Ceph Storage in OpenStack Part 2 openstack-ch,

RAL Site Report Castor F2F, CERN Matthew Viljoen.

Your university or experiment logo here NextGen Storage Shaun de Witt (STFC) With Contributions from: James Adams, Rob Appleyard, Ian Collier, Brian Davies,

Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.

Your university or experiment logo here CEPH at the Tier 1 Brain Davies On behalf of James Adams, Shaun de Witt & Rob Appleyard.

RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.

An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.

Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

SRM & SE Jens G Jensen WP5 ATF, December Collaborators Rutherford Appleton (ATLAS datastore) CERN (CASTOR) Fermilab Jefferson Lab Lawrence Berkeley.

Jens G Jensen RAL, EDG WP5 Storage Element Overview DataGrid Project Conference Heidelberg, 26 Sep-01 Oct 2003.

Your university or experiment logo here The Protocol Zoo A Site Presepective Shaun de Witt, STFC (RAL)

Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.

Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.

“Big Storage, Little Budget” Kyle Hutson Adam Tygart Dan Andresen.

Your university or experiment logo here Future Disk-Only Storage Project Shaun de Witt GridPP Review 20-June-2012.

Tier-2 storage A hardware view. HEP Storage dCache –needs feed and care although setup is now easier. DPM –easier to deploy xrootd (as system) is also.

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.

Future Plans at RAL Tier 1 Shaun de Witt. Introduction Current Set-Up Short term plans Final Configuration How we get there… How we plan/hope/pray to.

KIT – University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association STEINBUCH CENTRE FOR COMPUTING - SCC

ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.

SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.

1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Data architecture challenges for CERN and the High Energy.

CernVM-FS Infrastructure for EGI VOs Catalin Condurache - STFC RAL Tier1 EGI Webinar, 5 September 2013.

CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

Development of a Tier-1 computing cluster at National Research Centre 'Kurchatov Institute' Igor Tkachenko on behalf of the NRC-KI Tier-1 team National.

Wahid Bhimji (Some slides are stolen from Markus Schulz’s presentation to WLCG MB on 19 June Apologies to those who have seen some of this before)

CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.

Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016.

STFC in INDIGO DataCloud WP3 INDIGO DataCloud Kickoff Meeting Bologna April 2015 Ian Collier

An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)

DDN Web Object Scalar for Big Data Management Shaun de Witt, Roger Downing (STFC) Glenn Wright (DDN)

The HEPiX IPv6 Working Group David Kelsey (STFC-RAL) EGI OMB 19 Dec 2013.

New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 1 Presenter: Patrick Fuhrmann dCache.org Patrick Fuhrmann, Paul Millar,

SOFTWARE DEFINED STORAGE The future of storage.  Tomas Florian  IT Security  Virtualization  Asterisk  Empower people in their own productivity,

CERN Disk Storage Technology Choices LCG-France Meeting April 8 th 2005 CERN.ch.

Elasticsearch – An Open Source Log Analysis Tool Rob Appleyard and James Adams, STFC Application-Level Logging for a Large Tier 1 Storage System.

Riccardo Zappi INFN-CNAF SRM Breakout session. February 28, 2012 Ingredients 1. Basic ingredients (Fabric & Conn. level) 2. (Grid) Middleware ingredients.

CASTOR at RAL in 2016 Rob Appleyard. Contents Current Status Staffing Upgrade plans Questions Conclusion.

Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.

CERN IT-Storage Strategy Outlook Alberto Pace, Luca Mascetti, Julien Leduc

About ProLion CEO, Robert Graf Headquarter in Austria

WLCG IPv6 deployment strategy

Section 7 Erasure Coding Overview

CERN Lustre Evaluation and Storage Outlook

ATLAS Sites Jamboree, CERN January, 2017

Thoughts on Computing Upgrade Activities

Introduction To Computers

Ákos Frohner EGEE'08 September 2008

The INFN Tier-1 Storage Implementation

CTA: CERN Tape Archive Overview and architecture

Ceph at the Tier-1 Tom Byrne.

DataOptimizer Transparent File Tiering for NetApp Storage Robert Graf

Presentation transcript:

First Experiences with Ceph on the WLCG Grid Rob Appleyard Shaun de Witt, James Adams, Brian Davies

Contents Who am I and where am I coming from? What is Ceph? What is an object store? Why are we interested in it? Comparative performance What will it cost us?

Introduction Me! Rob Appleyard Sysadmin at the Rutherford Appleton Laboratory Working on LHC data storage at RAL for 2.5 years Where this talk is coming from… Discussion of Ceph and how it works Findings of an internal evaluation exercise

RAL – An LHC Tier 1 Computing Centre Our current situation CASTOR Disk/Tape ~17PB of RAID 6 storage nodes ~13PB of tape Our plans We need a better disk-only file system Don’t touch the tapes!

What is an Object Store? An object store manages each chunk of data that it goes into it as an object or objects. The structure is flat; the system just has a collection of objects with associated metadata. As opposed to a file system which uses a file hierarchy. Lower levels of storage are abstracted away, Capabilities Distributed/redundant metadata separated from data Scalable to multi-petabyte levels You can then impose a filesystem/namespace above the object store (Ceph: CephFS) Or do whatever. The object store doesn’t care.

Why do we want to use Ceph? Generic, FREE, non domain-specific solution. Incorporated into the Linux kernel CERN’s plan for CASTOR tape is to run Ceph as a file system. Cut out the middleman! Improved resilience Under CASTOR, the loss of one node will lose all files on that node. With Ceph, this is not a problem Distributed placement groups Flexibility Ceph also planned for Tier 1 and departmental cloud storage.

Performance -Early 2013 performance comparison exercise for disk-only Ceph looked… not great.

Performance (Ctd.) Why so slow? Ceph instance was set for one master and one replica It waits for both copies to be written before acknowledging. Performance testing on new test instance soon. New feature – ‘tiering’ – could help Manages a fast (SSD) cache sitting at the front Then passes data to back end.

Cost Modelling: The Test Examine whether Ceph is a viable replacement for CASTOR from a hardware cost perspective Budgets squeezed… We can’t exceed CASTOR’s hardware budget Use a vendor’s website to price up nodes for CASTOR and Ceph Different requirements CASTOR needs better drives, RAID controllers, etc. …and headnodes (not included) But Ceph needs more drives?

Cost Modelling: The Numbers Based on commodity nodes from Broadberry, 36*3TB SATA drives.Broadberry Prices from Dec 2013… CASTOR: RAID 6 at node level: $113/TB (we actually buy SAS drives, so this is an underestimate) Ceph: 1 master copy with 2 additional replicas: $313/TB 1 master copy with 1 additional replica: $208/TB 1 master copy with 2 erasure-coded copies per 16 disks: $119/TB 1 master copy with 1 erasure-coded copy per 16 disks: $111/TB

Cost Modelling: The Conclusion Ceph must fit into CASTOR’s budget Therefore we can’t use straight replication. Cost difference between 1 and 2 erasure coded copies is pretty small. 2 is much better than one! Not included: Power Cooling Staff effort …but a lot of this should be similar to CASTOR (we hope!)

The Future RAL: Large (1PB) test instance Performance – should be better than last time. 1 replica initially, then try erasure codes… Look to deploy into production as CASTOR replacement early 2015 One really big instance rather than one per experiment. Risks… Big change Erasure coding not stable Future development for WLCG? CERN are working on a plug-in bundle that is optimised for XRootD Also an optimised file system to replace CephFS.

Any Questions? Contact:

Spare Slides…

Why are we interested in Ceph? Up to now, we have not seen a reason to move away from CASTOR. We did a full survey of our options during 2012 and found nothing that was sufficiently superior to CASTOR to be worth the effort of migration. But things move on… CASTOR is proving problematic with new WLCG protocol (xroot) CERN seriously considering running Ceph under CASTOR for tape If we’ll be running it anyway, why not cut out the middleman? Some issues previously identified in Ceph are, or will soon be, addressed Erasure encoding, stability of CephFS

Why Ceph? CASTOR want drop current file system support If Ceph works as well as planned Gets us out of Particle Physics specific software Except CERN are contributing to the code base Improved resilience Currently loss of 3 disks on server will (probably) mean loss of all files Under Ceph, 3 disk loss will lose less (not quantified) Assuming suitable erasure encoding/duplication 2 erasure encoded disks per 16 physical Improved support Ceph also planned for Tier 1 and SCD cloud storage More cross-team knowledge

Plans and Times Currently developing quattor component Plan is to deploy ‘small’ test instance for all VOs 1Pb nominal capacity, less overhead Initially using CephFS and dual copy Move to erasure encoding as soon as possible NO SRM Anticipate deployment late April/early May Integration of XRootD RADOS plugin as soon as available After some months of VO testing (Oct. 2014?) Start migrating data from CASTOR to CEPH Need to work with the VO to minimise pain Fewest possible files migrated Start moving capacity from CASTOR to Ceph

Costs Ceph with dual copy to expensive long term Need erasure encoding Could deploy with current set-up 1 copy, RAID6, 1 hot spare … but not recommended Lose advantage of disk loss With erasure encoding… Single erasure encoded copy (w/o hotspare, 1 erasure disk per 17 dara disks) is cheaper than current setup But less resilient Dual erasure encoded copy (w/o hotspare, 2 erasure disks per 16 data disks) is about the same price And better resilience

Proposed Deployment Ideally ‘single’ instance with quotas… Single meaning 1 disk instance and 1 tape instance (still under CASTOR) Using Ceph pools Simpler to manage rather than 4 instances currently set-up Easier to shift space around according to demand Problem may be ALICE security model May force us to run with 2 instances Work with ALICE to see if this can be mitigated

Risks Lots of them... This is just a sample RiskLikelihodImpactMitigation Erasure encoding not readyLowHighNone CephFS not performantMedium Switch of http access CephFS not stableLowHighSwitch of http access XRootD/RADOS plugin not ready MediumHighUse POSIX CephFS Difficulty in migrating dataHigh Minimise data to be migrated Difficult to administerMedium Use testing time to learn about the system Ceph moves to support modelLowmediumBuy support from Inktank (or other vendor)