Ceph at the Tier-1 Tom Byrne.

Slides:

Advertisements

Similar presentations

MinCopysets: Derandomizing Replication in Cloud Storage

Advertisements

Availability in Globally Distributed Storage Systems

SQL Server Disaster Recovery Chris Shaw Sr. SQL Server DBA, Xtivia Inc.

Storage: Futures Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 8 October 2008.

Ceph: A Scalable, High-Performance Distributed File System

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

Client Management. Introduction In a typical organization there are a lot of client machines used for day to day operations Client management is a necessary.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.

A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah.

Ceph Storage in OpenStack Part 2 openstack-ch,

Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.

Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster.

Large Scale Test of a storage solution based on an Industry Standard Michael Ernst Brookhaven National Laboratory ADC Retreat Naples, Italy February 2,

Your university or experiment logo here CEPH at the Tier 1 Brain Davies On behalf of James Adams, Shaun de Witt & Rob Appleyard.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Ceph: A Scalable, High-Performance Distributed File System

)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,

“Big Storage, Little Budget” Kyle Hutson Adam Tygart Dan Andresen.

High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.

Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.

PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.

15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK

LHC Logging Cluster Nilo Segura IT/DB. Agenda ● Hardware Components ● Software Components ● Transparent Application Failover ● Service definition.

Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

First Experiences with Ceph on the WLCG Grid Rob Appleyard Shaun de Witt, James Adams, Brian Davies.

SOFTWARE DEFINED STORAGE The future of storage.  Tomas Florian  IT Security  Virtualization  Asterisk  Empower people in their own productivity,

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

File system: Ceph Felipe León fi Computing, Clusters, Grids & Clouds Professor Andrey Y. Shevel ITMO University.

CASTOR at RAL in 2016 Rob Appleyard. Contents Current Status Staffing Upgrade plans Questions Conclusion.

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland.

File-System Management

SQL Database Management

Dynamic Extension of the INFN Tier-1 on external resources

Title of the Poster Supervised By: Prof.*********

Introduction to Operating Systems

Experience of Lustre at QMUL

Scalable sync-and-share service with dCache

Tom Byrne, Bruno Canning

By Michael Poat & Dr. Jérôme Lauret

WP18, High-speed data recording Krzysztof Wrona, European XFEL

StoRM: a SRM solution for disk based storage systems

U.S. ATLAS Grid Production Experience

BNL Tier1 Report Worker nodes Tier 1: added 88 Dell R430 nodes

Service Challenge 3 CERN

Experience of Lustre at a Tier-2 site

Section 7 Erasure Coding Overview

Luca dell’Agnello INFN-CNAF

Data Federation with Xrootd Wei Yang US ATLAS Computing Facility meeting Southern Methodist University, Oct 11-12, 2011.

LCGAA nightlies infrastructure

5.0 : Windows Operating System

What's New in Infernalis

ATLAS Sites Jamboree, CERN January, 2017

CCRC'08 experience at PIC storage POV

Simulation use cases for T2 in ALICE

Brookhaven National Laboratory Storage service Group Hironori Ito

RAL Tom Byrne George Vasilakakos, Bruno Canning, Alastair Dewhurst, Ian Johnson, Alison Packer.

The INFN Tier-1 Storage Implementation

US CMS Testbed.

Large Scale Test of a storage solution based on an Industry Standard

Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems Sungjoon Koh, Jie Zhang, Miryeong.

Cloud Computing R&D Proposal

Grid Canada Testbed using HEP applications

X in [Integration, Delivery, Deployment]

RAID RAID Mukesh N Tekwani

Fault Tolerance Distributed Web-based Systems

Cloud computing mechanisms

RAID RAID Mukesh N Tekwani April 23, 2019

Presentation transcript:

Ceph at the Tier-1 Tom Byrne

Outline Using Ceph for grid storage Echo and data distribution Echo downtime reviews Future of Ceph and Conclusions

Introduction Echo has been in production at the Tier-1 for over 2 years now. Ceph strengths: Erasure Coding solves the limitations with hardware RAID for increasing capacity disks and is cheaper than replication. Data is aggressively balanced across the cluster maximizing throughput. Ceph is very flexible and same software can run clusters to provide a cloud backend, a file system or an object store.

Ceph Grid Setups Ceph can be configured in several ways to provide storage for the LHC VOs Object Store with GridFTP + XRootD plugins CephFS + GridFTP servers RBD + dCache RadosGW (S3) + DynaFed List in order of production readiness. Some setups have only been run using replication not Erasure Coding.

Echo Big ceph cluster for WLCG Tier-1 object storage (and other users) 181 Storage nodes 4700+ OSDs (6-12TB) 36/28PB raw/usable – 16PB data stored Density and throughput over latency EC 8+3 64MB rados objects Been through 3 major Ceph versions ~30% OSDs still filestore Mon stores just moved onto RocksDB

Cluster balancing The algorithmic data placement results in a normal distribution of disk fullness if OSD fullness is unmanaged Pre-Luminous, OSD reweights were used to improve this distribution this method was found to be inadequate for a cluster the size of Echo A pain point in 2018 was dealing with a very full cluster while adding hardware and increasing VO quotas

Data Distribution

The Upmap Balancer New feature in Ceph Luminous v12.2 – ‘Upmap’ Explicitly map a specific placement group to a specific disk Mapping stored with the CRUSH map Automatic balancing implemented using this feature Relatively small amount of upmaps needed to perfectly distribute a large amount of PGs/data A little juggling required to move from a reweight-balanced cluster to an upmap-balanced one without mass data movement Dan van der Ster wrote a script to make this a trivial operation* Greatly improves data distribution, and therefore total space available Explain CRUSH *https://gitlab.cern.ch/ceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py

The Upmap Balancer in action

Inconsistent PGs When a placement group is carrying out an internal consistency check known as scrubbing, any read errors will flag the PG up as inconsistent. These inconsistencies can be dangerous on low replica pools with genuine corruption, but they are usually trivial to identify and rectify E.g. 10/11 EC shards consistent, 1/11 shards unreadable Inconsistent PGs are unavoidable, it’s just a question of how many you deal with. Dependent on size of cluster and likelihood of disks throwing bad sectors For a <1000 disk cluster, you might get one a week Ceph has internal tools to repair inconsistent PGs, but human intervention required to start the process Improving (automating) this process is on the roadmap

Downtimes Echo became a production service at the start of February 2017. 2017-02-24: 47 minutes Stuck PG in ATLAS Pool. Normal remedies didn’t fix. 2017-08-19: 7 days Backill Bug 2018-08-10: 7 days Memory usage

Stuck PG While rebooting a storage node for patching in February 2017, a placement group in the atlas pool became stuck in a peering state I/O hung for any object in this PG To restore service availability, it was decided we would manually recreate the PG accepting loss of all 2300 files/160GB data in that PG The data loss occurred due to late discovery of correct remedy We would have been able to recover without data loss if we had identified the problem (and problem OSD) before we manually removed the PG from the set http://tracker.ceph.com/issues/18960

Backfill bug In August we encountered an backfill bug specific to erasure coded pools when adding 30 new nodes to the cluster. http://tracker.ceph.com/issues/18162 A read error on an object shard on any existing OSD in backfilling PG will: crash the primary OSD, and the next acting primary, and so on, until the PG goes down Misdiagnosis of the issue lead to the loss of an Atlas PG, 23,000 files lost Once the issue was understood we could handle the reoccurrences of the issue with no further data loss, and a Ceph upgrade fixed the issue for good. https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20170818_first_Echo_data_loss

Memory Usage Storage node memory usage climbed, causing OSD memory to go into swap and the cluster to grind to a halt, even after a full cluster restart Cause identified as a bug in the BlueStore OSD RocksDB trimming code, which was fixed in a later point release It was also decided that some generations of storage nodes needed more RAM to cope with extreme cluster circumstances Identifying the bug, installing extra RAM in ~80 storage nodes, upgrading Ceph and restoring full service happened in under a week, with no data loss.

Non-Downtimes The following things happened but didn’t cause Ceph to stop working: Site wide power outage UPS worked! Pity about the rest of the site… Central Network service interruptions Two major upgrades Rolling intervention could be done. Entire Disk server failures. High disk failure rates caused by excessive heat at the back of racks. Any security patching.

Future of Ceph Continual development from the open source community to improve ceph cluster performance, stability and ease of management Automatic balancing added and stable, automatic PG splitting/merging coming soon Ambitious project to rewrite the OSD data paths, aiming for much better performance from network to memory Large, growing, community SKA precursor (MeerKAT) uses a Ceph object store for it’s analysis pipeline + general storage STFC has joined the Ceph foundation as an academic member

Conclusion Using Ceph as the storage backend is working very well for the disk storage at the Tier-1 Ceph is very well suited for a Tier-2 size cluster Operational issues are to be expected, but being able to update, patch and deal with storage node failures without loss of availability has been fantastic The future of Ceph is looking promising There continues to be lots of uptake, both in scientific and commercial areas

Other sites Worker Node WN WN Ceph Cluster VO data pool WN WN Job container Job container User job User job XRootD client External GW XRootD client GridFTP server External GW Job container XRootD server rados plugin GridFTP server User job rados plugin libradosstriper rados plugin libradosstriper libradosstriper XRootD client GW container WN WN Ceph Cluster Job Job Job Job Job Job VO data pool Job Job GW Job Job GW WN WN Job Job Job Job Job Job Having all worker nodes talk to Ceph through the gateways seems like a bad idea Just need ceph.conf + keyring and any machine can talk directly to Ceph. All WNs now running an XRootD gateway in containers. Gateways have a caching layer. Job Job GW Job Job GW