Russian academic institutes participation in WLCG Data Lake project Andrey Kiryanov, Xavier Espinal, Alexei Klimentov, Andrey Zarochentsev
LHC Run 3 and HL-LHC Run 4 Computing Challenges Raw data volume We are here HL-LHC storage needs are a factor 10 above the expected technology evolution and flat funding. We need to optimize storage hardware usage and operational costs.
Motivation The HL-LHC will be a multi-Exabyte challenge where the anticipated storage and compute needs are a factor of ten above the projected technology evolution and flat funding. The WLCG community needs to evolve current models to store and manage data more efficiently. Technologies that will address the HL-LHC computing challenges may be applicable for other communities to manage large-scale data volumes (SKA, DUNE, CTA, LSST, BELLE-II, JUNO, etc). Co-operation is in progress.
Storage software emerged from HENP scientific community DPM – WLCG storage solution for small sites (T2s). Initially supported GridFTP+SRM, but now undergoing a reincarnation phase as DOME with HTTP/WebDAV/xrootd support as well. No tapes. dCache – a versatile storage system from DESY for both disks and tapes. Used by many T1s. xrootd – both a protocol and a storage system optimized for physics’ data access. Can be vastly extended by plug-ins. Used as a basis for ATLAS FAX and CMS AAA federations. EOS – based on xrootd, adds smart namespace and lots of extra features like automatic redundancy and geo-awareness. DynaFed – designed as a dynamic federation layer on top of HTTP/WebDAV-based storages. CASTOR – CERN’s tape storage solution, to be replaced by CTA. On top of that various data management solutions exist: FTS, Rucio, etc.
What is Data Lake Not another software or storage solution. It is a way of organizing a group of Data and Computing centers so that it can perform an effective data processing. A scientific community defines a “shape” of their Data Lake, which may be different for different communities. We see the Data Lake model as an evolution of the current infrastructure bringing reduction of the storage costs.
Data Lake
We’re not alone There are several storage-related R&D projects conducted in parallel: Data Carousel Data Lake Data Ocean (Google) Data Streaming All of them are in progress as a part of DOMA or/and IRIS- HEP global R&D for HL-LHC It is important to develop a coherent solution to address HL-LHC data challenges and to coordinate above and future projects
Requirements for a future WLCG data storage infrastructure Common namespace and interoperability Coexistence of different QoS Geo-awareness File transitioning based on namespace rules File layout flexibility Distributed redundancy Fast access to data, latency (>20 ms) compensation File R/W cache Namespace cache
WLCG Data Lake Prototype — EUlake Currently based on EOS Implies xrootd as a primary transfer protocol Other storage technologies and their possible interoperability are also considered Primary namespace server (MGM) is at CERN Deployment of a secondary namespace server at NRC “KI” is planned Due to EOS transition from in-memory namespace to QuarkDB multi- MGM deployment was unsupported for a while Storage endpoints run a simple EOS filesystem (FST) daemon Deployed at CERN, SARA, NIKHEF, RAL, JINR, PNPI, PIC and UoM perfSONAR endpoints are deployed at participating sites Performance tests (HC) are running continuouslys
Russian Federated Storage Project Started in 2015 EOS+dCache RUSSIA SPb Region SPbSU PNPI Moscow Region JINR NRC “KI” MEPhI SINP ITEP External Sites CERN
Russia in EUlake (1) Why? Extensive expertise in deployment and testing of distributed storages Russian institutes, including the ones that comprise NRC “KI”, are geographically distributed Network interconnection between Russian sites is constantly improving A similar prototype was successfully deployed on Russian sites (Russian Federated Storage Project) An appealing universal infrastructure may be useful not only for HL-LHC and HEP, but also for other experiments and fields of science relevant to us (NICA, PIK, XFEL) NRC “KI” equipment for EUlake is located at PNPI in Gatchina 10 Gbps connection, IPv6 ~100 TB of block storage Storage and Compute endpoints on VMs JINR equipment for EUlake is located in Dubna 10 Gbps connection ~16 TB of block storage Storage endpoints on VMs
Russia in EUlake (2) Manpower (NRC “KI” + JINR + SPbSU) Infrastructure deployment FSTs Hierarchical GeoTags Placement-related attributes Synthetic tests File I/O performance Metadata performance Real-life tests HammerCloud Monitoring
NRC “KI” + JINR international network infrastructure PNPI JINR
100 Gbps routes
Reliable back-end with redundancy NRC “KI” in EUlake Metadata request Primary Head Node Clients Disk Nodes CERN Redirection Data transfer Other Participants: JINR, PIC, NIKHEF, RAL, SARA, UoM Replication & Fall-back PNPI Disk Nodes Secondary Head Node VM hosts Reliable back-end with redundancy 10 Gbps Ceph Nodes x20, 128TB each 10 Gbps switch 160 Gbps Stack 2x10 Gbps trunk on each node Replication via a dedicated fabric
Highlights Why Ceph? Storage configuration Auxiliary infrastructure Deploying EOS on physical storage is perfectly suitable for CERN, but PNPI Data Centre is not a dedicated facility for HEP computing Ceph adds necessary flexibility in block storage management (we also use it for other purposes like VM images) Storage configuration We have started with Luminous but quickly moved to Mimic CephFS performance improved significantly in the new release We have four different “types” of Ceph storage exposed to EOS: CephFS with replicated data pool CephFS with Erasure Coded data pool Block device from a replicated pool Block device from an Erasure Coded pool Functional and performance tests are ongoing Auxiliary infrastructure Repository with stable EOS releases (CERN repo changes too fast, sometimes breaking the functionality) Web server with a visualization framework and a test results storage Compute nodes for HC tests
Ceph performance measurements Metadata performance of CephFS is much slower than of a dedicated RBD (this is expected) Block I/O performance is on par, but CPU usage is lower with CephFS
Ultimate goals Evaluate the fusion of local (Ceph) and global (EOS, dCache) storage technologies Figure out the strong and weak points Come out with a high-performance, flexible yet easily manageable storage solution for major scientific centers participating in multiple collaborations Further plans on testing converged solutions (Compute + Storage) Evaluate Data Lake as a storage platform for Russian scientific centers and major experiments NICA, XFEL, PIK Possibility to have dedicated storage resources with configurable redundancy in a global system Geolocation awareness and dynamic storage optimization Data relocation & replication with a proper use of fast networks Federated system with interoperable storage endpoints based on different solutions with a common transfer protocol
Synthetic file locality tests on EUlake The following combinations for layouts and placement policies were put in place and tested from a single client with geotag RU::PNPI: Layouts: Plain, Replica (2 stripes), RAIN (4+2 stripes) Placement policies: Gathered, Hybrid, Simple (based on client geotag) Expected results Availability of geo-local replicas should improve file read (stage-in) speed An ability to tie directories to local storages (FSTs) should improve write (stage-out) speed
100 MB file I/O performance tests with different layouts and placement policies (1) sys.forced.placementpolicy="gathered:RU": sys.forced.layout=plain no sys.forced.placementpolicy (based on client geotag) sys.forced.layout=plain Write Read Write Read Replica counts CERN::0513 1 CERN::HU 3 ES::PIC 2 RU::Dubna 46 RU::PNPI 48 Replica counts CERN::HU 6 ES::PIC 1 RU::PNPI 93 sys.forced.placementpolicy="gathered:RU": sys.forced.layout=raid6 no sys.forced.placementpolicy (based on client geotag) sys.forced.layout=raid6 Read Write Read Replica counts RU::Dubna 400 RU::PNPI 200 Replica counts CERN::0513 71 CERN::HU 129 ES::PIC 100 RU::Dubna 100 RU::PNPI 200
100 MB file I/O performance tests with different layouts and placement policies (2) no sys.forced.placementpolicy (based on client geotag) sys.forced.layout=replica sys.forced.placementpolicy="gathered:RU": sys.forced.layout=replica Write Read Write Read Replica counts RU::Dubna 100 RU::PNPI 100 Replica counts CERN::0513 13 CERN::HU 21 ES::PIC 35 RU::Dubna 19 RU::PNPI 112 sys.forced.placementpolicy="hybrid:RU": sys.forced.layout=replica Observed replica scattering (rebalancing) in a couple of days. Read is always redirected to the closest server. RAIN impacts I/O performance the most. Write Read Replica counts CERN::HU 14 CERN::9918 20 ES::PIC 31 RU::Dubna 65 RU::PNPI 7
ES RU in EOS ES:SiteA ES:SiteB RU:SiteA RU:SiteB RU:SiteC group.X replica 2 + CTA group.Y plain + CTA group.Z CTA group.W replica 2 group.U raid6 RU:SiteB RU:SiteC
Summary EUlake is currently operational as a proof-of- сoncept Functional and real-life tests are ongoing Effort is being made on publishing various metrics into a centralized monitoring system Expansion of a network capacity between major scientific centers in Russia enables efficient data management for future experiments
Future plans Extensive testing of different types of QoS (possibly simulated) with different storage groups Exploit different caching schemes Test automatic data migration Evolve the infrastructure from a simple Proof-of- Concept to an infrastructure capable of measuring performance of future possible distributed storage models
Thank you! Acknowledgements This work was supported by the NRC "Kurchatov Institute" (№ 1608) The authors express appreciation to the computing centers of NRC "Kurchatov Institute", JINR and other institutes for provided resources Thank you!