CASPUR / CERN / CSP / DataDirect / Panasas / RZ Garching Im Storage etwas Neues (Not All is Quiet on the Storage Front) Andrei Maslennikov CASPUR Consortium.

Slides:

Advertisements

Similar presentations

Data Storage Solutions Module 1.2. Data Storage Solutions Upon completion of this module, you will be able to: List the common storage media and solutions.

Advertisements

Tag line, tag line Perforce Benchmark with PAM over NFS, FCP & iSCSI Bikash R. Choudhury.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

SQL Server, Storage And You Part 2: SAN, NAS and IP Storage.

Session 3 Windows Platform Dina Alkhoudari. Learning Objectives Understanding Server Storage Technologies Direct Attached Storage DAS Network-Attached.

Data Storage Willis Kim 14 May Types of storages Direct Attached Storage – storage hardware that connects to a single server Direct Attached Storage.

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

National Energy Research Scientific Computing Center (NERSC) The GUPFS Project at NERSC GUPFS Team NERSC Center Division, LBNL November 2003.

Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.

BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.

RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.

Storage Area Networks The Basics. Storage Area Networks SANS are designed to give you: More disk space Multiple server access to a single disk pool Better.

© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.

Sponsored by: PASS Summit 2010 Preview Storage for the DBA Denny Cherry MVP, MCSA, MCDBA, MCTS, MCITP.

Managing Storage Lesson 3.

Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.

Database Storage Considerations Adam Backman White Star Software DB-05:

Computing Hardware Starter.

Introducing Snap Server™ 700i Series. 2 Introducing the Snap Server 700i series Hardware −iSCSI storage appliances with mid-market features −1U 19” rack-mount.

Module 12: Designing High Availability in Windows Server ® 2008.

D0 Run IIb Review 15-Jul-2004 Run IIb DAQ / Online status Stu Fuess Fermilab.

CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Catania, April 2001.

Module – 4 Intelligent storage system

DAC-FF The Ultimate Fibre-to-Fibre Channel External RAID Controller Solution for High Performance Servers, Clusters, and Storage Area Networks (SAN)

 International Computers Limited, 2001 AXiS 20/3/2001 AXiS Trimetra User Group Glenn Fitzgerald Manager, Storage Solutions ICL High Performance Systems.

CASPUR Site Report Andrei Maslennikov Lead - Systems Karlsruhe, May 2005.

Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.

Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.

6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.

CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium Catania, April 2002.

10/22/2002Bernd Panzer-Steindel, CERN/IT1 Data Challenges and Fabric Architecture.

EGEE is a project funded by the European Union under contract IST HellasGrid Hardware Tender Christos Aposkitis GRNET EGEE 3 rd parties Advanced.

Performance tests of storage arrays Irina Makhlyueva ALICE DAQ group 20 September 2004.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

LFC Replication Tests LCG 3D Workshop Barbara Martelli.

KIT – The cooperation of Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Hadoop on HEPiX storage test bed at FZK Artem Trunov.

CASPUR Site Report Andrei Maslennikov Lead - Systems Amsterdam, May 2003.

CASPUR / GARR / CERN / CNAF / CSP New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2003.

VMware vSphere Configuration and Management v6

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Implementation of a reliable and expandable on-line storage for compute clusters Jos van Wezel.

CASPUR Site Report Andrei Maslennikov Lead - Systems Rome, April 2006.

CERN Database Services for the LHC Computing Grid Maria Girone, CERN.

Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.

Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.

Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing,

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

ISCSI. iSCSI Terms An iSCSI initiator is something that requests disk blocks, aka a client An iSCSI target is something that provides disk blocks, aka.

RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,

PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.

Tested, seen, heard… Andrei Maslennikov Rome, April 2006.

1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 1 Presenter: Patrick Fuhrmann dCache.org Patrick Fuhrmann, Paul Millar,

G. Russo, D. Del Prete, S. Pardi Frascati, 2011 april 4th-7th The Naples' testbed for the SuperB computing model: first tests G. Russo, D. Del Prete, S.

Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.

E2800 Marco Deveronico All Flash or Hybrid system

Ryan Leonard Storage and Solutions Architect

Storage Area Networks The Basics.

WP18, High-speed data recording Krzysztof Wrona, European XFEL

Scalable Database Services for Physics: Oracle 10g RAC on Linux

Introduction to Networks

Storage Virtualization

Oracle Storage Performance Studies

Web Server Administration

Scalable Database Services for Physics: Oracle 10g RAC on Linux

High-Performance Storage System for the LHCb Experiment

Cost Effective Network Storage Solutions

CS 295: Modern Systems Organizing Storage Devices

Presentation transcript:

CASPUR / CERN / CSP / DataDirect / Panasas / RZ Garching Im Storage etwas Neues (Not All is Quiet on the Storage Front) Andrei Maslennikov CASPUR Consortium May 2005

A.Maslennikov - May Slab update2 Participated : CASPUR: S.Eccher, A.Maslennikov(*), M.Mililotti, M.Molowny, G.Palumbo CERN : B.Panzer-Steindel, R.Többicke CSP : R.Boraso DataDirect Networks : T. Knözinger Panasas : C.Weeden RZ Garching : H.Reuter (*) Project Coordinator

A.Maslennikov - May Slab update3 Sponsors for these test sessions: CISCO Systems: Loaned a 3750G-24TS switch Cluster File Systems: Provided the latest Lustre software DataDirect Networks : Loaned an S2A 8500 disk system, participated in tests E4 Computer Engineering: Loaned 5 assembled biprocessor nodes Extreme Networks: Loaned a Summit t switch Panasas : Loaned 3 ActiveScale shelves, actively participated in tests

A.Maslennikov - May Slab update4 Contents Goals Components under test Measurements Conclusions Cost considerations

A.Maslennikov - May Slab update5 Goals for these test series - This time we decided to re-evaluate one seasoned (AFS) and take a look at the two new (Lustre 1.4.1, Panasas 2.2.2) NAS solutions. These 3 distributed file systems all provide single file system image, are highly scalable and may be considered as possible datastore candidates for large Linux farms with commodity GigE NICs. - Two members of our collaboration were interested in NFS, so we were to test it as well. - As usual, we were to base our setup on the most recent components that we could obtain (disk systems and storage servers). One of the goals was to check and compare the performance of RAID controllers that were used.

A.Maslennikov - May Slab update6 Disk systems: 4x Infortrend EonStor A16F-G bay SATA-to-FC arrays: 16 Hitachi 7K GB SATA disks (7200 rpm) per controller Two 2 Gbit Fibre Channel outlets per controller Cache: 1 GB per controller 1x DataDirect S2A 8500 System: 2 controllers 8 Fibre Channel outlets at 2 Gbit 160 Maxtor Maxline II 250 GB SATA disks (7200 rpm) 8 Fibre Channel outlets at 2 Gbit Cache: 2.56 GB Components

Infortrend EonStor A16F-G Two 2Gbps Fibre Host Channels - NEW: ASIC 266 MHz architecture (twice the power of A16F-G1A2): 300+ MB/s - RAID levels supported: RAID 0, 1 (0+1), 3, 5, 10, 30, 50, NRAID and JBOD - Multiple arrays configurable with dedicated or global hot spares - Automatic background rebuild - Configurable stripe size and write policy per array - Up to 1024 LUNs supported - 3.5", 1" high 1.5Gbps SATA disk drives - Variable stripe size per logical drive - Up to 64TB per LD - Up to 1GB SDRAM

DataDirect S2A Single 2U S2A8500 with Four 2Gb/s Ports or Dual 4U with Eight 2Gb/s Ports - Up to 1120 Disk Drives; 8192 LUNs supported - 5TB to 336TB with FC or SATA disks - Sustained Performance 1.5 GB/s (dual 4U, sequential large block) - Full Fibre-Channel Duplex Performance on every port - PowerLUN™ 1.5 GB/s+ individual LUNs without host-based striping - Up to 20GB of Cache, LUN-in-Cache Solid State Disk functionality - Real time Any to Any Virtualization - Very fast rebuild rate

A.Maslennikov - May Slab update9 - High-end Linux units for both servers and clients Servers : 2-way Intel Nocona 3.4 GHz, 2GB RAM, 2 QLA2310 2Gbit HBA Clients : 2-way Intel Xeon 2.4+ GHz, 1GB RAM OS : SuSE SLES 9 on servers, SLES 9 / RHEL 3 on clients - Network – nonblocking GigE switches CISCO G-24TS (24 ports) Extreme Networks - Summit t (48 ports) - SAN Qlogic Sanbox 5200 – 32 ports - Appliances 3 Panasas ActiveScale Shelves Each shelf had 3 Director Blades and 8 Storage Blades Components

Slide 10 October 14, 2015 Panasas Confidential Panasas Storage Cluster Components DirectorBlade StorageBlade Integrated GE Switch Shelf Front 1 DB, 10 SB Shelf Rear Midplane routes GE, power Battery Module (2 Power units)

A.Maslennikov - May Slab update11 SATA / FC Systems

A.Maslennikov - May Slab update12 SATA / FC: raid controllers IFT A16F- G2221 2x3.4 GHz 2 QLA2310 HBAs DDN 8500A singlet 2x3.4 GHz 2 QLA2310 HBAs 2x3.4 GHz, 2xQLA2310 Storage Server Data Direct S2A 8500 Each of the two singlet controllers had 4 FC outlets, and we used two server nodes with 2 HBAs each to fully load it Infortrend A16F-G2221 The new ASIC 266 based controller is capable to deliver more then 200 MB/s, so we needed 2 HBAs at 2 Gbit to attach it to a server node 4 Logical drives 2 Logical drives

A.Maslennikov - May Slab update13 In all configurations, every server node had two logical drives (IFT or DDN). We have tried to use them either separately, or joined them in sw raid (MD). To get most out of our RAID systems, we have tried different filesystems (EXT3 and XFS), and varied the number of jobs doing I/O in the system. Every job was just an instance of “lmdd” from “lmbench” suite (see We wrote / read files of 10GB: lmdd of=/LD/file bs=1000k count=10000 fsync=1 SATA / FC: raid controllers

A.Maslennikov - May Slab update14 In case when XFS is used, one DDN singlet is more than two times faster than one IFT. The situation with EXT3 is different, 2 IFTs behave more or less like one DDN. SATA / FC: raid controllers One DDN S2A 8500 singlet EXT3, MB/secXFS, MB/sec WriteReadWriteRead LD1,2,3,4 – 4 jobs MD1(2 LDs) + MD2(2 LDs) – 4 jobs One IFT A16F-G2221 EXT3, MB/secXFS, MB/sec WriteReadWriteRead LD1 - 1 job LD1,2 - 2 jobs MD1(2 LDs) - 1 job MD1(2 LDs) - 2 jobs

A.Maslennikov - May Slab update15 Infortrend: One array with 1 GB of cache and backup battery (without disks) trades for 5350 E. The disks will cost either: 16*170= 2720 E (250 GB) or 16*350= 5600 E (400 GB). All together, this means: 300 MB/sec, 4.0 TB  8070 E  2.0 E/GB 300 MB/sec, 6.4 TB  E  1.7 E/GB The new model A24F-R with the same controller and 24 drives yields (very approximate pricing): 300 MB/sec, 9.6 TB  E  1.8 E/GB SATA / FC: Expandability and costs

A.Maslennikov - May Slab update16 Data Direct Networks This system definitively belongs to another pricing class. The 2-controller system with 20 TB and 8 FC outlets may cost around 80 KE: 1500 MB/sec, 20 TB  E  4.0 E/GB While the similar power may be achieved with 5 IFT systems at ~40 KE: 1500 MB/sec, 20 TB  E  2.0 E/GB At the price per GB two times higher, the DDN system has a lot more of expandabilty and redundancy options than IFT. As all logical drives may be zoned internally to any of the FC outlets, one can build different failover configurations. DDN drive rebuilds are exceptionally fast and do not lead to visible performance penalties. SATA / FC: Expandability and costs

A.Maslennikov - May Slab update17 File Systems Tests

A.Maslennikov - May Slab update18 Test setup (NFS, AFS, Lustre) Load Farm (16 biprocessor nodes at 2.4+ GHz ) Gigabit Ethernet CISCO/Extreme Server 1 LD 1LD 2 Server 2 LD 1LD 2 Server 3 LD 1LD 2 Server 4 LD 1LD 2 MDS(Lustre) SAN QLogic 5200 On each server, 2 Gigabit Ethernet NICs were bonded (bonding-ALB). LD1, LD2: could be IFT or DDN. Each LD was zoned to a distinct HBA.

A.Maslennikov - May Slab update19 Configuration details (NFS,AFS,Lustre) - In all 3 cases we used SuSE SLES with kernel 2.6.x on server nodes (was highly recommended by Lustre people) - NFS. All servers were started with either 64 or 128 threads. Each of them exported 2 partitions was set up on two logical drives to all the clients. The clients were RHEL 3 with kernel Elsmp. Mount options: -o tcp,rsize=32768,wsize= AFS. We used a variant of MR-AFS prepared by H.Reuter for SLES 9. All client nodes were SLES 9 with kernel smp. Cache: in memory, 64 MB. Each of the server nodes exported 2 /vicep partitions organized on two idependent logical drives. - Lustre Installation was straightforward (just 2 RPMs). As our disks were very fast, it made no big sense to use striping (although we tested it as well). Each of the servers (OSS) was exporting two independent OSTs set up on 2 indipendent LDs. We used RHEL 3 on clients, with Lustre’s version of kernel smp.

A.Maslennikov - May Slab update20 Configuration details (Panasas) - Panasas is an easily configurable appliance; physically it is a blade server. A 3U enclosure may host up to 11 blades. Each blade is a standalone Linux host complete with CPU and one or two disk drives. - All blades are connected inside a shelf to an internal GigE switch. Four ports of this switch serve to communicate with external world and should be connected in inter-switch trunking (802.3ad) to the clients’ network. - There are 2 types of blades: Director Blade and Storage Blade. Director Blades administer the data traffic requests. Real data are served by Storage Blades. Number of blades of different type in a shelf may vary. We used 1DB and 10SBs. - Client access: native protocol (DirectFlow), requires installation of a matching kernel module (we used RHEL 3 with kernel Elsmp), or NFS/CIFS. - Clients mount a single image file system from one of the Director Blades. Adding more shelves does not disrupt the operation and increases the aggregate througput and capacity (this we yet have not tested).

Slide 21 October 14, 2015 Panasas Confidential Dynamic Load Balancing StorageBlade load balancing Eliminates capacity imbalances Passive: uses creates to rebalance Active: re-stripes to balance new storage DirectorBlade load balancing Assigns clients across DirectorBlade cluster Balances metadata management DirectorBlades automatically added/deleted KEY : Eliminates need to constantly monitor and redistribute capacity and load

A.Maslennikov - May Slab update22 What we measured 1) Massive aggregate I/O (large files, lmdd) - All 16 clients were unleashed together, file sizes varied in the range 5-10GB - Gives a good idea about the system’s overall throughput 2) Pileup. This special benchmark was developed at CERN by R.Többicke. - Emulation of an important use case foreseen in one of the LHC experiments; - Several (64-128) 2GB files are first prepared on the file system under test - The files are then read by a growing number of reader threads (ramp-up) - Every thread selects randomly one file out of the list; - In a single read act, an arbitrary offset within file is calculated, and KB are read starting with this offset; - Output is the number of operations times bytes read per time interval - Pileup results are important for future service planning

A.Maslennikov - May Slab update23 A typical Pileup curve

A.Maslennikov - May Slab update24 3) Emulation of a DAQ Data Buffer - A very common scenario in HEP DAQ architecture - Data is constantly arriving from the detector and has to end up on the tertiary strorage (tapes) - A temporary storage area on the way of the data to tapes serves for reorganization of streams, preliminary real-time analysis and as a security buffer to hold against the interrupts of the archival system - Of big interest for service planning: general throughput of a balanced Data Buffer. - A DAQ Manager may moderate the data influx (for instance, by tuning certain trigger rates), thus balancing it with the outflux. - We were running 8 writers and 8 readers, one process per client. Each file was accessed at any given moment by one and only one process. On writer nodes we could moderate the writer speed by adding some dummy “CPU eaters”.

A.Maslennikov - May Slab update25 DAQ Data Buffer

A.Maslennikov - May Slab update26 Results for 8 GigE outlets Massive I/O, MB/secBalanced DAQ Buffer Influx, MB/sec Pileup, MB/sec WriteRead NFS IFT AFS IFT LUSTRE IFT LUSTRE DDN PANASAS x remarks: - Each of the storage nodes had 2 GigE NICs. We have tried to add a third NIC to see if we could get more out of the node. There was a modest improvement of less than 10 percent so we decided to use 8 NICs on 4 nodes per run. - Panasas shelf had 4 NICs, and we report here its results multiplied by 2, to be able to compare it with all other 8-NIC configurations.

A.Maslennikov - May Slab update27 Some caveats NFS: - There was not a single glitch when we were running all three tests on it. However, when once we started a VERY HIGH number of writer threads per server (192 threads per box), we were able to hang some of them. - We have not used the latest SLES kernel update, nor we had time to debug it. Whoever wants to entrust himself to this (or any other ) version of NFS has to do a massive stress test on it. Lustre: - The (undocumented) parameter that governs the Lustre’s client read-ahead behaviour was of crucial importance to our tests. It is hard to find a “golden” value for this parameter that would satisfy both Pileup and Massive Read benchmarks. If you are fast with Pileup, you are slow with large streaming read, and vice versa.

A.Maslennikov - May Slab update28 Conclusions 1) With 8 GigE NICs in the system, one would expect a throughput in excess of 800 MB/sec for large streaming I/O. Lustre and Panasas can clearly deliver this, NFS is also doing quite well. The very fact that we were operating around 800 MB/sec with this hardware means that our storage nodes were well-balanced (no bottlenecks, we even might have had a reserve of 100 MB/sec per setup). 2) Pileup results were relatively good for AFS, and best in case of Panasas. The outcome of this benchmark is correlated with the number of spindles in the system. Two Panasas shelves had 40 spindles, while 4 storage nodes used 64 spindles. So Panasas file system was doing a much better job per spindle than any other solution that we tested (NFS, AFS, Lustre).

A.Maslennikov - May Slab update29 3) Out of three distributed file systems that we tested, Lustre and Panasas demonstrated the best results. Panasas may have some point over Lustre, as it performs uniformly well both for large streaming and random access I/O. 4) AFS should still not be discarded as it appears that it may be even made economically affordable (see the next slide).

A.Maslennikov - May Slab update30 Cost estimates for 800 MB /sec (very approximate!) An IFT-based storage node: 1 machine KE 2 HBAs KE 1 IFT with 2.5 TB KE _______________________________________________________ Total KE Option 1: AFS “alignment” (add 1 machine) KE Option 2: Lustre commercial version- 5+ KE Panasas: 1 Shelf (bulk buy of more than N units, including one year of golden 24/7 support) KE

A.Maslennikov - May Slab update31 TOTALS at 10TB, 800 MB/sec comparison point 4 IFT Lustre public(*) (10 TB, 800 MB/sec) KE 4 IFT AFS (10 TB, 800 MB/sec) KE 4 IFT Lustre commercial (10 TB, 800 MB/sec) KE 2 Panasas shelves (10 TB, 800 MB/sec) KE Of course, these numbers are a bit “tweaked”, as one would never put only 2.5 TB in one infortrend array. The same table, with 4 IFT units completely filled with the disks will look differently (see next page). (*) We have obtained all the numbers using the commercial version of Lustre. It may become public in 1-2 years from now.

A.Maslennikov - May Slab update32 TOTALS at 800 MB/sec (different capacities) 4 IFT Lustre public 25.6 TB KE 4 IFT AFS 25.6 TB KE 4 IFT Lustre public 38.4 TB KE 4 IFT AFS 38.4 TB KE Two Panasas shelves contain 40 disk drives. Replacing the 250GB drives at 170 E with 500GB drives at 450 E apiece we may arrive to: 2 Panasas shelves 20 TB KE IFT may grow with a step of(200 MB/s – 10 TB - 22KE) Panasas may grow with a step of(400 MB/s – 10 TB - 35KE)