TPT-RAID: A High Performance Multi-Box Storage System

Slides:



Advertisements
Similar presentations
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
Advertisements

RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
RAID Redundant Array of Independent Disks
Introduction to Storage Area Network (SAN) Jie Feng Winter 2001.
NAS vs. SAN 10/2010 Palestinian Land Authority IT Department By Nahreen Ameen 1.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim S. Venkatraman, J. Wilkes HP Labs.
RAID Redundant Arrays of Inexpensive Disks –Using lots of disk drives improves: Performance Reliability –Alternative: Specialized, high-performance hardware.
1 CSC 486/586 Network Storage. 2 Objectives Familiarization with network data storage technologies Understanding of RAID concepts and RAID levels Discuss.
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
R.A.I.D. Copyright © 2005 by James Hug Redundant Array of Independent (or Inexpensive) Disks.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
File Management Systems
High Performance Computing Course Notes High Performance Storage.
1 I/O Management in Representative Operating Systems.
Introduction to client/server architecture
Storage Area Network (SAN)
Storage Networking Technologies and Virtualization Section 2 DAS and Introduction to SCSI1.
RAID Systems CS Introduction to Operating Systems.
Module – 7 network-attached storage (NAS)
© 2009 IBM Corporation Statements of IBM future plans and directions are provided for information purposes only. Plans and direction are subject to change.
Storage Networking. Storage Trends Storage growth Need for storage flexibility Simplify and automate management Continuous availability is required.
Data Storage Willis Kim 14 May Types of storages Direct Attached Storage – storage hardware that connects to a single server Direct Attached Storage.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
Storage Area Networks The Basics. Storage Area Networks SANS are designed to give you: More disk space Multiple server access to a single disk pool Better.
Introduction to Networks Networking Concepts IST-200 VWCC 1.
Configuring File Services Lesson 6. Skills Matrix Technology SkillObjective DomainObjective # Configuring a File ServerConfigure a file server4.1 Using.
SRP Update Bart Van Assche,.
GeoVision Solutions Storage Management & Backup. ๏ RAID - Redundant Array of Independent (or Inexpensive) Disks ๏ Combines multiple disk drives into a.
Networked File System CS Introduction to Operating Systems.
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Copyright DataDirect Networks - All Rights Reserved - Not reproducible without express written permission Adventures Installing Infiniband Storage Randy.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
Module 9: Configuring Storage
Chapter 5 Section 2 : Storage Networking Technologies and Virtualization.
4.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 4: Organizing a Disk for Data.
ISER Update OpenIB Workshop, Feb 2006 Yaron Haviv, Voltaire John Hufferd, Brocade
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
The concept of RAID in Databases By Junaid Ali Siddiqui.
RAID Systems Ver.2.0 Jan 09, 2005 Syam. RAID Primer Redundant Array of Inexpensive Disks random, real-time, redundant, array, assembly, interconnected,
Mr. P. K. GuptaSandeep Gupta Roopak Agarwal
Lec 5 part2 Disk Storage, Basic File Structures, and Hashing.
Multimedia Retrieval Architecture Electrical Communication Engineering, Indian Institute of Science, Bangalore – , India Multimedia Retrieval Architecture.
Storage Networking. Storage Trends Storage grows %/year, gets more complicated It’s necessary to pool storage for flexibility Intelligent storage.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
RAID Technology By: Adarsha A,S 1BY08A03. Overview What is RAID Technology? What is RAID Technology? History of RAID History of RAID Techniques/Methods.
What is raid? RAID is the term used to describe a storage systems' resilience to disk failure through the use of multiple disks and by the use of data.
Network-Attached Storage. Network-attached storage devices Attached to a local area network, generally an Ethernet-based network environment.
CS Introduction to Operating Systems
Storage Area Networks The Basics.
Configuring File Services
Direct Attached Storage and Introduction to SCSI
Storage Networking.
Introduction to Networks
Introduction to client/server architecture
Storage Virtualization
Direct Attached Storage and Introduction to SCSI
Storage Networking.
RAID RAID Mukesh N Tekwani
Storage Networking Protocols
UNIT IV RAID.
RAID RAID Mukesh N Tekwani April 23, 2019
CS 295: Modern Systems Organizing Storage Devices
Presentation transcript:

TPT-RAID: A High Performance Multi-Box Storage System Erez Zilber Yitzhak Birk Technion

Agenda Introduction Improving Communication Efficiency Relieving the Controller Bottleneck Performance

Basic Terminology SCSI (Small Computer System Interface): Standard protocol between computers and peripheral devices (mainly storage devices). Developed in the T10 working group of ANSI. Uses a client-server model. iSCSI (Internet SCSI): Mapping of SCSI over TCP. iSCSI client (e.g., host computer) is called ‘initiator’. iSCSI server (e.g., disk box) is called ‘target’.

Basic Terminology (cont.) RAID (Redundant Array of Inexpensive Disks): Using multiple drives for replicating data among the drives. Specifies a number of prototype “RAID Levels“: RAID-1: An exact copy of the data on two or more disks. RAID-4: Uses striping with a dedicated parity disk RAID-5: Similar to RAID-4 with parity data distributed across all member disks.

RAID - Examples RAID-4 RAID-1 (Mirroring) RAID-5 New data block New parity block RAID-5

Storage Trends Originally: direct-attached storage that belongs to its computer 1990s: “mainframe” storage servers 2000: Separation of control from actual storage boxes: Control: Network attached storage (NAS): file interface Storage area networks (SAN): block interface Storage boxes: RAID of some type In almost all of these: an entire RAID group is within a single box.

Single-box storage system The problem Storage devices are becoming cheaper. However, highly-available single-box storage systems are still expensive. Even such systems are susceptible to failures that affect the entire box. RAID Controller Disks Single-box storage system

Multi-box storage system Multi-Box RAID A single, fault-tolerant controller connected to multiple storage boxes (targets). Any given parity group utilizes at most one disk drive from any given box. The controller and the disks reside in separate machines. iSCSI may be used in order to send SCSI commands and data. Multi-box storage system

Multi-box storage system Multi-Box RAID (cont.) Advantages: There is no single point of storage-box failure. Highly available expensive storage boxes are no longer needed. Disadvantages: Transferring data over a network is not as efficient as using the DMA engine in a single-box RAID system. Merely using storage protocols (e.g. iSCSI) over conventional network infrastructure is not enough. Bottleneck in the controller  poor scalability. Preserving the storage-box capacity (cost effectiveness) may be problematic for the controller. Multi-box storage system

Agenda Introduction Improving Communication Efficiency Relieving the Controller Bottleneck Performance

InfiniBand InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end). InfiniBand supports RDMA (Remote DMA) High speed Low latency Very lean + no CPU involvement

iSCSI Extensions for RDMA (iSER) iSER is an IETF standard Maps iSCSI over a network that provides RDMA services. Data is transferred directly into SCSI I/O buffers without intermediate data copies. Splits control and data: RDMA is used for data transfer. Sending of control messages is left unchanged. The same physical path may be used for both.

iSCSI over iSER: Read Requests TCP packets RDMA

iSCSI over iSER: Write Requests TCP packets RDMA

iSER + Multi-Box RAID iSCSI over iSER solves the problem of inefficient data transfer. The separation of control and data is really a protocol separation over the same path. The scalability problem remains: All data passes through the controller. When using RAID-4/5, the controller has to perform parity calculations.

Agenda Introduction Improving Communication Efficiency Relieving the Controller Bottleneck Performance

Removing the Controller from the Data Path – 3rd Party Transfer 3rd Party Transfer: one iSCSI entity instructs a 2nd iSCSI entity to read or write data to a 3rd iSCSI entity. Data is transferred directly between hosts and targets under controller command: Lower zero-load latency, especially for large requests – one hop instead of two. The controller’s memory, busses and InfiniBand link do not become a bottleneck. Out-of-band controllers already exist, but: RDMA makes out-of-band data transfers more transparent. We carry the idea into the RAID.

RDMA and Out-of-Band Controller 3rd Party Transfer is more transparent when combined with RDMA : Transparent from a host point of view. Almost transparent from a target point of view. Adding 3rd Party Transfer to iSCSI over iSER is essential for removing the controller from the data path.

Distributed Parity Calculation The controller is not in the data path  it cannot compute parity. Side benefit: relieves another possible controller bottleneck.

Distributed Parity Calculation – a Binary Tree … Data block 0 Data block 1 Data block 2 Data block 3 Data block N-2 Parity block XOR XOR XOR Temp result Temp result Temp result XOR Temp result . . . New parity block

Example: 3rd Party Transfer and Distributed Parity Calculation The host sends a command to the RAID controller. Host The RAID controller sends commands to the targets. CMD The targets perform RDMA operations to the host. RDMA RAID controller The RAID controller sends commands to recalculate the parity block (only for WRITE requests). CMD CMD CMD Targets The targets calculate the new parity block: Target to target data transfers XOR operation in the receiving target 1 2 3 4 RDMA (parity calculation)

Compared Systems Baseline RAID: TPT-RAID: Hosts In-band controller (Baseline controller) Targets iSCSI over iSER TPT-RAID: Out-of-band controller (TPT controller) TPT targets 3rd Party Transfer Distributed Parity Calculation

Amount of Transferred Data (READ) Baseline system: Controller: Read from the target: 1 Write to the host: 1 Total: 2 blocks Targets: Write to the controller: 1 Total: 1 block TPT system: Controller: No data transfers. Total: 0 blocks Write to the host:1

Amount of Transferred Data (WRITE) Baseline system: Controller: Read from the host: 1 Read old data from the targets: 2 Write new data and parity to the targets: 2 Total: 5 blocks Targets: Write old data to the controller: 2 Read new data and parity from the controller: 2 Total: 4 blocks TPT system: Controller: No data transfers. Total: 0 blocks Read new data from the host: 1 Parity calculation between targets: 1 Total: 2 blocks

RDP with 3rd Party Transfer Row-Diagonal Parity (RDP) is an extension to RAID-5: Calculates two sets of parity information: Row parity Diagonal parity Can tolerate two failures.

RDP with 3rd Party Transfer (cont.) READ commands: similar to RAID-5. WRITE commands: More parity calculations are required. 3rd Party Transfer and Distributed Parity Calculation relieve the RAID controller bottleneck.

Mirroring with 3rd Party Transfer READ commands: similar to RAID-5. WRITE commands may be executed in one of the following two ways: All targets read the new data directly from the host. A single target reads the new data directly from the host and transfers it to other targets. Using 3rd Party Transfer for mirroring relieves the RAID controller bottleneck.

Degraded Mode When a target fails, the system moves to degraded mode. Failure identification is similar to the Baseline system. Execution of READ commands is similar to the execution of WRITE commands in normal mode. The same performance improvement that is achieved for WRITE commands in normal mode, is achieved for READ commands in degraded mode.

Required Protocol Changes Host: Minor change: The host must accept InfiniBand connection requests. RAID controller and targets: SCSI: Additional commands. No SCSI hardware changes are required. iSCSI: Small changes in login/logout process. Extra field added to iSCSI Command PDU. iSER: Added and modified iSER primitives. InfiniBand: No changes were made. However, allowing scatter-gather of remote memory handles could have improved performance.

Agenda Introduction Improving Communication Efficiency Relieving the Controller Bottleneck Performance

Test Setup Hardware: Software: Nodes (all types): Intel dual-XEON 3.2GHz Memory disks Mellanox MHEA28-1T (10Gb/s) InfiniBand HCA Mellanox MTS2400 InfiniBand switch Software: Linux SuSE 9.1 Professional (2.6.4-52 kernel) Voltaire InfiniBand host stack Voltaire iSER initiator and target

System Configurations Baseline system: Host In-band RAID controller 5 targets TPT-RAID system: TPT RAID controller 5 TPT targets Both systems use iSER (RDMA) over InfiniBand.

Scalability TPT-RAID (almost) doesn’t add work (relative to the Baseline): No extra disk (media) operations. No extra XOR operations. Added communication to the targets: More commands More data transfers The extra communication is divided among all targets.

Controller Scalability – RAID-5 (WRITE) Unlimited number of hosts Unlimited number of targets Req. size Block size Max. hosts Baseline TPT 1MB 32KB 1 (75%) 1 64KB 1 (72%) 2 8MB 1 (78%) 4 InfiniBand BW is not a limiting factor (multiple hosts and targets).

Max. Thpt. with One Host – RAID-5 (WRITE) Single host Single target set Even when only a single host is used, the Baseline controller is the bottleneck!

Controller Scalability – RDP (WRITE) Unlimited number of hosts Unlimited number of targets Req. size Block size Max. hosts Baseline TPT 1MB 32KB 1 (33%) 1 (70%) 64KB 1 (80%) 8MB 2 3

Controller Scalability – Mirroring (WRITE) Unlimited number of hosts Unlimited number of targets Req. size (Blk=32KB) Max. hosts Baseline TPT 256KB 1 (50%) 9 512KB 18 1MB 36 8MB 293

Max. Thpt. with One Host - Mirroring (WRITE) Single host Single target set Even when a single host is used, the Baseline controller is a bottleneck. For TPT, the bottleneck is the host or the targets.

Degraded Mode Same as the performance of WRITE commands in normal mode.

Summary Multi-box RAID: improved availability and low cost. Using a single controller retains simplicity. Single-box DMA engine is replaced by RDMA. Adding 3rd Party Transfer and Distributed Parity Calculation allows scalability: Can manage a larger system with more activity For a given workload: larger max. thpt.  shorter waiting times  lower latency Cost reduction is taken another step while retaining performance and simplicity.

InfiniBand support InfiniBand currently allows scattering/gathering of memory. Memory registration returns a memory handle. Scattering/gathering of memory handles will improve the performance of the TPT-RAID dramatically.