Presentation is loading. Please wait.

Presentation is loading. Please wait.

TPT-RAID: A High Performance Multi-Box Storage System

Similar presentations


Presentation on theme: "TPT-RAID: A High Performance Multi-Box Storage System"— Presentation transcript:

1 TPT-RAID: A High Performance Multi-Box Storage System
Erez Zilber Yitzhak Birk Technion

2 Agenda Introduction Improving Communication Efficiency
Relieving the Controller Bottleneck Performance

3 Basic Terminology SCSI (Small Computer System Interface):
Standard protocol between computers and peripheral devices (mainly storage devices). Developed in the T10 working group of ANSI. Uses a client-server model. iSCSI (Internet SCSI): Mapping of SCSI over TCP. iSCSI client (e.g., host computer) is called ‘initiator’. iSCSI server (e.g., disk box) is called ‘target’.

4 Basic Terminology (cont.)
RAID (Redundant Array of Inexpensive Disks): Using multiple drives for replicating data among the drives. Specifies a number of prototype “RAID Levels“: RAID-1: An exact copy of the data on two or more disks. RAID-4: Uses striping with a dedicated parity disk RAID-5: Similar to RAID-4 with parity data distributed across all member disks.

5 RAID - Examples RAID-4 RAID-1 (Mirroring) RAID-5 New data block
New parity block RAID-5

6 Storage Trends Originally: direct-attached storage that belongs to its computer 1990s: “mainframe” storage servers 2000: Separation of control from actual storage boxes: Control: Network attached storage (NAS): file interface Storage area networks (SAN): block interface Storage boxes: RAID of some type In almost all of these: an entire RAID group is within a single box.

7 Single-box storage system
The problem Storage devices are becoming cheaper. However, highly-available single-box storage systems are still expensive. Even such systems are susceptible to failures that affect the entire box. RAID Controller Disks Single-box storage system

8 Multi-box storage system
Multi-Box RAID A single, fault-tolerant controller connected to multiple storage boxes (targets). Any given parity group utilizes at most one disk drive from any given box. The controller and the disks reside in separate machines. iSCSI may be used in order to send SCSI commands and data. Multi-box storage system

9 Multi-box storage system
Multi-Box RAID (cont.) Advantages: There is no single point of storage-box failure. Highly available expensive storage boxes are no longer needed. Disadvantages: Transferring data over a network is not as efficient as using the DMA engine in a single-box RAID system. Merely using storage protocols (e.g. iSCSI) over conventional network infrastructure is not enough. Bottleneck in the controller  poor scalability. Preserving the storage-box capacity (cost effectiveness) may be problematic for the controller. Multi-box storage system

10 Agenda Introduction Improving Communication Efficiency
Relieving the Controller Bottleneck Performance

11 InfiniBand InfiniBand defines a high speed network for interconnecting processing nodes and I/O nodes (>10 Gbit/s end-to-end). InfiniBand supports RDMA (Remote DMA) High speed Low latency Very lean + no CPU involvement

12 iSCSI Extensions for RDMA (iSER)
iSER is an IETF standard Maps iSCSI over a network that provides RDMA services. Data is transferred directly into SCSI I/O buffers without intermediate data copies. Splits control and data: RDMA is used for data transfer. Sending of control messages is left unchanged. The same physical path may be used for both.

13 iSCSI over iSER: Read Requests
TCP packets RDMA

14 iSCSI over iSER: Write Requests
TCP packets RDMA

15 iSER + Multi-Box RAID iSCSI over iSER solves the problem of inefficient data transfer. The separation of control and data is really a protocol separation over the same path. The scalability problem remains: All data passes through the controller. When using RAID-4/5, the controller has to perform parity calculations.

16 Agenda Introduction Improving Communication Efficiency
Relieving the Controller Bottleneck Performance

17 Removing the Controller from the Data Path – 3rd Party Transfer
3rd Party Transfer: one iSCSI entity instructs a 2nd iSCSI entity to read or write data to a 3rd iSCSI entity. Data is transferred directly between hosts and targets under controller command: Lower zero-load latency, especially for large requests – one hop instead of two. The controller’s memory, busses and InfiniBand link do not become a bottleneck. Out-of-band controllers already exist, but: RDMA makes out-of-band data transfers more transparent. We carry the idea into the RAID.

18 RDMA and Out-of-Band Controller
3rd Party Transfer is more transparent when combined with RDMA : Transparent from a host point of view. Almost transparent from a target point of view. Adding 3rd Party Transfer to iSCSI over iSER is essential for removing the controller from the data path.

19 Distributed Parity Calculation
The controller is not in the data path  it cannot compute parity. Side benefit: relieves another possible controller bottleneck.

20 Distributed Parity Calculation – a Binary Tree
Data block 0 Data block 1 Data block 2 Data block 3 Data block N-2 Parity block XOR XOR XOR Temp result Temp result Temp result XOR Temp result . . . New parity block

21 Example: 3rd Party Transfer and Distributed Parity Calculation
The host sends a command to the RAID controller. Host The RAID controller sends commands to the targets. CMD The targets perform RDMA operations to the host. RDMA RAID controller The RAID controller sends commands to recalculate the parity block (only for WRITE requests). CMD CMD CMD Targets The targets calculate the new parity block: Target to target data transfers XOR operation in the receiving target 1 2 3 4 RDMA (parity calculation)

22 Compared Systems Baseline RAID: TPT-RAID: Hosts
In-band controller (Baseline controller) Targets iSCSI over iSER TPT-RAID: Out-of-band controller (TPT controller) TPT targets 3rd Party Transfer Distributed Parity Calculation

23 Amount of Transferred Data (READ)
Baseline system: Controller: Read from the target: 1 Write to the host: 1 Total: 2 blocks Targets: Write to the controller: 1 Total: 1 block TPT system: Controller: No data transfers. Total: 0 blocks Write to the host:1

24 Amount of Transferred Data (WRITE)
Baseline system: Controller: Read from the host: 1 Read old data from the targets: 2 Write new data and parity to the targets: 2 Total: 5 blocks Targets: Write old data to the controller: 2 Read new data and parity from the controller: 2 Total: 4 blocks TPT system: Controller: No data transfers. Total: 0 blocks Read new data from the host: 1 Parity calculation between targets: 1 Total: 2 blocks

25 RDP with 3rd Party Transfer
Row-Diagonal Parity (RDP) is an extension to RAID-5: Calculates two sets of parity information: Row parity Diagonal parity Can tolerate two failures.

26 RDP with 3rd Party Transfer (cont.)
READ commands: similar to RAID-5. WRITE commands: More parity calculations are required. 3rd Party Transfer and Distributed Parity Calculation relieve the RAID controller bottleneck.

27 Mirroring with 3rd Party Transfer
READ commands: similar to RAID-5. WRITE commands may be executed in one of the following two ways: All targets read the new data directly from the host. A single target reads the new data directly from the host and transfers it to other targets. Using 3rd Party Transfer for mirroring relieves the RAID controller bottleneck.

28 Degraded Mode When a target fails, the system moves to degraded mode.
Failure identification is similar to the Baseline system. Execution of READ commands is similar to the execution of WRITE commands in normal mode. The same performance improvement that is achieved for WRITE commands in normal mode, is achieved for READ commands in degraded mode.

29 Required Protocol Changes
Host: Minor change: The host must accept InfiniBand connection requests. RAID controller and targets: SCSI: Additional commands. No SCSI hardware changes are required. iSCSI: Small changes in login/logout process. Extra field added to iSCSI Command PDU. iSER: Added and modified iSER primitives. InfiniBand: No changes were made. However, allowing scatter-gather of remote memory handles could have improved performance.

30 Agenda Introduction Improving Communication Efficiency
Relieving the Controller Bottleneck Performance

31 Test Setup Hardware: Software:
Nodes (all types): Intel dual-XEON 3.2GHz Memory disks Mellanox MHEA28-1T (10Gb/s) InfiniBand HCA Mellanox MTS2400 InfiniBand switch Software: Linux SuSE 9.1 Professional ( kernel) Voltaire InfiniBand host stack Voltaire iSER initiator and target

32 System Configurations
Baseline system: Host In-band RAID controller 5 targets TPT-RAID system: TPT RAID controller 5 TPT targets Both systems use iSER (RDMA) over InfiniBand.

33 Scalability TPT-RAID (almost) doesn’t add work (relative to the Baseline): No extra disk (media) operations. No extra XOR operations. Added communication to the targets: More commands More data transfers The extra communication is divided among all targets.

34 Controller Scalability – RAID-5 (WRITE)
Unlimited number of hosts Unlimited number of targets Req. size Block size Max. hosts Baseline TPT 1MB 32KB 1 (75%) 1 64KB 1 (72%) 2 8MB 1 (78%) 4 InfiniBand BW is not a limiting factor (multiple hosts and targets).

35 Max. Thpt. with One Host – RAID-5 (WRITE)
Single host Single target set Even when only a single host is used, the Baseline controller is the bottleneck!

36 Controller Scalability – RDP (WRITE)
Unlimited number of hosts Unlimited number of targets Req. size Block size Max. hosts Baseline TPT 1MB 32KB 1 (33%) 1 (70%) 64KB 1 (80%) 8MB 2 3

37 Controller Scalability – Mirroring (WRITE)
Unlimited number of hosts Unlimited number of targets Req. size (Blk=32KB) Max. hosts Baseline TPT 256KB 1 (50%) 9 512KB 18 1MB 36 8MB 293

38 Max. Thpt. with One Host - Mirroring (WRITE)
Single host Single target set Even when a single host is used, the Baseline controller is a bottleneck. For TPT, the bottleneck is the host or the targets.

39 Degraded Mode Same as the performance of WRITE commands in normal mode.

40 Summary Multi-box RAID: improved availability and low cost.
Using a single controller retains simplicity. Single-box DMA engine is replaced by RDMA. Adding 3rd Party Transfer and Distributed Parity Calculation allows scalability: Can manage a larger system with more activity For a given workload: larger max. thpt.  shorter waiting times  lower latency Cost reduction is taken another step while retaining performance and simplicity.

41 InfiniBand support InfiniBand currently allows scattering/gathering of memory. Memory registration returns a memory handle. Scattering/gathering of memory handles will improve the performance of the TPT-RAID dramatically.


Download ppt "TPT-RAID: A High Performance Multi-Box Storage System"

Similar presentations


Ads by Google