Modularized Redundant Parallel Virtual System

Modularized Redundant Parallel Virtual System
Sheng-Kai Hung HPCC Lab

Parallel Virtual File System Overview
Developed by Clemson University Using RAID-0 like striping to distributed file Claim for high read/write performance Based on TCP/IP server-client model Centralized metadata server POSIX 、MPI-IO complaint NO fault tolerant mechanism provided

Our Pervious Design Parity information is stored at metadata server
A single point of failure Read/Write performance Using “delay write” to improve the parity overhead Use a buffer to store the difference of block being written Reading corresponding blocks are also needed

MTTF Formula

Examples of MTTF Assumption MTTF (hours) Group Size PVFS 528 -
MTTFD is no less than 100,000 hours (around 10 years) MTTFs 10,000 hours (around 1.5 years) MTTR is usually shorter than 4 hours Node number is 16 MTTF (hours) Group Size PVFS 528 - PVFSraid 624 1 RPVFS 86088

MTTF Result

Overhead of Using Parity
Read does not involve in the process of parity construction Read-Modify-Write Some blocks are dirtied Need 2 read, 2 write Write The whole striping units are overwritten 1 read, 2 write

System Architecture

Parity Cache Table (1/3) A pinned down memory region within the metadata node 4K entry each entry contain N data blocks plus a inode number tag and a reference count Can aggregate the written block to reduce the number of parity written We delay the writing and generating of parity block Several blocks near by can be combined in a single write

Parity Cache Table (2/3)

Parity Cache Table (3/3) When to write back the cache ?
Replacement Choosing the bigger {N,N} Ready When all the blocks needed to compute a parity block is ready Flush A routine like bdflush runs every 30 secs Potential data loss ? On average 15 secs

Write Performance

Read-Modify-Write Performance

Mirrored Parity Scheme (1/3)
RAID-1 Can not tolerate two faults in the same mirrored group For different groups 3 faults can be tolerated Disk overhead is 100% RAID-4 (RAID-5) Only can tolerate a single fault Disk overhead always less than 33.3%

Can tolerate faults occurred in the same mirrored group D1 、P12 faults D0、P01 faults Can tolerate at most 3 faults, except one case D1、P12、D0 all faults The concept of grouping disappeared Use the same disk overhead as Raid-1

Pro MTTF is higher Can tolerate more simultaneous fault when compared with RAID-1 With the same disk overhead Con Need at most N XOR operations to recovery the corrupted data N is the nodes involved in a parity group XOR is a cheap operation, but read 3 blocks may be a problem

Separate metadata cache
Accessing meta data is a serialized process Only 1 single metadata server with 1 disk Separate metadata cache from the real data cache Either on clients or on servers If on clients can save a socket connection when hitting Distributed metadata Handling the parity cache table Parity information must also be distributed Block based parity need to be modified

Modularized Redundant Parallel Virtual System

Similar presentations

Presentation on theme: "Modularized Redundant Parallel Virtual System"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modularized Redundant Parallel Virtual System

Similar presentations

Presentation on theme: "Modularized Redundant Parallel Virtual System"— Presentation transcript:

Similar presentations

About project

Feedback