Download presentation
1
GPFS Parallel File System
Yuri Volobuev GPFS development Austin, TX Presented by : Luigi Brochard, Lead Deep Computing Architect
2
GPFS: parallel file access
Parallel Cluster File System Based on Shared Disk Model Cluster – fabric-interconnected nodes (IP, SAN, …) Shared disk - all data and metadata on fabric-attached disk Parallel - data and metadata flows from all of the nodes to all of the disks in parallel under control of distributed lock manager. GPFS File System Nodes Switching fabric (System or storage area network) Shared disks (SAN-attached or network block device)
3
What GPFS is NOT Not a client-server file system like NFS, DFS, or AFS: no single-server bottleneck, no protocol overhead for data transfer Not like some SAN file systems: no distinct metadata server (which is a potential bottleneck), no requirement for a full SAN (all nodes see all LUNs)
4
Simulated SAN: Virtual Shared Disk
Still difficult to scale a SAN Expensive switches Limits on number of devices below GPFS scaling limits SAN duplicates existing fabric (Gig E, Federation, Myrinet, …) Solution: simulate SAN over existing switching fabric VSD (Virtual Shared Disk) on AIX Client is block device driver Server is interrupt-level storage router Highly AIX kernel dependent NSD (Network Shared Disk), part of GPFS, simulates SAN on Linux
5
Why is GPFS needed? Clustered applications impose new requirements on the file system Parallel applications need fine-grained access within a file from multiple nodes Serial applications dynamically assigned to processors based on load need high-performance access to their data from wherever they run Both require good availability of data and normal file system semantics GPFS supports this via: uniform access – single-system image across cluster conventional Posix interface – no program modification high capacity – multi-TB files, petabyte file systems high throughput – wide striping, large blocks, many GB/sec to one file parallel data and metadata access – shared disk and distributed locking reliability and fault-tolerance – tolerating node and disk failures online system management – dynamic configuration and monitoring
6
GPFS Architecture High capacity: High BW access to single file
Large number of disks in a single FS High BW access to single file Large block size, full-stride I/O to RAID Wide striping – one file over all disks Multiple nodes read/write in parallel Fault Tolerant Nodes: proven and robust VSD/NSD failover log recovery restores consistency after a node failure Data and MetaData: RAID and/or replication On-line management (add/remove disks or nodes without un-mounting) Single-system image, standard POSIX interface Distributed locking for read/write semantics How does GPFS achieve this extreme scalability? This is GPFS in a nutshell … · To make a big file system, GPFS allows combining a large number of disks (up to 4000) into a single file system. As in the previous example, each of these disks may actually be a RAID disk composed of some number of individual disks. · To achieve high bandwidth data is striped across all disks in the file system. Block 0 of a file is stored on one disk, block 1 is stored on another disk, block 2 on the next disk, etc. To store or retrieve a large file, we can therefore start multiple I/Os in parallel. Each block is larger than in traditional file systems, typically 256k. This allows retrieving a large amount of data with a single I/O operation, which minimizes the effect of the seek latency on the disk throughput. Finally, byte-range locking allows multiple nodes writing to the same file concurrently. This allows write throughput to a single file at a data rate higher than a single node could write. · In a large system you cannot expect all components to work flawlessly all of the time. Therefore, GPFS was designed so that a failure of one node does not affect file system operations on other nodes. Each node logs all of its metadata updates, and all of the logs are stored on shared disk just like all other data and metadata. When a node fails, some other node runs log recovery from the failed node’s log. After than files being updated by the failed not are known to be consistent again and can be accessed normally again. We also have to worry about disk failures. Therefore, large systems typically use RAID instead of conventional disks, and a GPFS will stripe its data across multiple RAID devices. In addition, GPFS provides an optional replication in the file system, that is, GPFS will store copies of all data or metadata on two different disks. Furthermore, GPFS does not require taking down the file system to make configuration changes, such as adding, removing, or replacing disks in an existing file system, or adding or nodes ot and existing cluster. · Finally, GPFS uses distributed locking to synchronize all file system operations between nodes. Thus processes running on different nodes in the cluster see the same file system behavior as if all processes were running on a single machine.
7
GPFS beyond the Cluster
Problem: In a large data center, multiple clusters and other nodes need to share data over the SAN Solution: eliminate the notion of a fixed cluster “control” nodes for admin, managing locking, recovery, … File access from “client” nodes Client nodes authenticate to control nodes to mount a file system Client nodes are trusted to enforce access control Clients still directly access disk data and metadata
8
GPFS Value Proven, robust failover characteristics
Mature AIX (pSeries) and Linux (x86, PowerPC64, Opteron, EM64T) products AIX offering since mid-1998, Linux since mid-2001 Several hundred GPFS customer deployments of up to 2400 nodes Applications in HPC, Life Sciences, Digital Media, Query/Decision Support, Finance, ... Policy-based disk-to-tape hierarchical storage management via IBM TSM Distributed/scalable metadata management and file I/O Single file I/O rates in current configurations at GB/sec Multiple clusters of hundreds TB today, a 2 PB install pending Internal replication allows for active/active two-site setups
9
Current support AIX RHEL3/RHEL4 and SLES8/SLES9 Linux support
PowerPC64 Opteron/EM64T IA32 IA64 (not GA)
10
Future trends Clusters growing larger
More nodes (Linux), larger SMP nodes (AIX) More storage, larger files and file systems (multiple Petabytes) Storage fabrics growing larger (more ports) Variety of storage fabric technologies Fibre channel, cluster switch, 1000bT, 10000bT, Infiniband, … Technology blurs distinction among SAN, LAN, and WAN “Why can’t I use whatever fabric I have for storage?” “Why do I need to separate fabrics for storage and communication?” “Why can’t I access storage over long distances?” SAN-wide file sharing Heterogeneous Mixed clusters Sharing among clusters, across distance Life cycle management Archiving Hierarchical Storage Management (HSM) General Data Management
11
Priorities Improved performance/capacity
Improve file serving performance especially SAMBA for Windows support Deliver on improved data management capabilities Meet expectations for additional multi-clustering functions (split tokens, more security granularity, debug improvements) Keep up with the improvements in networking. Industry standard protocols on Infiniband … Manageability, Reporting in the context of storage Linux currency Additional platforms
12
Release schedule GPFS 2.3 shipped in 12/04. GPFS 2.4 scheduled for 4/06 GPFS 2.5 scheduled for 4/07 GPFS 2.6 scheduled for 4/08
13
GPFS 2.3 Linux 2.6 kernel Larger File System / disk adress support
NFSv4 (AIX) RSCT dependence removed (simplified setup) WAN / multi-cluster (TeraGrid, DEISA) Larger Clusters Performance monitoring tools Disaster recovery capabilities More disk types, more interconnects More performance
14
GPFS 2.3 upgrades Infiniband support using IP for xSeries (4/05)
RHEL4 support for Intel, AMD and Power (8/05) Infiniband for pSeries (11/05 +) NFS V3 performance work on AIX (8/05) BGL (11/05) Samba support work HPSS as an archive system
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.