Download presentation
Presentation is loading. Please wait.
Published byStella Cooper Modified over 9 years ago
1
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center
2
Introduction Machines are getting more powerful But, we always can find bigger problems to solve Faster networks machines can form clusters Promising to solve big problems GPFS (general parallel file system) Mimics the semantics of a POSIX file system running on a single machine Running on 6/10 of the top supercomputers
3
Introduction Web server workloads Multiple nodes access multiple files Supercomputer workloads Single node can access a file stored on multiple nodes Multiple nodes can access the same file stored on multiple nodes Need to access files and metadata in parallel Need to perform administrative functions in parallel
4
GPFS Overview Uses shared disks switching fabric
5
General Large File System Issues Data striping and allocation, prefetch, and write-behind Large directory support Logging and recovery
6
Data Striping and Prefetch Striping implemented at the file system level Better control Fault tolerance Load balancing GPFS recognizes sequential, reverse sequential, various strided access patterns Prefetch data accordingly
7
Allocation Large files are stored in 256KB blocks Small files are stored in 8KB subblocks Need to watch out for disks with different sizes Maximizing space utilization Larger disks receive more I/O requests Bottleneck Maximizing parallel performance Under utilized disks
8
Large Directory Support GPFS uses extensible hashing to support very large directories empty 0100 | file_1 1001 | file_2 empty 0100 | file1 1001 | file2 empty 0011 | dir1 1110 | file2_hardlink
9
Logging and Recovery In a large file system, no time to run fsck Use journaling and write ahead log for metadata Data are not logged Each node has a separate log Can be read by all nodes Any node can perform recovery on behalf of a failed node
10
Managing Parallelism and Consistency in a Cluster
11
Distributed Locking vs. Centralized Management Goal: reading and writing in parallel from all nodes in the cluster Constraint: POSIX semantics Synchronizing access to data and metadata from multiple nodes If two processes on two nodes access the same file A read on one node will see either all or none of the data written by a concurrent write
12
Distributed Locking vs. Centralized Management Two approaches to locking: Distributed Consult with all other nodes before acquiring locks Greater parallelism Centralized Consult with a designated node Better for frequently updated metadata
13
Lock Granularity Too small High overhead Too large Many contending lock requests
14
The GPFS Distributed Lock Manager Centralized global lock manager on one node Local lock managers in each node Global lock manager Hands out lock tokens (right to grant locks) to local lock managers
15
Parallel Data Access How to write to the same file from multiple nodes? Byte-range locking to synchronize reads and writes Allows concurrent writes to different parts of the same file
16
Byte-Range Tokens First write request from one node Acquires a token for the whole file Efficient for non-concurrent writes Second write request to the same file from a second node Revoke part of the byte-range token held by the first node Knowing the reference pattern helps to predict how to break up the byte-ranges
17
Byte-Range Tokens Byte-range rounded to block boundaries So two nodes cannot modify the same block False sharing: a shared block being frequently moved between computers due to updates
18
Synchronizing Access to File Metadata Multiple nodes writing to the same file Concurrent updates to the inode and indirect blocks Synchronizing updates is very expensive
19
Synchronizing Access to File Metadata GPFS Uses a shared write lock on the inode Use the largest file size, latest time stamp How do multiple nodes append to the same file concurrently? One node is responsible for updating inodes Elected dynamically
20
Allocation Maps Need 32 bits per block due to subblocks Divided into n separate lockable regions Each node keeps track of 1/n th blocks on every disk Striped across all disks Minimize lock conflicts One node maintains the free space statistics Periodically updated
21
Other File System Metadata Centralized management to coordinate metadata updates Quota manager
22
Token Manager Scaling File size is unbounded Number of byte-range tokens is also unbounded Can use up the entire memory Token manager needs to monitor and prevent unbounded growth Revoke tokens as necessary Reuse token freed by deleted files
23
Fault Tolerance Node failures Communication failures Disk failures
24
Node Failures Periodic heartbeat messages to detect node failures Run log recovery from surviving nodes Token manager releases tokens held by the failed node Other nodes can resend committed updates
25
Communication Failures Network partition Continued operation can result in corrupted file system File system is accessible only by the group containing a majority of the nodes in the cluster
26
Disk Failures Dual attached RAID controllers Files can be replicated
27
Scalable Online System Utilities Adding, deleting, replacing disks Rebalancing the file system content Defragmentation, quota-check, fsck File system manager Coordinate administrative activities
28
Experiences Skewing of workloads Small management overhead can affect parallel applications in significant ways If a node slows down by 1%, it’s the same as leaving 5 nodes completely idle for 512-node cluster Need dedicated administrative nodes
29
Experiences Even the rarest failures can happen Data loss in a RAID A bad batch of disk drives
30
Related Work Storage area network Centralized metadata server SGI’s XFS file system Not a clustered file system Frangipani, Global File System Do not support multiple accesses to the same file
31
Summary and Conclusions GPFS Uses distributed locking and recovery Uses RAID and replication for reliability Can scale up to the largest super computers in the world Provides fault tolerance and system management functions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.