IBM Research Lab in Haifa Architectural and Design Issues in the General Parallel File System Benny Mandler - May 12, 2002
HRLHRLAgenda What is GPFS? ? a file system for deep computing GPFS uses General architecture How does GPFS meet its challenges - architectural issues ? performance ? scalability ? high availability ? concurrency control
HRLHRL RS/6000 SP Scalable Parallel Computer nodes connected by high-speed switch 1-16 CPUs per node (Power2 or PowerPC) >1 TB disk per node 500 MB/s full duplex per switch port Scalable parallel computing enables I/O-intensive applications : Deep computing - simulation, seismic analysis, data mining Server consolidation - aggregating file, web servers onto a centrally-managed machine Streaming video and audio for multimedia presentation Scalable object store for large digital libraries, web servers, databases,... Scalable Parallel Computing What is GPFS?
HRLHRL High Performance - multiple GB/s to/from a single file concurrent reads and writes, parallel data access - within a file and across files Support fully parallel access both to file data and metadata client caching enabled by distributed locking wide striping, large data blocks, prefetch Scalability scales up to 512 nodes (N-Way SMP). Storage nodes, file system nodes, disks, adapters... High Availability fault-tolerance via logging, replication, RAID support survives node and disk failures Uniform access via shared disks - Single image file system High capacity multiple TB per file system, 100s of GB per file. Standards compliant (X/Open 4.0 "POSIX") with minor exceptions GPFS addresses SP I/O requirements What is GPFS?
HRLHRL Native AIX File System (JFS) No file sharing - application can only access files on its own node Applications must do their own data partitioning DCE Distributed File System (follow-up of AFS) Application nodes (DCE clients) share files on server node Switch is used as a fast LAN Coarse-grained (file or segment level) parallelism Server node is performance and capacity bottleneck GPFS Parallel File System GPFS file systems are striped across multiple disks on multiple storage nodes Independent GPFS instances run on each application node GPFS instances use storage nodes as "block servers" - all instances can access all disks GPFS vs. local and distributed file systems on the SP2
HRLHRL Video on Demand for new "borough" of Tokyo Applications: movies, news, karaoke, education... Video distribution via hybrid fiber/coax Trial "live" since June '96 Currently 500 subscribers 6 Mbit/sec MPEG video streams 100 simultaneous viewers (75 MB/sec) 200 hours of video on line (700 GB) 12-node SP-2 (7 distribution, 5 storage) Tokyo Video on Demand Trial
HRLHRL Major aircraft manufacturer Using CATIA for large designs, Elfini for structural modeling and analysis SP used for modeling/analysis Using GPFS to store CATIA designs and structural modeling data GPFS allows all nodes to share designs and models Engineering Design GPFS uses
HRLHRL File systems consist of one or more shared disks ? Individual disk can contain data, metadata, or both ? Disks are designated to failure group ? Data and metadata are striped to balance load and maximize parallelism Recoverable Virtual Shared Disk for accessing disk storage ? Disks are physically attached to SP nodes ? VSD allows clients to access disks over the SP switch ? VSD client looks like disk device driver on client node ? VSD server executes I/O requests on storage node. ? VSD supports JBOD or RAID volumes, fencing, multi- pathing (where physical hardware permits) GPFS only assumes a conventional block I/O interface Shared Disks - Virtual Shared Disk architecture General architecture
HRLHRL Implications of Shared Disk Model ? All data and metadata on globally accessible disks (VSD) ? All access to permanent data through disk I/O interface ? Distributed protocols, e.g., distributed locking, coordinate disk access from multiple nodes ? Fine-grained locking allows parallel access by multiple clients ? Logging and Shadowing restore consistency after node failures Implications of Large Scale ? Support up to 4096 disks of up to 1 TB each (4 Petabytes) The largest system in production is 75 TB ? Failure detection and recovery protocols to handle node failures ? Replication and/or RAID protect against disk / storage node failure ? On-line dynamic reconfiguration (add, delete, replace disks and nodes; rebalance file system) GPFS Architecture Overview General architecture
HRLHRL Three types of nodes: file system, storage, and manager ? Each node can perform any of these functions ? File system nodes run user programs, read/write data to/from storage nodes implement virtual file system interface cooperate with manager nodes to perform metadata operations ? Manager nodes (one per “file system”) global lock manager recovery manager global allocation manager quota manager file metadata manager admin services fail over ? Storage nodes implement block I/O interface shared access from file system and manager nodes interact with manager nodes for recovery (e.g. fencing) file data and metadata striped across multiple disks on multiple storage nodes GPFS Architecture - Node Roles General architecture
HRLHRL GPFS Software Structure General architecture
HRLHRL Large block size allows efficient use of disk bandwidth Fragments reduce space overhead for small files No designated "mirror", no fixed placement function: Flexible replication (e.g., replicate only metadata, or only important files) Dynamic reconfiguration: data can migrate block-by-block Multi level indirect blocks ?Each disk address: list of pointers to replicas ?Each pointer: disk id + sector no. Disk Data Structures: Files General architecture
HRLHRL Conventional file systems store data in small blocks to pack data more densely GPFS uses large blocks (256KB default) to optimize disk transfer speed Large File Block Size Performance
HRLHRL Parallelism and consistency Distributed locking - acquire appropriate lock for every operation - used for updates to user data Centralized management - conflicting operations forwarded to a designated node - used for file metadata Distributed locking + centralized hints - used for space allocation Central coordinator - used for configuration changes I/O slowdown effects Additional I/O activity rather than token server overload
HRLHRL GPFS allows parallel applications on multiple nodes to access non- overlapping ranges of a single file with no conflict Global locking serializes access to overlapping ranges of a file Global locking based on "tokens" which convey access rights to an object (e.g. a file) or subset of an object (e.g. a byte range) Tokens can be held across file system operations, enabling coherent data caching in clients Cached data discarded or written to disk when token is revoked Performance optimizations: required/desired ranges, metanode, data shipping, special token modes for file size operations Parallel File Access From Multiple Nodes Performance
HRLHRL GPFS stripes successive blocks across successive disks Disk I/O for sequential reads and writes is done in parallel GPFS measures application "think time",disk throughput, and cache state to automatically determine optimal parallelism Prefetch algorithms now recognize strided and reverse sequential access Accepts hints Write-behind policy Application reads at 15 MB/sec Each disk reads at 5 MB/sec Three I/Os executed in parallel Deep Prefetch for High Throughput Performance
HRLHRL Hardware: Power2 wide nodes, SSA disks Experiment: sequential read/write from large number of GPFS nodes to varying number of storage nodes Result: throughput increases nearly linearly with number of storage nodes Bottlenecks: ? microchannel limits node throughput to 50MB/s ? system throughput limited by available storage nodes GPFS Throughput Scaling for Non-cached Files Scalability
HRLHRL Segmented Block Allocation MAP: Each segment contains bits representing blocks on all disks Each segment is a separately lockable unit Minimizes contention for allocation map when writing files on multiple nodes Allocation manager service provides hints which segments to try Similar: inode allocation map Disk Data Structures: Allocation map Scalability
HRLHRL Problem: detect/fix file system inconsistencies after a failure of one or more nodes ? All updates that may leave inconsistencies if uncompleted are logged ? Write-ahead logging policy: log record is forced to disk before dirty metadata is written ? Redo log: replaying all log records at recovery time restores file system consistency Logged updates: ? I/O to replicated data ? directory operations (create, delete, move,...) ? allocation map changes Other techniques: ? ordered writes ? shadowing High Availability - Logging and Recovery High Availability
HRLHRL Application node failure: ? force-on-steal policy ensures that all changes visible to other nodes have been written to disk and will not be lost ? all potential inconsistencies are protected by a token and are logged ? file system manager runs log recovery on behalf of the failed node after successful log recovery tokens held by the failed node are released ? actions taken: restore metadata being updated by the failed node to a consistent state, release resources held by the failed node File system manager failure: ? new node is appointed to take over ? new file system manager restores volatile state by querying other nodes ? New file system manager may have to undo or finish a partially completed configuration change (e.g., add/delete disk) Storage node failure: ? Dual-attached disk: use alternate path (VSD) ? Single attached disk: treat as disk failure Node Failure Recovery High Availability
HRLHRL When a disk failure is detected ? The node that detects the failure informs the file system manager ? File system manager updates the configuration data to mark the failed disk as "down" (quorum algorithm) While a disk is down ? Read one / write all available copies ? "Missing update" bit set in the inode of modified files When/if disk recovers ? File system manager searches inode file for missing update bits ? All data & metadata of files with missing updates are copied back to the recovering disk (one file at a time, normal locking protocol) ? Until missing update recovery is complete, data on the recovering disk is treated as write-only Unrecoverable disk failure ? Failed disk is deleted from configuration or replaced by a new one ? New replicas are created on the replacement or on other disks Handling Disk Failures
HRLHRL Cache Management Total Cache General Pool: Clock list, merge, re-map Block Size pool: Clock list Stats Seq / random optimal, total Seq / random optimal, total Seq / random optimal, total Seq / random optimal, total Balance dynamically according to usage patterns Avoid fragmentation - internal and external Unified steal Periodical re-balancing
HRLHRLEpilogue Used on six of the ten most powerful supercomputers in the world, including the largest (ASCI white) Installed at several hundred customer sites, on clusters ranging from a few nodes with less than a TB of disk, up to 512 nodes with 140 TB of disk in 2 file systems IP rich - ~20 filed patents State of the art TeraSort ? world record of 17 minutes ? using 488 node SP. 432 file system and 56 storage nodes (604e 332 MHz) ? total 6 TB disk space References ? GPFS home page: ? FAST 2002: ? TeraSort - ? Tiger Shark: