IBM Research Lab in Haifa Architectural and Design Issues in the General Parallel File System Benny Mandler - May 12, 2002.

Slides:



Advertisements
Similar presentations
General Parallel File System
Advertisements

The google file system Cs 595 Lecture 9.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
The Zebra Striped Network Filesystem. Approach Increase throughput, reliability by striping file data across multiple servers Data from each client is.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
City University London
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
G Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees.
1 File Management in Representative Operating Systems.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
PRASHANTHI NARAYAN NETTEM.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
File Systems and N/W attached storage (NAS) | VTU NOTES | QUESTION PAPERS | NEWS | VTU RESULTS | FORUM | BOOKSPAR ANDROID APP.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
1 The Google File System Reporter: You-Wei Zhang.
Page 19/4/2015 CSE 30341: Operating Systems Principles Raid storage  Raid – 0: Striping  Good I/O performance if spread across disks (equivalent to n.
Interposed Request Routing for Scalable Network Storage Darrell Anderson, Jeff Chase, and Amin Vahdat Department of Computer Science Duke University.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Disk Structure Disk drives are addressed as large one- dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer.
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Page 110/12/2015 CSE 30341: Operating Systems Principles Network-Attached Storage  Network-attached storage (NAS) is storage made available over a network.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
UNIX File and Directory Caching How UNIX Optimizes File System Performance and Presents Data to User Processes Using a Virtual File System.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Serverless Network File Systems Overview by Joseph Thompson.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Ceph: A Scalable, High-Performance Distributed File System
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
1 © 2002 hp Introduction to EVA Keith Parris Systems/Software Engineer HP Services Multivendor Systems Engineering Budapest, Hungary 23May 2003 Presentation.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
An Introduction to GPFS
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
GPFS Parallel File System
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
File-System Management

Jonathan Walpole Computer Science Portland State University
Fujitsu Training Documentation RAID Groups and Volumes
Google Filesystem Some slides taken from Alan Sussman.
Storage Virtualization
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
RAID RAID Mukesh N Tekwani
Overview Continuation from Monday (File system implementation)
Overview: File system implementation (cont)
Specialized Cloud Architectures
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
RAID RAID Mukesh N Tekwani April 23, 2019
Database System Architectures
Disk Scheduling The operating system is responsible for using hardware efficiently — for the disk drives, this means having a fast access time and disk.
Enterprise Class Virtual Tape Libraries
Presentation transcript:

IBM Research Lab in Haifa Architectural and Design Issues in the General Parallel File System Benny Mandler - May 12, 2002

HRLHRLAgenda  What is GPFS? ? a file system for deep computing  GPFS uses  General architecture  How does GPFS meet its challenges - architectural issues ? performance ? scalability ? high availability ? concurrency control

HRLHRL RS/6000 SP Scalable Parallel Computer  nodes connected by high-speed switch  1-16 CPUs per node (Power2 or PowerPC)  >1 TB disk per node  500 MB/s full duplex per switch port Scalable parallel computing enables I/O-intensive applications :  Deep computing - simulation, seismic analysis, data mining  Server consolidation - aggregating file, web servers onto a centrally-managed machine  Streaming video and audio for multimedia presentation  Scalable object store for large digital libraries, web servers, databases,... Scalable Parallel Computing What is GPFS?

HRLHRL  High Performance - multiple GB/s to/from a single file  concurrent reads and writes, parallel data access - within a file and across files Support fully parallel access both to file data and metadata  client caching enabled by distributed locking  wide striping, large data blocks, prefetch  Scalability  scales up to 512 nodes (N-Way SMP). Storage nodes, file system nodes, disks, adapters...  High Availability  fault-tolerance via logging, replication, RAID support  survives node and disk failures  Uniform access via shared disks - Single image file system  High capacity multiple TB per file system, 100s of GB per file.  Standards compliant (X/Open 4.0 "POSIX") with minor exceptions GPFS addresses SP I/O requirements What is GPFS?

HRLHRL Native AIX File System (JFS)  No file sharing - application can only access files on its own node  Applications must do their own data partitioning DCE Distributed File System (follow-up of AFS)  Application nodes (DCE clients) share files on server node  Switch is used as a fast LAN  Coarse-grained (file or segment level) parallelism  Server node is performance and capacity bottleneck GPFS Parallel File System  GPFS file systems are striped across multiple disks on multiple storage nodes  Independent GPFS instances run on each application node  GPFS instances use storage nodes as "block servers" - all instances can access all disks GPFS vs. local and distributed file systems on the SP2

HRLHRL  Video on Demand for new "borough" of Tokyo  Applications: movies, news, karaoke, education...  Video distribution via hybrid fiber/coax  Trial "live" since June '96  Currently 500 subscribers  6 Mbit/sec MPEG video streams  100 simultaneous viewers (75 MB/sec)  200 hours of video on line (700 GB)  12-node SP-2 (7 distribution, 5 storage) Tokyo Video on Demand Trial

HRLHRL Major aircraft manufacturer  Using CATIA for large designs, Elfini for structural modeling and analysis  SP used for modeling/analysis  Using GPFS to store CATIA designs and structural modeling data  GPFS allows all nodes to share designs and models Engineering Design GPFS uses

HRLHRL  File systems consist of one or more shared disks ? Individual disk can contain data, metadata, or both ? Disks are designated to failure group ? Data and metadata are striped to balance load and maximize parallelism  Recoverable Virtual Shared Disk for accessing disk storage ? Disks are physically attached to SP nodes ? VSD allows clients to access disks over the SP switch ? VSD client looks like disk device driver on client node ? VSD server executes I/O requests on storage node. ? VSD supports JBOD or RAID volumes, fencing, multi- pathing (where physical hardware permits)  GPFS only assumes a conventional block I/O interface Shared Disks - Virtual Shared Disk architecture General architecture

HRLHRL  Implications of Shared Disk Model ? All data and metadata on globally accessible disks (VSD) ? All access to permanent data through disk I/O interface ? Distributed protocols, e.g., distributed locking, coordinate disk access from multiple nodes ? Fine-grained locking allows parallel access by multiple clients ? Logging and Shadowing restore consistency after node failures  Implications of Large Scale ? Support up to 4096 disks of up to 1 TB each (4 Petabytes) The largest system in production is 75 TB ? Failure detection and recovery protocols to handle node failures ? Replication and/or RAID protect against disk / storage node failure ? On-line dynamic reconfiguration (add, delete, replace disks and nodes; rebalance file system) GPFS Architecture Overview General architecture

HRLHRL  Three types of nodes: file system, storage, and manager ? Each node can perform any of these functions ? File system nodes  run user programs, read/write data to/from storage nodes  implement virtual file system interface  cooperate with manager nodes to perform metadata operations ? Manager nodes (one per “file system”)  global lock manager  recovery manager  global allocation manager  quota manager  file metadata manager  admin services fail over ? Storage nodes  implement block I/O interface  shared access from file system and manager nodes  interact with manager nodes for recovery (e.g. fencing)  file data and metadata striped across multiple disks on multiple storage nodes GPFS Architecture - Node Roles General architecture

HRLHRL GPFS Software Structure General architecture

HRLHRL  Large block size allows efficient use of disk bandwidth  Fragments reduce space overhead for small files  No designated "mirror", no fixed placement function:  Flexible replication (e.g., replicate only metadata, or only important files)  Dynamic reconfiguration: data can migrate block-by-block  Multi level indirect blocks ?Each disk address: list of pointers to replicas ?Each pointer: disk id + sector no. Disk Data Structures: Files General architecture

HRLHRL  Conventional file systems store data in small blocks to pack data more densely  GPFS uses large blocks (256KB default) to optimize disk transfer speed Large File Block Size Performance

HRLHRL Parallelism and consistency  Distributed locking - acquire appropriate lock for every operation - used for updates to user data  Centralized management - conflicting operations forwarded to a designated node - used for file metadata  Distributed locking + centralized hints - used for space allocation  Central coordinator - used for configuration changes I/O slowdown effects Additional I/O activity rather than token server overload

HRLHRL  GPFS allows parallel applications on multiple nodes to access non- overlapping ranges of a single file with no conflict  Global locking serializes access to overlapping ranges of a file  Global locking based on "tokens" which convey access rights to an object (e.g. a file) or subset of an object (e.g. a byte range)  Tokens can be held across file system operations, enabling coherent data caching in clients  Cached data discarded or written to disk when token is revoked  Performance optimizations: required/desired ranges, metanode, data shipping, special token modes for file size operations Parallel File Access From Multiple Nodes Performance

HRLHRL  GPFS stripes successive blocks across successive disks  Disk I/O for sequential reads and writes is done in parallel  GPFS measures application "think time",disk throughput, and cache state to automatically determine optimal parallelism  Prefetch algorithms now recognize strided and reverse sequential access  Accepts hints  Write-behind policy Application reads at 15 MB/sec Each disk reads at 5 MB/sec Three I/Os executed in parallel Deep Prefetch for High Throughput Performance

HRLHRL  Hardware: Power2 wide nodes, SSA disks  Experiment: sequential read/write from large number of GPFS nodes to varying number of storage nodes  Result: throughput increases nearly linearly with number of storage nodes  Bottlenecks: ? microchannel limits node throughput to 50MB/s ? system throughput limited by available storage nodes GPFS Throughput Scaling for Non-cached Files Scalability

HRLHRL  Segmented Block Allocation MAP:  Each segment contains bits representing blocks on all disks  Each segment is a separately lockable unit  Minimizes contention for allocation map when writing files on multiple nodes  Allocation manager service provides hints which segments to try Similar: inode allocation map Disk Data Structures: Allocation map Scalability

HRLHRL  Problem: detect/fix file system inconsistencies after a failure of one or more nodes ? All updates that may leave inconsistencies if uncompleted are logged ? Write-ahead logging policy: log record is forced to disk before dirty metadata is written ? Redo log: replaying all log records at recovery time restores file system consistency  Logged updates: ? I/O to replicated data ? directory operations (create, delete, move,...) ? allocation map changes  Other techniques: ? ordered writes ? shadowing High Availability - Logging and Recovery High Availability

HRLHRL  Application node failure: ? force-on-steal policy ensures that all changes visible to other nodes have been written to disk and will not be lost ? all potential inconsistencies are protected by a token and are logged ? file system manager runs log recovery on behalf of the failed node after successful log recovery tokens held by the failed node are released ? actions taken: restore metadata being updated by the failed node to a consistent state, release resources held by the failed node  File system manager failure: ? new node is appointed to take over ? new file system manager restores volatile state by querying other nodes ? New file system manager may have to undo or finish a partially completed configuration change (e.g., add/delete disk)  Storage node failure: ? Dual-attached disk: use alternate path (VSD) ? Single attached disk: treat as disk failure Node Failure Recovery High Availability

HRLHRL  When a disk failure is detected ? The node that detects the failure informs the file system manager ? File system manager updates the configuration data to mark the failed disk as "down" (quorum algorithm)  While a disk is down ? Read one / write all available copies ? "Missing update" bit set in the inode of modified files  When/if disk recovers ? File system manager searches inode file for missing update bits ? All data & metadata of files with missing updates are copied back to the recovering disk (one file at a time, normal locking protocol) ? Until missing update recovery is complete, data on the recovering disk is treated as write-only  Unrecoverable disk failure ? Failed disk is deleted from configuration or replaced by a new one ? New replicas are created on the replacement or on other disks Handling Disk Failures

HRLHRL Cache Management Total Cache General Pool: Clock list, merge, re-map Block Size pool: Clock list Stats Seq / random optimal, total Seq / random optimal, total Seq / random optimal, total Seq / random optimal, total Balance dynamically according to usage patterns Avoid fragmentation - internal and external Unified steal Periodical re-balancing

HRLHRLEpilogue  Used on six of the ten most powerful supercomputers in the world, including the largest (ASCI white)  Installed at several hundred customer sites, on clusters ranging from a few nodes with less than a TB of disk, up to 512 nodes with 140 TB of disk in 2 file systems  IP rich - ~20 filed patents  State of the art  TeraSort ? world record of 17 minutes ? using 488 node SP. 432 file system and 56 storage nodes (604e 332 MHz) ? total 6 TB disk space  References ? GPFS home page: ? FAST 2002: ? TeraSort - ? Tiger Shark: