Tia Newhall, Daniel Amato, Alexandr Pshenichkin

Slides:



Advertisements
Similar presentations
Storage Management Lecture 7.
Advertisements

The google file system Cs 595 Lecture 9.
Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
CS-550: Distributed File Systems [SiS]1 Resource Management in Distributed Systems: Distributed File Systems.
File System Implementation
High Performance Computing Course Notes High Performance Storage.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters Tia Newhall, Daniel Amato, Alexandr Pshenichkin Computer Science Department.
The Google File System.
Wide-area cooperative storage with CFS
Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
1 The Google File System Reporter: You-Wei Zhang.
1 Storage Refinement. Outline Disk failures To attack Intermittent failures To attack Media Decay and Write failure –Checksum To attack Disk crash –RAID.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
GeoGrid: A scalable Location Service Network Authors: J.Zhang, G.Zhang, L.Liu Georgia Institute of Technology presented by Olga Weiss Com S 587x, Fall.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Chapter 20 Distributed File Systems Copyright © 2008.
CS414 Review Session.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Serverless Network File Systems Overview by Joseph Thompson.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
CS333 Intro to Operating Systems Jonathan Walpole.
Raid Techniques. Redundant Array of Independent Disks RAID is a great system for increasing speed and availability of data. More data protection than.
QMy general research area: System-level support for parallel and distributed computing User: run my_prog System: load my_prog on nodes, run it, collect.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
An Introduction to GPFS
Lecture 17 Raid. Device Protocol Variants Status checks: polling vs. interrupts Data: PIO vs. DMA Control: special instructions vs. memory-mapped I/O.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Module 12: I/O Systems I/O hardware Application I/O Interface
Jonathan Walpole Computer Science Portland State University
Transactions and Reliability
Fujitsu Training Documentation RAID Groups and Volumes
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 6: Planning, Configuring, And Troubleshooting WINS.
Introduction to NewSQL
Swapping Segmented paging allows us to have non-contiguous allocations
Introduction to Computers
CSI 400/500 Operating Systems Spring 2009
Storage Virtualization
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
EECS 582 Midterm Review Mosharaf Chowdhury EECS 582 – F16.
Introduction to Operating Systems
RAID RAID Mukesh N Tekwani
Page Replacement.
Outline Midterm results summary Distributed file systems – continued
CS703 - Advanced Operating Systems
Overview Continuation from Monday (File system implementation)
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
THE GOOGLE FILE SYSTEM.
Ch 17 - Binding Protocol Addresses
RAID RAID Mukesh N Tekwani April 23, 2019
Storage Management Lecture 7.
File system : Disk Space Management
COMP755 Advanced Operating Systems
Lecture 4: File-System Interface
Distributed Systems and Concurrency: Distributed Systems
Efficient Migration of Large-memory VMs Using Private Virtual Memory
Presentation transcript:

Nswap: A Reliable, Adaptable Network RAM System for General Purpose Clusters Tia Newhall, Daniel Amato, Alexandr Pshenichkin Computer Science Department Swarthmore College Swarthmore, PA USA newhall@cs.swarthmore.edu

Network RAM Cluster nodes share each other’s idle RAM as a remote swap partition When one node’s RAM is overcommitted, swap its pages out over the network to store in idle RAM of other nodes + Avoid swapping to slower local disk + Almost always some significant amt idle RAM even when some nodes overloaded Network swap out page Node A disk Node B RAM Cluster08, Tia Newhal, 2008

Nwap Design Goals Scalable Adaptable Fault Tolerant No central authority Adaptable Node’s RAM usage varies Don’t want remotely swapped page data to cause more swapping on a node amount of RAM made available for storing remotely swapped page data needs to grow/shrink with local usage Fault Tolerant A single node failure can lose pages from processes running on remote nodes One node’s failure can affect unrelated processes on other nodes Cluster08, Tia Newhal, 2008

Nswap Network swapping lkm for Linux clusters Completely Decentralized Runs entirely in kernel space on unmodified Linux 2.6 Completely Decentralized Each node runs a multi-threaded client & server Client is active when node swapping Uses local information to find a “good” Server when it swaps-out Server is active when node has idle RAM available Node A Node B User space Kernel space Server swap out page Client Server Client Nswap Cache Nswap Communication Layer Nswap Communication Layer Network Cluster08, Tia Newhal, 2008

How Pages Move around system Swap out: from client A to server B Node A Node B SWAP OUT Swap in: from server B to client A (B still is backing store) Node A Node B SWAP IN Migrate: server B shrinks its Nswap Cache sends pages to server C Node A Node B Node C MIGRATE Cluster08, Tia Newhal, 2008

Adding Reliability Requires extra time and space Minimize extra costs, particularly to nodes that are swapping Avoid reliability solutions that use disk use cluster-wide idle RAM for reliability data Has to work with Nswap’s: Dynamic resizing of Nswap Cache Varying Nswap Cache capacity at each node Support for migrating remotely swapped page data between servers => Reliability solutions that require fixed placement of page and reliability data won’t work Cluster08, Tia Newhal, 2008

Centralized Dynamic Parity RAID 4 like A single, dedicated, parity server node In large clusters, nodes divided into Parity Partitions, each partition has its own dedicated Parity Server Parity Server stores parity pages, keeps track of parity groups, implements page recovery + Client & server don’t need to know about parity grps Parity Server Node 1 Node 2 Node m-1 . . . Parity Partition 1 Node m+1 Node 2m-1 Parity Partition 2 Cluster08, Tia Newhal, 2008

Centralized Dynamic Parity (cont.) Like RAID 4 Parity group pages striped across cluster idle RAM Parity pages all on single parity server with some differences Parity group size and assignment is not fixed Pages can leave and enter a given parity group (garbage collection, migration, merging parity grps) Node 1 Node 2 Node 3 Node 4 Parity Server Group 1 P . . . Group 2 P Cluster08, Tia Newhal, 2008

Page Swap-out, case 1: new page swap Parity Pool at Client: client stores a set of in-progress parity pages As page swapped out it is added to a page in the pool minor computation overhead on client (XOR of 4K pages) As parity pages fill, they are sent to the Parity Server One extra page send to parity server every ~N swap-outs Servers Client SWAP OUT Parity Pool XOR SWAP OUT SWAP OUT Parity Server SWAP OUT PARITY PAGE Cluster08, Tia Newhal, 2008

Page Swap-out, case 2: overwrite Server has old copy of swapped out page: Client sends new page to server No extra overhead on client side vs. non-reliable Nswap Server computes the XOR of the old and new version of the page and sends it to the Parity Server before overwriting the old version with the new Node B Parity Server Node A old XOR Parity Page XOR new SWAP OUT UPDATE_XOR Cluster08, Tia Newhal, 2008

Servers in this Parity Group Page Recovery Detecting node sends a RECOVERY message to Parity Server Page Recovery runs concurrently with cluster applications Parity Server rebuilds all pages that were stored at the crashed node As it recovers each page, it migrates it to a non-failed Nswap Server page may stay in same parity group or be added to a new one The server receiving the recovered page tells client of its new location Servers in this Parity Group Parity Server Lost Page’s owner . . . Parity Page XOR New Server MIGRATE RECOVERED Recovered Page UPDATE Cluster08, Tia Newhal, 2008

Decentralized Dynamic Parity Like RAID 5: No dedicated parity server Data pages and Parity pages striped across Nswap Servers + not limited by Parity Server’s RAM capacity nor Parity Partitioning - every node is now Client, Server, Parity Server Store with each data page its parity server & P-group ID For each page, need to know its parity server and to which group it belongs A page’s parity group ID and parity server can change due to migration or merging of two small parity groups First set by client on swap-out when parity logging Server can change when page is migrated or parity groups are merged Client still uses parity pool Finds a node to take the parity page as it starts a new parity group One extra message per parity group to find a server for parity page Every Nswap server has to recover lost pages that belong to parity groups whose parity page it stores. +/- Decentralized recovery Cluster08, Tia Newhal, 2008

Kernel Benchmark Results Workload Swapping to Disk Nswap (No Reliability) (Centralized Parity) (1) Sequential R&W 220.31 116.28 (speedup 1.9) 117.10 (1.9) (2) Random R&W 2462.90 105.24 (23.4) 109.15 (22.6) (3) Random R&W & File I/O 3561.66 105.50 (33.8) 110.19 (32.3) 8 node Linux 2.6 cluster (Pentium 4, 512 MB RAM, TCP/IP over 1 Gbit Ethernet, 80 GB IDE (100MB/s)) Workloads: (1) Sequential R & W to large chunk of memory (best case for disk swapping) (2) Random R & W to memory (more disk arm seeks w/in swap partition) (3) 1 large file I/O, 1 W2 (disk arm seeks between swap & file partitions) Cluster08, Tia Newhal, 2008

Parallel Benchmark Results Workload Swapping to Disk Nswap (No Reliability) (Centralized Parity) Linpack 1745.05 418.26 (speedup 4.2) 415.02 (4.2) LU 33464.99 3940.12 (8.5) 109.15 (8.2) Radix 464.40 96.01 (4.8) 97.65 (4.8) FFT 156.58 94.81 (1.7) 95.95 (1.6) 8 node Linux 2.6 cluster (Pentium 4, 512 MB RAM, TCP/IP over 1 Gbit Ethernet, 80 GB IDE (100MB/s)) Application Processes running on half of the nodes (clients of Nswap), the other half are not running benchmark processes and are acting as Nswap servers. Cluster08, Tia Newhal, 2008

Recovery Results Timed execution of applications with and without concurrent page recovery (simulated node failure and the recovery of pages it lost) Concurrent recovery does not slow down application Measured the time it takes for the Parity Server to recover each page of lost data ~7,000 pages recovered per second When parity group size is ~5: 0.15 ms per page When parity group size is ~6: 0.18 ms per page Cluster08, Tia Newhal, 2008

Conclusions Nswap’s adaptable design makes adding reliability support difficult Our Dynamic Parity Solutions solve these difficulties, and should provide the best solutions in terms of time and space efficiency Results testing our Centralized Solution, support implementing the Decentralized Solution + more adaptable + no dedicated Parity Server or its fixed-size RAM limitations - more complicated protocols - more overlapping, potentially interfering operations - each node now a Client, Server, and Parity Server Cluster08, Tia Newhal, 2008

Acknowlegments Swarthmore Students: More information: Dan Amato’07 Alexandr Pshenishkin’07 Jenny Barry’07 Heather Jones’06 America Holloway’05 Ben Mitchell’05 Julian Rosse’04 Matti Klock ’03 Sean Finney ’03 Michael Spiegel ’03 Kuzman Ganchev ’03 More information: http://www.cs.swarthmore.edu/~newhall/nswap.html Cluster08, Tia Newhal, 2008

Nswap’s Design Goals Transparent Adaptable Efficient Scalable Reliable User should not have to do anything to enable swapping over NW Adaptable A Network RAM system that constantly runs on cluster must adjust to changes in local node’s memory usage Local processes should get local RAM before remote processes do Efficient Should be fast swapping in and out Should use a minimal amount of local memory state Scalable System should scale to large sized clusters (or networked systems) Reliable A crash of one node should not effect unrelated processes running on other nodes Cluster08, Tia Newhal, 2008

Complications Simultaneous Conflicting Operations Asynchrony and threads allows for fast, multiple ops at once, but some overlapping ops can conflict ex. Migration and new swap-out for same page Garbage Pages in the System When process terminates we need to remove its remotely swapped pages from servers Swap interface doesn’t contain call to device to free slots since this isn’t a problem for disk swap Node failure Can lose remotely swapped page data Cluster08, Tia Newhal, 2008

How Pages Move Around the System SWAP-OUT: SWAP-IN: Node A Node B swap out page Nswap Client Nswap Server SWAP_OUT? i shadow slot map Nswap Cache B OK Node B Node A Nswap Server swap in page i SWAP_IN Nswap Cache B YES, Cluster08, Tia Newhal, 2008

Nswap Client Implemented as device driver and added as a swap device on each node Kernel swaps pages to it just like any other swap device Shadow slot map stores state about remote location of each swapped out page - Extra space overhead that must be minimized swap out page: (1) kernel finds free swap slot i 1 i kernel’s slot map (2) kernel calls our driver’s write function i shadow slot map Nswap Client B (3) add server info. to shadow slot map (4) send the page to server B Cluster08, Tia Newhal, 2008

Nswap Server Nswap Server Nswap Cache Manages local idle RAM currently allocated for storing remote pages Handles swapping requests Swap-out: allocate page of RAM to store remote page Swap-in: fast lookup of page it stores Grows and Shrinks the amount of local RAM available based on the node’s local memory usage Acquire pages from paging system when there is idle RAM Release pages to paging system when they are needed locally Remotely swapped page data may be migrated to other servers swap out page Cluster08, Tia Newhal, 2008

Finding a Server to take a Page Client uses local info. to pick “best” server Local IP Table stores available RAM for each node Servers periodically broadcast their size values Clients update entries as they swap to servers IP Table also caches open sockets to nodes + No centralized remote memory server swap out page i IP Table Nswap Client HOST AMT Open Socks B 20 C 10 F 35 look up a good candidate server and get an open socket to it i shadow slot map B Cluster08, Tia Newhal, 2008

Soln 1: Mirroring On Swap-outs: send page to primary & back-up servers On Migrate: if new Server already has a copy of the page it will not accept the MIGRATE request and old server picks another candidate + Easy to Implement - 2 pages being sent on every swap-out - Requires 2x as much RAM space for pages - Increases the size of the shadow slot map Node A Node B Node C SWAP OUT SWAP OUT Cluster08, Tia Newhal, 2008