PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.

Slides:



Advertisements
Similar presentations
A CASE FOR REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAID) D. A. Patterson, G. A. Gibson, R. H. Katz University of California, Berkeley.
Advertisements

CN2140 Server II (V2) Kemtis Kunanuraksapong MSIS with Distinction MCT, MCITP, MCTS, MCDST, MCP, A+
Petal and Frangipani. Petal/Frangipani Petal Frangipani NFS “SAN” “NAS”
Silberschatz and Galvin  Operating System Concepts Module 16: Distributed-System Structures Network-Operating Systems Distributed-Operating.
MACHINE-INDEPENDENT VIRTUAL MEMORY MANAGEMENT FOR PAGED UNIPROCESSOR AND MULTIPROCESSOR ARCHITECTURES R. Rashid, A. Tevanian, M. Young, D. Golub, R. Baron,
The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim S. Venkatraman, J. Wilkes HP Labs.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
Distributed File Systems CS 519: Operating System Theory Computer Science, Rutgers University Instructor: Thu D. Nguyen TA: Xiaoyan Li Spring 2002.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Lecture 8 Epidemic communication, Server implementation.
DISTRIBUTED COMPUTING
Distributed Databases
MetaSync File Synchronization Across Multiple Untrusted Storage Services Seungyeop Han Haichen Shen, Taesoo Kim*, Arvind Krishnamurthy,
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
REPLICATION IN THE HARP FILE SYSTEM B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, L. Shrira, M. Williams MIT.
DESIGN AND IMPLEMENTATION OF THE SUN NETWORK FILESYSTEM R. Sandberg, D. Goldberg S. Kleinman, D. Walsh, R. Lyon Sun Microsystems.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.
Configuring File Services Lesson 6. Skills Matrix Technology SkillObjective DomainObjective # Configuring a File ServerConfigure a file server4.1 Using.
1 The Google File System Reporter: You-Wei Zhang.
Redundant Array of Independent Disks
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
Distributed Systems: Concepts and Design Chapter 1 Pages
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Distributed Database Systems Overview
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
Serverless Network File Systems Overview by Joseph Thompson.
Configuring File Services. Using the Distributed File System Larger enterprises typically use more file servers Used to improve network performce Reduce.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Fast File System 2/17/2006. Introduction Paper talked about changes to old BSD 4.2 File System (FS) Motivation - Applications require greater throughput.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Distributed File Systems Questions answered in this lecture: Why are distributed file systems useful? What is difficult about distributed file systems?
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Seminar On Rain Technology
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
System Models Advanced Operating Systems Nael Abu-halaweh.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
Configuring File Services
File System Implementation
Storage Virtualization
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)
Chapter 2: System Structures
Outline Announcements Fault Tolerance.
7.1. CONSISTENCY AND REPLICATION INTRODUCTION
TECHNICAL SEMINAR PRESENTATION
Outline Announcements Lab2 Distributed File Systems 1/17/2019 COP5611.
DESIGN AND IMPLEMENTATION OF THE SUN NETWORK FILESYSTEM
Chapter 2: Operating-System Structures
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
Database System Architectures
Chapter 2: Operating-System Structures
Presentation transcript:

PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC

Paper presents a distributed storage management system: –Petal consists of a collection of network- connected servers that cooperatively manage a pool of physical disks –Client see Petal as a highly available block- level storage partitioned into virtual disks Highlights

Introduction Petal is a distributed storage system that –Tolerates single component failures –Can be geographically distributed to tolerate site failures –Transparently reconfigures to expand in performance or capacity –Uniformly balances load and capacity –Provides fast efficient support for backup and recovery

Petal User Interface Petal appears to its clients as a collection of virtual disks: – Block-level interface –Lower-level service than a DFS –Makes system easier to model, design, implement and tune –Can support heterogeneous clients and applications

Client view Scalable Network BSD FFS NTFSEXT2 FSNTFS Virtual disks Petal

Physical view Storage Server Scalable Network BSD FFS NTFSEXT2 FSNTFS

Petal Server Modules Global State Module Recovery Module Data Access Module Liveliness Module Virtual to Physical

Overall design (I) All state information is maintained on servers –Clients maintain only hints Liveness module ensures that all servers will agree on the system operational status –Uses majority consensus and periodic exchanges of “I’m alive”/”You’re alive?” messages

Overall design (II) Information describing –current members of storage system and –currently supported virtual disks is replicated across all servers Global state module keeps this information consistent –Uses Lamport’s Paxos algorithm –Assumes fail-silent failures of servers

Overall design (III) Data access and recovery modules –Control how client data are distributed and stored –Support Simple data striping w/o redundancy Chained declustering –It distributes mirrored data in a way that balances load in the event of a failure

Address translation (I) Must translate virtual addresses into physical addresses Mechanism should be fast and fault-tolerant

Address translation (II) Uses three replicated data structures – Virtual disk directory: translates virtual disk ID into a global map ID – Global map: locates the server responsible for translating the given offset (block number) – Physical map: Locates physical disk and computers physical offset within that disk

Virtual to physical mapping VDir GMap PMap0 Server 0 VDir GMap PMap2 Server 2 VDir GMap PMap1 Server 1 vdiskID offset VDir GMap PMap2 Server 2 diskID and diskOffset on this server

Address translation (III) Three step process: 1.VDir translates virtual disk ID given by client into a GMap ID 2.Specified GMap finds server that can translate given offset 3.PMap of server translates GMap ID and offset to a physical disk and a disk offset Last two steps are almost always performed by same server

Address translation (IV) There is one GMap per virtual disk That GMap specifies –Tuple of servers spanned by the virtual disk –Redundancy scheme used to protect data –GMaps are immutable Cannot be modified Must create a new GMap

Address translation (V) PMaps are similar to page tables –Each PMap entry maps 64 KB of physical disk space –Server that performs the translation will usually perform the disk I/O Keeping GMaps and PMaps separate minimizes amount of global information that must be replicated

Support for backups Petal supports snapshots of virtual disks Snapshots are immutable copies of virtual disks –Created using copy-on-write VDir maps into –Epoch identifies current version of virtual disks and snapshots of past versions

Incremental reconfiguration (I) Used to add/remove new servers and new disks Three simple steps 1.Create new GMap 2.Update VDir entries 3.Redistribute the data Challenge is to perform the reconfiguration concurrently with normal client requests

Incremental reconfiguration (II) To solve the problem –Read requests will Try first new GMap Switch to old GMap if new GMap has no appropriate translation –Write requests will always use new GMap

Incremental reconfiguration (III) Observe that new GMap must be created before any data are moved –Too many read requests will have to consult both GMaps Seriously degrades system performance Do instead incremental changes over a fenced region of a virtual disk

Chained declustering (I) Virtual Disk Server 2 D2 D1 D6 D5 Server 1 D1 D0 D5 D4 Server 0 D0 D3 D4 D7 Server 3 D3 D2 D7 D6

Chained declustering (II) If one server fails, its workload will be almost equally distributed among remaining servers Petal uses a primary/secondary scheme for managing copies – Read requests can go to either primary or secondary copy – Write requests must go first to primary copy

Petal prototype Four servers –Each has fourteen 4.3 GB disks Four clients Links are 155 Mb/s ATM links Petal RPC interface has 24 calls

Latency of a virtual disk

Throughput of a virtual disk Throughput is mostly limited by CPU overhead (233 MHZ CPUs!)

File system performance (Modified Andrew Benchmark)

Conclusion Block-level interface s simpler and more flexible than a FS interface Use of distributed software solutions allows geographic distribution Petal performance is acceptable but for write requests –Must wait for primary and secondary copies to be successfully updated

Paxos: the main idea Proposers propose decision values from an arbitrary input set and try to collect acceptances from a majority of the accepters Learners observe this ratification process and attempt to detect that ratification has occurred Agreement is enforced because only one proposal can get the votes of a majority of accepters

Paxos: the assumptions Algorithm for consensus in a message-passing system Assumes the existence of Failure Detectors that let processes give up on stalled processes after some amount of time Processes can act as proposers, accepters, and learners –A process may combine all three roles

Paxos: the tricky part The tricky part is to avoid deadlocks when – There are more than two proposals – Some of the processes fail Paxos lets –Proposers make new proposals –Accepters release their earlier votes for losing proposals