Leases and cache consistency Jeff Chase Fall 2015.

Slides:



Advertisements
Similar presentations
PRESENTATION TITLE GOES HERE Introduction to NFS v4 and pNFS David Black, SNIA Technical Council, EMC slides by Alan Yoder, NetApp with thanks to Michael.
Advertisements

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The google file system Cs 595 Lecture 9.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
CS-550: Distributed File Systems [SiS]1 Resource Management in Distributed Systems: Distributed File Systems.
Google File System 1Arun Sundaram – Operating Systems.
Distributed Systems 2006 Styles of Client/Server Computing.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Comparison and Performance Evaluation of SAN File System Yubing Wang & Qun Cai.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed File System: Design Comparisons II Pei Cao Cisco Systems, Inc.
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
Consistency. Consistency model: –A constraint on the system state observable by applications Examples: –Local/disk memory : –Database: What is consistency?
The Google File System.
PRASHANTHI NARAYAN NETTEM.
University of Pennsylvania 11/21/00CSE 3801 Distributed File Systems CSE 380 Lecture Note 14 Insup Lee.
Google File System.
Case Study - GFS.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
1 The Google File System Reporter: You-Wei Zhang.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Distributed File Systems
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
CS 5204 (FALL 2005)1 Leases: An Efficient Fault Tolerant Mechanism for Distributed File Cache Consistency Gray and Cheriton By Farid Merchant Date: 9/21/05.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Distributed File Systems
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
Distributed Systems CS Consistency and Replication – Part I Lecture 10, September 30, 2013 Mohammad Hammoud.
D u k e S y s t e m s Asynchronous Replicated State Machines (Causal Multicast and All That) Jeff Chase Duke University.
Distributed File Systems Architecture – 11.1 Processes – 11.2 Communication – 11.3 Naming – 11.4.
Consistency and Replication. Outline Introduction (what’s it all about) Data-centric consistency Client-centric consistency Replica management Consistency.
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture Chunkservers Master Consistency Model File Mutation Garbage.
Part 1. Managing replicated server groups These questions pertain to managing server groups with replication, as in e.g., Chubby, Dynamo, and the classical.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
ZOOKEEPER. CONTENTS ZooKeeper Overview ZooKeeper Basics ZooKeeper Architecture Getting Started with ZooKeeper.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
Distributed Shared Memory
Google File System.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
Consistency Models.
The Google File System (GFS)
EECS 498 Introduction to Distributed Systems Fall 2017
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
Cary G. Gray David R. Cheriton Stanford University
The Google File System (GFS)
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
Distributed Resource Management: Distributed Shared Memory
The Google File System (GFS)
Presentation transcript:

Leases and cache consistency Jeff Chase Fall 2015

Distributed mutual exclusion It is often necessary to grant some node/process the “right” to “own” some given data or function. Ownership rights often must be mutually exclusive. – At most one owner at any given time. How to coordinate ownership?

One solution: lock service acquire grant acquire grant release A B x=x+1 lock service x=x+1

Definition of a lock (mutex) Acquire + release ops on L are strictly paired. – After acquire completes, the caller holds (owns) the lock L until the matching release. Acquire + release pairs on each L are ordered. – Total order: each lock L has at most one holder. – That property is mutual exclusion; L is a mutex. Some lock variants weaken mutual exclusion in useful and well-defined ways. – Reader/writer or SharedLock: see OS notes (later).

A lock service in the real world acquire grant acquire A B x=x+1 X ??? B

Leases (leased locks) A lease is a grant of ownership or control for a limited time. The owner/holder can renew or extend the lease. If the owner fails, the lease expires and is free again. The lease might end early. – lock service may recall or evict – holder may release or relinquish

A lease service in the real world acquire grant acquire A B x=x+1 X grant release x=x+1

A network partition A network partition is any event that blocks all message traffic between subsets of nodes.

Two kings? acquire grant acquire release A x=x+1 X? B grant release x=x+1

Never two kings at once acquire grant acquire A x=x+1 ??? B grant release x=x+1

Leases and time The lease holder and lease service must agree when a lease has expired. – i.e., that its expiration time is in the past – Even if they can’t communicate! We all have our clocks, but do they agree? – synchronized clocks For leases, it is sufficient for the clocks to have a known bound on clock drift. – |T(C i ) – T(C j )| < ε – Build in slack time > ε into the lease protocols as a safety margin.

Using locks to coordinate data access Ownership transfers on a lock are serialized. A SS B W(x)=v R(x) v W(x)=u OK grant release

Coordinating data access A SS B W(x)=v R(x) v W(x)=u OK grant release - or – Does my memory system need to see synchronization accesses by the processors? Thought question: must the storage service integrate with the lock service?

History

Network File System (NFS, 1985) [ucla.edu] Remote Procedure Call (RPC) External Data Representation (XDR)

NFS: revised picture BufferCache FS Applications BufferCache FS Client File server

Multiple clients BufferCache FS Applications BufferCache FS File server BufferCache FS Applications BufferCache FS Applications

Multiple clients BufferCache FS Applications BufferCache FS Applications BufferCache FS Applications Read(server=xx.xx…, inode=i27412, blockID=27, …)

Multiple clients BufferCache FS Applications BufferCache FS Applications BufferCache FS Applications Write(server=xx.xx…, inode=i27412, blockID=27, …)

Multiple clients BufferCache FS Applications BufferCache FS Applications BufferCache FS Applications What if another client reads that block? Will it get the right data? What is the “right” data? Will it get the “last” version of the block written? How to coordinate reads/writes and caching on multiple clients? How to keep the copies “in sync”?

Cache consistency How to ensure that each read sees the value stored by the most recent write? (Or some reasonable value)? This problem also appears in multi-core architecture. It appears in distributed data systems of various kinds. – DNS, Web Various solutions are available. – It may be OK for clients to read data that is “a little bit stale”. – In some cases, the clients themselves don’t change the data. But for “strong” consistency (single copy semantics) we can use leased locks….but we have to integrate them with the cache.

Lease example: network file cache A read lease ensures that no other client is writing the data. Holder is free to read from its cache. A write lease ensures that no other client is reading or writing the data. Holder is free to read/write from cache. Writer must push modified (dirty) cached data to the server before relinquishing write lease. – Must ensure that another client can see all updates before it is able to acquire a lease allowing it to read or write. If some client requests a conflicting lock, server may recall or evict on existing leases. – Callback RPC from server to lock holder: “please release now.” – Writers get a grace period to push cached writes and release.

Lease example network file cache consistency This approach is used in NFS and various other networked data services.

A few points about leases Classical leases for cache consistency are in essence a distributed reader/writer lock. – Add in callbacks and some push and purge operations on the local cache, and you are done. These techniques are used in essentially all scalable/parallel file systems. – But what is the performance? Would you use it for a shared database? How to reduce lock contention? The basic technique is ubiquitous in distributed systems. – Timeout-based failure detection with synchronized clock rates – E.g., designate a leader or primary replica.

SharedLock: Reader/Writer Lock A reader/write lock or SharedLock is a new kind of “lock” that is similar to our old definition: – supports Acquire and Release primitives – assures mutual exclusion for writes to shared state But: a SharedLock provides better concurrency for readers when no writer is present. class SharedLock { AcquireRead(); /* shared mode */ AcquireWrite(); /* exclusive mode */ ReleaseRead(); ReleaseWrite(); }

Reader/Writer Lock Illustrated ArAr Multiple readers may hold the lock concurrently in shared mode. Writers always hold the lock in exclusive mode, and must wait for all readers or writer to exit. modereadwritemax allowed sharedyesnomany exclusive yesyesone not holdernonomany ArAr RrRr RrRr RwRw AwAw If each thread acquires the lock in exclusive (*write) mode, SharedLock functions exactly as an ordinary mutex.

Google File System (GFS) Similar: Hadoop HDFS, p-NFS, many other parallel file systems. A master server stores metadata (names, file maps) and acts as lock server. Clients call master to open file, acquire locks, and obtain metadata. Then they read/write directly to a scalable array of data servers for the actual data. File data may be spread across many data servers: the maps say where it is.

GFS: leases Primary must hold a “lock” on its chunks. Use leased locks to tolerate primary failures. We use leases to maintain a consistent mutation order across replicas. The master grants a chunk lease to one of the replicas, which we call the primary. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations. Thus, the global mutation order is defined first by the lease grant order chosen by the master, and within a lease by the serial numbers assigned by the primary. The lease mechanism is designed to minimize management overhead at the master. A lease has an initial timeout of 60 seconds. However, as long as the chunk is being mutated, the primary can request and typically receive extensions from the master indefinitely. These extension requests and grants are piggybacked on the HeartBeat messages regularly exchanged between the master and all chunkservers. …Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires.

Parallel File Systems 101  Manage data sharing in large data stores [Renu Tewari, IBM] Asymmetric E.g., PVFS2, Lustre, High Road Ceph, GFS Symmetric E.g., GPFS, Polyserve Classical: Frangipani

Parallel NFS (pNFS) pNFS Clients Block (FC) / Object (OSD) / File (NFS) Storage NFSv4+ Server data metadata control [David Black, SNIA] Modifications to standard NFS protocol (v4.1, ) to offload bulk data storage to a scalable cluster of block servers or OSDs. Based on an asymmetric structure similar to GFS and Ceph.

pNFS architecture Only this is covered by the pNFS protocol Client-to-storage data path and server-to-storage control path are specified elsewhere, e.g. – SCSI Block Commands (SBC) over Fibre Channel (FC) – SCSI Object-based Storage Device (OSD) over iSCSI – Network File System (NFS) pNFS Clients Block (FC) / Object (OSD) / File (NFS) Storage NFSv4+ Server data metadata control [David Black, SNIA]

pNFS basic operation Client gets a layout from the NFS Server The layout maps the file onto storage devices and addresses The client uses the layout to perform direct I/O to storage At any time the server can recall the layout (leases/delegations) Client commits changes and returns the layout when it’s done pNFS is optional, the client can always use regular NFSv4 I/O Clients Storage NFSv4+ Server layout [David Black, SNIA]