Speculations: Speculative Execution in a Distributed File System 1 and Rethink the Sync 2 Edmund Nightingale 12, Kaushik Veeraraghavan 2, Peter Chen 12,

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Crash Recovery John Ortiz. Lecture 22Crash Recovery2 Review: The ACID properties  Atomicity: All actions in the transaction happen, or none happens 
IDA / ADIT Lecture 10: Database recovery Jose M. Peña
CS 440 Database Management Systems Lecture 10: Transaction Management - Recovery 1.
Rethink the Sync Ed Nightingale Kaushik Veeraraghavan Peter Chen Jason Flinn University of Michigan.
Speculative Execution In Distributed File System and External Synchrony Edmund B.Nightingale, Kaushik Veeraraghavan Peter Chen, Jason Flinn Presented by.
Bandwidth and latency optimizations Jinyang Li w/ speculator slides from Ed Nightingale.
Jan. 2014Dr. Yangjun Chen ACS Database recovery techniques (Ch. 21, 3 rd ed. – Ch. 19, 4 th and 5 th ed. – Ch. 23, 6 th ed.)
CMPT Dr. Alexandra Fedorova Lecture X: Transactions.
Distributed Systems 2006 Styles of Client/Server Computing.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
Chapter 19 Database Recovery Techniques. Slide Chapter 19 Outline Databases Recovery 1. Purpose of Database Recovery 2. Types of Failure 3. Transaction.
Remus: High Availability via Asynchronous Virtual Machine Replication.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
DESIGN AND IMPLEMENTATION OF THE SUN NETWORK FILESYSTEM R. Sandberg, D. Goldberg S. Kleinman, D. Walsh, R. Lyon Sun Microsystems.
Sun NFS Distributed File System Presentation by Jeff Graham and David Larsen.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
Operating System Support for Application-Specific Speculation Benjamin Wester Peter Chen and Jason Flinn University of Michigan.
1 AutoBash: Improving Configuration Management with Operating System Causality Analysis Ya-Yunn Su, Mona Attariyan, and Jason Flinn University of Michigan.
Networked File System CS Introduction to Operating Systems.
Distributed Systems. Interprocess Communication (IPC) Processes are either independent or cooperating – Threads provide a gray area – Cooperating processes.
Distributed File Systems
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
Switch off your Mobiles Phones or Change Profile to Silent Mode.
Distributed File Systems Overview  A file system is an abstract data type – an abstraction of a storage device.  A distributed file system is available.
PAVANI REDDY KATHURI TRANSACTION COMMUNICATION. OUTLINE 0 P ART I : I NTRODUCTION 0 P ART II : C URRENT R ESEARCH 0 P ART III : F UTURE P OTENTIAL 0 R.
Lecture 12 Recoverability and failure. 2 Optimistic Techniques Based on assumption that conflict is rare and more efficient to let transactions proceed.
EECS 262a Advanced Topics in Computer Systems Lecture 7 Transactional Flash & Rethink the Sync September 25 th, 2012 John Kubiatowicz and Anthony D. Joseph.
Parallelizing Security Checks on Commodity Hardware Ed Nightingale Dan Peek, Peter Chen Jason Flinn Microsoft Research University of Michigan.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
SPECULATIVE EXECUTION IN A DISTRIBUTED FILE SYSTEM E. B. Nightingale P. M. Chen J. Flint University of Michigan.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan.
Database Systems Recovery & Concurrency Lecture # 20 1 st April, 2011.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
Storage Systems CSE 598d, Spring 2007 Rethink the Sync April 3, 2007 Mark Johnson.
Outline for Today Journaling vs. Soft Updates Administrative.
Speculation Supriya Vadlamani CS 6410 Advanced Systems.
4P13 Week 9 Talking Points
Transactional Recovery and Checkpoints. Difference How is this different from schedule recovery? It is the details to implementing schedule recovery –It.
Distributed File Systems Questions answered in this lecture: Why are distributed file systems useful? What is difficult about distributed file systems?
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan Best Paper at SOSP 2005 Modified for CS739.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Storage Systems CSE 598d, Spring 2007 Lecture 13: File Systems March 8, 2007.
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
File System Consistency
Database recovery techniques
Database Recovery Techniques
DURABILITY OF TRANSACTIONS AND CRASH RECOVERY
Free Transactions with Rio Vista
Transactions and Reliability
Scaling a file system to many cores using an operation log
Improving File System Synchrony
Journaling File Systems
Operating System Reliability
Operating System Reliability
Introduction to Operating Systems
Operating System Reliability
Operating System Reliability
Rethink the Sync Ed Nightingale Kaushik Veeraraghavan Peter Chen
Free Transactions with Rio Vista
Printed on Monday, December 31, 2018 at 2:03 PM.
Rethink the Sync Ed Nightingale Kaushik Veeraraghavan Peter Chen
Operating System Reliability
Database Recovery 1 Purpose of Database Recovery
Operating System Reliability
Transaction Communication
Operating System Reliability
Presentation transcript:

Speculations: Speculative Execution in a Distributed File System 1 and Rethink the Sync 2 Edmund Nightingale 12, Kaushik Veeraraghavan 2, Peter Chen 12, Jason Flinn 12 Presentation by Ji-Yong Shin (Some slides are from Nightingale’s talk)

Agenda CAP theorem, Consistency Semantics, Consistency Model Papers – Speculative Execution in a Distributed File System (Award Paper from SOSP’05) – Rethink the Sync (Best Paper from OSDI’06)

CAP Theorem by Eric Brewer At most two of CAP can be satisfied simultaneously – Consistency: correctness of data – Availability: guaranteed immediate access to data – Partition Tolerance: guaranteed functioning despite network disruption or partition N1N2 AB C A P BA Sync

ACID vs BASE (Not exactly opposite but..) ACID Atomicity Consistency Isolation Duration BASE Basically Available Soft-state Eventually consistent Distributed File System Design

Consistency Semantics by Leslie Lamport Atomic (single copy) – Every read returns the value of the most recent write Regular – Read not concurrent with any write returns the most recent write – Read concurrent with some writes returns either the most recent write or a value of concurrent write Safe – Read not concurrent with any write returns the most recent write – Read concurrent with some writes returns any value

Consistency Models Strict consistency – All executions in strict order Sequential consistency – All execution results exposed in strict order Causal consistency – All executions results with causal dependency exposed in strict order Close-to-open consistency – All execution results of processes that closed the file should be exposed to process opening the file Delta consistency – After fixed period of time all memory parts will be consistent Eventually consistency – After sufficiently long period of time all memory parts will be consistent

Consistency and CAP theorem Papers – Speculative Execution in a Distributed File System (Award Paper from SOSP’05) – Rethink the Sync (Best Paper from OSDI’06)

Authors Edmund B Nightingale – PhD from UMich (Jason Flinn) – Microsoft Research – Both papers are part of PhD Thesis Kaushik Veeraraghavan – PhD Student in Umich (Jason Flinn) Peter M Chen – PhD fromUCB (David Patteron) – Faculty at UMich Jason Flinn – PhD at CMU (Mahadev Satyanarayanan) – Faculty at Umich

Speculation Execute using assumption – If assumption holds performance gain – If assumption fails Restart execution Not much performance overhead Example – Branch prediction – Transaction – Thread level speculation in multiprocessor (or multicore) IFDECEXWB If (a == 1) b = 0 c = 1b = 0 c = 1b = 0 c = 1b = 0 c = 1 If (a == 1) { b = 0; c = 1; } else { b = 1; c = 0; } Clk cycle IFDECEXWB If (a == 1) b = 0 c = 1b = 0 c = 1b = 0 c = 1b = 0 c = 1 IFDECEXWB If (a == 1) b = 0If (a == 1) c = 1b = 0If (a == 1) c = 1b = 0If (a == 1) c = 1b = 0 c = 1 IFDECEXWB If (a == 1) b = 0If (a == 1) c = 1b = 0If (a == 1) c = 1b = 0If (a == 1) b = 1 c = 0b = 1 c = 0b = 1 c = 0b = 1 c = 0 Sync Delay Sync Complete Rollback and restart

Motivation and Approach Distributed file system – Significant cost for consistency and safety Block and wait from sync msg and write – Tradeoff between consistency and performance Weak consistency for high performance Speculative distributed file system – Execute sync operations in async manner – While syncing execute next operation on cached files – Check correctness later and rollback if necessary – Guarantee single copy semantics

11 Big Idea: Slow Way RPC Req Client RPC Resp Server Block!2) Speculate! 1) Checkpoint Big Idea: Speculator 3) Correct? Yes: discard ckpt.No: restore process & re-execute RPC Req RPC Resp RPC Req RPC Resp

Conditions for Success 1.Highly predictive operations – Misprediction can worsen performance Rare misprediction 2.Faster checkpointing compared to remote IO – Slow checkpointing is not worth doing 52us for small process < network IO 3.Available spare resource for speculation – Speculation requires memory and CPU cycles Modern computers have abundant resource

13 Undo log Implementing Speculation Process Checkpoint Spec 1) System call2) Create speculation (create_speculation) Time Copy on write fork() Tracks kernel objects that depend on it Ordered list of speculative operations

14 Speculation Success Undo log Checkpoint 1) System call2) Create speculation Process 3) Commit speculation Time Spec (commit_speculation) Tracks kernel objects that depend on it Ordered list of speculative operations

15 Speculation Failure Undo log Checkpoint 1) System call 2) Create speculation 3) Fail speculation Process Time Spec (fail_speculation) Tracks kernel objects that depend on it Ordered list of speculative operations Process

Multi-Process Speculation Processes often cooperate – Example: “make” forks children to compile, link, etc. – Would block if speculation limited to one task Supports – Propagate dependencies among objects – Objects rolled back to prior states when specs fail

17 Spec 1 Multi-Process Speculation Spec 2 pid 8001 Checkpoint inode 3456 Chown -1 Write -1 pid 8000 Checkpoint Chown -1 Write -1 Stat AStat B

Ensuring Correctness Speculative state must never be visible to 1.User or external device 2.Process not depending on the it Controlling speculative process – Block access to external environment Read only (getpid) and private state updates (dup2) allowed – Buffer write to external device – Propagate speculation if necessary

19 Multi-Process Speculation Supports – Objects in distributed file system Will be explained in next slides – Objects in local memory file system (RAMFS) – Objects in local disk file system Use buffering strategy for speculation Shared on-disk metadata: only valid state committed using redo and undo Journal: only commit non speculative operations – Etc Pipe, fifos, unix sockets, signals, fork, exit Doesn’t support – write-shared memory including V IPC, futex

Using Speculation Client 1 cat foo(0) > bar(1) Client 2 cat bar Server foo(1), bar(0) bar(1) foo(0), bar(0)foo(1), bar(1)

Using Speculation Client 1 cat foo(0) > bar(1) Client 2 cat bar Server foo(1), bar(0) bar(0) foo(0), bar(0) foo(0) foo(0), bar(1) Mutating Operation – Server determines speculation success/failure State at server never speculative Can be durable to server crash – Requires server to track failed speculations – Requires in-order processing of messages

22 Group Commit Previously sequential ops now concurrent Sync ops usually committed to disk Speculator makes group commit possible write commit Client Server Updating different files… Can significantly improve disk throughput

Implementation SpecNFS – Modified NFSv3 in Linux 2.4 kernel to support Speculator Same RPCs issued (but many now asynchronous) SpecNFS has same close-to-open consistency, safety as NFS BlueFS – new file system for Speculator Single copy semantics Each file, directory, etc. has version number Check server for every operation Two Dell Precision 370 desktops as the client and file server Routed packet through NISTnet network emulator to insert delay. 23

24 Apache Build With delays SpecNFS up to 14 times faster

25 The Cost of Rollback All files out of date SpecNFS up to 11x faster

26 Group Commit & Sharing State

27 Conclusion Speculator greatly improves performance of existing distributed file systems Speculator enables new file systems to be safe, consistent in some sense and fast

Discussion Starvation (Infinite rollback)? Overhead for maintaining speculation? – Memory or CPU? Multiple server environment? Consistency? – Consistency Semantics? – Consistency Model? – CAP Theorem?

CAP theorem, Consistency Semantics, Consistency Model Papers – Speculative Execution in a Distributed File System (Award Paper from SOSP’05) – Rethink the Sync (Best Paper from OSDI’06)

Synchronization Asynchronous IO High performance – Non-blocking Low reliability – Vulnerable to crash – Ordering not guaranteed Synchronous IO Low performance – Blocking High reliability – Resilient to crash – Guaranteed ordering of IO External Synchrony High performance close to async IO – Async-like execution until externalization High reliability close to synchronous – User centric view of guaranteed durability

External Synchrony Delay commit of data until externally observable operation is necessary – Print to screen – Packet send Externally observable behavior implicates – Operation before the observed behavior are committed

Example: Synchronous I/O OS Kernel Disk Process 101 write(buf_1); 102 write(buf_2); 103 print(“work done”); 104 foo(); Application blocks %work done % TEXT %

Example: External synchrony OS Kernel Disk Process 101 write(buf_1); 102 write(buf_2); 103 print(“work done”); 104 foo(); TEXT %work done % %

Improving Performance Group commit of multiple modification – Atomic commit reduces disk access Buffering of output – Output function runs while committing – Buffered output is released after completion of commit

Multiprocess support Necessary functions – Tracking down causal dependencies Speculator concept borrowed – Output triggered commit Buffering output borrowed from Speculator

Multiprocess support Disk Process write(file1); 102 do_something(); %hello % % 101 print (“hello”); 102 read(file1); 103 print(“world”); Process 1 Process 2 Commit Dep 1 Process 1 OS Kernel Process 2 TEXT world

Limitation Application specific recovery is difficult – Delayed commit makes it difficult to track back Commit may be unlimitedly delayed – 5 second rule applied, users may not meet user’s expectation Data in multiple file system is difficult to commit in single transaction – Journal in different locations

Implementation Speculator Hide speculative state until RPC response Trace of causal dependency for commit and roll back Buffers output Group commit External Synchrony Delay commit until externalization Trace of causal dependency for commit Buffers output Group commit Implemented ext sync file system Xsyncfs – Based on the ext3 file system and Speculator – Use journaling to preserve order of writes – Use write barriers to flush volatile cache – Write to disk guaranteed

Evaluation Compare Xsyncfs to 3 other file systems – Default asynchronous ext3 – Default synchronous ext3 – Synchronous ext3 with write barriers

When is data safe? File System Configuration Data durable on write() Data durable on fsync() AsynchronousNo Not on power failure Synchronous Not on power failure Synchronous w/ write barriers Yes External synchronyYes

Postmark benchmark Xsyncfs within 7% of ext3 mounted asynchronously

The MySQL benchmark MySQL’s group commit can reach xsyncfs performance when # of client is large

Specweb99 throughput Xsyncfs within 8% of ext3 mounted asynchronously Lots of operations buffered, more externalization

Conclusion New concept, external synchrony, proposed External synchrony performs with 8% of async

Discussion What happens when external synchrony system fails? Consistency? – Consistency Semantics? – Consistency Model? – CAP Theorem?