Skippy: Enabling Long-Lived Snapshots of the Long-Lived Past Ross Shaull Liuba Shrira Hao Xu

Slides:



Advertisements
Similar presentations
More on File Management
Advertisements

Crash Recovery John Ortiz. Lecture 22Crash Recovery2 Review: The ACID properties  Atomicity: All actions in the transaction happen, or none happens 
1 CSIS 7102 Spring 2004 Lecture 9: Recovery (approaches) Dr. King-Ip Lin.
IDA / ADIT Lecture 10: Database recovery Jose M. Peña
Chapter 4 : File Systems What is a file system?
The google file system Cs 595 Lecture 9.
Jeff's Filesystem Papers Review Part II. Review of "The Design and Implementation of a Log-Structured File System"
Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.
Chapter 20: Recovery. 421B: Database Systems - Recovery 2 Failure Types q Transaction Failures: local recovery q System Failure: Global recovery I Main.
CSCI 3140 Module 8 – Database Recovery Theodore Chiasson Dalhousie University.
Chapter 19 Database Recovery Techniques
Split Snapshots and Skippy Indexing: Long Live the Past! Ross Shaull Liuba Shrira Brandeis University.
Transaction.
Ross Shaull Liuba Shrira Hao Xu Lab for Experimental Software Systems Brandeis University.
Chapter 11: File System Implementation
Time Travel Databases for the Web Programmer curr = filter(status__equals=form['s']) mid = filter(status__equals=form['s']).as_of(one week ago) old= filter(status__equals=form['s']).as_of(two.
G Robert Grimm New York University Sprite LFS or Let’s Log Everything.
G Robert Grimm New York University Recoverable Virtual Memory.
File System Implementation
ICS (072)Database Recovery1 Database Recovery Concepts and Techniques Dr. Muhammad Shafique.
Quick Review of May 1 material Concurrent Execution and Serializability –inconsistent concurrent schedules –transaction conflicts serializable == conflict.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Chapter 19 Database Recovery Techniques. Slide Chapter 19 Outline Databases Recovery 1. Purpose of Database Recovery 2. Types of Failure 3. Transaction.
G Robert Grimm New York University Sprite LFS or Let’s Log Everything.
04/07/2010CSCI 315 Operating Systems Design1 File System Implementation.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
1 The Google File System Reporter: You-Wei Zhang.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
By: Sang K. Cha, Sangyong Hwang, Kihong Kim and Kunjoo Kwon
Retro: Modular and efficient retrospection in a database Ross Shaull Liuba Shrira Brandeis University.
Journal-guided Resynchronization for Software RAID
Main memory DB PDT Ján GENČI. 2 Obsah Motivation DRDBMS MMDBMS DRDBMS versus MMDBMS Commit processing Support in commercial systems.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
Design of Flash-Based DBMS: An In-Page Logging Approach Sang-Won Lee and Bongki Moon Presented by Chris Homan.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
1 How can several users access and update the information at the same time? Real world results Model Database system Physical database Database management.
Chapter 16 Recovery Yonsei University 1 st Semester, 2015 Sanghyun Park.
Retrospective computation makes past states available inline with current state in a live system What is the language for retrospective computation? What.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.
File System Implementation
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
12.1 Silberschatz, Galvin and Gagne ©2003 Operating System Concepts with Java Chapter 12: File System Implementation Chapter 12: File System Implementation.
Enabling BITE: High-Performance Snapshots in a High-Level Cache Ross Shaull Liuba Shrira Brandeis University Presented SOSP WiP session 2007.
CS333 Intro to Operating Systems Jonathan Walpole.
Lecture 21 LFS. VSFS FFS fsck journaling SBDISBDISBDI Group 1Group 2Group N…Journal.
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
Lecture 20 FSCK & Journaling. FFS Review A few contributions: hybrid block size groups smart allocation.
Virtual Memory Pranav Shah CS147 - Sin Min Lee. Concept of Virtual Memory Purpose of Virtual Memory - to use hard disk as an extension of RAM. Personal.
Embedded System Lab. 정영진 The Design and Implementation of a Log-Structured File System Mendel Rosenblum and John K. Ousterhout ACM Transactions.
Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung

DURABILITY OF TRANSACTIONS AND CRASH RECOVERY
Transaction Log Fundamentals
Jonathan Walpole Computer Science Portland State University
Module 11: File Structure
File System Implementation
HashKV: Enabling Efficient Updates in KV Storage via Hashing
Lecture 20 LFS.
Transaction Log Fundamentals
Printed on Monday, December 31, 2018 at 2:03 PM.
Overview: File system implementation (cont)
Recovery System.
Database Recovery 1 Purpose of Database Recovery
File System Implementation
The Design and Implementation of a Log-Structured File System
Presentation transcript:

Skippy: Enabling Long-Lived Snapshots of the Long-Lived Past Ross Shaull Liuba Shrira Hao Xu Lab for Experimental Software Systems Brandeis University Motivation Each snapshot needs its own page table (an SPT) which points to current-state and COW’d pages Accessing Snapshots with Maplog Indexing Split Snapshots Updating SPTs on disk would be costly, since one COW may change the pointers in multiple SPTs P1P1 P3P3 P1P1 P1P1 P2P2 P3P3 SPT 2 P1P1 P2P2 DatabaseSnapshot pages P1P1 P2P2 P3P3 SPT 1 P2P2 Order of Events: 1.Snapshot 1 declared 2.Page 1 modified 3.Snapshot 2 declared 4.Page 1 modified again 5.Page 2 modified Instead of maintaining many SPTs, append mappings to snapshot pages into a log, the maplog Ordering invariant: All mappings retained for snapshot X are written into maplog before mappings retained for snapshot X+1 Construct SPT for snapshot X by scanning for first-encountered mappings (FEMs) Any page for which a mapping is not found in maplog is still “in the database” (i.e., has not been COW’d yet) P1P1 P1P1 P2P2 Skew hurts Cost of Scan Combat Skew with Skippy For faster scan, create higher-level logs of FEMs with fewer repeated mappings P1P1 P1P1 P2P2 P1P1 P1P1 P1P1 P2P2 P1P1 P3P3 P3P3 P1P1 P2P2 P1P1 P2P2 P1P1 P3P3 P3P3 Snap 1Snap 2Snap 3Snap 4Snap 5Snap 6 Solid arrows denote pointers Dotted arrows indicate copying Skippy Level 1 Maplog Start Divide Maplog into equal- sized chunks called nodes Copy each FEM in a Maplog node into Skippy Level 1 At the end of each node record an up-link that points to the next position in Skippy Level 1 where a mapping will be stored To construct Skippy Level N, recursively apply the same procedure to the previous Skippy Level When scanning, follow up- links to Skippy Levels (a Skippy scan) Analysis Experimental Evaluation Skippy-based Snapshot System Implemented in Berkeley DB (BDB) Page-level snapshots created using COW Page cache is augmented to intercept requests for pages Pages which are write locked get COW’d on the first request after declaration of a snapshot Pages are flushed periodically from a background thread to a second disk Snapshot metadata is created in-memory at transaction commit Metadata (and any unflushed pages COW’d before the checkpoint) are written to disk during checkpoint Recovery of snapshot pages and metadata can be done in one log pass More Stuff Skew# Skippy LevelsTime to Build SPT (s) 50/ / / Expected cost to build SPT Calculate expected cost of building SPT by factoring in: acceleration cost to read sequentially at each level cost of disk seeks between each level Plot shows time to build SPT versus the number of Skippy levels for various skews Setup: 100M database 50K node (holds 2560 mappings, which is 1/10th the number of database pages) 10,000rpm disk Conclusions: Skippy could counteract 80/20 skew in 3 levels 99/1 has hot section much smaller than node size, so one level is enough Can we create split snapshots with a Skippy index efficiently? Plot shows time to complete a single-threaded updating workload of 100,000 transactions in a 66M database with each of 50/50, 80/20, and 99/1 skews We can retain a snapshot after every transaction for a 6–8% penalty Across-Time Execution (ATE) Skippy can be used to cheaply construct SPTs for a window of snapshots for “across-time” analysis Garbage collection Collecting old records from a log is simple, just chop off the tail Skippy enables cheap generational garbage collection of snapshots according to priorities (Thresher, Shrira and Xu, USENIX ‘06) P1P1 P2P2 P3P3 SPT 1 Maplog An old problem: Time travel in databases Retaining past state at logical record level (ImmortalDB, Postgres) changes arrangement of current state File system-level approaches block transactions to get consistency (VSS) A new solution: Split Snapshots Integrates with page cache and transaction manager to provide disk page-level snapshots Application declares transactionally-consistent snapshots at any time with any frequency Snapshots are retained incrementally using copy-on-write (COW), without reorganizing database All applications and access methods run unmodified on persistent, on-line snapshots Achieve good performance in same manner as the database: leverage db recovery to defer snapshot writes A new problem: How to index copy-on-write split snapshots? References Shaull, R., Shrira, L., and Xu, H. Skippy: Enabling Long-Lived Snapshots of the Long-Lived Past. ICDE Shrira, L., van Ingen, C., and Shaull, R. Time Travel in the Virtualized Past. SYSTOR Shrira L., and Xu, H. Snap: Efficient Snapshots for Back-In-Time Execution. ICDE Skippy acceleration is the ratio of number of mappings in top Skippy level to maplog Acceleration depends on: Node size (repeated mappings are only eliminated if they appear in the same node) Workload (likelihood of repetitions) Snap 1Snap 2 Start A Skippy scan that begins at Start(X) constructs the same SPT X as a maplog scan P 2 is shared by SPT 1 and SPT 2 P 3 has not been modified so SPT 1 and SPT 2 point to P 3 into the database 1.Let overwrite cycle length L be the number of page updates required to overwrite entire database 2.Overwrite cycle length determines the number of mappings that must be scanned to construct SPT 3.Let N be the number of pages in the database 4.For a uniformly random workload, L = N ln N (by the “coupon collector’s waiting time” problem) 5.Skew in the update workload lengthens overwrite cycle by introducing many more repeated mappings 6.For example, skew of 80/20 (80% of updates to 20% of pages) increases L by a factor of 4