A chicken in every pot: a persistent snapshot memory scaled in time Liuba Shrira and Hao Xu Brandeis University.

Slides:



Advertisements
Similar presentations
Storing Data: Disk Organization and I/O
Advertisements

Chapter 4 : File Systems What is a file system?
Snapshots in a Flash with ioSnap TM Sriram Subramanian, Swami Sundararaman, Nisha Talagala, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau Copyright © 2014.
1 Deciding When to Forget in the Elephant File System University of British Columbia: Douglas. S. Santry, Michael J. Feeley, Norman C. Hutchinson, Ross.
7.1 Advanced Operating Systems Versioning File Systems Someone has typed: rm -r * However, he has been in the wrong directory. What can be done? Typical.
Log Tuning. AOBD 2007/08 H. Galhardas Atomicity and Durability Every transaction either commits or aborts. It cannot change its mind Even in the face.
Chapter 19 Database Recovery Techniques
Split Snapshots and Skippy Indexing: Long Live the Past! Ross Shaull Liuba Shrira Brandeis University.
Transaction.
File Systems Examples.
Ross Shaull Liuba Shrira Hao Xu Lab for Experimental Software Systems Brandeis University.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
G Robert Grimm New York University Sprite LFS or Let’s Log Everything.
File System Implementation
File System Implementation
ICS (072)Database Recovery1 Database Recovery Concepts and Techniques Dr. Muhammad Shafique.
G Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees.
Spring 2003 ECE569 Lecture ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
G Robert Grimm New York University Sprite LFS or Let’s Log Everything.
10 Copyright © 2009, Oracle. All rights reserved. Managing Undo Data.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
The Design and Implementation of a Log-Structured File System Presented by Carl Yao.
Module 18 Monitoring SQL Server 2008 R2. Module Overview Monitoring Activity Capturing and Managing Performance Data Analyzing Collected Performance Data.
AN IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM FOR UNIX Margo Seltzer, Harvard U. Keith Bostic, U. C. Berkeley Marshall Kirk McKusick, U. C. Berkeley.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
Retro: Modular and efficient retrospection in a database Ross Shaull Liuba Shrira Brandeis University.
© Dennis Shasha, Philippe Bonnet 2001 Log Tuning.
Log-structured File System Sriram Govindan
The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.
Skippy: Enabling Long-Lived Snapshots of the Long-Lived Past Ross Shaull Liuba Shrira Hao Xu
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Retrospective computation makes past states available inline with current state in a live system What is the language for retrospective computation? What.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Enabling BITE: High-Performance Snapshots in a High-Level Cache Ross Shaull Liuba Shrira Brandeis University Presented SOSP WiP session 2007.
Paging (continued) & Caching CS-3013 A-term Paging (continued) & Caching CS-3013 Operating Systems A-term 2008 (Slides include materials from Modern.
Lecture 21 LFS. VSFS FFS fsck journaling SBDISBDISBDI Group 1Group 2Group N…Journal.
Lecture 22 SSD. LFS review Good for …? Bad for …? How to write in LFS? How to read in LFS?
Chapter 6 File Systems. Essential requirements 1. Store very large amount of information 2. Must survive the termination of processes persistent 3. Concurrent.
Oracle Architecture - Structure. Oracle Architecture - Structure The Oracle Server architecture 1. Structures are well-defined objects that store the.
Review CS File Systems - Partitions What is a hard disk partition?
Bigtable: A Distributed Storage System for Structured Data
Embedded System Lab. 정영진 The Design and Implementation of a Log-Structured File System Mendel Rosenblum and John K. Ousterhout ACM Transactions.
SQL Advanced Monitoring Using DMV, Extended Events and Service Broker Javier Villegas – DBA | MCP | MCTS.
Oracle Database Architectural Components
10 Copyright © 2007, Oracle. All rights reserved. Managing Undo Data.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung

Database Recovery Techniques
Module 11: File Structure
Maximum Availability Architecture Enterprise Technology Centre.
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
Windows Azure Migrating SQL Server Workloads
Chapter Overview Understanding the Database Architecture
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Introduction to Operating Systems
Lecture 20 LFS.
Overview: File system implementation (cont)
Transaction Log Internals and Performance David M Maxwell
Lazy Type Changes in Object-oriented Databases
Database administration
Presentation transcript:

A chicken in every pot: a persistent snapshot memory scaled in time Liuba Shrira and Hao Xu Brandeis University

Storage systems: the 7 year itch 1984: rotational delay – FFS 1991: large memory - LFS 1998: cheaper disk - Elephant 2005:.. a chicken in every pot : snapshot box on the side..

Trends Hardware: Disk Cheap (1$/GB) and cheaper Software Industry: Forbes (12/2004) says: need for keeping past state is growing

Trends cont. - A casino chases a card counter - IT dept. chased by Sarbanes Oxley - Hippocratic DB audited about patient privacy preservation Need to analyze past activity

SNAP: a snapshot system for an object storage system Goal: Storage system capability for back-in-time execution (BITE): application runs against read-only snapshots without synchronization analysis in retrospect

Baseline Requirements for BITE Consistent snapshots: same (old) invariants hold BITE of general code: after-the-fact ad-hoc analysis ( vs predefined SQL access methods) App chooses the snapshot: snapshot state meaningful to app (vs “some time in the past” ) High time “resolution”: fine-grained past analysis (vs backup for recovery)

Over long time-scales.. Living with the past: how close? today: too close (Temporal DB, CVFS) or too far (warehouse - Netezza) Snapshots can be of long-term importance, or transient today: uniform - apps can not discriminate Inherent tension: latency of access vs cost of representation (space and time) today: limited adaptation - compress or not

Capturing past states Two ways: Cheep - no-overwrite update past stays put, copy new : less to write, but bloated DB, past inherits same rep Opportunistic- in-place update past is copied-out, separated: more to write but can write smartly, can tailor past rep, and DB stays clustered (vigor)

Our requirements: Non-disruptive past: just right distance - separated At adaptive distance: e.g. faster BITE on more recent states Discriminated past: application classifies, snapshot system filters: Some snapshots outlive others, some can be accessed faster Flexible classification: e.g. after the fact

Snapshot system operations Request to take a snapshot (declaration): sid: snapshot_request (filter_spec) Request to access a snapshot v: snapshot_access (sid) Request to specify a filter for a snapshot v: lazy_filter (sid,filter_spec) T1, T2, S1, T3, T4, T5, S2,…

Baseline storage system General interface: pages and a page table transactions access objects on pages Server: DB disk: slotted pages of objects physical oid (page#,o#) and a page table Transaction Log Cache: pages and modifed object cache

Storage system, cont. optimistic CC+ARIES Clients fetch pages, run transactions send modifed objects to server Server validates, commits (WAL) caches committed modifications no-force, no-STEAL

The snapshot system Archive separated from DB: Archive i/o sequential, DB random Copy-on-write (COW): copy out snapshot states into archive just before updating DB during cleaning.

Snapshot interface Same as DB - Snapshot Pages Snapshot Page Table So BITE is transparent: BITE on snapshot S(v) uses PageTable(v)

Snapshot system: below the interface: Some S(v) pages are in the archive, some in DB and pages in the archive can have a different representations

BITE (v): namespace redirection

Creating non-disruptive snapshots: (i/o bound system) Archiving snapshot states when cleaning can slow down cleaning compared to a system without snapshots. Copying to the archive disk (sequential I/O) in parallel to database I/O (random) can partially hide archiving cost behind database I/O.

Creating snapshots: how well can you hide? Is determined by: how much is archived: compactness of snapshot representation, frequency, snapshot update workload (overwriting) cost of archiving, sequential, other archive traffic – BITE

Creating snapshots: some issues Issue: avoid overwriting snapshot states (without blocking, pinning etc) Issue: update snapshot meta data efficiently (large, dynamic page tables ) Issue: filter out long-lived snaps (focus here)

New techniques for copy-out snapshots: - VMOB: in-memory versioned data structure preserves snapshot states w/out blocking -LPT: incrementally archived page table with logarithmic reconstruction cost -Filtering: exploit smart representation for past states (focus here)

Filtering: motivation Want unlimited past at high resolution but some snapshots are transient others of long-term interest to application application needs to discriminate between snapshots

Thresher: a filtering system for SNAP

Snapshot representation What can representation do for filtering? life-time based allocation – avoids fragmentation diff-based encoding – reduces cost of copying adaptive combination - real winner

Example: hierarchical snapshots at multiple time granularity ICU patient monitoring DB takes snapshots:: minute by minute vital sign monitor readings hourly includes nurse’s writeup summarizing monitor readings daily includes doctor’s notes summarizing nurse’s checkups Doctor’s have longer life-time than nurse’s…

Brief overview: snapshot creation Some notation: Snapshot span Recorded pages example:..v4, T: w (x_P), T’: w (y_S), v5, T’’.. Span of v4 : T, T’ Pages recorded by snapshot v4: P, S

Incremental snapshot creation: Archived snapshot pages: dispersed: v4 P S v5 P Q …-| |  Archived snapshot page tables (PT): PT(v4): addr (P4), addr(S4); PT(v5): addr(P5), addr(Q5).. …-| |  Another talk: how to construct archived page tables: :Construct APT (v4) = recorded (v4) + Construct APT (v5)

Filtering example: filter out short-lived v5 Doctor’s Nurse’s v4 P S v5 P Q v6 …-| | |-  Archive Filter: long-lived v4, reclaim v5: reclaim P5 retain Q5 (v4 needs it) filtering incremental snapshots creates fragmentation

Problem: fragmentation fragmented archive, over time: non sequential archive writes or random reads to copy out long lived states

Our approach: filter-spec Filter spec determines relative snapshot lifetime “App knows best”: the app supplies a filter spec the system filters

avoid fragmentation with filter-spec Known at snapshot declaration – use lifetime-based allocation After the fact - use a flexible rep to filter lazily rep allows adaptive trade-off: cost of filtering vs cost of BITE

App specifies filter at declaration P4 S4 Q5 long-lived pages …-|  P5 short-lived …-|  Invariant : to reclaim w/out fragmentation, short-lived areas store no long-lived pages

FilterTree: filter pages for free

After-the-fact (lazy) filtering Some applications want to defer filter specification Lazy filtering requires copying We can specialize representation (compact) to reduce copying cost

Compact representation: diffs Two components filtered separately: compact diffs – reduce cost of copying (diffs clustered by page) checkpoints – accelerate BITE (page-based snapshots system-declared, can use FilterTree)

Adaptive trade-off Like recovery log: less frequent checkpoints increase compactness more frequent checkpoints accelerate BITE

Lazy filtering: checkpoints filtered for free B1B1 B1B1 B2B2 B3B3 … … G 2 (diffs) G 1 (diffs) E1E1 E2E2 E3E3 FilterTree for checkpoints Archive regions for diff extents E

But some applications want more: lazy filtering and faster BITE e.g. - app runs BITE on batch of recent snapshots to decide which ones to retain - needs fast BITE to keep up..

Combined hybrid Faster BITE in recent window and Lazy filtering

Hybrid: checkpoints and checkpoint filtered for free

Status Implemented: SNAP and Thresher for Thor storage system Performance results – encouraging. here is a 5000 feet view:

Performance metrics Cost of filtering: non-disruptiveness = rate-of-drain/ rate-of-pour t_clean determins rate-of-drain workload parameter: overwriting Compactness of diff-based rep: retention relative to page-based rep R_diff - fixed R_ckp - tunable by frequency of checkpoints workload parameter: density BITE - page-based snapshots, vs diff-based vs DB

Non-disruptiveness Storage system w/hybrid snapshots vs w/out snapshots (Thor) How much drop in rate-of-drain / rate-of-pour

Experimental configuration Workoads: extend multiuser 007 to control density overwriting System configuration: single client, medium 007 – small DB 185MB multiple clients – large DB 140GB

FIlterTree Free!

Non-disruptiveness/ single client “summertime …life is easy”

Non-disruptiveness/multi user: “DB works harder”

Summary: non-disruptive snapshot memory Unlimited filtered past is cheaper than you may think... A chicken in every pot.. Every storage system can have a snapshot box on the side..

To get there: Generalize: ARIES/ STEAL / underway file systems / need extended interfaces Beyond: upgrades/ have techniques provenance / need ideas..