A chicken in every pot: a persistent snapshot memory scaled in time Liuba Shrira and Hao Xu Brandeis University
Storage systems: the 7 year itch 1984: rotational delay – FFS 1991: large memory - LFS 1998: cheaper disk - Elephant 2005:.. a chicken in every pot : snapshot box on the side..
Trends Hardware: Disk Cheap (1$/GB) and cheaper Software Industry: Forbes (12/2004) says: need for keeping past state is growing
Trends cont. - A casino chases a card counter - IT dept. chased by Sarbanes Oxley - Hippocratic DB audited about patient privacy preservation Need to analyze past activity
SNAP: a snapshot system for an object storage system Goal: Storage system capability for back-in-time execution (BITE): application runs against read-only snapshots without synchronization analysis in retrospect
Baseline Requirements for BITE Consistent snapshots: same (old) invariants hold BITE of general code: after-the-fact ad-hoc analysis ( vs predefined SQL access methods) App chooses the snapshot: snapshot state meaningful to app (vs “some time in the past” ) High time “resolution”: fine-grained past analysis (vs backup for recovery)
Over long time-scales.. Living with the past: how close? today: too close (Temporal DB, CVFS) or too far (warehouse - Netezza) Snapshots can be of long-term importance, or transient today: uniform - apps can not discriminate Inherent tension: latency of access vs cost of representation (space and time) today: limited adaptation - compress or not
Capturing past states Two ways: Cheep - no-overwrite update past stays put, copy new : less to write, but bloated DB, past inherits same rep Opportunistic- in-place update past is copied-out, separated: more to write but can write smartly, can tailor past rep, and DB stays clustered (vigor)
Our requirements: Non-disruptive past: just right distance - separated At adaptive distance: e.g. faster BITE on more recent states Discriminated past: application classifies, snapshot system filters: Some snapshots outlive others, some can be accessed faster Flexible classification: e.g. after the fact
Snapshot system operations Request to take a snapshot (declaration): sid: snapshot_request (filter_spec) Request to access a snapshot v: snapshot_access (sid) Request to specify a filter for a snapshot v: lazy_filter (sid,filter_spec) T1, T2, S1, T3, T4, T5, S2,…
Baseline storage system General interface: pages and a page table transactions access objects on pages Server: DB disk: slotted pages of objects physical oid (page#,o#) and a page table Transaction Log Cache: pages and modifed object cache
Storage system, cont. optimistic CC+ARIES Clients fetch pages, run transactions send modifed objects to server Server validates, commits (WAL) caches committed modifications no-force, no-STEAL
The snapshot system Archive separated from DB: Archive i/o sequential, DB random Copy-on-write (COW): copy out snapshot states into archive just before updating DB during cleaning.
Snapshot interface Same as DB - Snapshot Pages Snapshot Page Table So BITE is transparent: BITE on snapshot S(v) uses PageTable(v)
Snapshot system: below the interface: Some S(v) pages are in the archive, some in DB and pages in the archive can have a different representations
BITE (v): namespace redirection
Creating non-disruptive snapshots: (i/o bound system) Archiving snapshot states when cleaning can slow down cleaning compared to a system without snapshots. Copying to the archive disk (sequential I/O) in parallel to database I/O (random) can partially hide archiving cost behind database I/O.
Creating snapshots: how well can you hide? Is determined by: how much is archived: compactness of snapshot representation, frequency, snapshot update workload (overwriting) cost of archiving, sequential, other archive traffic – BITE
Creating snapshots: some issues Issue: avoid overwriting snapshot states (without blocking, pinning etc) Issue: update snapshot meta data efficiently (large, dynamic page tables ) Issue: filter out long-lived snaps (focus here)
New techniques for copy-out snapshots: - VMOB: in-memory versioned data structure preserves snapshot states w/out blocking -LPT: incrementally archived page table with logarithmic reconstruction cost -Filtering: exploit smart representation for past states (focus here)
Filtering: motivation Want unlimited past at high resolution but some snapshots are transient others of long-term interest to application application needs to discriminate between snapshots
Thresher: a filtering system for SNAP
Snapshot representation What can representation do for filtering? life-time based allocation – avoids fragmentation diff-based encoding – reduces cost of copying adaptive combination - real winner
Example: hierarchical snapshots at multiple time granularity ICU patient monitoring DB takes snapshots:: minute by minute vital sign monitor readings hourly includes nurse’s writeup summarizing monitor readings daily includes doctor’s notes summarizing nurse’s checkups Doctor’s have longer life-time than nurse’s…
Brief overview: snapshot creation Some notation: Snapshot span Recorded pages example:..v4, T: w (x_P), T’: w (y_S), v5, T’’.. Span of v4 : T, T’ Pages recorded by snapshot v4: P, S
Incremental snapshot creation: Archived snapshot pages: dispersed: v4 P S v5 P Q …-| | Archived snapshot page tables (PT): PT(v4): addr (P4), addr(S4); PT(v5): addr(P5), addr(Q5).. …-| | Another talk: how to construct archived page tables: :Construct APT (v4) = recorded (v4) + Construct APT (v5)
Filtering example: filter out short-lived v5 Doctor’s Nurse’s v4 P S v5 P Q v6 …-| | |- Archive Filter: long-lived v4, reclaim v5: reclaim P5 retain Q5 (v4 needs it) filtering incremental snapshots creates fragmentation
Problem: fragmentation fragmented archive, over time: non sequential archive writes or random reads to copy out long lived states
Our approach: filter-spec Filter spec determines relative snapshot lifetime “App knows best”: the app supplies a filter spec the system filters
avoid fragmentation with filter-spec Known at snapshot declaration – use lifetime-based allocation After the fact - use a flexible rep to filter lazily rep allows adaptive trade-off: cost of filtering vs cost of BITE
App specifies filter at declaration P4 S4 Q5 long-lived pages …-| P5 short-lived …-| Invariant : to reclaim w/out fragmentation, short-lived areas store no long-lived pages
FilterTree: filter pages for free
After-the-fact (lazy) filtering Some applications want to defer filter specification Lazy filtering requires copying We can specialize representation (compact) to reduce copying cost
Compact representation: diffs Two components filtered separately: compact diffs – reduce cost of copying (diffs clustered by page) checkpoints – accelerate BITE (page-based snapshots system-declared, can use FilterTree)
Adaptive trade-off Like recovery log: less frequent checkpoints increase compactness more frequent checkpoints accelerate BITE
Lazy filtering: checkpoints filtered for free B1B1 B1B1 B2B2 B3B3 … … G 2 (diffs) G 1 (diffs) E1E1 E2E2 E3E3 FilterTree for checkpoints Archive regions for diff extents E
But some applications want more: lazy filtering and faster BITE e.g. - app runs BITE on batch of recent snapshots to decide which ones to retain - needs fast BITE to keep up..
Combined hybrid Faster BITE in recent window and Lazy filtering
Hybrid: checkpoints and checkpoint filtered for free
Status Implemented: SNAP and Thresher for Thor storage system Performance results – encouraging. here is a 5000 feet view:
Performance metrics Cost of filtering: non-disruptiveness = rate-of-drain/ rate-of-pour t_clean determins rate-of-drain workload parameter: overwriting Compactness of diff-based rep: retention relative to page-based rep R_diff - fixed R_ckp - tunable by frequency of checkpoints workload parameter: density BITE - page-based snapshots, vs diff-based vs DB
Non-disruptiveness Storage system w/hybrid snapshots vs w/out snapshots (Thor) How much drop in rate-of-drain / rate-of-pour
Experimental configuration Workoads: extend multiuser 007 to control density overwriting System configuration: single client, medium 007 – small DB 185MB multiple clients – large DB 140GB
FIlterTree Free!
Non-disruptiveness/ single client “summertime …life is easy”
Non-disruptiveness/multi user: “DB works harder”
Summary: non-disruptive snapshot memory Unlimited filtered past is cheaper than you may think... A chicken in every pot.. Every storage system can have a snapshot box on the side..
To get there: Generalize: ARIES/ STEAL / underway file systems / need extended interfaces Beyond: upgrades/ have techniques provenance / need ideas..