Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group
Introduction SSD: block-level APIs as disks Lost of opportunity Goal: new abstractions for better matching the nature of the new medium as well as the need from file systems and databases
Idea: Transactional Flash (Txflash) An SSD (w/ new features) Addressing: a linear array of pages Support read and write operations Support a simple transactional construct Each tranx consists of a series of write operations Atomicity Isolation Durability
Why is this useful? Transaction abstraction required in many places: file system journals, etc. Each application implements its own Complexity Redundant work Reliability of the implementation Great if a storage layer provides transactional API
Previous Work: disk-based Copy-on-Write + Logging Fragmentation poor read performance Checkpointing and cleaning Cleaning cost SSDs mitigate these problems SSDs already do CoW for flash-related reasons Random read accesses are fast
Outline Introduction The Case for TxFlash Commit Protocols Implementation Evaluation Conclusion
TxFlash Architecture & API s WriteAtomic(p1…pn) p1…pn are in a tranx followed by write(p1)…write(pn) atomicity, isolation, durability Abort aborting in-progress tranx In-progress tranx Not issue conflict writes Core of TxFlash
Simple Interface WriteAtomic: multi-page writes Useful for file systems Not full-fledged tranx: no reads in tranx Reduce complexity Backward compatible
Flash is good for this purpose Copy-on-write: already supported by FTL Fast random reads High concurrency multiple flash chips inside New device: New interface more likely
Outline Introduction The Case for TxFlash Commit Protocols Implementation Evaluation Conclusion
Traditional Commit First write to a log: Intention record: (data, page# & version#, tranx ID) … Intention record Commit record Tranx is committed == commit record exists Intention records modify original data If modifications are done, the records can be garbage collected
Traditional Commit on SSDs Optimizations: All writes can be issued in parallel Not update the original data, just update the remap table Problem: commit record Extra latency after other writes Garbage collection is complicated: Must know if all the updates complete or not
New Proposal (1): Simple Cyclic Commit No commit record Intension records of the same tranx use next links to form a cycle (data, page# & version#, next page# & version#) Tranx is committed == all intension records are written Flash page (4KB) + metadata (128B) are co-located
Problem
Solution: Any uncommitted intention on the stable storage must be erased before any new writes are issued to the same or a referenced page
Operations Initialization: Setting version# to 0, next-link to self Transaction Garbage Collection: For any uncommitted intention For committed page if a newer version is committed Recovery: scan all pages then look for cycles
New Proposal (2): Back Pointer Cyclic Commit Another way to deal with ambiguity Intention record: (data, page#&version#, next-link, link to last committed version)
A3 is a straddler of A2 Some complexity in garbage collection and recovery because of this
Protocol Comparison
Outline Introduction The Case for TxFlash Commit Protocols Implementation Evaluation Conclusion
Implementation Simulatior DiskSim trace-driven SSD simulator (UNIX’08) modifications for TxFlash Support tranx of maximum size 4MB Pseudo-device driver for recording traces TxExt3: Employ Txflash for Ext3 file system Tranx: Ext3 journal commit
Experimental Setup TxFlash device: 32GB: 8x 4GB flash packages 4 I/O operations within every flash package 15% of space reserved for garbage collection Workload on top of Ext3: IOzone: micro benchmark (no sync writes) Linux-build (no sync writes) Maildir (sync writes) TPC-B: simulate 10,000 credit-debit-like operations on TxExt3 file system (sync writes) Synthetic workloads
Cyclic commit vs. Traditional commit
Unlike database logging, large tranx sizes: no sync; data are included
simple cyclic commit has a high cost if there are aborts
TxFlash vs. SSD Remove WriteAtomic from traces Use SSD simulator SSD does not provide any transaction guarantees (so should have better performance)
Space comparison: TxFlash needs 25% of more main memory than SSD 4+1 MB per 4GB flash 40 MB for the 32GB TxFlash device
End-to-end performance TxFlash: Run pseudo-device driver on real SSD The performance is close to that of TxFlash Ext3: Use SSD as journal SSD cache is disabled in both cases
Summary TxFlash: Adding transaction interface in SSD Cyclic commit protocols Nice solution for file system journaling