CPSC 426: Building Decentralized Systems Persistence
the new memory hierarchy (Taken from a LADIS 2015 talk by Andy Warfield)
what’s a filesystem? abstraction for durable storage: the file – sparse, byte-addressable address space – has a filename – organized into hierarchical directories consists of data + metadata – metadata maps logical filenames to physical block addresses on disk – e.g.: byte 16 of file “/tmp/xyz” block 589 on disk 0 hides complexity of storage hardware
conventional file systems write in-place, read in-place inodes contain file metadata (e.g. data blocks) directories are just special files: lists of filenames and inode numbers too many random writes SIIIIAADDD DDDDDDDDDD DDDDDDDDDD DDDDDDDDDD DDDDDDDDDD DDDDDDDDDD S = superblock I = inode A = allocation map D = data SIIIIAADDD DDDDDDDDDD DDDDDDDDDD DDDDDDDDDD DDDDDDDDDD DDDDDDDDDD
important numbers Hard disks: random R/W latency = 10 milliseconds random R/W bandwidth = 100 IOPS (<1MB/s) Sequential R/W bandwidth = 100 MB/s SSDs: random R/W latency = 200 microseconds random R/W bandwidth = 30K – 300K IOPS sequential R/W bandwidth = 100s of MB/s
1990: three technology trends CPUs getting faster systems are IO-bound disks staying slow memory getting bigger – better caches, fewer reads to disk – larger write buffers… (with a catch) Today: more cores instead of faster cores; disks still slow but flash is fast; memory (for now) getting bigger
conventional file systems: problems information is spread out on disk rely on synchronous writes to disk – multiple pieces of metadata must be written in order (e.g. write inode before pointing to it from directory entry) – which property is this trying to ensure?
log-structured storage (the good…) write all changes sequentially to disk in a log converts random writes to sequential writes …………… ……………DDD……………DDDI……………DDDIImpCR…………DDDIImp
garbage collection (the bad…) the Achilles’ heel of any log-structured system… CR…………DDDIImp CR…………DDDIImp DDDI CR…………DDDIImp DDDI DI DD DI DDI DDI DDDDI DDDI DI DDDDI DDI DDDDI Solution 1: threading Solution 2: copying LFS uses a combination: segments are threaded, but must be copied out before reuse
cleaning policy (the ugly…) which segment to clean? when to run cleaner? how many segments to clean? how to write out live blocks?
problems requires extra space for good performance; performs poorly when disk utilization is high files no longer have spatial locality for reads random reads are incredibly disruptive GC activity interferes with sequential writes SSDs have different performance characteristics sequential writes can be randomized in virtualized settings
the evolution of LFS The Logical Disk (SOSP 1993): same ideas under the block API instead of the FS API NetApp WAFL: enable access to older versions Linux btrfs: data structure on a log LSM trees… Google’s LevelDB later performance studies underlined LFS issues…
LFS Redux: SSD design SSDs support fast random reads… … but random writes are problematic NAND flash is organized into erase blocks each erase block has many (e.g. 64) 4KB pages a page cannot be overwritten unless the erase block is erased as you erase, flash wears out
SSD design SSD FTLs map from a logical address space to physical flash pages
SSD design SSD FTLs map from a logical address space to physical flash pages valid garbage empty
SSD design SSD FTLs map from a logical address space to physical flash pages valid garbage empty
SSD design valid garbage empty FS read/write/trim
the new memory hierarchy (Taken from a LADIS 2015 talk by Andy Warfield)
puzzle I bought a new SSD. I ran a random read benchmark on it and got 2X expected throughput. why? hint: as I wrote data to the SSD, the random read benchmark slowed down. once I had written all blocks on the SSD, I got expected throughput.
that’s all!