Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison
Systems Without Knowledge System designers often have limited knowledge About the applications they run About the other systems they interact with Result: The “curse of generality” Missed performance optimizations Limited functionality Costly, too
Didacticism and Systems How to gain knowledge? Depends on environment Sometimes it’s easy A scientific application w/ cooperative developers Sometimes it’s not Internals of Microsoft file system
What We Do Build systems that acquire and exploit knowledge “Gray box” techniques Make assumptions, probe + measure, learn something about how something works Use knowledge to control systems in unexpected ways Result Increase functionality, improve performance, increase robustness and manageability too
Outline Overview Knowledge and its applications Gray-box file placement Semantically-smart disks Scientific apps, the Grid, and I/O Conclusions
The People Gray-box file placement With James Nugent, Andrea Arpaci-Dusseau Semantically-smart disks With Muthian Sivathanu, Vijayan Prabhakaran, Florentina Popovici, Tim Denehy, Andrea Arpaci-Dusseau Scientific apps, the Grid, and I/O With John Bent, Doug Thain, Andrea Arpaci-Dusseau, Miron Livny
Gray-box Control over File Placement
Controlled File Placement Typical “Unix” file system: Little control over layout Just a simple API of open(), read(), write(), close() Some applications want more control e.g., a web server that knows which files are often accessed together Usual default: Use the raw disk Harder to manage, doesn’t integrate w/ other apps
What Might Be Better Use normal file system Convenience Expose control over layout to applications Control Do the above without changing the file system Can’t always change the system you’re using
PLACE A gray-box “Information and Control Layer” (ICL) It’s just a library Simple API for file placement Exposes “FFS-like” groups Place_Creat(file, mode, groupNumber); No changes to underlying file system File System PLACE P P
PLACE Outline Basic operation Gray-box knowledge Key techniques Assessment Accuracy Performance Conclusions
Allocation Knowledge Gray-box assumption: “FFS”-like allocation Splits disk into numerous consecutive “groups” Spreads directories across groups Puts files (inodes/data) that are within same directory into same “group” Many variants Our focus: ext2 (but with other variants in mind)
Exploiting Knowledge for Control Key structure: Shadow Directory Tree (SDT) To create a file /foo/bar in group 1: Create file /.H/1/bar Rename /.H/1/bar to /foo/bar /.h/ 1/2/n/ foo/ bar
Challenge: Building the SDT How to ensure that shadow directory for each group K is in the right on-disk location? Basic approach to creating a directory in a group: Mkdir(tmp); If (tmp is in the desired group) Break; Bias(); Point of portability: Bias() routine Must account for different allocation algorithms Repeat
Some Complications Controlled directory placement Similar to system initialization (hence, slow) To speed up, use shadow cache of directories Crash recovery Crash may leave junk in SDT Periodic sweep of SDT cleaner fixes this Level of control depends on underlying FS e.g., FFS vs. ext2 behavior for large files
Assessment
Does it work? Non-place: 250 files in 1 directory Non-place: 250 files in 10 directories Non-place: 250 files in 100 directories PLACE: 250 files in 100 directories into 1 group
Performance (Small Files) Performance of KB file reads (random)
Performance (Big Files) Each point: Bandwidth attained reading 100-MB file
PLACE Conclusions PLACE: Gray-box approach to file layout Simple and effective control over placement Main technique: Shadow Directory Tree Use to control placement Construction and maintenance are keys Controlled layout can improve performance Micro-benchmarks Web server and I/O parameterization (see USENIX ‘03)
Outline Overview Knowledge and its applications Gray-box file placement Semantically-smart disks Scientific apps, the Grid, and I/O Conclusions
Semantically-smart Disk Systems
Semantically-Smart Disk System (SDS) Disk system that understands file system Data structures Operations Operates underneath unmodified FS Must discover layout + on-disk structures Must “reverse engineer” block stream Exploits knowledge and “smarts” to implement new class of services File System SDS $ CPU
SDS Outline Semantic Knowledge: Acquisition Off-line On-line Semantic Knowledge: Exploitation Case studies Conclusions
Static Knowledge: File System Layout Challenge: How to discover layout information? White-box approach: Embed knowledge in SDS Trend: FS layout does not change frequently Superblk I-Bitmap D-Bitmap Inodes Data I-Bitmap D-Bitmap Inodes Data Group 1Group 2
Layout Discovery with EOF EOF: Extraction Of File-systems Tool to automatically determine layout Uses gray-box techniques Basic operation Start with “soft” model of file system Probe process (P): Initiates traffic SDS: Monitors activity from FS Two distinct tasks: Classifying blocks by type Identifying fields within an inode Result: “Hardened” model of file system structures + fields P SDS File System
EOF: More Details Multi-phase procedure: Bootstrap: Summary blocks Data/data bitmaps Inodes/inode bitmaps Inode fields, directory entries Key techniques Known patterns: Data blocks Isolation: Know all but one block, one block must be… Assertions: Check assumptions at each step
EOF: Simplified Example Create file: Touches many data structures Directory data, directory inode, file data (known pattern), file inode, data bitmap, inode bitmap Reset to beginning of file, write block again File data (known pattern), file inode Now, can classify inode block (isolation) Assertion: only two blocks observed
EOF: Overhead and Summary Performance: A few minutes per GB Probably OK, only done “once” per new file system Scales well with faster disks (sequential bandwidth) Limitations: “FFS”-like file systems (ext2/3, BSD FFS)
Have Knowledge, Will Innovate Knowing structures is not enough (sometimes) Data block overloading (data, pointer, directory) High-level operations not known (create, delete) Requires new on-line techniques Direct classification Indirect classification Block association Operation inferencing
A Simple Example: Smarter Caching Modern RAID may have significant cache Volatile (DRAM) Non-volatile (NVRAM) How to exploit semantic information to cache more intelligently? File System SDS $
Storing Meta-Data in NVRAM Start with simple meta-data: inodes, bitmaps, etc. Good for meta-data intensive workloads Super I-Bit D-Bit Inode Data I-Bit D-Bit Inode Data NVRAM Cache
Direct Classification Given address, determine type directly Direct classification via bounds check Given disk address, can check bounds to determine type (superblock, bitmaps, inodes, general data block) Super I-Bit D-Bit Inode Data I-Bit D-Bit Inode Data
Getting Rid Of The Dead If file blocks are deleted, remove them from cache No need to keep dead blocks around Problem: How to determine if a file is deleted? Need to look for signs of deletion Three different places to look: Inode bitmaps Directory that contains file Inode itself Operation inferencing via block differencing
Operation Inferencing: Detecting Deletes (Inode Bitmap) Super I-Bit D-Bit Inode Data I-Bit D-Bit Inode Data SDS Diff = Read Old Version I-Bitmap Result: Deleted Files
Operation Inferencing: Overheads Space overhead Block cache of inodes, indirect pointers, bitmaps, etc. (could be substantial) Time overhead CPU: Difference operation is like an extra copy Disk: May require block read (if small/no cache) [In paper: Quantified time and space overheads] Main point: There is a CPU and memory cost
Case Studies
Experimental Set-up Problem: Don’t have SDS hardware to use (yet!) “Cost-effective” alternative: Software prototype Insert driver underneath of FS Much like software RAID Good because… Traffic stream similar Bad because… CPU, memory not isolated from host File System SDS OS
Fast RAID Reconstruction Observe: When reconstructing data onto hot spare, no need to reconstruct data that isn’t live Trend: Less live data in performance- sensitive I/O systems Question: How can we perform reconstruction quickly? Mirror Hot Spare
Traditional Approaches Why not in the file system? File system doesn’t know what RAID is Why not in the storage system? RAID doesn’t know what blocks are live (minimally it does, if block has never been written)
The Semantic Way Easy: Scan disk, only copy live blocks Key piece of knowledge: Bitmaps Plus, need to watch for “unmapped” writes Optionally, can copy “dead” blocks later Useful if SDS doesn’t feel “sure” about its knowledge Guaranteed correct with prioritized recovery
Fast Reconstruction: A Graph Fast reconstruction: Less live data -> less time How data is spread across disk affects recovery time RAID-5, IBM Disks
Semantic Conclusions Innovation in traditional storage stack is limited File system: high but not low-level info Storage system: low but not high-level info Semantically-smart disks: Best of both worlds? Takes advantage of “smart” disk systems Exploit low-level information… …with high-level knowledge of file system A remaining challenge Overcoming the “file system obfuscation” problem
Outline Overview Knowledge and its applications Gray-box file placement Semantically-smart disks Scientific apps, the Grid, and I/O Conclusions
Trends in Scientific Computing What constitutes a job is increasingly complex Not your simple process anymore Data demands increasing Not just cycles anymore Wide-area collaboration “Grids” facilitate sharing
The Question How to run scientific workloads on the WAN? WAN Home Remote
Scientific Outline Typical “scientific” jobs Structure Properties Migratory file services Components Performance Conclusions
First Things First Study of modern scientific applications A “measure then build” approach Suite of six applications BLAST: Searches genomic databases for matching proteins IBIS: Global-scale simulation of earth systems CMS: High-energy physics testing software Nautilus: Simulation of molecular dynamics Messkit Hartree-Fock: Simulation of atomic interactions AMANDA: Astrophysics simulation of cosmic events
An Example: AMANDA A single “job” is a multi-process pipeline -> batch pipelined Each process is a blue circle There are many types of I/O Endpoint (red): unique input/output of pipeline Pipeline private (green): shared between pipe processes Batch shared (yellow): shared across all pipes in batch 4K 1M 23M 126M 26M5M 3M505M 2188s 42s 955s 3601s
Some Things We Learned Demands of a single pipeline are modest Modern PC with disk can handle demand Aggregation of I/O could be harder (WAN) Lots of sharing of data within and across pipelines Systems should (have to?) take advantage of this
Towards Systems Support
Systems Support Need to build systems support for global execution Should support “batch-pipelined” jobs effectively Goals Performance: Throughput is what matters (NOT simple metrics like “availability”) Failures: Must be handled effectively (again, with goal of improving performance)
Migratory File Services Migratory file service I/O environment for “batch-pipelined” workloads Integrates performance and failure management Key: Understanding of workloads Three pieces of implementation Virtual batch overlay Migratory proxies Workflow manager
The Virtual Batch Overlay Want familiar and controllable remote environment But often are stuck w/ particular queueing system Further, cannot assume all relevant s/w installed Glide in our own “virtual batch system” On each node, run master, virtual machine, and migratory proxy (described next) M VMMP
Migratory Proxies Migratory proxies: Run on each remote node Fetch and cache data from home node Cooperative cache for batch inputs Localize I/O that is pipeline local Remote WAN Home M VMMP J
Workflow Manager Where workload knowledge is encapsulated Takes workflow description Job dependencies File indicators Runs each while taking failures into account Transactional management Proxy failure and job failure are not catastrophic (just rerun the job!) Proactive data replication
Performance By exploiting knowledge, order of magnitude improvement over naïve approach
Outline Overview Knowledge and its applications Gray-box file placement Semantically-smart disks Scientific apps, the Grid, and I/O Conclusions
The theme: Knowledge is power If you know how FS decides on file layout, you can control it (PLACE) If you know details of FS on-disk structures, you can gain FS-level knowledge behind a block-based interface (Semantic disks) If you know something about workloads and their I/O behaviors, you can optimize performance and handle failure gracefully
“Beware of false knowledge; it is more dangerous than ignorance.” Bernard Shaw