Presentation is loading. Please wait.

Presentation is loading. Please wait.

Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison.

Similar presentations


Presentation on theme: "Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison."— Presentation transcript:

1 Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

2 Systems Without Knowledge System designers often have limited knowledge About the applications they run About the other systems they interact with Result: The “curse of generality” Missed performance optimizations Limited functionality Costly, too

3 Didacticism and Systems How to gain knowledge? Depends on environment Sometimes it’s easy A scientific application w/ cooperative developers Sometimes it’s not Internals of Microsoft file system

4 What We Do Build systems that acquire and exploit knowledge “Gray box” techniques Make assumptions, probe + measure, learn something about how something works Use knowledge to control systems in unexpected ways Result Increase functionality, improve performance, increase robustness and manageability too

5 Outline Overview Knowledge and its applications Gray-box file placement Semantically-smart disks Scientific apps, the Grid, and I/O Conclusions

6 The People Gray-box file placement With James Nugent, Andrea Arpaci-Dusseau Semantically-smart disks With Muthian Sivathanu, Vijayan Prabhakaran, Florentina Popovici, Tim Denehy, Andrea Arpaci-Dusseau Scientific apps, the Grid, and I/O With John Bent, Doug Thain, Andrea Arpaci-Dusseau, Miron Livny

7 Gray-box Control over File Placement

8 Controlled File Placement Typical “Unix” file system: Little control over layout Just a simple API of open(), read(), write(), close() Some applications want more control e.g., a web server that knows which files are often accessed together Usual default: Use the raw disk Harder to manage, doesn’t integrate w/ other apps

9 What Might Be Better Use normal file system Convenience Expose control over layout to applications Control Do the above without changing the file system Can’t always change the system you’re using

10 PLACE A gray-box “Information and Control Layer” (ICL) It’s just a library Simple API for file placement Exposes “FFS-like” groups Place_Creat(file, mode, groupNumber); No changes to underlying file system File System PLACE P P

11 PLACE Outline Basic operation Gray-box knowledge Key techniques Assessment Accuracy Performance Conclusions

12 Allocation Knowledge Gray-box assumption: “FFS”-like allocation Splits disk into numerous consecutive “groups” Spreads directories across groups Puts files (inodes/data) that are within same directory into same “group” Many variants Our focus: ext2 (but with other variants in mind)

13 Exploiting Knowledge for Control Key structure: Shadow Directory Tree (SDT) To create a file /foo/bar in group 1: Create file /.H/1/bar Rename /.H/1/bar to /foo/bar /.h/ 1/2/n/ foo/ bar

14 Challenge: Building the SDT How to ensure that shadow directory for each group K is in the right on-disk location? Basic approach to creating a directory in a group: Mkdir(tmp); If (tmp is in the desired group) Break; Bias(); Point of portability: Bias() routine Must account for different allocation algorithms Repeat

15 Some Complications Controlled directory placement Similar to system initialization (hence, slow) To speed up, use shadow cache of directories Crash recovery Crash may leave junk in SDT Periodic sweep of SDT cleaner fixes this Level of control depends on underlying FS e.g., FFS vs. ext2 behavior for large files

16 Assessment

17 Does it work? Non-place: 250 files in 1 directory Non-place: 250 files in 10 directories Non-place: 250 files in 100 directories PLACE: 250 files in 100 directories into 1 group

18 Performance (Small Files) Performance of 250 200-KB file reads (random)

19 Performance (Big Files) Each point: Bandwidth attained reading 100-MB file

20 PLACE Conclusions PLACE: Gray-box approach to file layout Simple and effective control over placement Main technique: Shadow Directory Tree Use to control placement Construction and maintenance are keys Controlled layout can improve performance Micro-benchmarks Web server and I/O parameterization (see USENIX ‘03)

21 Outline Overview Knowledge and its applications Gray-box file placement Semantically-smart disks Scientific apps, the Grid, and I/O Conclusions

22 Semantically-smart Disk Systems

23 Semantically-Smart Disk System (SDS) Disk system that understands file system Data structures Operations Operates underneath unmodified FS Must discover layout + on-disk structures Must “reverse engineer” block stream Exploits knowledge and “smarts” to implement new class of services File System SDS $ CPU

24 SDS Outline Semantic Knowledge: Acquisition Off-line On-line Semantic Knowledge: Exploitation Case studies Conclusions

25 Static Knowledge: File System Layout Challenge: How to discover layout information? White-box approach: Embed knowledge in SDS Trend: FS layout does not change frequently Superblk I-Bitmap D-Bitmap Inodes Data I-Bitmap D-Bitmap Inodes Data Group 1Group 2

26 Layout Discovery with EOF EOF: Extraction Of File-systems Tool to automatically determine layout Uses gray-box techniques Basic operation Start with “soft” model of file system Probe process (P): Initiates traffic SDS: Monitors activity from FS Two distinct tasks: Classifying blocks by type Identifying fields within an inode Result: “Hardened” model of file system structures + fields P SDS File System

27 EOF: More Details Multi-phase procedure: Bootstrap: Summary blocks Data/data bitmaps Inodes/inode bitmaps Inode fields, directory entries Key techniques Known patterns: Data blocks Isolation: Know all but one block, one block must be… Assertions: Check assumptions at each step

28 EOF: Simplified Example Create file: Touches many data structures Directory data, directory inode, file data (known pattern), file inode, data bitmap, inode bitmap Reset to beginning of file, write block again File data (known pattern), file inode Now, can classify inode block (isolation) Assertion: only two blocks observed

29 EOF: Overhead and Summary Performance: A few minutes per GB Probably OK, only done “once” per new file system Scales well with faster disks (sequential bandwidth) Limitations: “FFS”-like file systems (ext2/3, BSD FFS)

30 Have Knowledge, Will Innovate Knowing structures is not enough (sometimes) Data block overloading (data, pointer, directory) High-level operations not known (create, delete) Requires new on-line techniques Direct classification Indirect classification Block association Operation inferencing

31 A Simple Example: Smarter Caching Modern RAID may have significant cache Volatile (DRAM) Non-volatile (NVRAM) How to exploit semantic information to cache more intelligently? File System SDS $

32 Storing Meta-Data in NVRAM Start with simple meta-data: inodes, bitmaps, etc. Good for meta-data intensive workloads Super I-Bit D-Bit Inode Data I-Bit D-Bit Inode Data NVRAM Cache

33 Direct Classification Given address, determine type directly Direct classification via bounds check Given disk address, can check bounds to determine type (superblock, bitmaps, inodes, general data block) Super I-Bit D-Bit Inode Data I-Bit D-Bit Inode Data

34 Getting Rid Of The Dead If file blocks are deleted, remove them from cache No need to keep dead blocks around Problem: How to determine if a file is deleted? Need to look for signs of deletion Three different places to look: Inode bitmaps Directory that contains file Inode itself Operation inferencing via block differencing

35 Operation Inferencing: Detecting Deletes (Inode Bitmap) Super I-Bit D-Bit Inode Data I-Bit D-Bit Inode Data SDS Diff = Read Old Version I-Bitmap Result: Deleted Files

36 Operation Inferencing: Overheads Space overhead Block cache of inodes, indirect pointers, bitmaps, etc. (could be substantial) Time overhead CPU: Difference operation is like an extra copy Disk: May require block read (if small/no cache) [In paper: Quantified time and space overheads] Main point: There is a CPU and memory cost

37 Case Studies

38 Experimental Set-up Problem: Don’t have SDS hardware to use (yet!) “Cost-effective” alternative: Software prototype Insert driver underneath of FS Much like software RAID Good because… Traffic stream similar Bad because… CPU, memory not isolated from host File System SDS OS

39 Fast RAID Reconstruction Observe: When reconstructing data onto hot spare, no need to reconstruct data that isn’t live Trend: Less live data in performance- sensitive I/O systems Question: How can we perform reconstruction quickly? Mirror Hot Spare

40 Traditional Approaches Why not in the file system? File system doesn’t know what RAID is Why not in the storage system? RAID doesn’t know what blocks are live (minimally it does, if block has never been written)

41 The Semantic Way Easy: Scan disk, only copy live blocks Key piece of knowledge: Bitmaps Plus, need to watch for “unmapped” writes Optionally, can copy “dead” blocks later Useful if SDS doesn’t feel “sure” about its knowledge Guaranteed correct with prioritized recovery

42 Fast Reconstruction: A Graph Fast reconstruction: Less live data -> less time How data is spread across disk affects recovery time RAID-5, IBM Disks

43 Semantic Conclusions Innovation in traditional storage stack is limited File system: high but not low-level info Storage system: low but not high-level info Semantically-smart disks: Best of both worlds? Takes advantage of “smart” disk systems Exploit low-level information… …with high-level knowledge of file system A remaining challenge Overcoming the “file system obfuscation” problem

44 Outline Overview Knowledge and its applications Gray-box file placement Semantically-smart disks Scientific apps, the Grid, and I/O Conclusions

45 Trends in Scientific Computing What constitutes a job is increasingly complex Not your simple process anymore Data demands increasing Not just cycles anymore Wide-area collaboration “Grids” facilitate sharing

46 The Question How to run scientific workloads on the WAN? WAN Home Remote

47 Scientific Outline Typical “scientific” jobs Structure Properties Migratory file services Components Performance Conclusions

48 First Things First Study of modern scientific applications A “measure then build” approach Suite of six applications BLAST: Searches genomic databases for matching proteins IBIS: Global-scale simulation of earth systems CMS: High-energy physics testing software Nautilus: Simulation of molecular dynamics Messkit Hartree-Fock: Simulation of atomic interactions AMANDA: Astrophysics simulation of cosmic events

49 An Example: AMANDA A single “job” is a multi-process pipeline -> batch pipelined Each process is a blue circle There are many types of I/O Endpoint (red): unique input/output of pipeline Pipeline private (green): shared between pipe processes Batch shared (yellow): shared across all pipes in batch 4K 1M 23M 126M 26M5M 3M505M 2188s 42s 955s 3601s

50 Some Things We Learned Demands of a single pipeline are modest Modern PC with disk can handle demand Aggregation of I/O could be harder (WAN) Lots of sharing of data within and across pipelines Systems should (have to?) take advantage of this

51 Towards Systems Support

52 Systems Support Need to build systems support for global execution Should support “batch-pipelined” jobs effectively Goals Performance: Throughput is what matters (NOT simple metrics like “availability”) Failures: Must be handled effectively (again, with goal of improving performance)

53 Migratory File Services Migratory file service I/O environment for “batch-pipelined” workloads Integrates performance and failure management Key: Understanding of workloads Three pieces of implementation Virtual batch overlay Migratory proxies Workflow manager

54 The Virtual Batch Overlay Want familiar and controllable remote environment But often are stuck w/ particular queueing system Further, cannot assume all relevant s/w installed Glide in our own “virtual batch system” On each node, run master, virtual machine, and migratory proxy (described next) M VMMP

55 Migratory Proxies Migratory proxies: Run on each remote node Fetch and cache data from home node Cooperative cache for batch inputs Localize I/O that is pipeline local Remote WAN Home M VMMP J

56 Workflow Manager Where workload knowledge is encapsulated Takes workflow description Job dependencies File indicators Runs each while taking failures into account Transactional management Proxy failure and job failure are not catastrophic (just rerun the job!) Proactive data replication

57 Performance By exploiting knowledge, order of magnitude improvement over naïve approach

58 Outline Overview Knowledge and its applications Gray-box file placement Semantically-smart disks Scientific apps, the Grid, and I/O Conclusions

59 The theme: Knowledge is power If you know how FS decides on file layout, you can control it (PLACE) If you know details of FS on-disk structures, you can gain FS-level knowledge behind a block-based interface (Semantic disks) If you know something about workloads and their I/O behaviors, you can optimize performance and handle failure gracefully

60 “Beware of false knowledge; it is more dangerous than ignorance.” Bernard Shaw http://www.cs.wisc.edu/wind


Download ppt "Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison."

Similar presentations


Ads by Google