Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison.

Slides:



Advertisements
Similar presentations
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Advertisements

File Systems.
Information and Control in Gray-Box Systems Arpaci-Dusseau and Arpaci-Dusseau SOSP 18, 2001 John Otto Wi06 CS 395/495 Autonomic Computing Systems.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
CSE506: Operating Systems Block Cache. CSE506: Operating Systems Address Space Abstraction Given a file, which physical pages store its data? Each file.
Lecture 18 ffs and fsck. File-System Case Studies Local FFS: Fast File System LFS: Log-Structured File System Network NFS: Network File System AFS: Andrew.
File System Implementation
File System Implementation CSCI 444/544 Operating Systems Fall 2008.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
Secondary Storage Management Hank Levy. 8/7/20152 Secondary Storage • Secondary Storage is usually: –anything outside of “primary memory” –storage that.
CSE 451: Operating Systems Winter 2010 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
1 The Google File System Reporter: You-Wei Zhang.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Douglas Thain, John Bent, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny WiND and Condor Projects 6 May 2003 Pipeline and Batch Sharing in.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
Fragmentation in Large Object Repositories Russell Sears Catharine van Ingen CIDR 2007 This work was performed at Microsoft Research San Francisco with.
X-RAY: A Non-Invasive Exclusive Caching Mechanism for RAIDs Lakshmi N. Bairavasundaram Muthian Sivathanu Andrea C. Arpaci-Dusseau Remzi H. Arpaci-Dusseau.
Journal-guided Resynchronization for Software RAID
Log-structured File System Sriram Govindan
The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.
ReiserFS Hans Reiser
Chapter 4 Memory Management Virtual Memory.
Memory Management. Memory  Commemoration or Remembrance.
Exploiting Gray-Box Knowledge of Buffer Cache Management Nathan C. Burnett, John Bent, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of.
Deconstructing Storage Arrays Timothy E. Denehy, John Bent, Florentina I. Popovici, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin,
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
File System Implementation
CS 153 Design of Operating Systems Spring 2015 Lecture 21: File Systems.
Semantically-Smart Disk Systems Muthian Sivathanu, Vijayan Prabhakaran, Florentina Popovici, Tim Denehy, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau University.
Storage Research Meets The Grid Remzi Arpaci-Dusseau.
Disk & File System Management Disk Allocation Free Space Management Directory Structure Naming Disk Scheduling Protection CSE 331 Operating Systems Design.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
Operating Systems: Wrap-Up Questions answered in this lecture: What is an Operating System? Why are operating systems so interesting? What techniques can.
John Bent Computer Sciences Department University of Wisconsin-Madison Explicit Control in a Batch-aware.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
Lecture 20 FSCK & Journaling. FFS Review A few contributions: hybrid block size groups smart allocation.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
File System Performance CSE451 Andrew Whitaker. Ways to Improve Performance Access the disk less  Caching! Be smarter about accessing the disk  Turn.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Bridging the Information Gap in Storage Protocol Stacks Timothy E. Denehy, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau University of Wisconsin,
Lecture Topics: 11/22 HW 7 File systems –block allocation Unix and NT –disk scheduling –file caches –RAID.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Modularity Most useful abstractions an OS wants to offer can’t be directly realized by hardware Modularity is one technique the OS uses to provide better.
Jonathan Walpole Computer Science Portland State University
Memory COMPUTER ARCHITECTURE
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
FileSystems.
Migratory File Services for Batch-Pipelined Workloads
Journaling File Systems
(Architectural Support for) Semantically-Smart Disk Systems
Introduction to Database Systems
Overview Continuation from Monday (File system implementation)
File System Implementation
CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
CSE 451 Fall 2003 Section 11/20/2003.
Foundations and Definitions
The Design and Implementation of a Log-Structured File System
Presentation transcript:

Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Systems Without Knowledge System designers often have limited knowledge About the applications they run About the other systems they interact with Result: The “curse of generality” Missed performance optimizations Limited functionality Costly, too

Didacticism and Systems How to gain knowledge? Depends on environment Sometimes it’s easy A scientific application w/ cooperative developers Sometimes it’s not Internals of Microsoft file system

What We Do Build systems that acquire and exploit knowledge “Gray box” techniques Make assumptions, probe + measure, learn something about how something works Use knowledge to control systems in unexpected ways Result Increase functionality, improve performance, increase robustness and manageability too

Outline Overview Knowledge and its applications Gray-box file placement Semantically-smart disks Scientific apps, the Grid, and I/O Conclusions

The People Gray-box file placement With James Nugent, Andrea Arpaci-Dusseau Semantically-smart disks With Muthian Sivathanu, Vijayan Prabhakaran, Florentina Popovici, Tim Denehy, Andrea Arpaci-Dusseau Scientific apps, the Grid, and I/O With John Bent, Doug Thain, Andrea Arpaci-Dusseau, Miron Livny

Gray-box Control over File Placement

Controlled File Placement Typical “Unix” file system: Little control over layout Just a simple API of open(), read(), write(), close() Some applications want more control e.g., a web server that knows which files are often accessed together Usual default: Use the raw disk Harder to manage, doesn’t integrate w/ other apps

What Might Be Better Use normal file system Convenience Expose control over layout to applications Control Do the above without changing the file system Can’t always change the system you’re using

PLACE A gray-box “Information and Control Layer” (ICL) It’s just a library Simple API for file placement Exposes “FFS-like” groups Place_Creat(file, mode, groupNumber); No changes to underlying file system File System PLACE P P

PLACE Outline Basic operation Gray-box knowledge Key techniques Assessment Accuracy Performance Conclusions

Allocation Knowledge Gray-box assumption: “FFS”-like allocation Splits disk into numerous consecutive “groups” Spreads directories across groups Puts files (inodes/data) that are within same directory into same “group” Many variants Our focus: ext2 (but with other variants in mind)

Exploiting Knowledge for Control Key structure: Shadow Directory Tree (SDT) To create a file /foo/bar in group 1: Create file /.H/1/bar Rename /.H/1/bar to /foo/bar /.h/ 1/2/n/ foo/ bar

Challenge: Building the SDT How to ensure that shadow directory for each group K is in the right on-disk location? Basic approach to creating a directory in a group: Mkdir(tmp); If (tmp is in the desired group) Break; Bias(); Point of portability: Bias() routine Must account for different allocation algorithms Repeat

Some Complications Controlled directory placement Similar to system initialization (hence, slow) To speed up, use shadow cache of directories Crash recovery Crash may leave junk in SDT Periodic sweep of SDT cleaner fixes this Level of control depends on underlying FS e.g., FFS vs. ext2 behavior for large files

Assessment

Does it work? Non-place: 250 files in 1 directory Non-place: 250 files in 10 directories Non-place: 250 files in 100 directories PLACE: 250 files in 100 directories into 1 group

Performance (Small Files) Performance of KB file reads (random)

Performance (Big Files) Each point: Bandwidth attained reading 100-MB file

PLACE Conclusions PLACE: Gray-box approach to file layout Simple and effective control over placement Main technique: Shadow Directory Tree Use to control placement Construction and maintenance are keys Controlled layout can improve performance Micro-benchmarks Web server and I/O parameterization (see USENIX ‘03)

Outline Overview Knowledge and its applications Gray-box file placement Semantically-smart disks Scientific apps, the Grid, and I/O Conclusions

Semantically-smart Disk Systems

Semantically-Smart Disk System (SDS) Disk system that understands file system Data structures Operations Operates underneath unmodified FS Must discover layout + on-disk structures Must “reverse engineer” block stream Exploits knowledge and “smarts” to implement new class of services File System SDS $ CPU

SDS Outline Semantic Knowledge: Acquisition Off-line On-line Semantic Knowledge: Exploitation Case studies Conclusions

Static Knowledge: File System Layout Challenge: How to discover layout information? White-box approach: Embed knowledge in SDS Trend: FS layout does not change frequently Superblk I-Bitmap D-Bitmap Inodes Data I-Bitmap D-Bitmap Inodes Data Group 1Group 2

Layout Discovery with EOF EOF: Extraction Of File-systems Tool to automatically determine layout Uses gray-box techniques Basic operation Start with “soft” model of file system Probe process (P): Initiates traffic SDS: Monitors activity from FS Two distinct tasks: Classifying blocks by type Identifying fields within an inode Result: “Hardened” model of file system structures + fields P SDS File System

EOF: More Details Multi-phase procedure: Bootstrap: Summary blocks Data/data bitmaps Inodes/inode bitmaps Inode fields, directory entries Key techniques Known patterns: Data blocks Isolation: Know all but one block, one block must be… Assertions: Check assumptions at each step

EOF: Simplified Example Create file: Touches many data structures Directory data, directory inode, file data (known pattern), file inode, data bitmap, inode bitmap Reset to beginning of file, write block again File data (known pattern), file inode Now, can classify inode block (isolation) Assertion: only two blocks observed

EOF: Overhead and Summary Performance: A few minutes per GB Probably OK, only done “once” per new file system Scales well with faster disks (sequential bandwidth) Limitations: “FFS”-like file systems (ext2/3, BSD FFS)

Have Knowledge, Will Innovate Knowing structures is not enough (sometimes) Data block overloading (data, pointer, directory) High-level operations not known (create, delete) Requires new on-line techniques Direct classification Indirect classification Block association Operation inferencing

A Simple Example: Smarter Caching Modern RAID may have significant cache Volatile (DRAM) Non-volatile (NVRAM) How to exploit semantic information to cache more intelligently? File System SDS $

Storing Meta-Data in NVRAM Start with simple meta-data: inodes, bitmaps, etc. Good for meta-data intensive workloads Super I-Bit D-Bit Inode Data I-Bit D-Bit Inode Data NVRAM Cache

Direct Classification Given address, determine type directly Direct classification via bounds check Given disk address, can check bounds to determine type (superblock, bitmaps, inodes, general data block) Super I-Bit D-Bit Inode Data I-Bit D-Bit Inode Data

Getting Rid Of The Dead If file blocks are deleted, remove them from cache No need to keep dead blocks around Problem: How to determine if a file is deleted? Need to look for signs of deletion Three different places to look: Inode bitmaps Directory that contains file Inode itself Operation inferencing via block differencing

Operation Inferencing: Detecting Deletes (Inode Bitmap) Super I-Bit D-Bit Inode Data I-Bit D-Bit Inode Data SDS Diff = Read Old Version I-Bitmap Result: Deleted Files

Operation Inferencing: Overheads Space overhead Block cache of inodes, indirect pointers, bitmaps, etc. (could be substantial) Time overhead CPU: Difference operation is like an extra copy Disk: May require block read (if small/no cache) [In paper: Quantified time and space overheads] Main point: There is a CPU and memory cost

Case Studies

Experimental Set-up Problem: Don’t have SDS hardware to use (yet!) “Cost-effective” alternative: Software prototype Insert driver underneath of FS Much like software RAID Good because… Traffic stream similar Bad because… CPU, memory not isolated from host File System SDS OS

Fast RAID Reconstruction Observe: When reconstructing data onto hot spare, no need to reconstruct data that isn’t live Trend: Less live data in performance- sensitive I/O systems Question: How can we perform reconstruction quickly? Mirror Hot Spare

Traditional Approaches Why not in the file system? File system doesn’t know what RAID is Why not in the storage system? RAID doesn’t know what blocks are live (minimally it does, if block has never been written)

The Semantic Way Easy: Scan disk, only copy live blocks Key piece of knowledge: Bitmaps Plus, need to watch for “unmapped” writes Optionally, can copy “dead” blocks later Useful if SDS doesn’t feel “sure” about its knowledge Guaranteed correct with prioritized recovery

Fast Reconstruction: A Graph Fast reconstruction: Less live data -> less time How data is spread across disk affects recovery time RAID-5, IBM Disks

Semantic Conclusions Innovation in traditional storage stack is limited File system: high but not low-level info Storage system: low but not high-level info Semantically-smart disks: Best of both worlds? Takes advantage of “smart” disk systems Exploit low-level information… …with high-level knowledge of file system A remaining challenge Overcoming the “file system obfuscation” problem

Outline Overview Knowledge and its applications Gray-box file placement Semantically-smart disks Scientific apps, the Grid, and I/O Conclusions

Trends in Scientific Computing What constitutes a job is increasingly complex Not your simple process anymore Data demands increasing Not just cycles anymore Wide-area collaboration “Grids” facilitate sharing

The Question How to run scientific workloads on the WAN? WAN Home Remote

Scientific Outline Typical “scientific” jobs Structure Properties Migratory file services Components Performance Conclusions

First Things First Study of modern scientific applications A “measure then build” approach Suite of six applications BLAST: Searches genomic databases for matching proteins IBIS: Global-scale simulation of earth systems CMS: High-energy physics testing software Nautilus: Simulation of molecular dynamics Messkit Hartree-Fock: Simulation of atomic interactions AMANDA: Astrophysics simulation of cosmic events

An Example: AMANDA A single “job” is a multi-process pipeline -> batch pipelined Each process is a blue circle There are many types of I/O Endpoint (red): unique input/output of pipeline Pipeline private (green): shared between pipe processes Batch shared (yellow): shared across all pipes in batch 4K 1M 23M 126M 26M5M 3M505M 2188s 42s 955s 3601s

Some Things We Learned Demands of a single pipeline are modest Modern PC with disk can handle demand Aggregation of I/O could be harder (WAN) Lots of sharing of data within and across pipelines Systems should (have to?) take advantage of this

Towards Systems Support

Systems Support Need to build systems support for global execution Should support “batch-pipelined” jobs effectively Goals Performance: Throughput is what matters (NOT simple metrics like “availability”) Failures: Must be handled effectively (again, with goal of improving performance)

Migratory File Services Migratory file service I/O environment for “batch-pipelined” workloads Integrates performance and failure management Key: Understanding of workloads Three pieces of implementation Virtual batch overlay Migratory proxies Workflow manager

The Virtual Batch Overlay Want familiar and controllable remote environment But often are stuck w/ particular queueing system Further, cannot assume all relevant s/w installed Glide in our own “virtual batch system” On each node, run master, virtual machine, and migratory proxy (described next) M VMMP

Migratory Proxies Migratory proxies: Run on each remote node Fetch and cache data from home node Cooperative cache for batch inputs Localize I/O that is pipeline local Remote WAN Home M VMMP J

Workflow Manager Where workload knowledge is encapsulated Takes workflow description Job dependencies File indicators Runs each while taking failures into account Transactional management Proxy failure and job failure are not catastrophic (just rerun the job!) Proactive data replication

Performance By exploiting knowledge, order of magnitude improvement over naïve approach

Outline Overview Knowledge and its applications Gray-box file placement Semantically-smart disks Scientific apps, the Grid, and I/O Conclusions

The theme: Knowledge is power If you know how FS decides on file layout, you can control it (PLACE) If you know details of FS on-disk structures, you can gain FS-level knowledge behind a block-based interface (Semantic disks) If you know something about workloads and their I/O behaviors, you can optimize performance and handle failure gracefully

“Beware of false knowledge; it is more dangerous than ignorance.” Bernard Shaw