Filesystem Optimizations for Static Content Multimedia Servers Review of academic papers for TDC573 Jeff Absher.

Slides:



Advertisements
Similar presentations
IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.
Advertisements

Part IV: Memory Management
1 Storage-Aware Caching: Revisiting Caching for Heterogeneous Systems Brian Forney Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Wisconsin Network Disks University.
File Systems.
Distributed Multimedia Systems
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 35 – Media Server (Part 4) Klara Nahrstedt Spring 2012.
The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim S. Venkatraman, J. Wilkes HP Labs.
Flash: An efficient and portable Web server Authors: Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Presented at the Usenix Technical Conference, June.
G Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees.
Chapter 3.5 Memory and I/O Systems. Memory Management 2 Only applies to languages with explicit memory management (C, C++) Memory problems are one of.
Chapter 13 Embedded Systems
Chapter 3.7 Memory and I/O Systems. 2 Memory Management Only applies to languages with explicit memory management (C or C++) Memory problems are one of.
PRASHANTHI NARAYAN NETTEM.
COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
File Systems (2). Readings r Silbershatz et al: 11.8.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
1 File System Implementation Operating Systems Hebrew University Spring 2010.
Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
Disk Access. DISK STRUCTURE Sector: Smallest unit of data transfer from/to disk; 512B 2/4/8 adjacent sectors transferred together: Blocks Read/write heads.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
CS533 Concepts of Operating Systems Jonathan Walpole.
Lecture On Database Analysis and Design By- Jesmin Akhter Lecturer, IIT, Jahangirnagar University.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.
Fall 2000M.B. Ibáñez Lecture 01 Introduction What is an Operating System? The Evolution of Operating Systems Course Outline.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
Chapter 3.5 Memory and I/O Systems. 2 Memory Management Memory problems are one of the leading causes of bugs in programs (60-80%) MUCH worse in languages.
Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.
CS Spring 2011 CS 414 – Multimedia Systems Design Lecture 30 – Media Server (Part 6) Klara Nahrstedt Spring 2011.
Carrying Your Environment With You or Virtual Machine Migration Abstraction for Research Computing.
IT253: Computer Organization
Chapter 5 File Management File System Implementation.
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Supporting Multi-Processors Bernard Wong February 17, 2003.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein and Dror Feitelson School of Engineering and Computer.
Computer Architecture Lecture 27 Fasih ur Rehman.
Jeff's Filesystem Papers Review Part I. Review of "Design and Implementation of The Second Extended Filesystem"
Operating-System Structures
Operating System concerns for Multimedia Multimedia File Systems -Jaydeep Punde.
Linux file systems Name: Peijun Li Student ID: Prof. Morteza Anvari.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
I/O Software CS 537 – Introduction to Operating Systems.
Silberschatz, Galvin and Gagne ©2011 Operating System Concepts Essentials – 8 th Edition Chapter 2: The Linux System Part 5.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
1 OPERATING SYSTEMS. 2 CONTENTS 1.What is an Operating System? 2.OS Functions 3.OS Services 4.Structure of OS 5.Evolution of OS.
Parallel Virtual File System (PVFS) a.k.a. OrangeFS
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Kernel Design & Implementation
Memory Management.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
CS703 - Advanced Operating Systems
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CSE 451 Fall 2003 Section 11/20/2003.
CSE451 Virtual Memory Paging Autumn 2002
Virtual Memory: Working Sets
Operating System Concepts
Operating System Concepts
Presentation transcript:

Filesystem Optimizations for Static Content Multimedia Servers Review of academic papers for TDC573 Jeff Absher

Papers Reviewed Implementation and Evaluation of EXT3NS Multimedia File System  Baik-Song Ahn, Sung-Hoon Sohn, Chei-Yol Kim, Gyu-Il Cha, Yun-Cheol Baek, Sung-In Jung, Myung-Joon Kim.  Presented 12th Annual ACM International Conference on Multimedia, October 10–16, 2004, New York, New York, USA. The Tiger Shark File System  Roger L. Haskin, Frank B. Schmuck  IBM Journal of Research and Development, 1998

What is the problem? MM Server with relatively static content  Prerecorded movies  Audio  Lectures  Commercials End user can start/stop/pause/seek Many different simultaneous users. Massive transfer of data from disks to NIC. Can safely avoid focusing on real-time writes.  It is a server.  Assume data is collected non-real-time.  Note: both systems could be easily extended within their scope to handle RT writes. Should be backward compatible for legacy requests.

Scope Limitations and Design Goals Limitations  Single Server or Cluster with single shared set of disks. No distributed nodes. There is research in the slightly different areas of distributed Filesystems, P2P filesystems, and others.  Single local Filesystem, May consist of an array of multiple disks. Design Goals in order of importance  Pump as much data as you can from the disks to the NICs. This can be done by avoiding kernel memcopys  Seeking  Quick Recoverability for very large filesystems Journaling  Legacy Compatibility

Problems with “old” filesystem block transfer to NIC in the network-server context? (simplified) Multiple Memcpy() calls across user/kernel mode. Disk blocks optimized for small files. Many context switches. The kernel must be involved in both reading from the disk and writing to the NIC. Bus contention with other IO. Block Cache is in main memory, may not be fast enough from a hardware perspective. The data may be slow to “bubble down” the Networking layer due to redirectors, routing, etc. Checksum calculations and such for networking happen in software.

The newer MM Filesystems: Classes of requests Both of the studied filesystems assign some type of class to FS requests.  the minimum needed is 2 classes. Legacy Requests  Read/Write data for small files, not needed quickly at the NIC High-Performance Requests  Read data for large likely-contiguous files that needs to be quickly dumped to the nic  This is similar to our newer networking paradigm “not all traffic is equal”  Unaddressed question that I had: Can we take the concept of discardability and apply it to filesystems?

Classes of requests EXT3NS  2 classes which are determined by an argument to the system call in a user buffer address. Fastpath Class dumps data onto the NIC, Legacy Class handles legacy filesystem requests.  The data itself does not have an inherent class and the client process explicitly defines its class. Tiger Shark  Real-time Class Real-time class is fine grained into subclasses, because Tiger Shark has  Resource Reservation  Admission Control  If the controllers and disks cannot handle the predicted load then the request is denied.  Legacy Class Also has a legacy interface for old filesystem access interfaces.

EXT3NS Caching, Quantization, and Scheduling optimizations The hardware is designed to have a minimum block size of 256 KB up to a maximum of 2MB;  normal Linux block devices have a maximum block size of 4KB.  Some compromises were made in disk metadata block design for SDA (what is SDA? The substitute for RAID) that it was compatible with EXT3FRS.  The large block sizes lead to a large maximum addressable file size for first-level indirection is 275 GB, for maximal indirection is ~2 53 B. The memory contained on the NS card is actually a buffer in the current version of EXT3NS, the authors plan to add caching capability to it. (if you don't know the difference between a buffer and a cache.. Look it up!). Asynch IO is not currently supported, but plans are in place.

Tiger Shark Caching, Quantization, and Scheduling optimizations "Deadline Scheduling" instead of elevator algorithms.  This is an interesting aspect of Tiger Shark, it benchmarks the hardware against a "synthetic workload" to determine the best order to schedule the disk requests and the best thresholds to start denying requests. Blocksize is 256KB (default), Normal AIX uses 4KB size. Tiger Shark will "chunk" contiguous block reads better than the default filesystems to work with its large blocksize.

EXT3NS Streamlining of operations to get the data from the platter to the NIC. EXT3NS has special hardware that avoids memcopy and most kernel calculations. This hardware takes the data output from the disk hardware buffer directly onto a custom PCI BUS and then copies through buffers and directly to the NIC on the SAME CARD. Hardware avoids using the system's PCI bus when the fastpath option is used. Joint Network Interface and Disk Controller. Hardware speedups also calculate IP/TCP/UDP headers and checksums to speed up processing.

Tiger Shark Streamlining of operations to get the data from the platter to the NIC. A running daemon that pre-allocates OS resources such as buffer space, disk bandwidth and controller time. Not a hardware dependant solution. Even though it does not have shared memory hardware, Tiger Shark copies data from the disks into a shared memory area. Essentially this is a very large extension of the kernel's disk block cache. VFS layer for Tiger Shark intercepts repeated calls and uses the shared memory area, therefore saving kernel memcopys on subsequent requests.

Platter Layout and Scaling optimizations for Contiguous Streaming EXT3NS Hardware uses a RAID-3-like cross platter optimization called "SDA" which distributes the blocks across multiple disk platters (simple striping, not interleaving).  Maximum of 4 platters as implemented. Tiger Shark  Striping across a maximum of platters  Striping method unspecified, looks like it is flexible and extended to include redundancy if desired. Keeps all members of a block group contiguous (per journaling FS concepts) and attempts to keep the block groups contiguous.

Seeking Optimizations EXT3NS  None noted beyond large block size. Tiger Shark  Byte Range Locking. Allows multiple clients to access different areas of a file with real-time guarantees if they don't step on each other.

Legacy nods EXT3NS: If the legacy option (slow class) is used, the disk contents are copied into the system's page cache through the system's bus as if EXT3FS was being used.  The paper does not go into it, but my guess is that this is a rather wasteful operation given the large blocksize of SDA. Other legacy tools such as fsck and mkfs are also available for EXT3NS. Tiger Shark: Compatible with JFS. VFS/JFS calls go through the kernel interface with some Block translation. EXT3NS and Tiger Shark: Fully compatible with VFS for respective platforms.  Virtual Filesystem

Current Research and Future Directions and Jeff’s questions Tiger Shark gives us Filesystem QoS. But can we do better by integrating VBR/ABR into the system? What about Peeling in a VBR system to save resources? Replication and redundancy are always an issue, but not addressed in this scope. If it is a software-based system such as Tiger Shark, Where in the OS should we put these optimizations? (Kernel, Tack- On Daemon, Middleware) Legacy disk accesses have a huge cost in both of these systems, how can we minimize?

EXT3NS Final Thoughts Valid, but not a novel approach.  custom hardware does not represent an incremental step forward in universal knowledge. EXT3NS is built for exactly one thing: Network Streaming of data.  An engineering change was made to the hardware design of a computer system, and some optimizations were made to the software to take advantage of it. The authors are not advocating a radical design change to all computers.  Violates a few “design principles” therefore it must be relegated to a customized specific-purpose system. Empirical data confirm that EXT3NS design is able to squeeze more concurrent sessions out of a multimedia server than would have been available previously.  There is still a saturation point where the memory of the NS card or the capabilities of the card's internal bus break down and the system cannot scale beyond that point. Better than Best Effort.

Tiger Shark Final Thoughts Valid, somewhat novel approach  It adds QoS guarantees to current disk interface architectures Built to be extensible to more than just MM disk access. But definitely optimized for it. Empirical data confirm that Tiger Shark design is able to serve more concurrent sessions out of a multimedia server than would have been available previously, BUT there is still a kernel bottleneck for the initial block load. Better suited to multiple concurrent access than EXT3NS  Currently appears scalable beyond any reasonable (modern) demands.. As usual in computer science though, future demands may find a point of scaling breakdown of the system. Guaranteed QoS. Many other later QoS filesystems extend this concept and tweak some aspects of it such as scheduling.

The fundamental academic question at the end of the day: The 2 major competing solution paradigms:  Fundamentally alter the hardware datapath in a computer and present a customized hardware solution with relevant changes in OS. Scaling = not addressed.  Retrofit current operating systems with some tacked-on task-specific optimizations and tweaking of settings. The system and the hardware are kept generic. Scaling = buy more hardware Or can we find an alternate third paradigm?