Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006.

Slides:



Advertisements
Similar presentations
This course is designed for system managers/administrators to better understand the SAAZ Desktop and Server Management components Students will learn.
Advertisements

Paging: Design Issues. Readings r Silbershatz et al: ,
Operating System Structures
University of Notre Dame
COS 461 Fall 1997 Workstation Clusters u replace big mainframe machines with a group of small cheap machines u get performance of big machines on the cost-curve.
PlanetLab Operating System support* *a work in progress.
Module 1: Installing Windows XP Professional
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Research Issues in Cooperative Computing Douglas Thain
Exploring the UNIX File System and File Security
File Management Systems
Deconstructing Clusters for High End Biometric Applications NSF CCF June Douglas Thain and Patrick Flynn University of Notre Dame 5 August.
Research Issues in Cooperative Computing Douglas Thain
Guide To UNIX Using Linux Third Edition
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
CSE 490dp Resource Control Robert Grimm. Problems How to access resources? –Basic usage tracking How to measure resource consumption? –Accounting How.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
Operating Systems: Principles and Practice
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 7 Configuring File Services in Windows Server 2008.
Simplify your Job – Automatic Storage Management Angelo Session id:
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
File Systems and N/W attached storage (NAS) | VTU NOTES | QUESTION PAPERS | NEWS | VTU RESULTS | FORUM | BOOKSPAR ANDROID APP.
IT:Network:Applications Fall  Running one “machine” inside another “machine”  OS in Virtual machines sees ◦ CPU(s) ◦ Memory ◦ Disk ◦ USB ◦ etc.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.
Guide To UNIX Using Linux Fourth Edition
Hands-On Microsoft Windows Server 2008
CS 346 – Chapter 12 File systems –Structure –Information to maintain –How to access a file –Directory implementation –Disk allocation methods  efficient.
LINUX System : Lecture 2 OS and UNIX summary Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Acknowledgement.
SRM at Clemson Michael Fenn. What is a Storage Element? Provides grid-accessible storage space. Is accessible to applications running on OSG through either.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
Robert Fourer, Jun Ma, Kipp Martin Copyright 2006 An Enterprise Computational System Built on the Optimization Services (OS) Framework and Standards Jun.
Page 1 of John Wong CTO Twin Peaks Software Inc. Mirror File System A Multiple Server File System.
1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.
Chapter 6: Linux Filesystem Administration
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
Chapter Two Exploring the UNIX File System and File Security.
1 Interface Two most common types of interfaces –SCSI: Small Computer Systems Interface (servers and high-performance desktops) –IDE/ATA: Integrated Drive.
SLAC Experience on Bestman and Xrootd Storage Wei Yang Alex Sim US ATLAS Tier2/Tier3 meeting at Univ. of Chicago Aug 19-20,
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Module 3 Planning and Deploying Mailbox Services.
Chapter Two Exploring the UNIX File System and File Security.
Module 3 Configuring File Access and Printers on Windows 7 Clients.
Predrag Buncic (CERN/PH-SFT) WP9 - Workshop Summary
Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.
Chapter 11: File System Implementation Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Jan 1, 2005 Implementation.
Creating and Managing File Systems. Module 5 – Creating and Managing File Systems ♦ Overview This module deals with the structure of the file system,
PTA Linux Series Copyright Professional Training Academy, CSIS, University of Limerick, 2006 © Workshop V Files and the File System Part B – File System.
Jeff's Filesystem Papers Review Part I. Review of "Design and Implementation of The Second Extended Filesystem"
OSG Abhishek Rana Frank Würthwein UCSD.
UNIX & Windows NT Name: Jing Bai ID: Date:8/28/00.
Chapter 9: Networking with Unix and Linux. Objectives: Describe the origins and history of the UNIX operating system Identify similarities and differences.
CSC414 “Introduction to UNIX/ Linux” Lecture 6. Schedule 1. Introduction to Unix/ Linux 2. Kernel Structure and Device Drivers. 3. System and Storage.
Lecture 20 FSCK & Journaling. FFS Review A few contributions: hybrid block size groups smart allocation.
Digital Forensics Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #8 File Systems September 22, 2008.
Predrag Buncic (CERN/PH-SFT) Software Packaging: Can Virtualization help?
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
CEG 2400 FALL 2012 Linux/UNIX Network Operating Systems.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
2Operating Systems  Program that runs on a computer  Manages hardware resources  Allows for execution of programs  Acts as an intermediary between.
COMPSCI 110 Operating Systems
Chapter 12: File System Implementation
Integration of Singularity With Makeflow
NFS and AFS Adapted from slides by Ed Lazowska, Hank Levy, Andrea and Remzi Arpaci-Dussea, Michael Swift.
Exploring the UNIX File System and File Security
Haiyan Meng and Douglas Thain
Directory Structure A collection of nodes containing information about all files Directory Files F 1 F 2 F 3 F 4 F n Both the directory structure and the.
CSE451 - Section 10.
Presentation transcript:

Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006

Bad News: Many large distributed systems fall to pieces under heavy load!

Example: Grid3 (OSG) Robert Gardner, et al. (102 authors) The Grid3 Production Grid Principles and Practice IEEE HPDC 2004 The Grid2003 Project has deployed a multi-virtual organization, application-driven grid laboratory that has sustained for several months the production-level services required by… ATLAS, CMS, SDSS, LIGO…

Grid2003: The Details The good news: –27 sites with 2800 CPUs –40985 CPU-days provided over 6 months –10 applications with 1300 simultaneous jobs The bad news on ATLAS jobs: –40-70 percent utilization –30 percent of jobs would fail. –90 percent of failures were site problems –Most site failures were due to disk space!

CPU A Thought Experiment CPU shared disk CPU in out task Job in out task x 1,000,000 task 1 - Only a problem when load > capacity. 2 – Grids are employed by users with infinite needs!

Need Space Allocation Grid storage managers: –SRB - Storage Resource Broker at SDSC. –SRM – Storage Resource Manager at LBNL. –NeST – Networked Storage at UW-Madison. –IBP – Internet Backplane Protocol at UTK. But, do not have any help from the OS. –A runaway logfile can invalidate the careful accounting of the grid storage mgr.

Outline Grids Need OS Support for Allocation A Model of Space Allocation Three Implementations –User-Level Library –Loopback Devices –AllocFS: Kernel Filesystem Application to a Cluster

A Model of Space Allocation root jobshome j1j2 alicebetty size:1000 GB used: 0 GB size: 100 GB used: 0 GB size: 100 GB used: 0 GB size: 500 GB used: 0 GB size: 10 GB used: 0 GB size:1000 GB used: 100 GB size: 100 GB used: 10 GB data size: 10 GB used: 5 GB core size: 100 GB used: 100 GB size:1000 GB used: 700 GB Three commands: mkalloc (dir) (size) lsalloc (dir) rm –rf (dir)

No Built-In Allocation Policy In order to make an allocation: –Must have permission to mkdir. –New allocation must fit in available space. Need something more complex? –Check remote database re global quota? –Delete allocation after a certain time? –Send when allocation is full? Use a storage manager at a higher level. –SRB, SRM, NeST, IBP, etc...

No Built-In Allocation Policy grid storage manager need 10 GB ok, use jobs/j5 jobs j4j5 size: 100 GB used: 10 GB size: 10 GB used: 0 GB size: 10 GB used: 5 GB check database, charge credit card, consult human... size: 10 GB used: 0 GB size: 100 GB used: 20 GB (writeable by alice) mkalloc /jobs/j5 10GB setacl /jobs/j5 alice write ordinary file access task1task2 size: 5 GB used: 0 GB size: 5 GB used: 0 GB

Outline Grids Need OS Support for Allocation A Model of Space Allocation Three Implementations –User-Level Library –Loopback Devices –AllocFS: Kernel Filesystem Application to a Cluster

User Level Library root jobs j1j2 size:1000 GB used: 0 GB size: 10 GB used: 0 GB size: 100 GB used: 0 GB Appl LibAlloc file 1 - lock/read file 2 - stat/write 3 - unlock/write 1 - lock/read 2 - stat/write 3 - write/unlock size: 10 GB used: 2 GB size: 100 GB used: 5 GB

User Level Library Some details about locking: see paper. Applicability –Must modify apps or servers to employ. –Fails if non-enabled apps interfere. –But, can employ anywhere without privileges. Performance –Optimization: Cache locks until idle 2 sec. –At best, writes double in latency. –At worst, shared directories ping-pong locks. Recovery –fixalloc: traverses the directory structure and recomputes current allocations.

size:1000 GB Loopback Filesystems size: 100 GB jobs size: 10 GB root j1j2 file dd if=/dev/zero of=/jobs.fs 100GB losetup /dev/loopN /jobs.fs mke2fs /dev/loopN mount /dev/loopN /jobs

Loopback Filesystems Applicability –Works with any standard application. –Must be root to deploy and manage allocations. –Limited to approx allocations. Performance –Ordinary reads and writes: no overhead. –Allocations: Must touch every block to reserve! –Massively increases I/O traffic to disk. Recovery –Must scan hierarchy, fsck and mount every allocation. –Disastrous for large file systems!

AllocFS: Kernel-Level Filesystem #uidsizeusedparent GB700 GB GB99 GB GB5 GB root jobs j1j2 file Inode Table 1 – To update allocation state, update fields in incore-inode. 2 – To create/delete an allocation, update the parent’s allocation state, which is already cached for other reasons.

AllocFS: Kernel-Level Filesystem Applicability –Works with any ordinary application. –Must load module and be root to install. –Binary compatible with existing EXT2 filesystem. –Once loaded, ordinary users may employ. Performance –No measurable overhead on I/O. –Creating an allocation: touch two inodes. –Deleting an allocation: same as deleting directory. Recovery –fixalloc: traverses the directory structure and recomputes current allocations.

Library Adds Latency

Allocation Performance Loopback Filesystem –1 second per 25 MB of allocation. (40 sec/GB) –Must touch every single block. –Big increase in unnecessary I/O traffic! Allocation Library –227 usec regardless of size. –Several synchronous disk ops. Kernel Level Filesystem –32 usec regardless of size. –Touch one inode.

Comparison Priv. Reqd. Guarantee?Max #Write Perf. Alloc Perf. Recovery Libraryany user nono limit2x latency usecfixalloc once Loopbackroot to install, use yes10-100no change secs to mins fsck and mount each alloc Kernelroot to install yesno limitno change usecfixalloc once

Outline Grids Need OS Support for Allocation A Model of Space Allocation Three Implementations –User-Level Library –Loopback Devices –AllocFS: Kernel Filesystem Application to a Cluster

CPU A Physical Experiment CPU shared disk CPU in out task Job in out task Three configurations: 1 – No allocations. 2 – Backoff when failures detected. 3 – Heuristic: don’t start job unless space > threshhold. 4 – Allocate space for each job. Only space for 10. Vary load: # of simultaneous jobs.

Allocations Improve Robustness

Summary Grids require space allocations in order to become robust under heavy loads. Explicit operating system support for allocations is needed in order to make them manageable and efficient. User level approximations are possible, but have overheads in perf and mgmt. AllocFS provides allocations compatible with EXT2 with no measurable overhead.

Library Implementation Solaris, Linux, Mac, Windows Start server with –Q 100GB

Kernel Implementation Works with Linux Install over existing EXT2 FS. –(And, uninstall without loss.) % mkalloc /mnt/alloctest/adir 25M mkalloc: /mnt/alloctest/adir allocated blocks. % lsalloc -r /mnt/alloctest USED TOTAL PCT PATH 25.01M 87.14M 28% /mnt/alloctest 10.00M 25.00M 39% /mnt/alloctest/adir

A Final Thought [Some think] traditional OS issues are either solved problems or minor problems. We believe that building such vast distributed systems upon the fragile infrastructure provided by today’s operating systems is analogous to building castles on sand. The Persistent Relevance of the Local Operating System to Global Applications Jay Lepreau, Bryan Ford, and Mike Hibler SIGOPS European Workshop, September 1996

For More Information: Cooperative Computing Lab: – Douglas Thain Related Talks: –“Grid Deployment of Bioinformatics Apps...” Session 4A Friday –“Cacheable Decentralized Groups...” Session 5B Friday

Extra Slides

Existing Tools Not Suitable for the Grid User and Group Quotas –Don’t always correspond to allocation needs! User might want one alloc per job. Or, many users may want to share an alloc. Disk Partitions –Very expensive to create, change, manage. –Not hierarchical: only root can manage. ZFS Allocations –Cheap to create, change, manage. –Not hierarchical: only root can manage.

Library Suffers on Small Writes

Recovery Linear wrt # of Files