Fragmentation in Large Object Repositories Russell Sears Catharine van Ingen CIDR 2007 This work was performed at Microsoft Research San Francisco with.

Slides:



Advertisements
Similar presentations
File Systems.
Advertisements

Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.
Chapter 13 – File and Database Systems
G Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees.
Web-Conscious Storage Management for Web Proxies Evangelos P. Markatos, Dionisios N. Pnevmatikatos, Member, IEEE, Michail D. Flouris, and Manolis G. H.
Bandwidth Allocation in a Self-Managing Multimedia File Server Vijay Sundaram and Prashant Shenoy Department of Computer Science University of Massachusetts.
5.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 5: Working with File Systems.
Storage Networking Technologies and Virtualization Section 2 DAS and Introduction to SCSI1.
Google AppEngine. Google App Engine enables you to build and host web apps on the same systems that power Google applications. App Engine offers fast.
Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
1 The Google File System Reporter: You-Wei Zhang.
RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.
Toolbox for Dimensioning Windows Storage Systems Jalil Boukhobza, Claude Timsit 12/09/2006 Versailles Saint Quentin University.
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
Dynamic Memory Allocation Questions answered in this lecture: When is a stack appropriate? When is a heap? What are best-fit, first-fit, worst-fit, and.
Google File System Simulator Pratima Kolan Vinod Ramachandran.
STEALTH Content Store for SharePoint using Caringo CAStor  Boosting your SharePoint to the MAX! "Optimizing your Business behind the scenes"
Scalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS Bill Devlin, Jim Cray, Bill Laing, George Spix Microsoft Research Dec
Emalayan Vairavanathan
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
1/14/2005Yan Huang - CSCI5330 Database Implementation – Storage and File Structure Storage and File Structure.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Hosted by The Pros & Cons of Content Addressed Storage Arun Taneja Founder & Consulting Analyst.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
CS 153 Design of Operating Systems Spring 2015 Lecture 21: File Systems.
Ceph: A Scalable, High-Performance Distributed File System
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Bigtable: A Distributed Storage System for Structured Data
Parallel IO for Cluster Computing Tran, Van Hoai.
PARALLEL DATA LABORATORY Carnegie Mellon University A framework for implementing IO-bound maintenance applications Eno Thereska.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Lecture Topics: 11/22 HW 7 File systems –block allocation Unix and NT –disk scheduling –file caches –RAID.
Using Deduplicating Storage for Efficient Disk Image Deployment Xing Lin, Mike Hibler, Eric Eide, Robert Ricci University of Utah.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Tom Van Steenkiste Supervisor: Predrag Buncic
Jonathan Walpole Computer Science Portland State University
Efficient data maintenance in GlusterFS using databases
RAID Redundant Arrays of Independent Disks
Applying Control Theory to Stream Processing Systems
Andy Wang CIS 5930 Computer Systems Performance Analysis
Google Filesystem Some slides taken from Alan Sussman.
CSI 400/500 Operating Systems Spring 2009
Direct Attached Storage and Introduction to SCSI
EECS 582 Midterm Review Mosharaf Chowdhury EECS 582 – F16.
Building a Database on S3
7.1. CONSISTENCY AND REPLICATION INTRODUCTION
A Redundant Global Storage Architecture
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
by Mikael Bjerga & Arne Lange
COMP755 Advanced Operating Systems
Presentation transcript:

Fragmentation in Large Object Repositories Russell Sears Catharine van Ingen CIDR 2007 This work was performed at Microsoft Research San Francisco with input from the NTFS and SQL Server teams

Background Web services store large objects for users –eg: Wikipedia, Flickr, YouTube, GFS, Hotmail Replicate BLOBs or files –No update-in-place Benchmark before deployment –Then, encounter storage performance problems We set out to make some sense of this Object Stores DB (metadata) Application Servers Replication / Data scrubbing Clients

Problems with partial updates Multiple changes per application request –Atomicity (distributed transactions) Most updates change object size –Must fragment, or relocate data Reading / writing the entire object addresses these issues

Experimental Setup Single storage node Compared filesystem, database –NTFS on Windows Server 2003 R2 –SQL Server 2005 beta Repeatedly update (free, reallocate) objects –Randomly chose sizes, objects to update –Unrealistic, easy to understand Measured throughput, fragmentation

Reasoning about time Existing metrics –Wall clock time: Requires trace to be meaningful, cannot compare different workloads –Updates per volume: Coupled to volume size Storage Age: Average number of updates per object

Read performance Clean system –SQL good small object performance (inexpensive opens) –NTFS significantly faster with objects >>1MB SQL degraded quickly NTFS small object performance was low, but constant NTFSSQLNTFSSQL Read Throughput (MB/s) 024 Updates per object 256 KB Objects1 MB Objects

10MB object fragmentation Storage Age Fragments/object SQL Server NTFS NTFS approaching asymptote SQL Server degrades linearly –No BLOB defragmenter

Rules of Thumb Classic pitfalls –Low free space (< 10%) –Repeated allocation and deallocation (High storage age) One new problem –Small volumes (< x object size) Implicit tuning knobs –Size of write requests

Append is expensive! Neither system can take advantage of final object size during allocation Both API’s provide “append” –Leave gaps for future appends –Place objects without knowing length Observe same behavior with single and random object sizes

Conclusions Get/put storage is important in practice Storage age –Metric for comparing implementations and workloads –Fragmentation behaviors vary significantly Append leads to poor layout

----BACKUP SLIDES----

Theory vs. Practice Theory focuses on contiguous layout of objects of known size Objects that are allocated in groups are freed in groups –Good allocation algorithms exploit this –Generally ignored for average case results –Leads to pathological behavior in some cases

Small objects / Large volumes –Percent free space Large objects / Small volumes –Number of free objects Small volumes

Efficient Get/Put No update-in-place –Partial updates complicate apps –Objects change size Pipeline requests –Small write buffers, I/O Parallelism Application server 1234

Lessons learned Target systems avoid update-in-place No use for database data models Quantified fragmentation behavior –Across implementations, workloads Common API’s complicate allocation –Filesystem / BLOB API is too expressive

Application server 1234

Example systems SharePoint –Everything in the database, one copy per version Wikipedia –One blob per document version; images are files Flickr / YouTube GFS –Scalable append; chunk data into 64MB files Hotmail –Each mailbox is stored as a single opaque BLOB

The folklore is accurate, so why do application designers… …benchmark, then deploy the “wrong” technology? …switch to the “right one” a year later? …then switch back?!? Performance problems crop up over time

Conclusions Existing systems vary widely –Measuring clean systems is inadequate, but standard practice Support for append is expensive Unpredictable storage is difficult to reliably scale and manage –See paper for more information about predicting and managing fragmentation in existing systems

Comparing data layout strategies Study the impact of –Volume size –Object size –Workload –Update strategies –Maintenance tasks –System implementation Need a metric that is independent of these factors

Related work Theoretical results –Worst case performance is unacceptable –Average case good for certain workloads –Structure in deallocation requests leads to poor real-world performance Buddy system –Place structural limitations on file layout –Bounds fragmentation, fails on large files

Introduction Content-rich web services require large, predictable and reliable storage Characterizing fragmentation behavior Opportunities for improvement

Data intensive web applications Simple data model (BLOBs) –Hotmail: user mailbox –Flickr: photograph(s) Replication –Instead of backup –Load balancing –Scalability Object Stores DB (metadata) Application Servers Replication / Data scrubbing Clients

Databases vs. Filesystems Manageability should be primary concern –No need for advanced storage features –Disk bound Folklore –File opens are slow –Database interfaces stream data poorly

Clean system performance K512K1M Object Size Read throughput (MB/sec) SQL Server NTFS Single node –Used network API’s Random workload –Get/put one object at a time Large objects lead to sequential I/O

Revisiting Fragmentation Data intensive web services –Long term predictability –Simple data model: get/put opaque objects Performance of existing systems Opportunities for improvement

Introduction Large object updates and web services –Replication for scalability, reliability –Get / put vs. partial updates Storage age –Characterizing fragmentation behavior –Comparing multiple approaches State-of-the-art approach: –Lay out data without knowing final object size –Change the interface?