UC Santa Cruz Providing High Reliability in a Minimum Redundancy Archival Storage System Deepavali Bhagwat Kristal Pollack Darrell D. E. Long Ethan L.

Slides:

Advertisements

Similar presentations

CLEARSPACE Digital Document Archiving system INTRODUCTION Digital Document Archiving is the process of capturing paper documents through scanning and.

Advertisements

Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.

Henry C. H. Chen and Patrick P. C. Lee

2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.

Availability in Globally Distributed Storage Systems

Disk Scrubbing in Large Archival Storage Systems Thomas Schwarz, S.J. 1,2 Qin Xin 1,3, Ethan Miller 1, Darrell Long 1, Andy Hospodor 1,2, Spencer Ng 3.

EndRE: An End-System Redundancy Elimination Service.

Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.

Peer-to-peer archival data trading Brian Cooper Joint work with Hector Garcia-Molina (and others) Stanford University.

Availability in Global Peer-to-Peer Systems Qin (Chris) Xin, Ethan L. Miller Storage Systems Research Center University of California, Santa Cruz Thomas.

Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.

1 High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two Nov. 24, 2003 Byung-Gon Chun.

On Object Maintenance in Peer-to-Peer Systems IPTPS 2006 Kiran Tati and Geoffrey M. Voelker UC San Diego.

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Wide-area cooperative storage with CFS

Peer-to-peer archival data trading Brian Cooper and Hector Garcia-Molina Stanford University.

© 2009 EMC Corporation. All rights reserved. Content Addressed Storage Module 2.5.

DEDUPLICATION IN YAFFS KARTHIK NARAYAN PAVITHRA SESHADRIVIJAYAKRISHNAN.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗

New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Wide-Area Cooperative Storage with CFS Robert Morris Frank Dabek, M. Frans Kaashoek, David Karger, Ion Stoica MIT and Berkeley.

Overview of SQL Server Alka Arora.

Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

Adventures in Digital Asset Management: Fedora at the National Library of Wales Glen Robson National Library of Wales

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Introduction to data management.

Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2

Computer Organisation 1 Secondary Storage Sébastien Piccand

Annick Le Follic Bibliothèque nationale de France Tallinn,

Chapter 9 Section 2 : Storage Networking Technologies and Virtualization.

An Efficient Approach for Content Delivery in Overlay Networks Mohammad Malli Chadi Barakat, Walid Dabbous Planete Project To appear in proceedings of.

Speaker: 吳晋賢 (Chin-Hsien Wu) Embedded Computing and Applications Lab Department of Electronic Engineering National Taiwan University of Science and Technology,

Improving Content Addressable Storage For Databases Conference on Reliable Awesome Projects (no acronyms please) Advanced Operating Systems (CS736) Brandon.

EndRE: An End-System Redundancy Elimination Service Bhavish Aggarwal, Aditya Akella, Ashok Anand, Athula Balachandran, Pushkar Chitnis, Chitra Muthukrishnan,

Chapter 6 Protecting Your Files. 2Practical PC 5 th Edition Chapter 6 Getting Started In this Chapter, you will learn: − What you should know about losing.

10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.

Presenters: Rezan Amiri Sahar Delroshan

1 CloudVS: Enabling Version Control for Virtual Machines in an Open- Source Cloud under Commodity Settings Chung-Pan Tang, Tsz-Yeung Wong, Patrick P. C.

Hosted by The Pros & Cons of Content Addressed Storage Arun Taneja Founder & Consulting Analyst.

University of Massachusetts, Amherst TFS: A Transparent File System for Contributory Storage James Cipar, Mark Corner, Emery Berger

Data Replication and Power Consumption in Data Grids Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff Byrd Department of Computer Science The University.

Content Addressed Storage

Methodology – Physical Database Design for Relational Databases.

File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.

CENTER FOR HIGH PERFORMANCE COMPUTING Introduction to I/O in the HPC Environment Brian Haymore, Sam Liston,

Three-Dimensional Redundancy Codes for Archival Storage J.-F. Pâris, U. of Houston D. D. E. Long, U. C. Santa Cruz W. Litwin, U. Paris-Dauphine.

Witold Litwin Université Paris Dauphine Darrell LongUniversity of California Santa Cruz Thomas SchwarzUniversidad Católica del Uruguay Combining Chunk.

INFORMATION MANAGEMENT Module INFORMATION MANAGEMENT Module

Thomas Schwarz, S.J. Qin Xin, Ethan Miller, Darrell Long, Andy Hospodor, Spencer Ng Summarized by Leonid Kibrik.

The Power of Aligning Backup, Recovery, and Archive Bob Madaio Sr. Manager; Backup, Recovery and Archive Marketing EMC Corporation.

1 Reforming Software Delivery Using P2P Technology Purvi Shah Advisor: Jehan-François Pâris Department of Computer Science University of Houston Jeffrey.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

1 © 2002 hp Introduction to EVA Keith Parris Systems/Software Engineer HP Services Multivendor Systems Engineering Budapest, Hungary 23May 2003 Presentation.

Using Multiple Predictors to Improve the Accuracy of File Access Predictions Gary A. S. Whittle, U of Houston Jehan-François Pâris, U of Houston Ahmed.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

Using Deduplicating Storage for Efficient Disk Image Deployment Xing Lin, Mike Hibler, Eric Eide, Robert Ricci University of Utah.

Deploying disk deduplication for Hyper-v 3.0 Žigmund Maťašovský.

CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.

Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua.

Chapter 6 Protecting Your Files

University of Maryland Baltimore County

Subject Name: File Structures

Joseph JaJa, Mike Smorul, and Sangchul Song

Review Graph Directed Graph Undirected Graph Sub-Graph

Scalar Quantization – Mathematical Model

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.

Title Month Year Chris Patel EMC Centera Strategic Alliance Manager

Database management systems

Presentation transcript:

UC Santa Cruz Providing High Reliability in a Minimum Redundancy Archival Storage System Deepavali Bhagwat Kristal Pollack Darrell D. E. Long Ethan L. Miller Storage Systems Research Center University of California, Santa Cruz Thomas Schwarz Computer Engineering Department Santa Clara University Jehan-François Pâris Department of Computer Science University of Houston, Texas

2 Introduction u Archival data will increase ten-fold from 2007 to 2010 J. McKnight, T. Asaro, and B. Babineau, Digital Archiving:End-User Survey and Market Forecast The Enterprise Strategy Group, Jan u Data compression techniques used to reduce storage costs u Deep Store - An archival storage system Uses interfile and intrafile compression techniques Uses chunking u Compression hurts reliability Loss of a shared chunk  Disproportionate data loss u Our solution: Reinvest the saved storage space to improve reliability Selective replication of chunks u Our results: Better reliability compared to that of mirrored Lempel-Ziv compressed files using only about half of the storage space

3 Deep Store: An overview u Whole File Hashing Content Addressable Storage u Delta Compression u Chunk-based Compression File broken down into variable-length chunks using a sliding window technique A chunk identifier/digest used to look for identical chunks Only unique chunks stored w fixed size window chunk end/start variable chunk size window fingerprint chunk ID (content address) sliding window

4 Effects of Compression on Reliability u Chunk-based compression  Interfile dependencies u Loss of a shared chunk  Disproportionate amount of data loss Files Chunks

5 Effects of Compression on Reliability….. u Simple experiment to show the effects of interfile dependencies: 9.8 GB of data from several websites, The Internet Archive Compressed using chunking to 1.83 GB. (5.62 GB using gzip) Chunks were mirrored and distributed evenly onto 179 devices, 20 MB each.

6 Compression and Reliability u Chunking: Minimizes redundancies. Gives us excellent compression ratios Introduces interfile dependencies Interfile dependencies are detrimental to reliability u Hence, reintroduce redundancies Selective replication of chunks u Some chunks more important than others. How important? The amount of data depending on a chunk (byte count) The number of files depending on a chunk (reference count) u Selective replication strategy Weight of a chunk (w)  Number of replicas for a chunk (k) We use a heuristic function to calculate k

7 u k : Number of replicas u w : Weight of a chunk u a : Base level of replication, independent of w u b : To boost the number of replicas for chunks with high weight u Every chunk is mirrored u k max : Maximum number of replicas As replicas increase the gain in reliability obtained as a result of every additional replica reduces u k rounded off to the nearest integer. Heuristic Function

8 Distribution of Chunks u An archival system receives files in batches u Files stored onto a disk until the disk is full u For every file Chunks extracted and compressed Unique chunks stored u A non unique chunk stored only if: The present disk does not contain the chunk For this chunk, k < k max u At the end of the batch All chunks revisited and replicas made for appropriate chunks u A chunk is not proactively replicated Wait for a chunk’s replica to arrive as a chunk of a future file Reduce inter-device dependencies for a file.

9 Experimental Setup u We measure Robustness: The fraction of the data available given a certain percentage of unavailable storage devices u We use Replication to introduce redundancies Future work will investigate erasure codes u Data Set: HTML, PDF, image files from The Internet Archive. (9.8 GB) HTML, image (JPG and TIFF), PDF, Microsoft Word files from The Santa Cruz Sentinel (40.22 GB) u We compare the Robustness and Storage Space utilization of archives that use: Chunking with selective redundancies and Lempel-Ziv compression with mirroring

10 Details of the Experimental Data

11 u When using dependent data (byte count) as a heuristic: w = D/d D : sum of the sizes of all files depending on a chunk d : average size of a chunk u When using the number of files (reference count) as a heuristic: w = F F : number of files depending on a chunk Weight of a Chunk

12 Robustness, Effect of varying a, w=F, b=1, k max =4, The Internet Archive

13 Robustness, Effect of varying a, w=D/d,b=0.4, k max =4, The Internet Archive

14 Robustness, Effect of limiting k, w=D/d, b=0.55, a=0, The Internet Archive

15 Robustness, Effect of varying b, w=D/d, a=0, k max =4, The Internet Archive

16 Robustness, Effect of varying b, w=D/d, a=0, k max =4, The Sentinel

17 Choice of a Heuristic u Choice of a heuristic depends on the corpus If file size is indicative of file importance, choose w=D/d If file’s importance is independent of its size, choose w=F Use the same metric to measure robustness

18 Future Work u Study reliability of Deep Store With a recovery model in place When using delta compression u Use different redundancy mechanisms such as erasure codes u Data placement in conjunction with hardware statistics

19 Related Work u Many archival systems use Content Addressable Storage: EMC’s Centera Variable-length chunks: LBFS Fixed-size chunks: Venti u OceanStore aims to provide continuous access to persistent data uses automatic replication for high reliability erasure codes for high availability u FARSITE: a distributed file system Replication of metadata, replication chosen avoid the overhead of data reconstruction when using erasure codes u PASIS, Glacier use aggressive replication as a protection against data loss u LOCKSS provides long term access to digital data uses peer-to-peer audit and repair protocol to preserve the integrity and long-term access to document collections

20 Conclusion u Chunking gives excellent compression ratios but introduces interfile dependencies that adversely affect system reliability. u Selective replication of chunks using heuristics gives better robustness than mirrored LZ-compressed files significantly high storage space efficiency -- only uses about half of the space used by mirrored LZ-compressed files u We use simple replication. Our results will only improve with other forms of redundancies.

UC Santa Cruz Providing High Reliability in a Minimum Redundancy Archival Storage System Deepavali Bhagwat Kristal Pollack Darrell D. E. Long Ethan L. Miller Storage Systems Research Center University of California, Santa Cruz Thomas Schwarz Computer Engineering Department Santa Clara University Jehan-François Pâris Department of Computer Science University of Houston, Texas