Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.

Slides:

Advertisements

Similar presentations

Big Data Working with Terabytes in SQL Server Andrew Novick

Advertisements

1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.

Tradeoffs in Scalable Data Routing for Deduplication Clusters FAST '11 Wei Dong From Princeton University Fred Douglis, Kai Li, Hugo Patterson, Sazzala.

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.

Robust video fingerprinting system Daniel Luis

1 Live Deduplication Storage of Virtual Machine Images in an Open-Source Cloud Chun-Ho Ng, Mingcao Ma, Tsz-Yeung Wong, Patrick P. C. Lee, John C. S. Lui.

1 A Framework for Lazy Replication in P2P VoD Bin Cheng 1, Lex Stein 2, Hai Jin 1, Zheng Zhang 2 1 Huazhong University of Science & Technology (HUST) 2.

Energy Efficient Prefetching – from models to Implementation 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software Engineering.

Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

File System Implementation

Yongtao Zhou, Yuhui Deng, Junjie Xie

Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University.

DISCLAIMER: This material is based on work supported by the National Science Foundation and the Department of Defense under grant No. CNS Any.

Department of Computer Science Southern Illinois University Edwardsville Dr. Hiroshi Fujinoki and Kiran Gollamudi {hfujino,

Objectives Learn what a file system does

EE616 Technical Project Video Hosting Architecture By Phillip Sutton.

Design and Implement an Efficient Web Application Server Presented by Tai-Lin Han Date: 11/28/2000.

Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.

Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2

1 I/O Management and Disk Scheduling Chapter Categories of I/O Devices Human readable Used to communicate with the user Printers Video display terminals.

1. Memory Manager 2 Memory Management In an environment that supports dynamic memory allocation, the memory manager must keep a record of the usage of.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

DATA DEDUPLICATION By: Lily Contreras April 15, 2010.

1 Towards Cinematic Internet Video-on-Demand Bin Cheng, Lex Stein, Hai Jin and Zheng Zhang HUST and MSRA Huazhong University of Science & Technology Microsoft.

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.

Oracle Advanced Compression – Reduce Storage, Reduce Costs, Increase Performance Session: S Gregg Christman -- Senior Product Manager Vineet Marwah.

Improving Content Addressable Storage For Databases Conference on Reliable Awesome Projects (no acronyms please) Advanced Operating Systems (CS736) Brandon.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Digital Literacy Lesson 3. The Role of Memory A computer stores data in the memory when a task is performed. Data is stored in the form of 0s and 1s.

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.

Evaluating memory compression and deduplication Yuhui Deng, Liangshan Song, Xinyu Huang Department of Computer Science Jinan University.

EQC16: An Optimized Packet Classification Algorithm For Large Rule-Sets Author: Uday Trivedi, Mohan Lal Jangir Publisher: 2014 International Conference.

Data Replication and Power Consumption in Data Grids Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff Byrd Department of Computer Science The University.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.

Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.

A Low-bandwidth Network File System Athicha Muthitacharoen et al. Presented by Matt Miller September 12, 2002.

Virtual Memory 1 1.

RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng, Patrick P. C. Lee The Chinese University of Hong Kong.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Witold Litwin Université Paris Dauphine Darrell LongUniversity of California Santa Cruz Thomas SchwarzUniversidad Católica del Uruguay Combining Chunk.

The Million Point PI System – PI Server 3.4 The Million Point PI System PI Server 3.4 Jon Peterson Rulik Perla Denis Vacher.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.

Jiahao Chen, Yuhui Deng, Zhan Huang 1 ICA3PP2015: The 15th International Conference on Algorithms and Architectures for Parallel Processing. zhangjiajie,

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Using Deduplicating Storage for Efficient Disk Image Deployment Xing Lin, Mike Hibler, Eric Eide, Robert Ricci University of Utah.

Gorilla: A Fast, Scalable, In-Memory Time Series Database

Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

Wide-area Network Acceleration for the Developing World

CMPE Database Systems Workshop June 16 Class Meeting

Efficient Multi-User Indexing for Secure Keyword Search

A Simulation Analysis of Reliability in Primary Storage Deduplication

Parallel Data Laboratory, Carnegie Mellon University

File Processing : Storage Media

The Anatomy of a Large-Scale Hypertextual Web Search Engine

File Processing : Storage Media

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

CMPE 252A : Computer Networks

CSE451 Virtual Memory Paging Autumn 2002

A flow aware packet sampling mechanism for high speed links

Fan Ni Xing Lin Song Jiang

Virtual Memory 1 1.

Efficient Migration of Large-memory VMs Using Private Virtual Memory

Presentation transcript:

Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1

Motivation Challenges Related work Our idea System architecture Evaluation Conclusion 2

3 The Explosive Growth of Data  Industrial manufacturing, E-commerce, Social network...  IDC: 1,800EB data in 2011, 40-60% annual increase  YouTube : 72 hours of video are uploaded per minute.  Facebook : 1 billion active users upload 250 million photos per day.  Greater pressure to the traditional data centers  Up to 60% of the data stored in backup system is redundant  Image from 3

4 10×10 12 Bytes/(20×10 6 bits/seconds) =4,000,000seconds=45days WAN bandwidth ： Assume that we want to send 10 TB from U.C. Berkeley to Amazon in Seattle, Washington. Garfinkel measured bandwidth to S3 from three sites and found an average write bandwidth of 5 to 18 Mbits/second. Suppose we get 20 Mbit/sec over a WAN link, it would take: S. Garfinkel. An evaluation of amazon ’ s grid computing services: ec2, s3 and sqs. Tech. Rep. TR-08-07, Harvard University, August Amazon would also charge $1000 in network transfer fees when it receives the data.

Big Data Store ⇒ Data deduplication  To speed up the process of identifying redundant data chunks, a fingerprint is calculated to represent each data chunk  A table of redundant fingerprints is used to determine whether a chunk is redundant.  the fingerprint information grows with the increase of data.  Some fingerprints have to be stored on disk.  However, due to the lacking of locality, fingerprints cannot be effectively cached, and the fingerprints generate random disk accesses.  Fingerprint lookup on-disk becomes a very important overhead in deduplication system. 5

Chunking Algorithm  Fix-Size Partition (FSP): fast and efficient, but vulnerable to the changes in a file  Variable-Size Partition (VSP): CDC (content-defined chunking) algorithm, SB (sliding block) algorithm, and etc: not vulnerable to the changes in a file CDC employs data content within files to locate the boundaries of chunks, thus avoiding the impact of data shifting. 6

Advantages  Save disk space and bandwidth  Higher throughput than that of the traditional data compression methods  Save other related cost 7

Throughput : store data in the given limited window time. Disk Bottlenecks  The amount of fingerprints grows with the increase of data  Traditional cache algorithms are not effective to handle the fingerprints  Low cache hit ratio degrades the performance of data deduplication

Bloom Filter  An summary vector in memory  excludes unnecessary lookup in advance and avoids extra disk I/Os 9 Bloom filter for searching the fingerprint tableBloom Filter( Summer Vector)

Extreme Binning  A hierarchy index policy 10 A two-tier chunk index with the primary index in RAM and bins on disk

LRU-based Index Partitioning  Enforces access locality of the fingerprint lookup in storing fingerprints 11 LRU-based index partitioning

A fingerprint prefetching algorithm by leveraging file similarity and data locality Request fingerprints from disk drives in advance Significantly improve the cache hit ratio, enhancing the performance of data deduplication 12

Traditional deduplication system architecture 13

Chunking Module Chunking Algorithm  Fix-Size Partition (FSP)  fast and efficient  Variable-Size Partition (VSP): CDC (content-defined chunking) algorithm, SB (sliding block) algorithm, and etc  not vulnerable to the changes in a file 14

Fingerprint Generator  Calculate a fingerprint for the chunk  Fingerprint: short (128bit), represent the unique chunk  expedite the process of chunk comparison  Hash Algorithm:MD5,SHA-1 15

Fingerprint Lookup  determining whether the chunk represented by current fingerprint is repeated  two chunks are considered identical if their fingerprints are the same  tends to be time-consuming when the fingerprint table becomes large Exists not store Not exists store 16

Similar File Identification Module  Identify similar files which share most of the identical chunks and fingerprints Fingerprint Prefetching Module  Accelerate the process of fingerprint lookup Sequential Arrangement Module  Preserve data locality 17 FPP Deduplication system architecture

Similar File Identification Target: identify similar files which share most of the identical chunks and fingerprints Consider fileA is similar to fileB and fileB has been stored before, place the fingerprints of file A in RAM before the process of fingerprint lookup for file B most of the lookup will succeed in RAM 18

Steps:  Step1: extract a group of sampling chunks from the target file  Step2: calculate fingerprints for these chunks  Step3: compare fingerprints, two files are considered to be similar if the degree of similarity between fingerprints reaches a certain threshold 19 Sample chunks

How to sample chunks  Step1: calculate Rabin fingerprint for sliding window  Step2: if it meets the predefined condition, then over; else move the sliding window  Step3: if the movement exceeds upper threshold,then over; else go to step1 20

Sequential Arrangement  traditional cache algorithms are not effective: fingerprints generated by cryptographic hash function are random  fingerprints are stored in accordance with the sequence that files occur in the data stream. 21

Fingerprint Prefetching Target: accelerate the process of fingerprint lookup with the combination of file similarity and locality. Two prefetching schemes :  schemes 1: all the unique fingerprints of the similar file from disk into cache.  schemes 2: read a portion of fingerprints from the recently visited location of fingerprint database into cache. 22

Experiment Setup Datasets:  Dataset1 : 78 files, word documents, pdf documents, powerpoint presentations etc, 1.4GB  Dataset2 : 4 virtual machine disk images, 1.8GB  Hardware:  Intel(R) Core(TM) (Dual Core 3.1GHz) with 2GB memory  hard disk drive (Seagate, 7200 RPM and 2000 GB). 23

 Experiment Result Overall Performance of Fingerprint Prefetching ① Data Compression Ratio ② Cache Hit Ratio of Fingerprint Access ③ Fingerprint Lookup Time ④ Deduplication Time Impact of RAM Size on Fingerprint Prefetching 24

① Data Compression Ratio  Dataset1 is compressed from 1.4G to 724M  Dataset2 is compressed from 1.8G to 1.5G Result analyse:  Dataset1 consists of documents revised and stored with multiple versions and copies,  virtual machine disk images contain less redundant data 25

② Cache Hit Ratio of Fingerprint Access 26  50% 95% for Dataset1  15% 90% for Dataset2  improve cache hit rate significantly

③ Fingerprint Lookup Time T L : Total Fingerprint Lookup Time T S : Similar File Identification Time, T P : Fingerprint Prefetching Time, T R : fingerprint retrieval time which does not include the time of Similar File Identification and Fingerprint Prefetching  fingerprint prefetching algorithm is more effective for big chunk size  fingerprint prefetching algorithm is more effective for big files  fingerprint prefetching algorithm is more effective for big files with small chunk size 27

28

④ Deduplication Time 29 fingerprint prefetching algorithm is effective for big files rather than small files

Impact of RAM Size on Fingerprint Prefetching Experiment set: About Datasets:  N-Dataset1 : Dataset1 reduced from 1.4G to 270M  N-Dataset2 : Dataset2 reduced from 1.8G to 302M Hardware:  Intel(R) Core(TM) i5-2520M (Quad core 2.50GHz) with centos6.3 30

Change the RAM size from 256M to 1024M Fingerprint prefetching algorithm obtains a more significant effectiveness in the case of 256M 31

Analysis:  For limited RAM, prefetching fingerprints can saves a large amount of time  For limited RAM, fingerprint prefetching algorithm can effectively alleviate the disk bottleneck of data deduplication  For “Big Data”, fingerprint prefetching algorithm can significantly improve the performance of deduplication system 32

Improve the throughput of data deduplication  help improve cache hit ratio  reduce the fingerprint lookup time  achieve a significant performance improvement of deduplication 33

Sample chunks  Number of chunks  How to better sample chunks Identify similar file  How to identify similar file more accurately 34

References B. Zhu, K. Li, & H. Patterson, (2008, February). Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In Fast (Vol. 8, pp ). Bhagwat, K. Eshghi, D. Long, & M. Lillibridge, (2009, September). Extreme binning: Scalable, parallel deduplication for chunk-based file backup. MASCOTS’09, A. Broder, & M. Mitzenmacher, (2004). Network applications of bloom filters: A survey. Internet Mathematics, 1(4), H. Bloom, (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, & P. Camble, (2009, February). Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In Fast (Vol. 9, pp ). A. Muthitacharoen, B. Chen, & D. Mazieres, (2001, October). A low-bandwidth network file system. In ACM SIGOPS Operating Systems Review(Vol. 35, No. 5, pp ). ACM. Y. Won, J. Ban, J. Min, J. Hur, S. Oh, & J. Lee, (2008, September). Efficient index lookup for De-duplication backup system. In MASCOTS T. Meyer, & J. Bolosky, (2012). A study of practical deduplication.ACM Transactions on Storage (TOS), 7(4), 14. F. Guo, & P. Efstathopoulos, (2011, June). Building a highperformance deduplication system. In Proceedings of the 2011 USENIX conference on USENIX annual technical conference (pp ). USENIX Association. Y. Tan, H. Jiang, D. Feng, L. Tian, & Z. Yan, (2011, May). CABdedupe: A causality-based deduplication performance booster for cloud backup services. In IPDPS, B. Debnath, S. Sengupta, & J. Li. ChunkStash: speeding up inline storage deduplication using flash memory. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference.

36 HPCC 2013: The 15th IEEE International Conference on High Performance Computing and Communications (HPCC 2013) Zhangjiajie, China, November 13-15, 2013