Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.

Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1

Motivation Challenges Related work Our idea System architecture Evaluation Conclusion 2

3 The Explosive Growth of Data  Industrial manufacturing, E-commerce, Social network...  IDC: 1,800EB data in 2011, 40-60% annual increase  YouTube : 72 hours of video are uploaded per minute.  Facebook : 1 billion active users upload 250 million photos per day.  Greater pressure to the traditional data centers  Up to 60% of the data stored in backup system is redundant  Image from http://www.buzzfeed.com 3

4 10×10 12 Bytes/(20×10 6 bits/seconds) =4,000,000seconds=45days WAN bandwidth ： Assume that we want to send 10 TB from U.C. Berkeley to Amazon in Seattle, Washington. Garfinkel measured bandwidth to S3 from three sites and found an average write bandwidth of 5 to 18 Mbits/second. Suppose we get 20 Mbit/sec over a WAN link, it would take: S. Garfinkel. An evaluation of amazon ’ s grid computing services: ec2, s3 and sqs. Tech. Rep. TR-08-07, Harvard University, August 2007. Amazon would also charge $1000 in network transfer fees when it receives the data.

Big Data Store ⇒ Data deduplication  To speed up the process of identifying redundant data chunks, a fingerprint is calculated to represent each data chunk  A table of redundant fingerprints is used to determine whether a chunk is redundant.  the fingerprint information grows with the increase of data.  Some fingerprints have to be stored on disk.  However, due to the lacking of locality, fingerprints cannot be effectively cached, and the fingerprints generate random disk accesses.  Fingerprint lookup on-disk becomes a very important overhead in deduplication system. 5

Chunking Algorithm  Fix-Size Partition (FSP): fast and efficient, but vulnerable to the changes in a file  Variable-Size Partition (VSP): CDC (content-defined chunking) algorithm, SB (sliding block) algorithm, and etc: not vulnerable to the changes in a file CDC employs data content within files to locate the boundaries of chunks, thus avoiding the impact of data shifting. 6

Advantages  Save disk space and bandwidth  Higher throughput than that of the traditional data compression methods  Save other related cost 7

Throughput : store data in the given limited window time. Disk Bottlenecks  The amount of fingerprints grows with the increase of data  Traditional cache algorithms are not effective to handle the fingerprints  Low cache hit ratio degrades the performance of data deduplication

Bloom Filter  An summary vector in memory  excludes unnecessary lookup in advance and avoids extra disk I/Os 9 Bloom filter for searching the fingerprint tableBloom Filter( Summer Vector)

Extreme Binning  A hierarchy index policy 10 A two-tier chunk index with the primary index in RAM and bins on disk

LRU-based Index Partitioning  Enforces access locality of the fingerprint lookup in storing fingerprints 11 LRU-based index partitioning

A fingerprint prefetching algorithm by leveraging file similarity and data locality Request fingerprints from disk drives in advance Significantly improve the cache hit ratio, enhancing the performance of data deduplication 12

Traditional deduplication system architecture 13

Chunking Module Chunking Algorithm  Fix-Size Partition (FSP)  fast and efficient  Variable-Size Partition (VSP): CDC (content-defined chunking) algorithm, SB (sliding block) algorithm, and etc  not vulnerable to the changes in a file 14

Fingerprint Generator  Calculate a fingerprint for the chunk  Fingerprint: short (128bit), represent the unique chunk  expedite the process of chunk comparison  Hash Algorithm:MD5,SHA-1 15

Fingerprint Lookup  determining whether the chunk represented by current fingerprint is repeated  two chunks are considered identical if their fingerprints are the same  tends to be time-consuming when the fingerprint table becomes large Exists not store Not exists store 16

Similar File Identification Module  Identify similar files which share most of the identical chunks and fingerprints Fingerprint Prefetching Module  Accelerate the process of fingerprint lookup Sequential Arrangement Module  Preserve data locality 17 FPP Deduplication system architecture

Similar File Identification Target: identify similar files which share most of the identical chunks and fingerprints Consider fileA is similar to fileB and fileB has been stored before, place the fingerprints of file A in RAM before the process of fingerprint lookup for file B most of the lookup will succeed in RAM 18

Steps:  Step1: extract a group of sampling chunks from the target file  Step2: calculate fingerprints for these chunks  Step3: compare fingerprints, two files are considered to be similar if the degree of similarity between fingerprints reaches a certain threshold 19 Sample chunks

How to sample chunks  Step1: calculate Rabin fingerprint for sliding window  Step2: if it meets the predefined condition, then over; else move the sliding window  Step3: if the movement exceeds upper threshold,then over; else go to step1 20

Sequential Arrangement  traditional cache algorithms are not effective: fingerprints generated by cryptographic hash function are random  fingerprints are stored in accordance with the sequence that files occur in the data stream. 21

Fingerprint Prefetching Target: accelerate the process of fingerprint lookup with the combination of file similarity and locality. Two prefetching schemes :  schemes 1: all the unique fingerprints of the similar file from disk into cache.  schemes 2: read a portion of fingerprints from the recently visited location of fingerprint database into cache. 22

Experiment Setup Datasets:  Dataset1 : 78 files, word documents, pdf documents, powerpoint presentations etc, 1.4GB  Dataset2 : 4 virtual machine disk images, 1.8GB  Hardware:  Intel(R) Core(TM) (Dual Core 3.1GHz) with 2GB memory  hard disk drive (Seagate, 7200 RPM and 2000 GB). 23

 Experiment Result Overall Performance of Fingerprint Prefetching ① Data Compression Ratio ② Cache Hit Ratio of Fingerprint Access ③ Fingerprint Lookup Time ④ Deduplication Time Impact of RAM Size on Fingerprint Prefetching 24

① Data Compression Ratio  Dataset1 is compressed from 1.4G to 724M  Dataset2 is compressed from 1.8G to 1.5G Result analyse:  Dataset1 consists of documents revised and stored with multiple versions and copies,  virtual machine disk images contain less redundant data 25

② Cache Hit Ratio of Fingerprint Access 26  50% 95% for Dataset1  15% 90% for Dataset2  improve cache hit rate significantly

③ Fingerprint Lookup Time T L : Total Fingerprint Lookup Time T S : Similar File Identification Time, T P : Fingerprint Prefetching Time, T R : fingerprint retrieval time which does not include the time of Similar File Identification and Fingerprint Prefetching  fingerprint prefetching algorithm is more effective for big chunk size  fingerprint prefetching algorithm is more effective for big files  fingerprint prefetching algorithm is more effective for big files with small chunk size 27

④ Deduplication Time 29 fingerprint prefetching algorithm is effective for big files rather than small files

Impact of RAM Size on Fingerprint Prefetching Experiment set: About Datasets:  N-Dataset1 : Dataset1 reduced from 1.4G to 270M  N-Dataset2 : Dataset2 reduced from 1.8G to 302M Hardware:  Intel(R) Core(TM) i5-2520M (Quad core 2.50GHz) with centos6.3 30

Change the RAM size from 256M to 1024M Fingerprint prefetching algorithm obtains a more significant effectiveness in the case of 256M 31

Analysis:  For limited RAM, prefetching fingerprints can saves a large amount of time  For limited RAM, fingerprint prefetching algorithm can effectively alleviate the disk bottleneck of data deduplication  For “Big Data”, fingerprint prefetching algorithm can significantly improve the performance of deduplication system 32

Improve the throughput of data deduplication  help improve cache hit ratio  reduce the fingerprint lookup time  achieve a significant performance improvement of deduplication 33

Sample chunks  Number of chunks  How to better sample chunks Identify similar file  How to identify similar file more accurately 34

References B. Zhu, K. Li, & H. Patterson, (2008, February). Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In Fast (Vol. 8, pp. 269-282). Bhagwat, K. Eshghi, D. Long, & M. Lillibridge, (2009, September). Extreme binning: Scalable, parallel deduplication for chunk-based file backup. MASCOTS’09, 2009. A. Broder, & M. Mitzenmacher, (2004). Network applications of bloom filters: A survey. Internet Mathematics, 1(4), 485-509. H. Bloom, (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426. M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, & P. Camble, (2009, February). Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In Fast (Vol. 9, pp. 111-123). A. Muthitacharoen, B. Chen, & D. Mazieres, (2001, October). A low-bandwidth network file system. In ACM SIGOPS Operating Systems Review(Vol. 35, No. 5, pp. 174-187). ACM. Y. Won, J. Ban, J. Min, J. Hur, S. Oh, & J. Lee, (2008, September). Efficient index lookup for De-duplication backup system. In MASCOTS 2008. T. Meyer, & J. Bolosky, (2012). A study of practical deduplication.ACM Transactions on Storage (TOS), 7(4), 14. F. Guo, & P. Efstathopoulos, (2011, June). Building a highperformance deduplication system. In Proceedings of the 2011 USENIX conference on USENIX annual technical conference (pp. 25-25). USENIX Association. Y. Tan, H. Jiang, D. Feng, L. Tian, & Z. Yan, (2011, May). CABdedupe: A causality-based deduplication performance booster for cloud backup services. In IPDPS, 2011. B. Debnath, S. Sengupta, & J. Li. ChunkStash: speeding up inline storage deduplication using flash memory. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference.

36 HPCC 2013: The 15th IEEE International Conference on High Performance Computing and Communications (HPCC 2013) Zhangjiajie, China, November 13-15, 2013

Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.

Similar presentations

Presentation on theme: "Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.

Similar presentations

Presentation on theme: "Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1."— Presentation transcript:

Similar presentations

About project

Feedback