Yongtao Zhou, Yuhui Deng, Junjie Xie

Slides:



Advertisements
Similar presentations
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.
Advertisements

Flash storage memory and Design Trade offs for SSD performance
CSCE430/830 Computer Architecture
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Difference Engine: Harnessing Memory Redundancy in Virtual Machines by Diwaker Gupta et al. presented by Jonathan Berkhahn.
RAID Redundant Arrays of Inexpensive Disks –Using lots of disk drives improves: Performance Reliability –Alternative: Specialized, high-performance hardware.
1© Copyright 2012 EMC Corporation. All rights reserved. Delta Compressed and Deduplicated Storage Using Stream-Informed Locality Philip Shilane, Grant.
Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration Xiang Zhang 1,2, Zhigang Huo 1, Jie Ma 1, Dan Meng 1 1. National Research Center.
Tradeoffs in Scalable Data Routing for Deduplication Clusters FAST '11 Wei Dong From Princeton University Fred Douglis, Kai Li, Hugo Patterson, Sazzala.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Energy Efficient Prefetching – from models to Implementation 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software Engineering.
Energy Efficient Prefetching with Buffer Disks for Cluster File Systems 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Department of Computer Science, Jinan University, Guangzhou, P.R. China Lijun Lyu, Junjie Xie, Yuhui Deng, Yongtao Zhou ICA3PP 2014: The 14th International.
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
1© Copyright 2012 EMC Corporation. All rights reserved. WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression Philip Shilane,
Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University.
DISCLAIMER: This material is based on work supported by the National Science Foundation and the Department of Defense under grant No. CNS Any.
Department of Computer Science Jinan University( 暨南大学 ) Liangshan Song, Yuhui Deng, Junjie Xie 1.
Analyzing the Energy Efficiency of a Database Server Hanskamal Patel SE 521.
UC Santa Cruz Providing High Reliability in a Minimum Redundancy Archival Storage System Deepavali Bhagwat Kristal Pollack Darrell D. E. Long Ethan L.
1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Redundant Array of Independent Disks
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2
DATA DEDUPLICATION By: Lily Contreras April 15, 2010.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Oracle Advanced Compression – Reduce Storage, Reduce Costs, Increase Performance Session: S Gregg Christman -- Senior Product Manager Vineet Marwah.
Improving Content Addressable Storage For Databases Conference on Reliable Awesome Projects (no acronyms please) Advanced Operating Systems (CS736) Brandon.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
Data Replication and Power Consumption in Data Grids Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff Byrd Department of Computer Science The University.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng, Patrick P. C. Lee The Chinese University of Hong Kong.
CENTER FOR HIGH PERFORMANCE COMPUTING Introduction to I/O in the HPC Environment Brian Haymore, Sam Liston,
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Witold Litwin Université Paris Dauphine Darrell LongUniversity of California Santa Cruz Thomas SchwarzUniversidad Católica del Uruguay Combining Chunk.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
DE-DUPLICATION ALGORITHMS FOR HIGH BANDWIDTH DATA DE-DUPLICATION OF LARGE SCALE DATA SETS From Virtualization to Cloud (Spring 2011) Ariel Szapiro, Leeor.
A Measurement Based Memory Performance Evaluation of Streaming Media Servers Garba Isa Yau and Abdul Waheed Department of Computer Engineering King Fahd.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Embedded System Lab. 정영진 The Design and Implementation of a Log-Structured File System Mendel Rosenblum and John K. Ousterhout ACM Transactions.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
Jiahao Chen, Yuhui Deng, Zhan Huang 1 ICA3PP2015: The 15th International Conference on Algorithms and Architectures for Parallel Processing. zhangjiajie,
IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
Using Deduplicating Storage for Efficient Disk Image Deployment Xing Lin, Mike Hibler, Eric Eide, Robert Ricci University of Utah.
Distributed Caching and Adaptive Search in Multilayer P2P Networks Chen Wang, Li Xiao, Yunhao Liu, Pei Zheng The 24th International Conference on Distributed.
Geethanjali College Of Engineering and Technology Cheeryal( V), Keesara ( M), Ranga Reddy District. I I Internal Guide Mrs.CH.V.Anupama Assistant Professor.
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua.
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
CMPE Database Systems Workshop June 16 Class Meeting
Efficient Multi-User Indexing for Secure Keyword Search
Parallel-DFTL: A Flash Translation Layer that Exploits Internal Parallelism in Solid State Drives Wei Xie1 , Yong Chen1 and Philip C. Roth2 1. Texas Tech.
An Adaptive Data Separation Aware FTL for Improving the Garbage Collection Efficiency of Solid State Drives Wei Xie and Yong Chen Texas Tech University.
Jiang Zhou, Wei Xie, Dong Dai, and Yong Chen
Unistore: Project Updates
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
SigMatch Fast and Scalable Multi-Pattern Matching
Multiple-resource Request Scheduling. for Differentiated QoS
Similarity based deduplication
Presentation transcript:

Yongtao Zhou, Yuhui Deng, Junjie Xie Leverage Similarity and Locality to Enhance Fingerprint Prefetching of Data Deduplication Yongtao Zhou, Yuhui Deng, Junjie Xie Department of Computer Science, Jinan University, Guangzhou, 510632, P. R.China ICPADS 2014

Agenda Introduction Related work Motivation System overview Evaluation ICPADS 2014

Introduction IDC: 95% redundant data in the backup systems; 75% redundant data across the digital world Consumes IT resources and expensive network bandwidth Data deduplication: eliminate redundant data by storing only one data copy Fingerprints to large too store all fingerprints in memory ICPADS 2014

Introduction Querying fingerprints incurs disk bottleneck The size of fingerprints are too large too be cached in memory Cache hit ration very low (lack temporal locality) The IOPS of disk drivers is limited A large portion of fingerprints have to be stored on disk drives ICPADS 2014

Locality based strategies: DDFS Locality: segments tend to reappear in the same or very similar sequences with other segments. This is because most data from previous backup has a slight modifications. Prefetching!!!! Poor deduplication performance when there is little or no locality in datasets ICPADS 2014

Similarity based method: Extreme Binning Fail to identify and thus remove significant amounts of redundant data when there is a lack of similarity among files It uses a two-leve lindex structure made up of similarity characteristic value and the granularity of bin. Extreme Bining stores the similarity characteristics value in RAM. Extreme Bining only identifies the redundant data in the same bin, even though neighbouring bins may have identical data blocks. This results in some redundant data blocks so as to degrade the deduplication ratio. the deduplication ration of Extreme Bining heavily relies on the similarity degree of data streams. ICPADS 2014

SSD based approach Fingerprint lookup disk bottleneck The IOPS of disk drive is limited Some studies alleviate disk bottleneck by using SSD Dedupv1, ChunkStash SDD is still very expensive in contrast to disk drives. The performance of random and small writes becomes a new bottleneck of SSD HDD VS SSD ICPADS 2014

Our approach A fingerprint prefetching approach by using the file similarity to enhance the deduplication performance The locality of fingerprints are maintained by arranging the fingerprints in terms of the sequence of the backup data stream The overhead of different similarity identification algorithms are investigated, and the impacts of those algorithms on data deduplication are evaluated in contrast to previous studies Extreme Binning, Silo, FPP This approach does not impact the deduplication ration ICPADS 2014

System architecture Implementation in LessFS Implementation in Tokyo Cabinet ICPADS 2014

Storage structure for fingerprints The locality of fingerprints are maintained by arranging the fingerprints in terms of the sequence of the backup data stream Loss the locality of fingerprints ICPADS 2014

The process of fingerprints prefetching ICPADS 2014

Evaluation Implement a real prototype based on LessFS and Tokyo Cabinet Three similarity identification algorithms FPP, PAS and Simhash are implemented in the Similar File Identification Module Ubuntu operation system(Kernel version is 3.5.0-17) ,1GB memory, 2:4GHz Intel(R) Xeon(R) CPU We take four full backups to evaluate the system like what DDFS does. Four data sets backup1, backup2, backup3 and backup4 to perform the evaluation 10GB, 15GB, 20GB and 25GB, and the numbers of files are 3073, 4694, 6539 and 9910, respectively. We choose fixed-size chunk algorithm. The chunk size is 4KB, 8KB, 16KB, 32KB, 64KB and 128KB ICPADS 2014

FPP and PAS ICPADS 2014

Simhash Simhash is a member of the local sensitive hash Simhash has the property that the fingerprints of similar files differ in a small number of bit positions Actual runs at Google web search engine ICPADS 2014

Data sets The file size distribution matches the previous studies. ICPADS 2014

Deduplication ratio We measure the size of unique data blocks by using three different similarity identification algorithms including FPP, PAS and Simhash with four full backups When the chunk size is 4KB, the unique data blocks are 14GB, and the data deduplication ratios are 3.93 across the three cases. The performance is the same as that of the baseline system LessFS. ICPADS 2014

Time overhead of fingerprint lookup 𝑇 𝑠 : the time of similarity detection 𝑇 𝑝 : the time of fingerprint prefetch 𝑇 𝑓 : the time of fingerprint lookup The overall overhead of fingerprint lookup 𝑇 𝑡 = 𝑇 𝑠 + 𝑇 𝑝 + 𝑇 𝑓 For Base has 𝑇 𝑠 = 𝑇 𝑝 =0 ICPADS 2014

Time overhead of fingerprint lookup ICPADS 2014

CPU utilization ICPADS 2014

Memory utilization ICPADS 2014

Conclusion Proposes a fingerprint prefetching approach by preserving the locality of fingerprint in the form of backup data stream as well as taking advantage of file similarity The proposed method can effectively alleviate the disk bottleneck with acceptable overhead of CPU, memory, and storage when performing fingerprint lookup, thus improving the throughput of data deduplication Does not impact the data deduplication ratio ICPADS 2014

Reference SSD: http://en.wikipedia.org/wiki/Solid-state_drive HDD vs SSD: http://www.diffen.com/difference/HDD_vs_SSD D. Bhagwat, K. Eshghi, D. D. Long, and M. Lillibridge, “Extreme binning: Scalable, parallel deduplication for chunk-based file backup,” in Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS’09. IEEE International Symposium on. IEEE, 2009, pp. 1–9. W. Xia, H. Jiang, D. Feng, and Y. Hua, “Silo: a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput,” in Proceedings of the 2011 USENIX conference on USENIX annual technical conference. USENIX Association, 2011, pp. 26–28. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, “Min-wise independent permutations,” Journal of Computer and System Sciences, vol. 60, no. 3, pp. 630–659, 2000. Y. Zhou, Y. Deng, X. Chen, and J. Xie, “Identifying file similarity in large data sets by modulo file length,” in Proceedings of the 14th International Conference on Algorithms and Architectures for Parallel Processing. IEEE, 2014. D. Meister and A. Brinkmann, “dedupv1: Improving deduplication throughput using solid state drives (ssd),” in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE, 2010, pp. 1–6. B. Debnath, S. Sengupta, and J. Li, “Chunkstash: speeding up inline storage deduplication using flash memory,” in Proceedings of the 2010 USENIX conference on USENIX annual technical conference. USENIX Association, 2010, pp. 16–16. Y. Deng, “What is the future of disk drives, death or rebirth?” ACM Computing Surveys (CSUR), vol. 43, no. 3, p. 23, 2011. B. Zhu, K. Li, and R. H. Patterson, “Avoiding the disk bottleneck in the data domain deduplication file system.” in Fast, vol. 8, 2008, pp. 1–14. J. Gantz and D. Reinsel, “The digital universe decade-are you ready,” IDC iView, 2010. S. Quinlan and S. Dorward, “Venti: A new approach to archival storage.” in FAST, vol. 2, 2002, pp. 89–101. M. Ruijter, “Lessfs,” http://www.lessfs.com/wordpress/. F. Labs, “Tokyo cabinet,” http://fallabs.com/tokyocabinet/. M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, and P. Camble, “Sparse indexing: Large scale, inline deduplication using sampling and locality.” in Fast, vol. 9, 2009, pp. 111–123. G. S. Manku, A. Jain, and A. Das Sarma, “Detecting nearduplicates for web crawling,” in Proceedings of the 16th international conference on World Wide Web. ACM, 2007, pp. 141–150. ICPADS 2014

Thank you! Question? ICPADS 2014