Jingwei Li, Patrick P. C. Lee #, Yanjing Ren, and Xiaosong Zhang*

Metadedup: Deduplicating Metadata in Encrypted Deduplication via Indirection
Jingwei Li*, Patrick P. C. Lee #, Yanjing Ren*, and Xiaosong Zhang* *University of Electronic Science and Technology of China #The Chinese University of Hong Kong Hello everyone, I’m Yanjing Ren from University of Electronic Science and Technology of China. This is a joint work with Jingwei Li, Patrick Lee, and Xiaosong Zhang. The title of our work is Metadedup: Deduplicating Metadata in Encrypted Deduplication via Indirection. MSST 2019

Background Global datasphere is expected to grow from 33 ZB in 2018 to 175 ZB in 2025[1] (Note: 1 ZB = 1 trillion GB) According to IDC report, global datasphere will grow from 33 ZB in 2018 to 175 ZB in 2025. To manage the huge volume of data, outsourcing data to public clouds is more popular than before. In the cloud, various cloud service providers apply deduplication to save the maintenance cost of storage services. Outsourcing data management to public cloud storage is popular Various cloud services have applied deduplication on stored data, in order to save maintenance cost [1]

Storage space saved by 5/12 = 42%!
Deduplication Deduplication  coarse-grained compression Unit: chunk (fixed- or variable-size) Store only one copy of chunks with same content; other chunks refer to the copy by references (pointers) Storage space saved by 5/12 = 42%! Deduplication is regarded as coarse-grained compression. It works on the unit of chunks, which can be either fixed or variable-size, and stores only one copy of duplicate chunks. In the example, we have three files and each file includes four chunks. After deduplication, we only need to store 7 non-duplicate chunks. This implies that we save storage space by 42%.

Encrypted Deduplication
Encrypted deduplication augments deduplication with encryption for securing outsourced data against cloud service providers Message-locked encryption (MLE)[Bellare, EUROCRYPT’13] uses a key derived from content of chunk Enable deduplication on same data from distinct users MLE instantiations: Historical MLE[Bellare, EUROCRYPT’13] uses chunk fingerprint as key Server-aided MLE[Bellare, Security’13] derives key based on chunk fingerprint and a global secret Robust against brute-force attack Encrypted deduplication augments deduplication with an encryption layer, so as to achieve both security and storage efficiency. The cryptographic foundation of encrypted deduplication is message-locked encryption, abbreviated as MLE. It derives keys from the contents of chunks and enables to apply deduplication on the same chunks from different users. In the state of the art, two MLE instantiations have been proposed. The historical MLE (also known as convergent encryption) directly uses the fingerprint of chunk as its key. The server-aided MLE derives keys based on both chunk fingerprint and a global secret, and thus it can be robust against brute-force attack.

Metadata In addition to storing non-duplicate data, encrypted deduplication storage keeps metadata: Fingerprint index: fingerprints of already stored chunks Deduplication metadata: fingerprint-to-chunk info mappings (file recipe) Key metadata: fingerprint-to-key mappings (key recipe) Key recipes need to be managed separately from file recipes, and protected by users’ master keys In addition to storing non-duplicate data, encrypted deduplication storage keeps metadata. We have three types of metadata. Fingerprint index tracks the fingerprints of all chunks that have already been stored, and enables to detect duplicates efficiently. Deduplication metadata, commonly stored as file recipe, holds the mappings from the chunks in the file to the references of the corresponding physical copies. It is used to reconstruct original files from chunks. Key metadata, stored as key recipe, tracks the chunk-to-key mappings to allow the decryption of individual files. Key recipe includes sensitive information, and it needs to be managed separately from file recipes, and protected by users’ master keys.

Overall Metadata Storage
Metadata storage overhead in two backup workloads: Next, we show the high metadata storage overhead in encrypted deduplication. We consider two real-world backup workloads FSL and VM. The x axis is the number of FSL or VM backups and the y axis is the cumulative size of data or metadata. We see that the cumulative size of metadata increases with the number of backups, and even exceeds the cumulative size of physical data in the VM dataset. X axis: number of backups Y axis: cumulative size of data or metadata The cumulative size of metadata approximates (FSL) or even exceeds (VM) that of physical data

Metadata Storage Breakdown
We present the breakdown of different metadata component. File recipes and key recipes take a significant account of overall metadata storage. X axis: number of backups Y axis: cumulative size of different metadata File recipes and key recipes take more than 99.5% of overall metadata storage

Challenges Grouping & re-chunking[Romanski, SYSTOR’11; Kruus, FAST’10]
Depend on prior knowledge of deduplication results Compression[Meister, FAST’13] Cannot be applied to key recipes Key management[Li, TPDS’14; Zhou, MSST’15, Li, DSN’16] Add significant overhead to deduplication metadata or degrade storage efficiency of data How to enable space-efficient metadata management? To address the storage overhead of recipes, existing approaches can be classified into three categories: Grouping & re-chunking groups adjacent duplicate chunks into a big chunk, while re-chunks non-duplicate chunks into small chunks. It depends on the prior knowledge of deduplication results, which can be abused to extract secret information against encrypted deduplication. Compression replaces the fingerprints in file recipes by short codewords, but it cannot be applied to key recipes that are encrypted by file owners’ master keys. Some key management approaches apply deduplication to keys and suffers additional storage overhead of the metadata of key deduplication. Other approaches generate one key for a group of chunks and degrade the storage efficiency of data. Thus, we pose our question as “How to enable space-efficient metadata management in encrypted deduplication?”

Our Contributions Metadedup: an encrypted deduplication storage system with space-efficient metadata management Build on indirection to apply deduplication on metadata Preserve security guarantees for both data and metadata Extensive prototype and trace-driven evaluations Metadedup suppresses the storage space of metadata, while incurring limited performance and indexing penalties In this work, we introduce Metadedup, an encrypted deduplication storage system with space-efficient metadata management. The key idea is to build on the power of indirection and apply deduplication on metadata in a coarse-grained granularity. Our approach preserves the security guarantees for both data and metadata as encrypted deduplication. We implement Metadedup prototype and conduct extensive evaluations. Our evaluations show that Metadedup significantly suppresses the storage space of metadata, while incurring limited performance and indexing penalties.

Architecture Target backup workloads
Client Non-duplicate chunks File recipes Deduplication Fingerprint index Key recipes File Chunk … … encrypted data & metadata Internet File Chunk … Server Client Metadedup builds on client-server architecture. It targets the outsourcing storage of backup workloads. Each client partitions input files into fine-grained chunks and uploads encrypted data chunks, as well as corresponding metadata to server. The server maintains a fingerprint index to support deduplication of data from different clients. It only stores the file recipes, key recipes and non-duplicate chunks. Target backup workloads Support deduplication of data from multiple clients

Design Goals Low storage overhead of metadata
Suppress storage space of file recipe and key recipe Add small storage overhead to fingerprint index Ensure security for data and metadata Prevent adversarial server from accessing any data or metadata Prevent adversarial clients from accessing unauthorized data or metadata Incur limited performance overhead Add small performance overhead for write We have the following design goals for Metadedup. Low storage overhead of metadata. Metadedup suppresses the storage space of both file recipes and key recipes, and it only adds small storage overhead to the fingerprint index due to metadata deduplication. Ensure security for data and metadata. In threat model, we consider adversarial server and adversarial clients. Metadedup prevents adversarial server from accessing any data or metadata, and it prevents adversarial clients from accessing unauthorized data or metadata. (note that unauthorized means beyond the access scope of the adversarial clients) Incur limited performance overhead. Compared to existing encrypted deduplication storage system, Metadedup only adds small performance overhead to the write operation.

Main Idea Build indirect index, called metadata chunks
Each metadata chunk stores metadata of multiple continuous chunks File recipes and key recipes store references of metadata chunks Metadata chunks are highly redundant for being deduplicated Changes to backups are limited to a few regions of data Unchanged regions have long sequences of data, as well as metadata, in identical orders Our idea builds on an indirect index, called metadata chunks. Each metadata chunk stores the metadata of multiple continuous data chunks, and file recipes and key recipes only keep the references of metadata chunks. Our observation is that the metadata chunks are highly redundant for being deduplicated. This is because changes to backups are usually limited to a few regions of data, and the unchanged regions should have long sequences of data and metadata in identical orders.

Segmentation Metadedup works after encryption procedure
Each data chunk has been encrypted with MLE Apply variable-size segmentation scheme[Lillibridge, FAST’09] on encrypted data chunks Create boundaries if chunk fingerprints match specific pattern Address boundary-shift problem and achieve high effectiveness of metadata deduplication Metadedup works after the encryption procedure. In other words, we assume that each data chunk has already been encrypted with MLE. It applies variable-size segmentation scheme on encrypted data chunks and creates a segment boundary if the fingerprint of the encrypted chunk matches a specific pattern. This addresses the boundary shift of segments and enables high effectiveness of metadata deduplication.

Metadata Management 🔒 Encrypt metadata chunks with historical MLE
Fingerprint Size Key Other Deduplication & Storage 🔒 File recipe Key recipe Storage only Data chunks Metadata chunks Recipes Segment For each segment, Metadedup creates a metadata chunk that keeps the fingerprint, size, key of each data chunk in the segment. It encrypts metadata chunks with historical MLE. Both data chunks and metadata chunks are uploaded to server for deduplication and storage. Metadedup organizes file recipes and key recipes for storing the metadata of only metadata chunks. It encrypts the key recipe with master key, and store both file recipe and the encrypted key recipe in the server. Encrypt metadata chunks with historical MLE Apply deduplication to both data chunks and metadata chunks

Operations Write operation Restore operation
Client: generate and encrypt data chunks; construct metadata chunks; create file recipe and key recipe Server: store non-duplicate data and metadata chunks, as well as file recipes and key recipes Restore operation First stage: retrieve file recipes, key recipes and metadata chunks Second stage: retrieve data chunks and assemble original file To sum up, Metadedup supports write and restore operations. In the write operation, client partitions files into data chunks and encrypts data chunks with MLE; it groups data chunks into segments, generates a metadata chunk for each segment and encrypts segments with historical MLE; the client further creates file recipes and key recipes for metadata chunks and protects the key recipes. The server applies deduplication to data chunks and metadata chunks, and it stores non-duplicate data and metadata chunks, as well as file recipes and key recipes. In the restore operation, Metadedup follows two stages to reconstruct original files. In the first stage, the client retrieves metadata, including the file recipes, key recipes and metadata chunks. In the second stage, based on retrieved metadata, it retrieves data chunks and assembles original files.

Security Analysis Retain confidentiality for data chunks as encrypted deduplication If server-aided MLE is applied, ensure security for all data chunks If historical MLE is applied, ensure security for unpredictable data chunks Case I: server-aided MLE is applied on data chunks Metadata chunks are fully protected Case II: historical MLE is applied on data chunks Infeasible to launch brute-force attack against metadata chunks Metadedup preserves the security of data chunks as encrypted deduplication. Specifically, if server-aided MLE is applied, ensure security for all data chunks, and if historical MLE is applied, it ensures security for unpredictable data chunks. Based on how data chunks are protected, Metadedup ensures security of metadata in two cases. If server-aided MLE is applied, even we use historical MLE to protect metadata chunks, the metadata chunks are fully protected. If historical MLE is applied, an adversary can the sample metadata of data chunks and launch brute-force attack against metadata chunks. We show that such brute-force attack is much more computationally expensive and it is infeasible to be launched in practice.

Implementation Metadedup builds on our prior system CDStore[Li, ATC’15] Augment CDStore with metadata deduplication ~7.5K LoC in total Open issues: Protect sensitive filenames via obfuscation Optimize restore performance We build on our prior system CDStore to implement Metadedup. It includes around 7.5K line of code in total and augments CDStore with metadata deduplication. Our implementation raises some open issues, such as protecting sensitive filenames via obfuscation and optimizing the restore performance.

Experimental Setup Testbed: Datasets: Evaluation goals:
Xeon 2.40GHz machines connected via 1Gb/s switch Datasets: FSL: file system snapshots (115 full daily backups, 56.20TB in total) VM: virtual image snapshots (26 full daily backups, 39.61TB in total) Evaluation goals: Performance penalty? Storage savings? Can be further improved? We evaluate Metadedup in our local testbed. We use two real-world datasets for our storage evaluation. Our evaluation seeks to answer three high-level questions: What is the performance penalty of Metadedup compared to the conventional encrypted deduplication approach that does not apply metadata deduplication? Can Metadedup achieve storage savings through metadata deduplication? And can Metadedup further improve storage savings when being combined with existing compression approaches?

Performance We compare the performance of Metadedup with CDStore in the write and restore operations. In the x axis, the notation Md-X means the Metadedup instance configured with an average segment size of X. We can see that Metadedup adds small write and medium restore overheads to CDStore. More experiments about metadata processing overhead and scalability can be referred in our paper. Note: Md-X denotes Metadedup instance with average segment size of X Metadedup adds small write (~4.05%) and medium restore overheads Metadedup incurs limited metadata processing overhead (details in our paper)

Overall Storage Efficiency
Components/Metrics Raw 512KB 1MB 2MB 4MB FSL Total (GB) 12.211 10.767 10.428 10.704 Storage saving - 97.07% 97.46% 97.55% 97.47% Index overhead 1.94% 1.07% 0.60% 0.33% VM 27.438 26.981 37.900 42.204 95.74% 95.81% 94.03% 93.33% 1.91% 0.96% 0.70% 0.39% We compare the metadata storage of Metadedup under different average segment sizes with the raw metadata size without deduplication or compression. The table shows the results after storing all FSL or VM backups. We find that metadedup significantly reduces storage space of metadata, while incurring limited indexing penalties. For an average segment size of 1MB, It saves metadata storage by 97.46% at expense of 1.07% index overhead. Metadedup reduces storage space of metadata, while incurring limited indexing penalties (see breakdown in our paper) Save metadata storage by 97.46% at expense of 1.07% index overhead

Storage Saving We compare the storage efficiency of Metadedup with that of existing compression approaches in long-term backup. The details about the compression approaches can be referred in our paper. Here, we can see that the metadata storage saving of Metadedup is significantly higher than those of baseline compression approaches. Storage saving of Metadedup is significantly higher than those of baseline compression approaches[Meister, FAST’13]

Index Overhead We plot the index overhead in the long term. Metadedup only adds small index overhead. Compared with baseline compression approaches[Meister, FAST’13], Metadedup adds small index overhead

Combined with Compression
Components/Metrics FSL VM Metadedup + ZC Total (GB) 10.679 26.682 Storage saving 97.48% 95.86% Index overhead 1.07% 0.96% Metadedup + PC 9.949 26.630 97.68% 95.85% 1.11% 1.00% We consider combined approaches, where we apply Metadedup followed by compression to further compress the metadata of metadata chunks. Here, we only consider two compression approaches, because the other approaches lead to high index overhead. Metadedup is marginally improved by combination. The reason is that metadata chunks take a large part in the metadata storage of Metadedup, but the compression approaches cannot be applied to the encrypted metadata chunks. Combination: apply Metadedup, followed by compression to suppress metadata of metadata chunks Metadedup is marginally improved Compression cannot apply to physical metadata chunks

Conclusion Metadedup: an encrypted deduplication storage system with space-efficient metadata management Build on indirection to apply metadata deduplication Implement a prototype Conduct extensive prototype and trace-driven evaluation Software: Finally, we draw the conclusion of our work. We present Metadedup that implements metadata deduplication to augment encrypted deduplication storage with space-efficient metadata management. We implement Metadedup prototype and conduct extensive prototype and trace-driven evaluations. The source code of our prototype can be found in this website. That is all. Thank you. I am glad to answer your questions.

Jingwei Li, Patrick P. C. Lee #, Yanjing Ren, and Xiaosong Zhang*

Similar presentations

Presentation on theme: "Jingwei Li, Patrick P. C. Lee #, Yanjing Ren, and Xiaosong Zhang*"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jingwei Li*, Patrick P. C. Lee #, Yanjing Ren*, and Xiaosong Zhang*

Similar presentations

Presentation on theme: "Jingwei Li*, Patrick P. C. Lee #, Yanjing Ren*, and Xiaosong Zhang*"— Presentation transcript:

Similar presentations

About project

Feedback

Jingwei Li, Patrick P. C. Lee #, Yanjing Ren, and Xiaosong Zhang*

Presentation on theme: "Jingwei Li, Patrick P. C. Lee #, Yanjing Ren, and Xiaosong Zhang*"— Presentation transcript: