EndRE: An End-System Redundancy Elimination Service Bhavish Aggarwal, Aditya Akella, Ashok Anand, Athula Balachandran, Pushkar Chitnis, Chitra Muthukrishnan,

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Network Algorithms, Lecture 4: Longest Matching Prefix Lookups George Varghese.
REfactor-ing Content Overhearing to Improve Wireless Performance Shan-Hsiang Shen Aaron Gember Ashok Anand Aditya Akella abc 1d ab 1.
File Systems.
The Zebra Striped Network Filesystem. Approach Increase throughput, reliability by striping file data across multiple servers Data from each client is.
REDUNDANCY IN NETWORK TRAFFIC: FINDINGS AND IMPLICATIONS Ashok Anand Ramachandran Ramjee Chitra Muthukrishnan Microsoft Research Lab, India Aditya Akella.
EndRE: An End-System Redundancy Elimination Service.
Chapter 11: File System Implementation
CS 153 Design of Operating Systems Spring 2015
Modern Information Retrieval
Virtual Memory Chapter 8.
Virtual Memory Chapter 8.
File System Implementation
Virtual Memory Chapter 8. Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.
File System Implementation CSCI 444/544 Operating Systems Fall 2008.
Hash Tables1 Part E Hash Tables  
Virtual Memory Chapter 8.
1 Virtual Memory Chapter 8. 2 Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.
File System Structure §File structure l Logical storage unit l Collection of related information §File system resides on secondary storage (disks). §File.
Computer Organization Cs 147 Prof. Lee Azita Keshmiri.
CSI 400/500 Operating Systems Spring 2009 Lecture #9 – Paging and Segmentation in Virtual Memory Monday, March 2 nd and Wednesday, March 4 th, 2009.
Mercury: Supporting Scalable Multi-Attribute Range Queries A. Bharambe, M. Agrawal, S. Seshan In Proceedings of the SIGCOMM’04, USA Παρουσίαση: Τζιοβάρα.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Wide-area cooperative storage with CFS
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
DATA DEDUPLICATION By: Lily Contreras April 15, 2010.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Data Structures & Algorithms and The Internet: A different way of thinking.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
COST REDUCTION MECHANISM IN CLOUD USING PACK COST REDUCTION MECHANISM IN CLOUD USING PACK GUIDE PROJECT BY Dr. J.JAGADEESAN AVAYAMBIGAI.J SANTHI.K.S PARIMALA.S.
Improving Content Addressable Storage For Databases Conference on Reliable Awesome Projects (no acronyms please) Advanced Operating Systems (CS736) Brandon.
Authors: Haowei Yuan, Tian Song, and Patrick Crowley Publisher: ICCCN 2012 Presenter: Chai-Yi Chu Date: 2013/05/22 1.
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng, Patrick P. C. Lee The Chinese University of Hong Kong.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Operating Systems Unit 7: – Virtual Memory organization Operating Systems.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
ICOM 5016 – Introduction to Database Systems Lecture 13- File Structures Dr. Bienvenido Vélez Electrical and Computer Engineering Department Slides by.
Memory Management Continued Questions answered in this lecture: What is paging? How can segmentation and paging be combined? How can one speed up address.
Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.
CSCI 599: Beyond Web Browsers Professor Shahram Ghandeharizadeh Computer Science Department Los Angeles, CA
Operating Systems Session 7: – Virtual Memory organization Operating Systems.
Chapter 5 Record Storage and Primary File Organizations
A Fragmented Approach by Tim Micheletto. It is a way of having multiple cache servers handling data to perform a sort of load balancing It is also referred.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
Niosha Behnam CMPE 259 – Fall  Real-time data availability is not required for all sensor networks.  Robust disconnected operation is a needed.
Information Retrieval in Practice
Module 11: File Structure
File-System Implementation
File System Implementation
An Introduction to Cache Design
Computer Architecture
CSE 451: Operating Systems Autumn 2005 Memory Management
DATABASE IMPLEMENTATION ISSUES
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
ICOM 5016 – Introduction to Database Systems
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
Ch 17 - Binding Protocol Addresses
Database Implementation Issues
COMP755 Advanced Operating Systems
Virtual Memory 1 1.
Presentation transcript:

EndRE: An End-System Redundancy Elimination Service Bhavish Aggarwal, Aditya Akella, Ashok Anand, Athula Balachandran, Pushkar Chitnis, Chitra Muthukrishnan, Ramachandran Ramjee and George Varghese

Identify and remove redundancy Implemented either at IP or socket layer Accomplished in two steps:  Fingerprinting  Matching and encoding

Fingerprinting Selecting a few “representative regions” for. the current block of data handed down by application(s)

Matching and encoding Approaches for identification of redundant content (given representative regions have been identified)  Chunk-Match  Max-Match These two approaches differ in the trade-off between the memory overhead imposed on the server and the effectiveness of RE

Fingerprinting: Balancing Server Computation with Effectiveness

Notation and terminology Data block(S): certain amount of data handed down by an application w(S>>w) represent the size of the minimum redundant string (contiguous bytes) that is to be identified Number of potential candidates?

1/p representative candidates are chosen. P is varied based on load. Markers : The first byte of these chosen candidate strings Chunks: The string of bytes between two markers Fingerprints: a pseudo-random hash of fixed w-byte strings beginning at each marker Chunk-hashes: hashes of the variable sized chunks

Fingerprinting algorithms MODP MAXP FIXED SAMPLEBYTE

MODP Marker identification and fingerprinting both handled by same hash function per block computational cost is independent of the sampling period, p

MAXP markers are chosen as bytes that are local- maxima over each region of p bytes of the data block Once the marker byte is chosen, an efficient hash function such as Jenkins Hash can be used to compute the fingerprint By increasing p, fewer maxima-based markers need to be identified

FIXED A content-agnostic approach. Select every p th byte as a marker. (incurs no computational cost) Once markers are chosen, S/p fingerprints are computed using Jenkins Hash as in MAXP.

SAMPLEBYTE uses a 256-entry lookup table with a few predefined positions set a byte is chosen as a marker if the corresponding entry in the lookup table is set fingerprint is computed using Jenkins Hash, and p/2 bytes of content are skipped before the process repeats

SAMPLEBYTE

SAMPLEBYTE Lookup Table Creation: Used network traces from one of the enterprise sites Sort the characters by decreasing order of their presence in the identified redundant content Setting the corresponding entries in the lookup table to 1 until compression gain diminishes The intuition behind this approach is that we would like to increase the probability of being selected as markers to those characters that are more likely to be part of redundant content.

Matching and Encoding: Optimizing Storage and Client Computation

Matching and encoding Accomplished in two ways-  Chunk match: data that repeat in full across data blocks  Max-Match: maximal matches around fingerprints that are repeated across data blocks

Chunk match Chunk-hashes from payloads of future data blocks are looked up in the Chunk-hash store to identify if one or more chunks have been encountered earlier. Once matching chunks are identified, they are replaced by meta-data.

EndRE optimization Offloads all storage management and computation to servers Client simply maintains a fixed-size circular FIFO log of data blocks For each matching chunk, the server encodes and sends a four-byte tuple of the chunk in the client’s cache The client “de-references” the offsets sent by the server and reconstructs the com-pressed regions from local cache

Drawbacks of chunk match  can only detect exact matches in the chunks computed for a data block  could miss redundancies that span contiguous portions of neighboring chunks or redundancies that only span portions of chunks

Max-Match For each matching fingerprint, the corresponding matching data block is retrieved from the cache and the match region is expanded byte-by-byte in both directions to obtain the maximal region of redundant bytes. Matched regions are then encoded with tuples.

EndRE optimization Max-Match relies on byte-by-byte comparison, any collision will be recovered.

EndRE optimization Simple hashing used as byte-by-byte storage is anyways needed. Optimize the representation of the fingerprint hash table to limit storage needs  fingerprint table is a contiguous set of offsets, indexed by the fingerprint hash value  16 MB (2^24 bits) cache, p=64 -> 2^24/64 = 2^18 fingerprints

Evaluation