Cluster-Based Delta Compression of a Collection of Files Zan Ouyang Nasir Memon Torsten Suel Dimitre Trendafilov CIS Department Polytechnic University.

Slides:



Advertisements
Similar presentations
Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol Li Fan, Pei Cao and Jussara Almeida University of Wisconsin-Madison Andrei Broder Compaq/DEC.
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Fast Algorithms For Hierarchical Range Histogram Constructions
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Difference Engine: Harnessing Memory Redundancy in Virtual Machines by Diwaker Gupta et al. presented by Jonathan Berkhahn.
1 K-clustering in Wireless Ad Hoc Networks using local search Rachel Ben-Eliyahu-Zohary JCE and BGU Joint work with Ran Giladi (BGU) and Stuart Sheiber.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Computational problems, algorithms, runtime, hardness
Server-Friendly Delta Compression for Efficient Web Access Anubhav Savant Torsten Suel CIS Department Polytechnic University Brooklyn, NY
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
Incremental Network Programming for Wireless Sensors NEST Retreat June 3 rd, 2004 Jaein Jeong UC Berkeley, EECS Introduction Background – Mechanisms of.
CS Lecture 9 Storeing and Querying Large Web Graphs.
Lecture 11: Data Synchronization Techniques for Mobile Devices © Dimitre Trendafilov 2003 Modified by T. Suel 2004 CS623, 4/20/2004.
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Scheduling Algorithms for Wireless Ad-Hoc Sensor Networks Department of Electrical Engineering California Institute of Technology. [Cedric Florens, Robert.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Incremental Network Programming for Wireless Sensors IEEE SECON 2004 Jaein Jeong and David Culler UC Berkeley, EECS.
1 Lecture 18 Syntactic Web Clustering CS
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks Torsten Suel CIS Department Polytechnic University.
2-Layer Crossing Minimisation Johan van Rooij. Overview Problem definitions NP-Hardness proof Heuristics & Performance Practical Computation One layer:
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Rsync: Efficiently Synchronizing Files Using Hashing By David Shao For CS 265, Spring 2004.
Decentralised Coordination of Mobile Sensors School of Electronics and Computer Science University of Southampton Ruben Stranders,
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.
Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.
Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.
Network Aware Resource Allocation in Distributed Clouds.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
June 21, 2007 Minimum Interference Channel Assignment in Multi-Radio Wireless Mesh Networks Anand Prabhu Subramanian, Himanshu Gupta.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Expanders via Random Spanning Trees R 許榮財 R 黃佳婷 R 黃怡嘉.
1 Network Coding and its Applications in Communication Networks Alex Sprintson Computer Engineering Group Department of Electrical and Computer Engineering.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.
Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.
CS270 Project Overview Maximum Planar Subgraph Danyel Fisher Jason Hong Greg Lawrence Jimmy Lin.
1 Latency-Bounded Minimum Influential Node Selection in Social Networks Incheol Shin
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
CPS Computational problems, algorithms, runtime, hardness (a ridiculously brief introduction to theoretical computer science) Vincent Conitzer.
Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Bigtable: A Distributed Storage System for Structured Data
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
Compression & Networking. Background  network links are getting faster and faster but  many clients still connected by fairly slow links (mobile?) 
Compression of documents
Computing and Compressive Sensing in Wireless Sensor Networks
CS 326A: Motion Planning Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces (1996) L. Kavraki, P. Švestka, J.-C. Latombe,
RE-Tree: An Efficient Index Structure for Regular Expressions
Query-Friendly Compression of Graph Streams
Bin Fu Department of Computer Science
Fast Nearest Neighbor Search on Road Networks
Range-Efficient Computation of F0 over Massive Data Streams
Minwise Hashing and Efficient Search
Topological Signatures For Fast Mobility Analysis
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

Cluster-Based Delta Compression of a Collection of Files Zan Ouyang Nasir Memon Torsten Suel Dimitre Trendafilov CIS Department Polytechnic University Brooklyn, NY 11201

1. Introduction and Motivation 2. Some known Approaches and Results 3. A Cluster-Based Approach 4. Experimental Results 5. Conclusions Overview: Problem: “how to achieve better compression of large collections of files by exploiting similarity among the files”

Delta Compression: (differential compression) 1. Introduction Goal: to improve communication and storage efficiency by exploiting file similarity Example: Distribution of software updates - new and old version are similar - only need to send a patch to update - delta compression: how to compute concise patch a delta compressor encodes a target file w.r.t. a reference file by computing a highly compressed representation of their differences

Application Scenario: server has copy of both files local problem at server ServerClient d(F_new, F_old) F_new F_old request Remote File Synchronization: (e.g., rsync) server does not know need protocol between two parties

Applications: patches for software upgrades revision control systems (RCS, CVS) versioning file systems improving HTTP performance - optimistic deltas for latency (AT&T) - deltas for wireless links storing document collections (web sites, AvantGo) incremental backup compression of file collection (improving tar+gzip)

Problem in this Paper: we are given a set of N files some of the files share some similarities, but we do not know which are most similar what is the best way to efficiently encode the collection using delta compression? limitation: we allow only a single reference file for each target file need to find a good sequence of delta compression operations between pairs of files

Delta Compression Tools and Approaches compute an edit sequence between the files - used by diff and bdiff utilities - what is a good distance measure? (block move, copy) practical approach based on Lempel/Ziv: - encode repeated substrings by pointing to previous occurrences and occurrences in the reference file available tools - vdelta/vcdiff K.-P. Vo (AT&T) - xdelta J. MacDonald (UCB) - zdelta Trendafilov, Memon, Suel (Poly)

Some Technical Details: in reality, some more difficulties how to express distances in reference file - sources of copies in F_ref aligned - better coding how to handle long files - window shifting in both files F_target F_ref F_target F_ref

Performance of delta compressors: gzip, vcdiff, zdelta, zdelta files: - gnu/emacs (Hunt/Vo/Tichy) caveat for running times: I/O issues gzip & zdelta: first number direct access, second standard I/O vcdiff numbers: standard I/O gcc size (KB) time (s) orig gzip /30 xdelta vcdiff zdelta /32 emacs size (KB) time (s) orig gzip /35 xdelta vcdiff zdelta /42

2. Some Known Approaches and Results reduction to the Maximum Branching Problem in a weighted directed graph a branching is an acyclic subset of edges with at most one incoming edge for each node a maximum branching is a branching of max. weight (sum of the weights of the edges) can be solved in near-linear time

empty D_1 D_2 D_3 D_4 w(i,j) = size of delta between D_i and D_j Reduction to Optimum Branching: one node for each document directed edges between every pair of nodes weight equal to benefit due to delta compression Algorithm: generate graph compute optimum branching delta compression according to branching

Summary of known results: optimal delta in polynomial time if each file uses only one reference file (optimal branching) Problem with the optimal branching approach: - N^2 edges between N documents - solution might have long chains NP-complete if we limit length of chains [Tate 91] NP-complete if we use more than one reference per file (hypergraph branching, Adler/Mitzenmacher 2001)

Experimental Results for Optimum Branching: fast implementation based on Boost library tested on collections of pages from web sites inefficient for more than a few hundred files most time spent on computing the weights branching much faster (but also N^2)

3. A Cluster-Based Approach Challenge: to compute a good branching without computing all N^2 edges precisely General Framework: (1) Collection Analysis Perform a clustering computation that identifies pairs of files that are good candidates for delta compression. Build a sparse subgraph G’ with edges between pairs. (2) Weight Assignment Compute or approximate edge weights for G’. (3) Maximum Branching Perform a maximum branching computation on G’

Techniques for efficient clustering Recent clustering techniques based on sampling by Broder and by Indyk - first generate all q-grams (substrings of length q) - files are similar if they have many q-grams in common - use minwise independent hashing to select a small number w of representative q-grams for each file Broder: O(w* N^2) to estimate distances between all N^2 pairs of files (also by Manber/Wu) Indyk: O(w *N * log(N)) to identify pairs of files with distance below a chosen threshold

Techniques for efficient clustering (cont.) Broder: minwise hashing (MH) - use random hash function to map q-grams to integers - retain the smallest integer obtained as sample - repeat w times to get w samples (can be optimized) - use samples to estimate distance between each pair Indyk: Locally sensitive hashing (LSH) - sampling same as before - instead of computing distance between each pair of samples, construct signatures from sample - signature = bitwise concatenation of several samples - samples have large intersection if identical signature - repeat a few times

Discussion of Distance Measure Two possible distance measures on q-grams: - Let S(F) be the set of q-grams of file F - Intersection: d(F, F’) = (S(F) intersect S(F’)) / (S(F) union S(F’)) - Containment: d(F, F’) = (S(F) intersect S(F’)) / S(F) Intersection symmetric, Containment asymmetric There are formal bounds relating both measures to edit distance with and without block moves Inaccuracy due to distance measure and sampling In practice OK (as we will see)

Design decisions: Intersection or inclusion measure? Which edges to keep? - all edges above a certain similarity threshold - the k best incoming edges for each file (NN) Approximate or exact weights for G’ ? Different sampling approaches Sampling ratios Value of q for q-grams (we used q=4) Indyk approach supports only intersection and similarity threshold (extension difficult) Indyk does not provide weight estimate

4. Experimental Results Implementation in C++ using zdelta compressor Experiments on P4 and UltraSparc architectures Data sets: - union of 6 commercial web sites used before (49 MB) pages from poly.edu domain (257 MB) - gcc and emacs distributions (>1000 files, 25 MB) - changed pages from web sites Comparison with gzip, tar+gzip, and cat+gzip See paper and report for details

Threshold-Based Approach: Keep edges with distance below threshold Based on Broder (MH) Fixed sample size and fixed sampling ratio

Discussion of threshold-based approach: Sampling method not critical Containment slightly better than Intersection (but can result in undesirable pairings) There is no really good threshold - low threshold keeps too many edges (inefficient) - high threshold cuts too many edges (bad compression) Possible solutions - use k nearest neighbors - use estimates instead of precise edge weights - use other heuristics to prune edges from dense areas

Nearest Neighbor-Based Approach: Keep edges from k nearest neighbors (k = 1, 2, 4, 8) Based on Broder (MH) with f ixed sampling ratio Good compression even for small k Time for computing weights increases linearly with k Clustering time linear in ratio and quadratic in # files Open theoretical question on performance of k-NN

Nearest Neighbor with Estimated Weights: Keep edges from k nearest neighbors (k = 1, 2, 3, 4) Use intersection estimate as weight of edge Compression only slightly worse than for exact weights Time for computing weights reduced to zero Clustering time still quadratic in # files but OK for up to thousands of files zdelta time affected by disk (compared to tar+gzip)

Indyk (LSH) with Pruning Heuristic: Threshold and fixed sampling ratio Heuristic: keep only sparse subgraph “everywhere” Compression degrades somewhat Could try to use more sound approach (biased sampling) Clustering time better than with Broder for a few thousand files and more Could do Indyk (LSH) followed by Broder (MH)

Experiments on Large Data Set: pages, 257 MB Results still very unoptimized in terms of speed However: cat+bzip2 does even better in some cases! - depends on page ordering and file sizes - our techniques are (now) faster

5. Conclusions We studied compression of collections of files Framework for cluster-based delta compression Implemented and optimized several approaches Rich set of possible problems and techniques Results still preliminary Part of ongoing research effort on delta compression and file synchronization

Current and future work Extending zdelta to several base files Compressing a collection using several base files - optimal solution NP-Complete - approximation results for optimum benefit Approach based on Hamiltonian Path - greedy heuristic for Hamiltonian Path - almost as good as Maximum Branching - useful for case of several base files - useful for tar+gzip and tar+bzip2 Goal: towards a better generic utility (than tar+gzip) Study based on data sets from ftp sites and web

Application: Large Web Archival System Consider a massive collection of web pages Internet Archive: 10 billion pages, 100TB Many versions of each page: Wayback Machine (at Archive does not currently employ delta compression [Kahle 2002] How to build a TB storage system that - employs delta compression - has good insertion and random read performance - has good streaming performance. Limited to very simple heuristics