Compression & Networking. Background  network links are getting faster and faster but  many clients still connected by fairly slow links (mobile?) 

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Joining LANs - Bridges. Connecting LANs 4 Repeater –Operates at the Physical layer no decision making, processing signal boosting only 4 Bridges –operates.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
An Introduction to Secure Sockets Layer (SSL). Overview Types of encryption SSL History Design Goals Protocol Problems Competing Technologies.
More on protocol implementation Version walks Timers and their problems.
Delta Encoding in the compressed domain A semi compressed domain scheme with a compressed output.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
1 Routing and Scheduling in Web Server Clusters. 2 Reference The State of the Art in Locally Distributed Web-server Systems Valeria Cardellini, Emiliano.
Server-Friendly Delta Compression for Efficient Web Access Anubhav Savant Torsten Suel CIS Department Polytechnic University Brooklyn, NY
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
UnInformed Search What to do when you don’t know anything.
Incremental Network Programming for Wireless Sensors NEST Retreat June 3 rd, 2004 Jaein Jeong UC Berkeley, EECS Introduction Background – Mechanisms of.
Lecture 11: Data Synchronization Techniques for Mobile Devices © Dimitre Trendafilov 2003 Modified by T. Suel 2004 CS623, 4/20/2004.
Incremental Network Programming for Wireless Sensors IEEE SECON 2004 Jaein Jeong and David Culler UC Berkeley, EECS.
Cluster-Based Delta Compression of a Collection of Files Zan Ouyang Nasir Memon Torsten Suel Dimitre Trendafilov CIS Department Polytechnic University.
Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks Torsten Suel CIS Department Polytechnic University.
Rsync: Efficiently Synchronizing Files Using Hashing By David Shao For CS 265, Spring 2004.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
A Low-Bandwidth Network File System A. Muthitacharoen, MIT B. Chen, MIT D. Mazieres, NYU.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Graphs and Sets Dr. Andrew Wallace PhD BEng(hons) EurIng
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
Linux Technology Center 18 April 2003 © 2003 IBM LDAP Content Synchronization Kurt D. ZeilengaJong Hyuk Choi OpenLDAP ProjectIBM Research Title slide.
1 Computer Communication & Networks Lecture 28 Application Layer: HTTP & WWW p Waleed Ejaz
De-Nian Young Ming-Syan Chen IEEE Transactions on Mobile Computing Slide content thanks in part to Yu-Hsun Chen, University of Taiwan.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,
A Low-bandwidth Network File System Athicha Muthitacharoen et al. Presented by Matt Miller September 12, 2002.
A P2P-Based Architecture for Secure Software Delivery Using Volunteer Assistance Purvi Shah, Jehan-François Pâris, Jeffrey Morgan and John Schettino IEEE.
COS 420 Day 15. Agenda Finish Individualized Project Presentations on Thrusday Have Grading sheets to me by Friday Group Project Discussion Goals & Timelines.
Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
INTERNET TECHNOLOGIES Week 10 Peer to Peer Paradigm 1.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Parallel Programming in Chess Simulations Part 2 Tyler Patton.
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
ECE 544 Protocol Design Project 2016 Nirali Shah Thara Philipson Nithin Raju Chandy.
Jonathan Walpole Computer Science Portland State University
Statistics Visualizer for Crawler
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Compression of documents
Reddy Mainampati Udit Parikh Alex Kardomateas
Optimizing the Migration of Virtual Computers
H.264/SVC Video Transmission Over P2P Networks
The minimum cost flow problem
Network and the internet
Minimum Spanning Tree 8/7/2018 4:26 AM
Web Caching? Web Caching:.
CHAPTER 3 Architectures for Distributed Systems
Whether you decide to use hidden frames or XMLHttp, there are several things you'll need to consider when building an Ajax application. Expanding the role.
Cache Memory Presentation I
13 Text Processing Hongfei Yan June 1, 2016.
Database Performance Tuning and Query Optimization
Capriccio – A Thread Model
Computer Communication & Networks
Kalyan Boggavarapu Lehigh University
Edge computing (1) Content Distribution Networks
What to do when you don’t know anything know nothing
Outline This topic covers Prim’s algorithm:
Chapter 11 Database Performance Tuning and Query Optimization
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Ch 17 - Binding Protocol Addresses
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
Distributed Databases
Minimum Spanning Trees
Presentation transcript:

Compression & Networking

Background  network links are getting faster and faster but  many clients still connected by fairly slow links (mobile?)  people wish to send more and more data  battery life is a king problem how can we make this transparent to the user? senderreceiver data knowledge about data at receiver

Two standard techniques caching: “avoid sending the same object again” Done on the basis of “atomic” objects Thus only works if objects are unchanged How about objects that are slightly changed? compression: “remove redundancy in transmitted data” avoid repeated substrings in transmitted data can be extended to history of past transmissions (overhead) How if the sender has never seen data at receiver ?

Types of Techniques Common knowledge between sender & receiver Unstructured file: delta compression “partial” knowledge Unstructured files: file synchronization Record-based data: set reconciliation

Formalization Delta compression [diff, zdelta, REBL,…] Compress file f deploying known file f’ Compress a group of files Speed-up web access by sending differences between the requested page and the ones available in cache File synchronization [rsynch, zsync] Client updates its old file f old with f new available on a server Mirroring, Shared Crawling, Content Distr. Net Set reconciliation Client updates its structured file f old with file f new available on server Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression (one-to-one) Problem: We have two files f known and f new (known to both parties) and the goal is to compute a file f d of minimum size such that f new can be derived from f known and f d Assume that block moves and copies are allowed Find an optimal covering set of f new based on f known LZ77-scheme provides and efficient, optimal solution f known is “previously encoded text”, compress f known f new starting from f new zdelta is one of the best implementations Uses e.g. in Version control, Backups, and Transmission.

Efficient Web Access Use zdelta to reduce traffic: Old version is available at both proxies (one on client cache, and one on proxy) Restricted to pages already visited (30% cache hits), or URL-prefix match Dual proxy architecture: pair of proxies (client cache + proxy) located on each side of the slow link use a proprietary protocol to increase comm perf Small cache Client request Delta-encoding Proxy request Page web reference Cache reference Slow-linkFast-link

Cluster-based delta compression Problem: We wish to compress a group of files F Useful on a dynamic collection of web pages, back-ups, … Apply pairwise zdelta: find a good reference for each f  F Reduction to the Min Branching problem on DAGs Build a (complete?) weighted graph G F, nodes=files, weights= zdelta-size Insert a dummy node connected to all, and weights are gzip-coding Compute the directed spanning tree of min tot cost, covering G’s nodes. spacetime uncompr 30Mb--- tgz20%linear THIS8%quadratic

Improvement (group of files) Problem: Constructing G is very costly, n 2 edge calculations (zdelta exec) We wish to exploit some pruning approach Collection analysis: Cluster the files that appear similar and thus good candidates for zdelta-compression. Build a sparse weighted graph G’ F containing only edges between pairs of files in the same cluster Assign weights: Estimate appropriate edge weights for G’ F thus saving zdelta execution. Nonetheless, strict n 2 time spacetime uncompr 260Mb--- tgz12%2 mins THIS8%16 mins

File Synchronization

File synch: The problem client request to update an old file server has new file but does not know the old file updates f_old without sending the entire f_new rsync: file synch tool, distributed with Linux ServerClient update f_new f_old request Delta compression is a sort of local synch Since the server knows both files

The rsync algorithm ServerClient encoded file f_new f_old few hashes

The rsync algorithm (contd) simple, widely used, single roundtrip optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals choice of block size problematic (default: max{700, √n} bytes) not good in theory: granularity of changes may disrupt use of blocks

A new framework: zsync Server sends hashes (unlike the client in rsync), clients checks them Server deploys the common f ref to compress the new f tar (rsync compress just it).

Small differences (e.g. agenda) If d differences, then on each level  d hashes not match, and need to continue Communication complexity is O(d lg(n/k) * lg n) bits [1 upward path per different k-block] n/k blocks of k elems each Log n/k levels match Server sends hashes to client per levels Client checks equality with hash-substrings