Compression of documents

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
An Introduction to Secure Sockets Layer (SSL). Overview Types of encryption SSL History Design Goals Protocol Problems Competing Technologies.
Delta Encoding in the compressed domain A semi compressed domain scheme with a compressed output.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Server-Friendly Delta Compression for Efficient Web Access Anubhav Savant Torsten Suel CIS Department Polytechnic University Brooklyn, NY
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Lecture 11: Data Synchronization Techniques for Mobile Devices © Dimitre Trendafilov 2003 Modified by T. Suel 2004 CS623, 4/20/2004.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
1 Accelerating Multi-Patterns Matching on Compressed HTTP Traffic Authors: Anat Bremler-Barr, Yaron Koral Presenter: Chia-Ming,Chang Date: Publisher/Conf.
World Wide Web1 Applications World Wide Web. 2 Introduction What is hypertext model? Use of hypertext in World Wide Web (WWW) – HTML. WWW client-server.
Cluster-Based Delta Compression of a Collection of Files Zan Ouyang Nasir Memon Torsten Suel Dimitre Trendafilov CIS Department Polytechnic University.
Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks Torsten Suel CIS Department Polytechnic University.
Rsync: Efficiently Synchronizing Files Using Hashing By David Shao For CS 265, Spring 2004.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Survey on Improving Dynamic Web Performance Guide:- Dr. G. ShanmungaSundaram (M.Tech, Ph.D), Assistant Professor, Dept of IT, SMVEC. Aswini. S M.Tech CSE.
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web.
A Low-bandwidth Network File System Athicha Muthitacharoen et al. Presented by Matt Miller September 12, 2002.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
15-853Page :Algorithms in the Real World Data Compression III Lempel-Ziv algorithms Burrows-Wheeler Introduction to Lossy Compression.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Compression & Networking. Background  network links are getting faster and faster but  many clients still connected by fairly slow links (mobile?) 
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information Retrieval in Practice
Jonathan Walpole Computer Science Portland State University
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
HUFFMAN CODES.
WWW and HTTP King Fahd University of Petroleum & Minerals
Query processing: phrase queries and positional indexes
Reddy Mainampati Udit Parikh Alex Kardomateas
Greedy Technique.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Minimum Spanning Tree 8/7/2018 4:26 AM
Web Caching? Web Caching:.
The Greedy Method and Text Compression
Whether you decide to use hidden frames or XMLHttp, there are several things you'll need to consider when building an Ajax application. Expanding the role.
Cache Memory Presentation I
13 Text Processing Hongfei Yan June 1, 2016.
Database Performance Tuning and Query Optimization
Kalyan Boggavarapu Lehigh University
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Chapter 9: Huffman Codes
Index Construction: sorting
Chapter 11 Data Compression
Greedy: Huffman Codes Yin Tat Lee
Chapter 11 Database Performance Tuning and Query Optimization
Index construction: Compression of postings
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Query processing: phrase queries and positional indexes
Index construction: Compression of postings
CPS 296.3:Algorithms in the Real World
Presentation transcript:

Compression of documents Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Raw docs are needed

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" LZ77 Algorithm’s step: Output <dist, len, next-char> Advance by len + 1 A buffer “window” has fixed length and moves a a c a a c a b c a a a a a a a c Dictionary (all substrings starting here) <6,3,a> a a c a a c a b c a a a a a a c a c <3,4,c>

Example: LZ77 with window Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Example: LZ77 with window a c b (0,0,a) a c b a c b (1,1,c) a c b (3,4,b) a c b c a b (3,3,a) c a b a c b (1,2,c) a c b within W Window size = 6 Longest match Next character Gzip -1…-9

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" LZ77 Decoding Which is faster in decompression among gzip -1…-9 ? Decoder keeps same dictionary window as encoder. Finds substring <len,dist,char> in previously decoded text Inserts a copy of it What if len > dist ? (overlap with text to be compressed) E.g. seen = abcd, next codeword is <2,9,e> for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i] Output is correct: abcdcdcdcdcdce

You find this at: www.gzip.org/zlib/ Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" You find this at: www.gzip.org/zlib/

Transfer + Decompression

https://quixdb.github.io/squash-benchmark/

Compression & Networking Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Compression & Networking

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Background data knowledge about data at receiver sender receiver network links are getting faster and faster but many clients still connected by fairly slow links (mobile?) people wish to send more and more data battery life is a king problem how can we make this transparent to the user?

Two standard techniques Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Two standard techniques caching: “avoid sending the same object again” Done on the basis of “atomic” objects Thus only works if objects are unchanged How about objects that are slightly changed? compression: “remove redundancy in transmitted data” avoid repeated substrings in transmitted data can be extended to history of past transmissions (overhead) How if the sender has never seen data at receiver ?

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Types of Techniques Common knowledge between sender & receiver Unstructured file: delta compression “partial” knowledge Unstructured files: file synchronization Record-based data: set reconciliation

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Formalization Delta compression [diff, zdelta, REBL,…] Compress file f deploying known file f’ Compress a group of files Speed-up web access by sending differences between the requested page and the ones available in cache File synchronization [rsynch, zsync] Client updates its old file fold with fnew available on a server Mirroring, Shared Crawling, Content Distr. Net Set reconciliation Client updates its structured file fold with file fnew available on server Update of contacts or appointments, intersect IL in P2P search engine

Z-delta compression (one-to-one) Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Z-delta compression (one-to-one) Problem: We have two files fknown and fnew (known to both parties) and the goal is to compute a file fd of minimum size such that fnew can be derived from fknown and fd Assume that block moves and copies are allowed Find an optimal covering set of fnew based on fknown LZ77-scheme provides and efficient, optimal solution fknown is “previously encoded text”, compress fknownfnew starting from fnew zdelta is one of the best implementations Uses e.g. in Version control, Backups, and Transmission.

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Efficient Web Access Dual proxy architecture: pair of proxies (client cache + proxy) located on each side of the slow link use a proprietary protocol to increase comm perf web request request Client Proxy Slow-link Fast-link Cache reference reference Delta-encoding Page Use zdelta to reduce traffic: Old version is available at both proxies (one on client cache, and one on proxy) Restricted to pages already visited (30% cache hits), or URL-prefix match Small cache

Cluster-based delta compression Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Cluster-based delta compression Problem: We wish to compress a group of files F Useful on a dynamic collection of web pages, back-ups, … Apply pairwise zdelta: find a good reference for each f  F Reduction to the Min Branching problem on DAGs Build a (complete?) weighted graph GF, nodes=files, weights= zdelta-size Insert a dummy node connected to all, and weights are gzip-coding Compute the directed spanning tree of min tot cost, covering G’s nodes. 1 2 3 123 20 620 2000 220 5 space time uncompr 30Mb --- tgz 20% linear THIS 8% quadratic 90

Improvement (group of files) Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Improvement (group of files) Problem: Constructing G is very costly, n2 edge calculations (zdelta exec) We wish to exploit some pruning approach Collection analysis: Cluster the files that appear similar and thus good candidates for zdelta-compression. Build a sparse weighted graph G’F containing only edges between pairs of files in the same cluster Assign weights: Estimate appropriate edge weights for G’F thus saving zdelta execution. Nonetheless, strict n2 time space time uncompr 260Mb --- tgz 12% 2 mins THIS 8% 16 mins

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" File Synchronization

File synch: The problem Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" File synch: The problem request f_new f_old update Server Client client request to update an old file server has new file but does not know the old file updates f_old without sending the entire f_new rsync: file synch tool, distributed with Linux Delta compression is a sort of local synch Since the server knows both files

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" The rsync algorithm few hashes f_new f_old encoded file Server Client

The rsync algorithm (contd) Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" The rsync algorithm (contd) simple, widely used, single roundtrip optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals choice of block size problematic (default: max{700, √n} bytes) not good in theory: granularity of changes may disrupt use of blocks

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" A new framework: zsync Server sends hashes (unlike the client in rsync), clients checks them Server deploys the common fref to compress the new ftar (rsync compress just it).

Small differences (e.g. agenda) Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Small differences (e.g. agenda) Server sends hashes to client per levels Client checks equality with hash-substrings Log n/k levels match n/k blocks of k elems each If d differences, then on each level  d hashes not match, and need to continue Communication complexity is O(d lg(n/k) * lg n) bits [1 upward path per different k-block]