IR IL Compression.  code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie.

Slides:



Advertisements
Similar presentations
Tree Recursion Traditional Approach. Tree Recursion Consider the Fibonacci Number Sequence: Time: , 1, 1, 2, 3, 5, 8, 13, 21,... /
Advertisements

Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Data Structures: A Pseudocode Approach with C 1 Chapter 6 Objectives Upon completion you will be able to: Understand and use basic tree terminology and.
1 abstract containers hierarchical (1 to many) graph (many to many) first ith last sequence/linear (1 to 1) set.
Greedy Algorithms (Huffman Coding)
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
BTrees & Bitmap Indexes
Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.
CS336: Intelligent Information Retrieval
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
Starting Out with C++: Early Objects 5/e © 2006 Pearson Education. All Rights Reserved Starting Out with C++: Early Objects 5 th Edition Chapter 17 Linked.
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
1 abstract containers hierarchical (1 to many) graph (many to many) first ith last sequence/linear (1 to 1) set.
Information Retrieval Space occupancy evaluation.
Huffman Codes Message consisting of five characters: a, b, c, d,e
CSE Lectures 22 – Huffman codes
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
Introduction to AEP In information theory, the asymptotic equipartition property (AEP) is the analog of the law of large numbers. This law states that.
Information and Coding Theory
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
Dr.-Ing. Khaled Shawky Hassan
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Binary, Decimal and Hexadecimal Numbers Svetlin Nakov Telerik Corporation
Ceng-112 Data Structures I 1 Chapter 7 Introduction to Trees.
Data : The Small Forwarding Table(SFT), In general, The small forwarding table is the compressed version of a trie. Since SFT organizes.
Final Review Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
1 Trees 2 Binary trees Section Binary Trees Definition: A binary tree is a rooted tree in which no vertex has more than two children –Left and.
Recursion Textbook chapter Recursive Function Call a recursive call is a function call in which the called function is the same as the one making.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
V.2 Index Compression Heap’s law (empirically observed and postulated): Size of the vocabulary (distinct terms) in a corpus with total number of term occurrences.
Advanced Databases: Lecture 6 Query Optimization (I) 1 Introduction to query processing + Implementing Relational Algebra Advanced Databases By Dr. Akhtar.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Huffman Code and Data Decomposition Pranav Shah CS157B.
Procedures – Generating the Code Lecture 21 Mon, Apr 4, 2005.
Inverted File Compression In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철.
CSE 326: Data Structures Lecture #6 From Lists to Trees Henry Kautz Winter 2002.
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
School of Computer Science & Information Technology G6DICP - Lecture 4 Variables, data types & decision making.
David Stotts Computer Science Department UNC Chapel Hill.
Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
Greedy algorithms 2 David Kauchak cs302 Spring 2012.
SEAC-2 J.Teuhola Coding-Theoretic Foundations Source alphabet S  Target alphabet {0, 1} Categories of source encoding: 1. Codebook methods:
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Index construction: Compression of postings
Chapter 12 – Data Structures
Advanced Algorithms for Massive DataSets
Representing Sets (2.3.3) Huffman Encoding Trees (2.3.4)
Searching an Array: Binary Search
Binary, Decimal and Hexadecimal Numbers
CSC 172– Data Structures and Algorithms
Searching.
Binary Code  
Data Structures and Analysis (COMP 410)
Trees Lecture 9 CS2110 – Fall 2009.
SNMP Examples.
Index construction: Compression of postings
Tutorial 3.
IV. Convolutional Codes
Tutorial 2.
Trees Addenda.
Index construction: Compression of postings
Index construction: Compression of postings
Data Compression.
Presentation transcript:

IR IL Compression

 code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers

It is a prefix-free encoding… Given the following sequence of  coded integers, reconstruct the original sequence:

 code for integer encoding Use  -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as.  coding x takes about  log 2 x  + 2  log 2 (  log 2 x  )  + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers

Variable-byte  codes [10.2 bits per TREC12] Wish to get very fast (de)compress  byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=  binary(v) = Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!

Rice code (simplification of Golomb code) It is a parametric code: depends on k Quotient q=  (v-1)/k , and the rest is r= v – k * q – 1 Useful when integers concentrated around k How do we choose k ? Usually k  0.69 * mean(v) [Bernoulli model] Optimal for Pr(v) = p (1-p) v-1, where mean(v)=1/p, and i.i.d ints [q times 0s] 1 Log k bits

Interpolative coding  = M = Recursive coding  preorder traversal of a balanced binary tree At every step we know (initially, they are encoded): num = |M| = 12, Lidx=1, low = 1, Ridx=12, hi = 21 Take the middle element: h= (Lidx+Ridx)/2=6  M[6]=9, num_left= h – Lidx = 5, num_right= Ridx-h = 6 low + left_size =1+5 = 6 ≤ M[h] ≤ hi – right_size = (21 – 6) = 15 We can encode 9-6=3 in log 2 (15-6+1) = 4 bits lo=1, hi=9-1=8, num=5 lo=9+1=10, hi=21, num=6

PForDelta coding 1011 … … a block of 128 numbers Use b (e.g. 2) bits to encode 128 numbers or create exceptions Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b  waste more bits, b  more exceptions Translate data: [base, base + 2 b -1]  [0,2 b -1]