Storage 1 Some of these slides are based on Stanford IR Course slides at

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Arithmetic Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a How we can do better than Huffman? - I As we have seen, the.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
CS276 Information Retrieval and Web Search Lecture 5 – Index compression.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
Hinrich Schütze and Christina Lioma Lecture 5: Index Compression
Index Compression David Kauchak cs160 Fall 2009 adapted from:
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Information Retrieval and Web Search
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Modern Information Retrieval
CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
CS276A Information Retrieval Lecture 3. Recap: lecture 2 Stemming, tokenization etc. Faster postings merges Phrase queries.
CS336: Intelligent Information Retrieval
Information Retrieval IR 4. Plan This time: Index construction.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Chapter 7 Indexing Objectives: To get familiar with: Indexing
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
Information Retrieval Space occupancy evaluation.
Memory Allocation CS Introduction to Operating Systems.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
CSE Lectures 22 – Huffman codes
Lecture 6: Index Compression
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
CS728 Web Indexes Lecture 15. Building an Index for the Web Wish to answer simple boolean queries – given query term, return address of web pages that.
IT253: Computer Organization Lecture 3: Memory and Bit Operations Tonga Institute of Higher Education.
1 ITCS 6265 Information Retrieval and Web Mining Lecture 5 – Index compression.
Index Compression David Kauchak cs458 Fall 2012 adapted from:
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web.
Information Retrieval Techniques MS(CS) Lecture 6 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Evidence from Content INST 734 Module 2 Doug Oard.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 8: Index Compression.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
1 Inverted Index. The Inverted Index The inverted index is a set of posting lists, one for each term in the lexicon Example: Doc 1: A C F Doc 2: B E D.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Why indexing? For efficient searching of a document
COMP9319: Web Data Compression and Search
Large Scale Search: Inverted Index, etc.
Information Retrieval in Practice
HUFFMAN CODES.
Text Indexing and Search
Index Compression Adapted from Lectures by Prabhakar Raghavan (Google)
Index Compression Adapted from Lectures by
Lecture 7: Index Construction
Lecture 5: Index Compression Hankz Hankui Zhuo
3-3. Index compression Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
CS276: Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Storage 1 Some of these slides are based on Stanford IR Course slides at

Basic assumptions of Information Retrieval Collection: A set of documents –Assume it is a static collection for the moment Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task 2 Sec. 1.1

how trap mice alive The classic search model Collection User task Info need Query Results Search engine Query refinement Get rid of mice in a politically correct way Info about removing mice without killing them Misconception?Misformulation? Search 3

Boolean Retrieval Boolean retrieval is a simplified version of the actual search problem. Simplifying Assumption 1: The user accurately translates his task into a query (=Boolean combination of keywords) –(trap or remove) AND mice AND NOT kill Simplifying Assumption 2: A document is relevant to the user’s task if and only if it satisfies the Boolean combination of keywords 4

Boolean Retrieval Limitations Precise matching of query to documents, in real life might –Miss task-relevant documents –Return non-task-relevant documents No ranking of quality of result HOWEVER: A good start for understanding and modeling information retrieval –We will start by assuming the Boolean model! 5

Problem at Hand Given: –Huge collection of documents –Boolean keyword query Return: –Documents satisfying the query Dimension tradeoffs: –Speed –Memory size –Types of queries to be supported 6

Ideas? 7

Option 1: Store “As Is” Pages are stored "as is" as files in the file system Can find words in files using a grep style tool –Grep is a command-line utility for searching plain-text data sets for lines matching a regular expression. –Uses Boyer-Moore algorithm for substring search To process the data, it must be transferred from disk to main memory, and then searched for substrings –For large data, disk transfer is already a bottleneck! 8

Typical System Parameters (2007) Average Seek Time5ms=5*10 -3 s Transfer Time per byte 0.02  s=2*10 -8 s Low-level Processor Operation 0.01  s=10 -8 s Size of Main MemorySeveral GBs Size of Disk Space1TB Bottom Line: Seek and transfer are expensive operations! Try to avoid as much as possible 9

What do you think Suppose we have 10MB of text stored continuously. –How long will it take to read the data? Suppose we have 1GB of text stored in 100 continuous chunks. –How long will it take to read the data? Are queries processed quickly? Is this space efficient? 10

Option 2: Relational Database How would we find documents containing rain? Rain and Spain? Rain and not Spain? Is this better or worse than using the file system with grep? 11 DocIDDoc 1Rain, rain, go away... 2The rain in Spain falls mainly in the plain Model A

DB: Other Ways to Model the Data 12 DocIdWid... APPEARS DocIDWord... APPEARS Two options. Which is better? WordWid... WORD_INDEX Model B Model C

Relational Database Example 13 The rain in Spain falls mainly on the plain. Rain, rain go away. DocID: 1 DocID: 2

Relational Database Example 14 WORD_INDEX APPEARS Note the case- folding More about this later

Query Processing How are queries processed? Example query: rain –SELECT DocId –FROM WORD_INDEX W, APPEARS A –WHERE W.Wid=A.Wid and W.Word='rain' How can we answer the queries: –rain and go ? –rain and not Spain ? 15 Is Model C better than Model A?

Space Efficiency? Does it save more space than saving as files? –Depends on word frequency! Why? If a word appears in a thousand documents, then its wid will be repeated 1000 times. Why waste the space? If a word appears in a thousand documents, we will have to access a thousand rows in order to find the documents 16

Query Efficiency? Does not easily support queries that require multiple words Note: Some databases have special support for textual queries. Special purpose indices 17

Option 3: Bitmaps 18 There is a vector of 1s and 0s for each word. Queries are computed using bitwise operations on the vectors – efficiently implemented in the hardware.

Option 3: Bitmaps 19 How would you compute: Q1 = rain Q2 = rain and Spain Q3 = rain or not Spain

Bitmaps Tradeoffs Bitmaps can be efficiently processed However, they have high memory requirements. Example: –1M of documents, each with 1K of terms –500K distinct terms in total –What is the size of the matrix? –How many 1s will it have? Summary: A lot of wasted space for the 0s 20

The Index Repository A Good Solution 21

Two Structures Dictionary: –list of all terms in the documents –For each term in the document, store a pointer to its list in the inverted file Inverted Index: –For each term in the dictionary, an inverted list that stores pointers to all occurrences of the term in the documents. –Usually, pointers = document numbers –Usually, pointers are sorted –Sometimes also store term locations within documents (Why?) 22

Example Doc 1: A B C Doc 2: E B D Doc 3: A B D F How do you find documents with A and D? 23 A B C D E F The Devil is in the Details! Dictionary (Lexicon) Posting Lists (Inverted Index)

Goal Store dictionary in main memory Store inverted index on disk Use compression techniques to save space –Saves a little money on storage –Keep more stuff in memory, to increase speed –Increase speed of data transfer from disk to memory [read compressed data | decompress] can be faster than [read uncompressed data] 24

Coming Up… Document Statistics: –How big will the dictionary be? –How big will the inverted index be? Storing the dictionary –Space saving techniques Storing the inverted index –Compression techniques 25

Document Statistics Empirical Laws 26

Some Terminology Collection: set of documents Token: “word” appearing in at least one document, sometimes also called a term Vocabulary size: Number of different tokens appearing in the collection Collection size: Number of tokens appearing in the collection 27

Vocabulary vs. collection size How big is the term vocabulary? –That is, how many distinct words are there? Can we assume an upper bound? In practice, the vocabulary will keep growing with the collection size 28

Vocabulary vs. collection size Heaps Law estimates the size of the vocabulary as a function of the size of the collection: M = kT b where: –M is the size of the vocabulary –T is the number of tokens in the collection –Typically 30 ≤ k ≤ 100 and b ≈ 0.5 In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about ½ (“empirical law”) 29

Heaps’ Law For RCV1, the dashed line log 10 M = 0.49 log 10 T is the best least squares fit. Thus, M = T 0.49 so k = ≈ 44 and b = Good empirical fit for Reuters RCV1 ! For first 1,000,020 tokens, law predicts 38,323 terms; actually, 38,365 terms 30

Collection Size In natural language, there are a few very frequent terms and many very rare terms. Zipf’s law states that the i-th most frequent term has frequency proportional to 1/i. cf i = K/i where –cf i is number of occurrences of the i-th most frequent token –K is a normalizing constant 31

Zipf consequences If the most frequent term (the) occurs cf 1 times –then the second most frequent term (of) occurs cf 1 /2 times –the third most frequent term (and) occurs cf 1 /3 times … Equivalent: cf i = K/i where K is a normalizing factor, so –log cf i = log K - log i –Linear relationship between log cf i and log i 32

Zipf’s law for Reuters RCV1 33

The Dictionary Data Structures 34

Dictionary: Reminder Doc 1: A B C Doc 2: E B D Doc 3: A B D F Want to store: –Terms –Their frequencies –Pointer from each term to inverted index 35 A B C D E F Dictionary (Lexicon) Posting Lists (Inverted Index)

The Dictionary Assumptions: we are interested in simple queries: –No phrases –No wildcards Goals: –Efficient (i.e., log) access –Small size (fit in main memory) Want to store: –Word –Address of inverted index entry –Length of inverted index entry = word frequency (why?) 36

Why compress the dictionary? Search begins with the dictionary We want to keep it in memory Memory footprint competition with other applications Embedded/mobile devices may have very little memory Even if the dictionary isn’t in memory, we want it to be small for a fast search startup time So, compressing the dictionary is important 37

Some Assumptions To assess different storage solutions, in this part we will assume: –There are 400,000 different terms –Average word length is 8 letters (Why? Average token length is 4.5 letters!) –Each letter requires a byte of storage –Term frequency can be stored in 4 bytes –Pointers to the inverted index require 4 bytes We will see a series of different storage options 38

Dictionary storage - first cut Array of fixed-width entries, assuming maximum word length of 20 Search Complexity? Size: –400,000 terms –20 letters per word –4 bytes for frequency –4 bytes for posting list pointer –Total: 11.2 MB. 39

Fixed-width terms are wasteful Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms. –Avg. dictionary word length is 8 characters –On average, we wasted 12 characters per word! And we still can’t handle words longer than 20 letters, like: supercalifragilisticexpialidocious 40

Compressing the term list: Dictionary-as-a-String Store dictionary as a (long) string of characters: –Pointer to next word shows end of current word ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. 41

Compressing the term list: Dictionary-as-a-String How do we know where terms end? How do we search the dictionary? –Complexity? ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. 42

Compressing the term list: Dictionary-as-a-String String length: –400,000 terms * 8 bytes on avg. per term = 3.2MB Array size: –400,000 terms * (4 bytes for frequency + 4 bytes for posting list pointer + 3 bytes for pointer into string) = 4.4MB Total: 7.6MB ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. 43 Think About it: Why 3 bytes per pointer into String?

Blocking Blocking is a method to save on storage of pointers into the string of words –Instead of storing a pointer, for each term, we store a pointer to every k-th term –In order to know where words end, we also store term lengths …. systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. 44 Term Ptr LengthPosting Ptr Freq

Blocking Why are there term pointers missing below? Why is there a length value missing below? How is search performed? Complexity? …. systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. 45 Term Ptr LengthPosting Ptr Freq

Blocking How many bytes should we use to store the length? How much size does this index require, as a function of k? How much size when k = 4? …. systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. 46 Term Ptr LengthPosting Ptr Freq

Front Coding We now consider an alternative method of saving on space. Adjacent words tend to have common prefixes –Why? Size of the string can be reduced if we take advantage of common prefixes With front coding we –Remove common prefixes –Store the common prefix size –Store pointer into the concatenated string 47

Front Coding Example 48 jezebel jezer jezerit jeziah jeziel …ebelritiahel… Term Ptr Prefix size Posting Ptr Freq

Front Coding Example What is the search time? What is the size of the index, assuming that the common prefix is of size 3, on average? 49 …ebelritiahel… Term Ptr Prefix size Posting Ptr Freq

(k-1)-in-k Front Coding Front coding saves space, but binary search of the index is no longer possible To allow for binary search, “(k-1)-in-k” front coding can be used In this method, in every block of k words, the first is completely given, and all others are front-coded Binary search can be based on the complete words to find the correct block Combines ideas of blocking and front coding 50

3-in-4 Front Coding Example 51 …jezebelritiahjeziel… Term Ptr Prefix size LengthPosting Ptr Freq jezebel jezer jezerit jeziah jeziel What is the search time? Why are there missing prefix values? What is the size of the index, assuming that the common prefix is of size 3, on average?

52 Inverted Index

Inverted Index: Reminder Doc 1: A B C Doc 2: E B D Doc 3: A B D F Want to store: –Document ids 53 A B C D E F Dictionary (Lexicon) Posting Lists (Inverted Index)

The Inverted Index The inverted index is a set of posting lists, one for each term in the lexicon Example: Doc 1: A C F Doc 2: B E D B Doc 3: A B D F 54 If we only want to store docIDs, B’s posting list will be: 2 3 If we only want to store positions within docIDs, B’s posting list will be: (2; 1, 4), (3, 2) Or actually Positions increase the size of the posting list!

The Inverted Index The inverted index is a set of posting lists, one for each term in the lexicon From now on, we will assume that posting lists are simply lists of document ids Document ids in a posting list are sorted –A posting list is simply an increasing list of integers The inverted index is very large –We discuss methods to compress the inverted index 55

Postings compression The postings file is much larger than the dictionary, factor of at least 10. Key desideratum: store each posting compactly. A posting for our purposes is a docID. For Reuters (800,000 documents), we would use 32 bits per docID when using 4-byte integers. Alternatively, we can use log 2 800,000 ≈ 20 bits per docID. Our goal: use a lot less than 20 bits per docID. Sec

Postings: two conflicting forces A term like arachnocentric occurs in maybe one doc out of a million – we would like to store this posting using log 2 1M ~ 20 bits. A term like the occurs in virtually every doc, so 20 bits/posting is too expensive. –Prefer 0/1 bitmap vector in this case Sec

Postings file entry We store the list of docs containing a term in increasing order of docID. –computer: 33,47,154,159,202 … Consequence: it suffices to store gaps. –33,14,107,5,43 … Hope: most gaps can be encoded/stored with far fewer than 20 bits. –What happens if we use fixed length encoding? Sec

Three postings entries Sec

Variable length encoding Aim: –For arachnocentric, we will use ~20 bits/gap entry. –For the, we will use ~1 bit/gap entry. If the average gap for a term is G, we want to use ~log 2 G bits/gap entry. Key challenge: encode every integer (gap) with about as few bits as needed for that integer. This requires a variable length encoding Variable length codes achieve this by using short codes for small numbers Sec

Types of Compression Methods Length –Variable Byte –Variable bit Encoding/decoding prior information –Non-parameterized –Parameterized 61

Types of Compression Methods We will start by discussing non- parameterized methods –Variable byte –Variable bit Afterwards we discuss two parameterized methods that are both variable bit 62

Variable Byte Compression Document ids (=numbers) are stored using a varying number of bytes Numbers are byte-aligned Many compression methods have been developed. We discuss: –Varint –Length-Precoded Varint –Group Varint 63

Varint codes For a gap value G, we want to use close to the fewest bytes needed to hold log 2 G bits Begin with one byte to store G and dedicate 1 bit in it to be a continuation bit c If G ≤127, binary-encode it in the 7 available bits and set c =1 Else encode G’s higer-order 7 bits and then use additional bytes to encode the next 7 higher order bits using the same algorithm At the end set the continuation bit of the last byte to 1 (c =1) – and for the other bytes c = 0. Sec

Example docIDs gaps varint code Postings stored as the byte concatenation Key property: varint encoded postings are uniquely prefix-decodable. For a small gap (5), VB uses a whole byte. Sec

Length-Precoded Varint Currently, we must check the first bit of each byte before deciding how to proceed. Length-Precoded Varint aims at lowering the number of “branch-and-checks” Store each number in 1-4 bytes. Use the first 2 bits of the first byte to indicate the number of bytes used 66

Example Varint encoding: –7 bits per byte with continuation bit Length-Precoded Varint encoding: –Encode byte length as low 2 bits

Length-Precoded Varint: Pros and Cons Pros –Less branching –Less bit shifts Cons –Still requires branching/bit shifts –What is the largest number that can be represented? 68

Group Varint Encoding Introduced by Jeff Dean (Google) Idea: encode groups of 4 values in 5-17 bytes –Pull out 4 2-bit binary lengths into single byte prefix –Decoding uses a 256-entry table to determine the masks of all proceeding numbers 69

Example Length-Precoded Varint encoding: –Encode byte length as low 2 bits Group Varint encoding:

Other Variable Unit codes Instead of bytes, we can also use a different “unit of alignment”: 32 bits (words), 16 bits, 4 bits (nibbles). –When would smaller units of alignment be superior? When would larger units of alignment be superior? Variable byte codes: –Used by many commercial/research systems –Good low-tech blend of variable-length coding and sensitivity to computer memory alignment matches (vs. bit-level codes, which we look at next). Sec

Variable bit Codes In variable bit codes, each code word can use a different number of bits to encode Examples: –Unary codes –Gamma codes –Delta codes Other well-know examples: –Golomb codes, Rice codes 72

Unary code Represent n as n-1 1s with a final 0. Unary code for 3 is 110. Unary code for 40 is Unary code for 80 is: This doesn’t look promising, but…. 73

Gamma codes We can compress better with bit-level codes –The Gamma code is the best known of these. Represent a gap G as a pair length and offset offset is G in binary, with the leading bit cut off –For example 13 → 1101 → 101 length is the length of binary code –For 13 (1101), this is 4. We encode length with unary code: Gamma code of 13 is the concatenation of length and offset: Sec

Gamma code examples numberlengthoffset  -code 0None (why is this ok for us?) , , , , , , , , Sec

Gamma code properties G is encoded using 2  log G  + 1 bits All gamma codes have an odd number of bits Almost within a factor of 2 of best possible, log 2 G Gamma code is uniquely prefix-decodable Gamma code is parameter-free Sec

Delta codes Similar to gamma codes, except that length is encoded in gamma code Example: Compute the delta code of 9 Decode: Gamma codes = more compact for smaller numbers Delta codes = more compact for larger numbers 77

Disadvantages of Variable Bit Codes Machines have word boundaries – 8, 16, 32, 64 bits –Operations that cross word boundaries are slower Compressing and manipulating at the granularity of bits can be slow Variable byte encoding is aligned and thus potentially more efficient Regardless of efficiency, variable byte is conceptually simpler at little additional space cost Sec

Think About It Question: Can we do binary search on a Gamma or Delta Coded sequence of increasing numbers? Question: Can we do binary search on a Varint Coded sequence of increasing numbers? 79

Parameterized Methods A parameterized encoding gets the probability distribution of the input symbols, and creates encodings accordingly We will discuss 2 important parameterized methods for compression: –Canonical Huffman codes –Arithmetic encoding These methods can also be used for compressing the dictionary! 80

Huffman Codes: Review Surprising history of Huffman codes Huffman codes are optimal prefix codes for symbol by symbol encoding, i.e., codes in which no codeword is a prefix of another Input: a set of symbols, along with a probability for each symbol 81 A0.1 B0.2 C0.05 D E0.3 F0.2 G0.1

Creating a Huffman Code: Greedy Algorithm Create a node for each symbol and assign with the probability of the symbol While there is more than 1 node without a parent –choose 2 nodes with the lowest probabilities and create a new node with both nodes as children. –Assign the new node the sum of probabilities of children The tree derived gives the code for each symbol (leftwards is 0, and rightwards is 1) 82

Problems with Huffman Codes The tree must be stored for decoding. –Can significantly increase memory requirements If the tree does not fit into main memory, then traversal (for decoding) is very expensive Solution: Canonical Huffman Codes 83

Canonical Huffman Codes Intuition: A canonical Huffman code can be efficiently described by just giving: –The list of symbols –Length of codeword of each symbol This information is sufficient for decoding 84

Properties of Canonical Huffman Codes 1.Codewords of a given length are consecutive binary numbers 2.Given two symbols s, s’ with codewords of the same length then: cw(s) < cw(s’) if and only if s<s’ 3.The first shortest codeword is a string of 0-s 4.The last longest codeword is a string of 1-s 85

Properties of Canonical Huffman Codes (cont) 5.Suppose that –d is the last codeword of length i –the next length of codeword appearing in the code is j –the first codeword of length j is c Then c=2 j-i (d+1) 86

Try it Suppose that we have the following lengths per symbol, what is the canonical Huffman code: 87 A3 B2 C4 D4 E2 F3 G3

Decoding Let l 1,…,l n be the lengths of codewords appearing in the canonical code The decoding process will use the following information, for each distinct length l i : –the codeword c i of the first symbol with length l i –the number of words n i of length l i –easily computed using the information about symbol lengths 88

Decoding (cont) i = 0 Repeat –i = i+1 –Let d be the word derived by reading l i symbols Until d <= c i + n i -1 Return the d-c i +1 th symbol (in lexicographic order) of length I Example: Decode

Some More Details How do we compute the lengths for each symbol? How do we compute the probabilities of each symbol? –model per posting list –single model for all posting lists –model for each group of posting lists (grouped by size) 90

Huffman Code Drawbacks Each symbol is coded separately Each symbol uses a whole number of bits Can be very inefficient when there are extremely likely/unlikely values 91

How Much can We Compress? Given: (1) A set of symbols, (2) Each symbol s has an associated probability P(s) Shannon’s lower bound on the average number of bits per symbol needed is:  s –P(s) log P(s) –Roughly speaking, each symbol s with probability P(s) needs at least –log P(s) bits to represent –Example: the outcome of a fair coin needs –log 0.5=1 bit to represent Ideally, we aim to find a compression method that reaches Shannon’s bound 92

Example Suppose A has probability 0.99 and B has probability How many bits will Huffman’s code use for 10 A-s? Shannon’s bound gives us a requirement of - log(0.99)=0.015 bits per word, i.e., only 0.15 bits in total! Inefficiency of Huffman’s code is bounded from above by where s m is the most likely symbol 93

Arithmetic Coding Comes closer to Shannon’s bound by coding symbols together Input: Set of symbols S with probabilities, input text s 1,…,s n Output: length n of the input text and a number (written in binary) in [0,1) In order to explain the algorithm, numbers will be shown as decimal, but obviously they are always binary 94

ArithemeticEncoding(s 1 …s n ) low := 0 high := 1 for i=1 to n do (low,high) := Restrict(low,high,s i ) return any number between low and high 95

Restrict(low,high,s i ) low_bound := sum{P(s) | s  S and s<s i } high_bound := low_bound + P(s i ) range := high - low new_low := low + range*low_bound new_high := low + range*high_bound return (new_low, new_high) 96

ArithmeticDecoding(k,n) low := 0 high := 1 for i = 1 to n do –for each s  S do (new_low,new_high) := Restrict(low,high,s) –if new_low  k < new_high then Output “s” low := new_low high := new_high break 97

Think about it Decode the string 0.34 of length 3, given alphabet consisting of A, B both with prob 0.5 In general, what is the size of the encoding of an input? –to store a number in an interval of size high-low, we need –log(high-low) bits –The size of the final interval is, and needs bits 98

Adaptive Arithmetic Coding In order to decode, the probabilities of each symbol must be known. –This must be stored, which adds to overhead The probabilities may change over the course of the text –Cannot be modeled thus far In adaptive arithmetic coding the encoder (and decoder) compute the probabilities on the fly by counting symbol frequencies 99

An example - I String bccb from the alphabet {a,b,c} Zero-frequency problem solved initializing at 1 all character counters When the first b is to be coded all symbols have a 33% probability (why?) The arithmetic coder maintains two numbers, low and high, which represent a subinterval [low,high) of the range [0,1) Initially low=0 and high=1 100

An example - II The range between low and high is divided between the symbols of the alphabet, according to their probabilities 101 low high a b c (P[c]=1/3) (P[b]=1/3) (P[a]=1/3)

An example - III 102 low high a b c b low = high =  P[a]=1/4  P[b]=2/4  P[c]=1/4 new probabilities

An example - IV new probabilities P[a]=1/5 P[b]=2/5 P[c]=2/5 103 low high a b c c low = high = (P[c]=1/4) (P[b]=2/4) (P[a]=1/4)

An example - V new probabilities P[a]=1/6 P[b]=2/6 P[c]=3/6 104 low high a b c c low = high = (P[c]=2/5) (P[b]=2/5) (P[a]=1/5)

An example - VI Final interval [0.6390,0.6501) we can send low high a b c low = high = b (P[c]=3/6) (P[b]=2/6) (P[a]=1/6)

An example - summary Starting from the range between 0 and 1 we restrict ourself each time to the subinterval that codify the given symbol At the end the whole sequence can be codified by any of the numbers in the final range (but mind the brackets...) 106

An example - summary a b c / /4 2/4 1/4 a b c /5 1/ a b c a b c /6 2/6 1/6 [0.6390, )0.64