Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
15-583:Algorithms in the Real World
Lecture 3: Source Coding Theory TSBK01 Image Coding and Data Compression Jörgen Ahlberg Div. of Sensor Technology Swedish Defence Research Agency (FOI)
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Analysis of Algorithms
Information Theory EE322 Al-Sanie.
Lecture: Algorithmic complexity
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
Data Compression.
Lecture04 Data Compression.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Lecture 6: Huffman Code Thinh Nguyen Oregon State University.
I/O-Algorithms Lars Arge Spring 2009 February 2, 2009.
2015/6/15VLC 2006 PART 1 Introduction on Video Coding StandardsVLC 2006 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
CS336: Intelligent Information Retrieval
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Lossless data compression Lecture 1. Data Compression Lossless data compression: Store/Transmit big files using few bytes so that the original files.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
Variable-Length Codes: Huffman Codes
2015/7/12VLC 2008 PART 1 Introduction on Video Coding StandardsVLC 2008 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
CSI Uncertainty in A.I. Lecture 201 Basic Information Theory Review Measuring the uncertainty of an event Measuring the uncertainty in a probability.
CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University1 Sorting CS 202 – Fundamental Structures of Computer Science II Bilkent.
1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003.
Noise, Information Theory, and Entropy
Information Retrieval Space occupancy evaluation.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Introduction to AEP In information theory, the asymptotic equipartition property (AEP) is the analog of the law of large numbers. This law states that.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
Information and Coding Theory
Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???
Source Coding-Compression
Dr.-Ing. Khaled Shawky Hassan
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large  It might be useful for some modern devices to support information retrieval.
Machine Architecture CMSC 104, Section 4 Richard Chang 1.
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Analysis of Algorithms
Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Data Compression Meeting October 25, 2002 Arithmetic Coding.
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Information Retrieval WS 2012 / 2013 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
1/6/20161 CS 3343: Analysis of Algorithms Lecture 2: Asymptotic Notations.
Data Structures Haim Kaplan & Uri Zwick December 2013 Sorting 1.
15-853: Algorithms in the Real World Locality II: Cache-oblivious algorithms – Matrix multiplication – Distribution sort – Static searching.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
Searching Topics Sequential Search Binary Search.
Bringing Together Paradox, Counting, and Computation To Make Randomness! CS Lecture 21 
Ch03-Algorithms 1. Algorithms What is an algorithm? An algorithm is a finite set of precise instructions for performing a computation or for solving a.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
Algorithm Complexity is concerned about how fast or slow particular algorithm performs.
Ch4. Zero-Error Data Compression Yuan Luo. Content  Ch4. Zero-Error Data Compression  4.1 The Entropy Bound  4.2 Prefix Codes  Definition and.
COMP9319: Web Data Compression and Search
Indexing & querying text
Theory of Computation Lecture 17: Calculations on Strings II
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture 4, Thursday November 12 th, 2009 (IO- and Cache-Efficiency, Compression)

Goals Learn why compression is important –cache efficiency: sequential vs. random access to memory –IO-efficiency: sequential vs. random access to disk Learn about various compression schemes –Elias, Golomb, Variable Byte, Simple-9 Learn some important theoretical foundations –entropy = optimal compressibility of data –which scheme is optimal under which circumstances? –how fast can one compress / decompress?

IO-Efficiency (IO = Input / Output) Sequential vs. random access to disk –typical disk transfer rate is 50 MB / second –typical disk seek time is 5 milliseconds This means that –.. when we write code for data is so large that it does not completely fit into memory (as is typical for search apps) … –then we must take great care that whenever we read from (or write to) disk, we actually read (or write) big *blocks of data* –an algorithm which does that is called IO-efficient –more formally  next slide

IO-Efficiency — Formalization Standard RAM model (RAM = random access machine) –count the number of operations one operation = an arithmetic operation, read/write a single memory cell, etc. –for example, sorting n integers needs O(n ∙ log n) operations IO-Model or: External memory model –data is read / written in blocks of B contiguous bytes – count the number of block reads / write –ignore all other operations / computational costs –for example, the IO-efficiency of sorting is O(n/B ∙ log M/B n/B) were M is the size of the main memory briefly explain bound

Cache-Efficiency Basic functioning of a cache draw interactively

Comparison: Disk / Memory Both have in common that –sequential access is much more efficient than random access The ratios are very different though –disk ~ 50 MB / second transfer rate ~ 5 milliseconds seek time this implies an optimal block size of ~ 250 KB –memory ~ 1 GB / second sequential read ~ 100 nanoseconds for a random access / cache miss this implies an optimal block size of ~ 100 bytes

IO / Cache Efficiency Understand –considering IO and cache efficiency is *key* for the efficiency of any program you write that deals with non- trivial amounts of data –an algorithm that tries to access data in blocks as much as possible is said to have good locality of access In the exercise … –… you will write two programs that compute exactly the same function, but with very different locality of access –let’s see which running time differences you measure

Compression Now we can understand why compression is important –for information retrieval / search –or for any application that deals with large amounts of data Namely –consider a query –assume that query needs 50 MB of data from disk –then it takes 1 second just to read this data –assume the data is stored compressed using only 5 MB –now we can read it in just 0.1 seconds –assume it takes us 0.1 seconds to decompress the data –then we are still 5 times faster then without compression (assuming that the internal computation is << 0.1 seconds)

Compressing inverted lists Example of an inverted list of document ids 3, 17, 21, 24, 34, 38, 45, …, 11876, 11899, 11913, … –numbers can become very large –need 4 bytes to store each, for web search even more –but we can also store the list like this +3, +14, +4, +3, +10, +4, +7, …, +12, +23, +14, … –this is called gap-encoding –works as long as we process the lists from left to right –now we have a sequence of mostly small numbers –need a scheme which stores small numbers in few bits –such a scheme is called universal encoding  next slide

Universal encoding Goal of universal encoding –store a number in x in ~ log 2 x bits –less is not possible, why?  Exercise Elias code –write the number x to be encoded in binary –prepend z zeroes, where z = floor(log 2 x) –examples

Prefix Freeness For our purposes, codes should be prefix free –prefix free = no encoding of a symbol must be a prefix of an encoding of some other symbol –assume the following code (which is not prefix-free) A encoded by 1, B encoded by 11 now what does the sequence 1111 encode? could be AAAA or ABA or BAA or AAB or BB –for a prefix-free code, decoding is unambiguous –the Elias code is prefix-free  Exercise –and so are all the codes we will consider in this lecture

Elias-Gamma Encoding Elias encoding –uses ~ 2 log 2 x bits to encode a number x –that is about a factor of 2 off the optimum –the reason is the prepended zeroes (unary encoding) Elias-Gamma encoding –encode the prepended zeroes with Elias –show example –now log 2 x + 2 ∙ log 2 log 2 x bits –this can be iterated  log2 x + 2 log 2 (k) x bits –what is the optimal k?  Exercise

Entropy Encoding What if the numbers are not in sorted order –or not numbers at all but just symbols C C B A D B B A B B C B B C B D Entropy encoding –give each number a code corresponding to its frequency –frequencies in our example: A: 2 B: 8 C: 4 D: 2 –prefix-free codes: B  1 C  01 D  0010 A  0001 –requires 8 ∙ ∙ ∙ ∙ 4 = 32 bits –that is 2 bits / symbol on average –better than the obvious 3-bit code –but is it the best we can get?

Definition of Entropy Entropy –defined for a discrete random variable X (that is, for a probability distribution with finite range) –assume w.l.o.g that X is from the range {1, …, m} –let p i = Prob(X = i) –then the entropy of X is written and defined as H(X) = - Σ i=1,…m p i ∙ log p i Examples equidistribution: p i = 1/n  H = log 2 n deterministic: p i = 1, all others 0  H = 0 intuitively: entropy = average #bits to encode a symbol

Source Coding Theorem By Claude Shannon, 1948 –let X be random variable with finite range –let C be a (binary) code for the possible values C(x) = code for value x from the range L(x) = length of that code –Then E[ L(X) ] ≥ H(X) E[ L(X) ] ≤ H(X) + 1 Claude Shannon *1916 Michigan †2001 Massachusetts

Proof of Source Coding Theorem Prove lower bound, give hints on upper bound Key: the Kraft inequality

Entropy Encoding Universal Encoding Recall –entropy-optimal encoding gives a code with log 2 1/p(x) bits to a symbol which occurs with probability p(x) –optimal universal encoding gives a code with c ∙ log 2 x + O(1) bits to a positive integer x Therefore, by the source code theorem –universal encoding is the entropy-optimal code when number x occurs with probability ~ 1 / x c –for example, the Elias code is optimal when number x occurs with probability ~ 1 / x 2

Golomb encoding By Solomon Golomb, 1966 –comes with a parameter M (modulus) –write positive integer x as q ∙ M + r –where q = x div M and r = x mod M –the codeword for x is then the concatenation of the quotient q written in unary with 0s a single 1 (as a delimiter) the remainder r written in binary –examples Solomon Golomb *1932 Maryland

Golomb Encoding — Analysis Show that Golomb encoding is optimal –for gap-encoding inverted lists –assuming the doc ids in a list of size m are a random subset of size m of all doc ids 1..n

Simpler encodings — Variable Byte Variable byte encoding –always use 8 ∙ x bits  codes aligned to byte boundaries –most significant bit of byte indicates whether code continues –examples –advantages: simple faster to decompress than non-byte aligned code

Simpler encodings — Simple9 Simple-9 Encoding (Anh and Moffat, 2005) –align to full machine words (used to be: 4-byte ints) –each int is split into two parts x (4 bits) and y (28 bits) –x says how y is to be interpreted –depending on y, x is interpreted as 14 (small) numbers of 2 bits each, or 9 (small) numbers of 3 bits each, or … 1 number of 28 bits –advantage: decompression of a whole 4-byte int can be hard- coded for each possible x –this gives a super fast decompression –compression ratio is not optimal but ok