INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Slides:



Advertisements
Similar presentations
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Advertisements

PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
CS276 Information Retrieval and Web Search Lecture 5 – Index compression.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
Hinrich Schütze and Christina Lioma Lecture 5: Index Compression
Index Compression David Kauchak cs160 Fall 2009 adapted from:
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Information Retrieval and Web Search
CS276A Information Retrieval Lecture 3. Recap: lecture 2 Stemming, tokenization etc. Faster postings merges Phrase queries.
Index Compression Lecture 4. Recap: lecture 3 Stemming, tokenization etc. Faster postings merges Phrase queries.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
Information Retrieval Space occupancy evaluation.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Lecture 6: Index Compression
CS347 Lecture 2 April 9, 2001 ©Prabhakar Raghavan.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
LIS618 lecture 2 the Boolean model Thomas Krichel
ADFOCS 2004 Prabhakar Raghavan Lecture 2. Corpus size for estimates Consider n = 1M documents, each with about 1K terms. Avg 6 bytes/term incl spaces/punctuation.
CS728 Web Indexes Lecture 15. Building an Index for the Web Wish to answer simple boolean queries – given query term, return address of web pages that.
1 ITCS 6265 Information Retrieval and Web Mining Lecture 5 – Index compression.
Index Compression David Kauchak cs458 Fall 2012 adapted from:
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
CSE 326: Data Structures Lecture #16 Hashing HUGE Data Sets (and two presents from the Database Fiancée) Steve Wolfman Winter Quarter 2000.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 8: Index Compression.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
CS315 Introduction to Information Retrieval Boolean Search 1.
Why indexing? For efficient searching of a document
COMP9319: Web Data Compression and Search
Information Retrieval and Web Search
Text Indexing and Search
CS122B: Projects in Databases and Web Applications Winter 2017
Index Compression Adapted from Lectures by Prabhakar Raghavan (Google)
Index Compression Adapted from Lectures by
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval and Web Search Lecture 1: Boolean retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture 5: Index Compression Hankz Hankui Zhuo
3-3. Index compression Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Adapted from Lectures by
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CS276: Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID Lecture # 11 Compression

ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the following sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎  “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

Outline compression for inverted indexes Dictionary storage Dictionary-as-a-String Blocking

Basic indexing pipeline Documents to be indexed Friends, Romans, countrymen. Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman 00:00:20  00:0:45 Indexer Inverted index friend roman countryman 2 4 13 16 1

Why compression for inverted indexes? Dictionary Make it small enough to keep in main memory Make it so small that you can keep some postings lists in main memory too Postings file(s) Reduce disk space needed Decrease time needed to read postings lists from disk Large search engines keep a significant part of the postings in memory. Compression lets you keep more in memory 00:03:49  00:04:10

Dictionary storage - first cut Array of fixed-width entries 500,000 terms; 28 bytes/term = 14MB. 00:08:20  00:10:35 Allows for fast binary search into dictionary 20 bytes 4 bytes each

Fixed-width terms are wasteful Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms. And still can’t handle supercalifragilisticexpialidocious. Written English averages ~4.5 characters. Exercise: Why is/isn’t this the number to use for estimating the dictionary size? Short words dominate token counts. Average word in English: ~8 characters. Explain this. 00:11:40  00:12:00 00:12:25  00:13:25 00:13:35  00:14:05 What are the corresponding numbers for Italian text?

Compressing the term list: Dictionary-as-a-String Store dictionary as a (long) string of characters: Pointer to next word shows end of current word Hope to save up to 60% of dictionary space. ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Total string length = 500KB x 8 = 4MB 00:16:46  00:17:30 00:18:00  00:19:30 00:20:00  00:22:40 Pointers resolve 4M positions: log24M = 22bits = 3bytes Binary search these pointers 8

Total space for compressed list 4 bytes per term for Freq. 4 bytes per term for pointer to Postings. 3 bytes per term pointer Avg. 8 bytes per term in term string 500K terms  9.5MB Total Space = 500K terms * (4 bytes Freq + 4 bytes Postings + 3 bytes term pointer + 8 bytes per term in term string) = 500K terms * (19 Bytes/Term)  Now avg. 11  bytes/term,  not 20. 00:22:50  00:23:50

Blocking Store pointers to every kth on term string. Example below: k=4. Need to store term lengths (1 extra byte) ….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo…. 00:31:00  00:32:10 00:35:00  00:37:20 00:38:35  00:39:30  Save 9 bytes  on 3  pointers. Lose 4 bytes on term lengths.

Net Where we used 3 bytes/pointer without blocking 3 x 4 = 12 bytes for k=4 pointers, now we use 3+4=7 bytes for 4 pointers. 00:42:00  00:44:00 00:44:30  00:45:30 Shaved another ~0.5MB; can save more with larger k. Why not go with larger k?

Resources Chapter 5 of IIR Resources at http://ifnlp.org/ir Original publication on word-aligned binary codes by Anh and Moffat (2005); also: Anh and Moffat (2006a) Original publication on variable byte codes by Scholer, Williams, Yiannis and Zobel (2002) More details on compression (including compression of positions and frequencies) in Zobel and Moffat (2006)