Evidence from Content INST 734 Module 2 Doug Oard.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Folk/Zoellick/Riccardi, File Structures 1 Objectives: To get familiar with: Data compression Storage management Internal sorting and binary search Chapter.
Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
Hinrich Schütze and Christina Lioma Lecture 5: Index Compression
Index Compression David Kauchak cs160 Fall 2009 adapted from:
Information Retrieval and Web Search
LBSC 796/INFM 718R: Week 5 Indexing Jimmy Lin College of Information Studies University of Maryland Monday, February 27, 2006.
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting.
Full-Text Indexing Session 10 INFM 718N Web-Enabled Databases.
Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Information Retrieval IR 4. Plan This time: Index construction.
Advance Information Retrieval Topics Hassan Bashiri.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Organizing files for performance Chapter Data compression Advantages of reduced file size Redundancy reduction: state code example Repeating sequences:
CpSc 881: Information Retrieval. 2 Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin by reviewing.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
Information Retrieval Space occupancy evaluation.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
Lecture 6: Index Compression
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Indexing LBSC 708A/CMSC 838L Session 7, October 23, 2001 Philip Resnik.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Dictionaries and Tolerant retrieval
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
CS728 Web Indexes Lecture 15. Building an Index for the Web Wish to answer simple boolean queries – given query term, return address of web pages that.
1 ITCS 6265 Information Retrieval and Web Mining Lecture 5 – Index compression.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Information Retrieval Techniques MS(CS) Lecture 6 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Representing and Indexing Content INST 734 Lecture 2 February 5, 2014.
Evidence from Content INST 734 Module 2 Doug Oard.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
CSE 326 Nov 18, 1999 (Title pages make Powerpoint happy)
Statistical Properties of Text
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 8: Index Compression.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Why indexing? For efficient searching of a document
COMP9319: Web Data Compression and Search
Information Retrieval in Practice
Tries 07/28/16 11:04 Text Compression
Text Indexing and Search
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Indexing & querying text
Implementation Issues & IR Systems
Lecture 7: Index Construction
Lecture 5: Index Compression Hankz Hankui Zhuo
3-3. Index compression Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
File Compression Even though disks have gotten bigger, we are still running short on disk space A common technique is to compress files so that they take.
CS276: Information Retrieval and Web Search
Self-Balancing Search Trees
Presentation transcript:

Evidence from Content INST 734 Module 2 Doug Oard

Agenda Character sets Terms as units of meaning Boolean retrieval  Building an index

An “Inverted Index” quick brown fox over lazy dog back now time all good men come jump aid their party Term Doc 1Doc Doc 3 Doc Doc 5Doc Doc 7Doc 8 A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 Postings Term Index

Deconstructing the Inverted Index quick brown fox over lazy dog back now time all good men come jump aid their party Postings File 1, 3 1, 3, 5, 7 3, 5, 7 1, 3, 5, 7, 8 1, 3, 5, 7 3, 5 1, 3, 7 2, 6, 8 2, 4, 6 2, 4, 6, 8 2, 4, 8 2, 4, 6, 8 3 4, 8 1, 5, 7 6, 8 The term Index

Computational Complexity Time complexity: how long will it take: –At index-creation time? –At query time? Space complexity: how much memory is needed: –In RAM? –On disk?

relaxation astronomical zebra belligerent subterfuge daffodil cadence wingman loiter peace arcade respondent complex tax kingdom jambalaya Linear Dictionary Lookup Worst-case time: proportional to number of dictionary entries This algorithm is O(n) (a “linear time” algorithm) Suppose we want to find the word “complex” Found it!

With a Sorted Dictionary Worst-case time: proportional to number of halvings (1, 2, 4, 8, … 1024, 2048, 4096, …) We call this Binary “search” –an “O(log n) time” algorithm arcade astronomical belligerent cadence complex daffodil jambalaya kingdom loiter peace relaxation respondent subterfuge tax wingman zebra Let’s try again, except this time with a sorted dictionary: find “complex” Found it!

“Asymptotic” Complexity

Term Index Size Heap’s Law predicts vocabulary size Term index will usually fits in RAM –For any size collection V is vocabulary size n is number of documents) K and  are constants

Building a Term Index Simplest solution is a single sorted array –Fast lookup using binary search –But sorting is expensive [it’s O(n * log n)] And adding one document means starting over Tree structures allow easy insertion –But the worst case lookup time is O(n) Balanced trees provide the best of both –Fast lookup [O (log n) and easy insertion [O(log n)] –But they require 45% more disk space

Postings File Size Fairly compact for Boolean retrieval –About 10% of the size of the documents Not much larger for ranked retrieval –Perhaps 20% Enormous for proximity operators –Sometimes larger than the documents! Most postings must be stored on disk

Large Postings Cause Slow Queries Disks are 200,000 times slower than RAM! –Typical RAM: Size: 2 GB, Access speed: 50 ns –Typical Disk: Size: 1 TB, access speed: 10 ms Smaller postings require fewer disk reads Two strategies for reducing postings size: –Stopword removal –Index compression

Zipf’s “Long Tail” Law For many distributions, the nth most frequent element is related to its frequency by: Only few words occur very frequently –Very frequent words are rarely useful query terms –Stopword removal yields faster query processing or f = frequency r = rank c = constant

Word Frequency in English Frequency of 50 most common words in English (sample of 19 million words)

Demonstrating Zipf’s Law The following shows r*(f/n)*1000 r is the rank of word w in the sample f is the frequency of word w in the sample n is the total number of word occurrences in the sample

Index Compression CPU’s are much faster than disks –A disk can transfer 1,000 bytes in ~20 ms –The CPU can do ~10 million instructions in that time Compressing the postings file is a big win –Trade decompression time for fewer disk reads Key idea: reduce redundancy –Trick 1: store relative offsets (some will be the same) –Trick 2: use a near-optimal coding scheme

Compression Example Raw postings: 7 one-byte Doc-IDs (56 bits) 37, 42, 43, 48, 97, 98, 243 Difference encoding (e.g., 42-37=5) 37, 5, 1, 5, 49, 1, 145 Variable length binary Huffman code 0:1, 10:5, 110:37, 1110:49, 1111: 145 Compressed postings (17 bits; 30% of raw)

Summary Slow indexing yields fast query processing –Key fact: most terms don’t appear in most documents We use extra disk space to save query time –Index space is in addition to document space –Time and space complexity must be balanced Disk reads are the critical resource –This makes index compression a big win