1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.

Slides:

Advertisements

Similar presentations

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Advertisements

PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.

CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.

CS276 Information Retrieval and Web Search Lecture 5 – Index compression.

Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.

Hinrich Schütze and Christina Lioma Lecture 5: Index Compression

Index Compression David Kauchak cs160 Fall 2009 adapted from:

Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.

Information Retrieval and Web Search

1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)

1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.

BTrees & Bitmap Indexes

Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.

Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.

1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.

CS276A Information Retrieval Lecture 3. Recap: lecture 2 Stemming, tokenization etc. Faster postings merges Phrase queries.

Information Retrieval IR 4. Plan This time: Index construction.

Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Hinrich Schütze and Christina Lioma Lecture 4: Index Construction

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.

CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.

Information Retrieval Space occupancy evaluation.

Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Lecture 6: Index Compression

CS347 Lecture 2 April 9, 2001 ©Prabhakar Raghavan.

Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.

Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Dictionaries and Tolerant retrieval

CS728 Web Indexes Lecture 15. Building an Index for the Web Wish to answer simple boolean queries – given query term, return address of web pages that.

1 ITCS 6265 Information Retrieval and Web Mining Lecture 5 – Index compression.

Index Compression David Kauchak cs458 Fall 2012 adapted from:

Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.

Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Information Retrieval Techniques MS(CS) Lecture 6 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.

Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

Evidence from Content INST 734 Module 2 Doug Oard.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:

Internal and External Sorting External Searching

Statistical Properties of Text

Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 8: Index Compression.

Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.

Storage 1 Some of these slides are based on Stanford IR Course slides at

CS315 Introduction to Information Retrieval Boolean Search 1.

Why indexing? For efficient searching of a document

COMP9319: Web Data Compression and Search

Information Retrieval and Web Search

Large Scale Search: Inverted Index, etc.

Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,

Lecture 1: Introduction and the Boolean Model Information Retrieval

Indexing & querying text

Text Based Information Retrieval

Index Compression Adapted from Lectures by Prabhakar Raghavan (Google)

Implementation Issues & IR Systems

CS 430: Information Discovery

Index Compression Adapted from Lectures by

Basic Information Retrieval

Lecture 7: Index Construction

Lecture 5: Index Compression Hankz Hankui Zhuo

3-3. Index compression Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Adapted from Lectures by

CS276: Information Retrieval and Web Search

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Presentation transcript:

1

2

Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems Lexicon (dictionary) + Inverted Index idea Dictionary Expected size: Heaps’ Law, Zipf’s Law Storage solutions 3

What do you think? 4

Some of the Main Challenges Speed of answer: Huge Web, many users Ranking: How can the search engine make sure to return the "best" pages on the top? Coverage: How can a search engine be sure that it covers sufficiently large portions of the web? Hidden Web Storage: Data from web pages is stored locally at the search engine. How can so much information be stored using a reasonable amount of memory? 5

Search Engine Components Index Repository: Storage of web pages (and additional data) Indexer: Program that gets a web page (found by the crawler) and inserts the data from the page into the Index Repository Crawler: Program that "crawls" the web to find web pages Note that the Crawler and Indexer are constantly running in the "background". They are NOT run for specific user queries 6

Search Engine Components Query processor: Gets the query from the user interface and finds satisfying documents from the index repository Ranker: Ranks the documents found according to how well they "match" the query User Interface: What the user sees 7

8 Index Repository Query Processor Crawler Indexer Web Ranker

9 Index Repository Query Processor Crawler Indexer Web Ranker Many users querying (and many crawlers crawling) running in parallel. Challenge: Coordinate between all these processes.

Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). 10

Unstructured (text) vs. structured (database) data in

Unstructured (text) vs. structured (database) data in

Brainstorming 13 Index Repository

The Problem We want to store (information about) a lot of pages Given a list of words, we want to find documents containing all words Note this simplification – assume that the user task is exactly reflected in the query! Ignore ranking for now Dimension tradeoffs: Speed Memory size Types of queries to be supported 14

Typical System Parameters (2007) Average Seek Time5ms=5*10 -3 s Transfer Time per byte 0.02  s=2*10 -8 s Low-level Processor Operation 0.01  s=10 -8 s Size of Main MemorySeveral GBs Size of Disk Space1TB Bottom Line: Seek and transfer are expensive operations! Try to avoid as much as possible 15

Ideas? 16

Option 1: Store “As Is” Pages are stored "as is" as files in the file system Can find words in files using a grep style tool Suppose we have 10MB of text stored continuously. How long will it take to read the data? Suppose we have 1GB of text stored in 100 continuous chunks. How long will it take to read the data? 17

What do you think Are queries processed quickly? Is this space efficient? 18

Option 2: Relational Database How would we find documents containing rain? Rain and Spain? Rain and not Spain? Is this better or worse than using the file system with grep? 19 DocIDDoc 1Rain, rain, go away... 2The rain in Spain falls mainly in the plain Model A

DB: Other Ways to Model the Data 20 DocIdWid... APPEARS DocIDWord... APPEARS Two options. Which is better? WordWid... WORD_INDEX Model B Model C

Relational Database Example 21 The rain in Spain falls mainly on the plain. Rain, rain go away. DocID: 1 DocID: 2

Relational Database Example 22 WORD_INDEX APPEARS Note the case- folding More about this later

Is This a Good Idea? Does it save more space than saving as files? Depends on word frequency! Why? How are queries processed? Example query: rain SELECT DocId FROM WORD_INDEX W, APPEARS A WHERE W.Wid=A.Wid and W.Word='rain' How can we answer the queries: rain and go ? rain and not Spain ? 23 Is Model C better than Model A?

Is it good to use a relational DB? If a word appears in a thousand documents, then its wid will be repeated 1000 times. Why waste the space? If a word appears in a thousand documents, we will have to access a thousand rows in order to find the documents Does not easily support queries that require multiple words Note: Some databases have special support for textual queries. Special purpose indices 24

Option 3: Bitmaps 25 There is a vector of 1s and 0s for each word. Queries are computed using bitwise operations on the vectors – efficiently implemented in the hardware.

Option 3: Bitmaps 26 How would you compute: Q1 = rain Q2 = rain and Spain Q3 = rain or not Spain

Bitmaps Tradeoffs Bitmaps can be efficiently processed However, they have high memory requirements. Example: 1M of documents, each with 1K of terms 500K distinct terms in total What is the size of the matrix? How many 1s will it have? Summary: A lot of wasted space for the 0s 27

A Good Solution 28

Two Structures Dictionary: list of all terms in the documents For each term in the document, store a pointer to its list in the inverted file Inverted Index: For each term in the dictionary, an inverted list that stores pointers to all occurrences of the term in the documents. Usually, pointers = document numbers Usually, pointers are sorted Sometimes also store term locations within documents (Why?) 29

Example Doc 1: A B C Doc 2: E B D Doc 3: A B D F How do you find documents with A and D? 30 A B C D E F The Devil is in the Details! Dictionary (Lexicon) Posting Lists (Inverted Index)

Compression Use less disk space Saves a little money Keep more stuff in memory Increases speed Increase speed of data transfer from disk to memory [read compressed data | decompress] is faster than [read uncompressed data] Premise: Decompression algorithms are fast

Why compression for Index Repository? Dictionary Make it small enough to keep in main memory Make it so small that you can keep some postings lists in main memory too Postings file(s) Reduce disk space needed Decrease time needed to read postings lists from disk Large search engines keep a significant part of the postings in memory. Compression lets you keep more in memory

How big will the dictionary be? 33

Vocabulary vs. collection size How big is the term vocabulary? That is, how many distinct words are there? Can we assume an upper bound? In practice, the vocabulary will keep growing with the collection size

Vocabulary vs. collection size Heaps’ law: M = kT b M is the size of the vocabulary, T is the number of tokens in the collection Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5 In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about ½ An empirical finding (“empirical law”)

Heaps’ Law For RCV1, the dashed line log 10 M = 0.49 log 10 T is the best least squares fit. Thus, M = T 0.49 so k = ≈ 44 and b = Good empirical fit for Reuters RCV1 ! For first 1,000,020 tokens, law predicts 38,323 terms; actually, 38,365 terms

Try It Heaps’ law: M = kT b Compute the vocabulary size M for this scenario: Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. Assume a search engine indexes a total of 20,000,000,000 (2 × ) pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law?

Try It Heaps’ law: M = kT b Suppose that you know that the parameter b for a specific collection is ½. What percentage of the text do you have to read in order to see 90% of the different words in the text? 38

Zipf’s law Heaps’ law gives the vocabulary size in collections. We also study the relative frequencies of terms. In natural language, there are a few very frequent terms and very many very rare terms. Zipf’s law: The ith most frequent term has frequency proportional to 1/i. cf i = K/i where K is a normalizing constant cf i is collection frequency: the number of occurrences of the term t i in the collection.

Zipf consequences If the most frequent term (the) occurs cf 1 times then the second most frequent term (of) occurs cf 1 /2 times the third most frequent term (and) occurs cf 1 /3 times … Equivalent: cf i = K/i where K is a normalizing factor, so log cf i = log K - log i Linear relationship between log cf i and log i

Zipf’s law for Reuters RCV1 41

Try It Suppose that t 2, the second most common word in the text, appears 10,000 times How many times will t 10 appear? 42

Data Structures 43

The Dictionary Assumptions: we are interested in simple queries: No phrases No wildcards Goals: Efficient (i.e., log) access Small size (fit in main memory) Want to store: Word Address of inverted index entry Length of inverted index entry = word frequency (why?) 44

Why compress the dictionary? Search begins with the dictionary We want to keep it in memory Memory footprint competition with other applications Embedded/mobile devices may have very little memory Even if the dictionary isn’t in memory, we want it to be small for a fast search startup time So, compressing the dictionary is important

Dictionary storage - first cut Array of fixed-width entries ~400,000 terms; 28 bytes/term = 11.2 MB. 20 bytes4 bytes each

Fixed-width terms are wasteful Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms. And we still can’t handle supercalifragilisticexpialidocious Question: Written English averages ~4.5 characters/word. Why is/isn’t this the number to use for estimating the dictionary size? Question: Avg. dictionary word in English: ~8 characters How do we use ~8 characters per dictionary term?

Compressing the term list: Dictionary-as-a-String Store dictionary as a (long) string of characters: Pointer to next word shows end of current word Hope to save up to 60% of dictionary space. ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Total string length = 400K x 8B = 3.2MB Pointers resolve 3.2M positions: log 2 3.2M = 22bits = 3bytes

Space for dictionary as a string How do we know where terms end? How do we search the dictionary? What is the size? 4 bytes per term for Freq. 4 bytes per term for pointer to Postings. 3 bytes per term pointer Avg. 8 bytes per term in term string 400K terms x 19  7.6 MB (as opposed to 11.2MB for fixed width)

Blocking Blocks of size k: Store pointers to every kth term string. Example below: k=4. Need to store term lengths (1 extra byte) ….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo…. Save 9 bytes on 3 pointers. Lose 4 bytes on term lengths.

Net Size Example for block size k = 4 What is the size now? What about using a larger k Advantages? Disadvantages? How much query slowdown is expected? We look at an example

Dictionary search without blocking Double arrows indicate traversal during binary search Assuming each dictionary term equally likely in query (not really so in practice!), average number of comparisons = (1+2∙2+4∙3+4)/8 ~2.6

Dictionary search with blocking Binary search down to 4-term block; Then linear search (single arrows) through terms in block. Blocks of 4 (binary tree), avg. = (1+2∙2+2∙3+2∙4+5)/8 = 3 compares

Front Coding Adjacent words tend to have common prefixes Why? Size of the string can be reduced if we take advantage of common prefixes With front coding we Remove common prefixes Store the common prefix size Store pointer into the concatenated string 54

Front Coding Example 55 jezebel jezer jezerit jeziah jeziel …ebelritiahel… freq Postings pointer String pointer prefix size

Front Coding Example 56 …ebelritiahel… freq t disk addresses address of term t prefix size Assume 1 byte to store prefix size Assuming 3 letters in common prefix (on average), what is the size of the dictionary? What is the search time?

3-in-4 Front Coding Front coding saves space, but binary search of the index is no longer possible To allow for binary search, “3-in-4” front coding can be used In this method, in every block of 4 words, the first is completely given, and all others are front-coded Binary search can be based on the complete words to find the correct block Combines ideas of blocking and front coding What will be the size now? 57

Think about and analyze these solutions 58

Trie Runtime? Size? 59

Patricia Trie Runtime? Size? 60

Hashtable Runtime? Size? 61