Download presentation
Presentation is loading. Please wait.
Published byByron Shaw Modified over 8 years ago
1
1
2
2
3
Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems Lexicon (dictionary) + Inverted Index idea Dictionary Expected size: Heaps’ Law, Zipf’s Law Storage solutions 3
4
What do you think? 4
5
Some of the Main Challenges Speed of answer: Huge Web, many users Ranking: How can the search engine make sure to return the "best" pages on the top? Coverage: How can a search engine be sure that it covers sufficiently large portions of the web? Hidden Web Storage: Data from web pages is stored locally at the search engine. How can so much information be stored using a reasonable amount of memory? 5
6
Search Engine Components Index Repository: Storage of web pages (and additional data) Indexer: Program that gets a web page (found by the crawler) and inserts the data from the page into the Index Repository Crawler: Program that "crawls" the web to find web pages Note that the Crawler and Indexer are constantly running in the "background". They are NOT run for specific user queries 6
7
Search Engine Components Query processor: Gets the query from the user interface and finds satisfying documents from the index repository Ranker: Ranks the documents found according to how well they "match" the query User Interface: What the user sees 7
8
8 Index Repository Query Processor Crawler Indexer Web Ranker
9
9 Index Repository Query Processor Crawler Indexer Web Ranker Many users querying (and many crawlers crawling) running in parallel. Challenge: Coordinate between all these processes.
10
Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). 10
11
Unstructured (text) vs. structured (database) data in 1996 11
12
Unstructured (text) vs. structured (database) data in 2009 12
13
Brainstorming 13 Index Repository
14
The Problem We want to store (information about) a lot of pages Given a list of words, we want to find documents containing all words Note this simplification – assume that the user task is exactly reflected in the query! Ignore ranking for now Dimension tradeoffs: Speed Memory size Types of queries to be supported 14
15
Typical System Parameters (2007) Average Seek Time5ms=5*10 -3 s Transfer Time per byte 0.02 s=2*10 -8 s Low-level Processor Operation 0.01 s=10 -8 s Size of Main MemorySeveral GBs Size of Disk Space1TB Bottom Line: Seek and transfer are expensive operations! Try to avoid as much as possible 15
16
Ideas? 16
17
Option 1: Store “As Is” Pages are stored "as is" as files in the file system Can find words in files using a grep style tool Suppose we have 10MB of text stored continuously. How long will it take to read the data? Suppose we have 1GB of text stored in 100 continuous chunks. How long will it take to read the data? 17
18
What do you think Are queries processed quickly? Is this space efficient? 18
19
Option 2: Relational Database How would we find documents containing rain? Rain and Spain? Rain and not Spain? Is this better or worse than using the file system with grep? 19 DocIDDoc 1Rain, rain, go away... 2The rain in Spain falls mainly in the plain Model A
20
DB: Other Ways to Model the Data 20 DocIdWid... APPEARS DocIDWord... APPEARS Two options. Which is better? WordWid... WORD_INDEX Model B Model C
21
Relational Database Example 21 The rain in Spain falls mainly on the plain. Rain, rain go away. DocID: 1 DocID: 2
22
Relational Database Example 22 WORD_INDEX APPEARS Note the case- folding More about this later
23
Is This a Good Idea? Does it save more space than saving as files? Depends on word frequency! Why? How are queries processed? Example query: rain SELECT DocId FROM WORD_INDEX W, APPEARS A WHERE W.Wid=A.Wid and W.Word='rain' How can we answer the queries: rain and go ? rain and not Spain ? 23 Is Model C better than Model A?
24
Is it good to use a relational DB? If a word appears in a thousand documents, then its wid will be repeated 1000 times. Why waste the space? If a word appears in a thousand documents, we will have to access a thousand rows in order to find the documents Does not easily support queries that require multiple words Note: Some databases have special support for textual queries. Special purpose indices 24
25
Option 3: Bitmaps 25 There is a vector of 1s and 0s for each word. Queries are computed using bitwise operations on the vectors – efficiently implemented in the hardware.
26
Option 3: Bitmaps 26 How would you compute: Q1 = rain Q2 = rain and Spain Q3 = rain or not Spain
27
Bitmaps Tradeoffs Bitmaps can be efficiently processed However, they have high memory requirements. Example: 1M of documents, each with 1K of terms 500K distinct terms in total What is the size of the matrix? How many 1s will it have? Summary: A lot of wasted space for the 0s 27
28
A Good Solution 28
29
Two Structures Dictionary: list of all terms in the documents For each term in the document, store a pointer to its list in the inverted file Inverted Index: For each term in the dictionary, an inverted list that stores pointers to all occurrences of the term in the documents. Usually, pointers = document numbers Usually, pointers are sorted Sometimes also store term locations within documents (Why?) 29
30
Example Doc 1: A B C Doc 2: E B D Doc 3: A B D F How do you find documents with A and D? 30 A B C D E F 13 123 1 23 2 3 The Devil is in the Details! Dictionary (Lexicon) Posting Lists (Inverted Index)
31
Compression Use less disk space Saves a little money Keep more stuff in memory Increases speed Increase speed of data transfer from disk to memory [read compressed data | decompress] is faster than [read uncompressed data] Premise: Decompression algorithms are fast
32
Why compression for Index Repository? Dictionary Make it small enough to keep in main memory Make it so small that you can keep some postings lists in main memory too Postings file(s) Reduce disk space needed Decrease time needed to read postings lists from disk Large search engines keep a significant part of the postings in memory. Compression lets you keep more in memory
33
How big will the dictionary be? 33
34
Vocabulary vs. collection size How big is the term vocabulary? That is, how many distinct words are there? Can we assume an upper bound? In practice, the vocabulary will keep growing with the collection size
35
Vocabulary vs. collection size Heaps’ law: M = kT b M is the size of the vocabulary, T is the number of tokens in the collection Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5 In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about ½ An empirical finding (“empirical law”)
36
Heaps’ Law For RCV1, the dashed line log 10 M = 0.49 log 10 T + 1.64 is the best least squares fit. Thus, M = 10 1.64 T 0.49 so k = 10 1.64 ≈ 44 and b = 0.49. Good empirical fit for Reuters RCV1 ! For first 1,000,020 tokens, law predicts 38,323 terms; actually, 38,365 terms
37
Try It Heaps’ law: M = kT b Compute the vocabulary size M for this scenario: Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. Assume a search engine indexes a total of 20,000,000,000 (2 × 10 10 ) pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law?
38
Try It Heaps’ law: M = kT b Suppose that you know that the parameter b for a specific collection is ½. What percentage of the text do you have to read in order to see 90% of the different words in the text? 38
39
Zipf’s law Heaps’ law gives the vocabulary size in collections. We also study the relative frequencies of terms. In natural language, there are a few very frequent terms and very many very rare terms. Zipf’s law: The ith most frequent term has frequency proportional to 1/i. cf i = K/i where K is a normalizing constant cf i is collection frequency: the number of occurrences of the term t i in the collection.
40
Zipf consequences If the most frequent term (the) occurs cf 1 times then the second most frequent term (of) occurs cf 1 /2 times the third most frequent term (and) occurs cf 1 /3 times … Equivalent: cf i = K/i where K is a normalizing factor, so log cf i = log K - log i Linear relationship between log cf i and log i
41
Zipf’s law for Reuters RCV1 41
42
Try It Suppose that t 2, the second most common word in the text, appears 10,000 times How many times will t 10 appear? 42
43
Data Structures 43
44
The Dictionary Assumptions: we are interested in simple queries: No phrases No wildcards Goals: Efficient (i.e., log) access Small size (fit in main memory) Want to store: Word Address of inverted index entry Length of inverted index entry = word frequency (why?) 44
45
Why compress the dictionary? Search begins with the dictionary We want to keep it in memory Memory footprint competition with other applications Embedded/mobile devices may have very little memory Even if the dictionary isn’t in memory, we want it to be small for a fast search startup time So, compressing the dictionary is important
46
Dictionary storage - first cut Array of fixed-width entries ~400,000 terms; 28 bytes/term = 11.2 MB. 20 bytes4 bytes each
47
Fixed-width terms are wasteful Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms. And we still can’t handle supercalifragilisticexpialidocious Question: Written English averages ~4.5 characters/word. Why is/isn’t this the number to use for estimating the dictionary size? Question: Avg. dictionary word in English: ~8 characters How do we use ~8 characters per dictionary term?
48
Compressing the term list: Dictionary-as-a-String Store dictionary as a (long) string of characters: Pointer to next word shows end of current word Hope to save up to 60% of dictionary space. ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Total string length = 400K x 8B = 3.2MB Pointers resolve 3.2M positions: log 2 3.2M = 22bits = 3bytes
49
Space for dictionary as a string How do we know where terms end? How do we search the dictionary? What is the size? 4 bytes per term for Freq. 4 bytes per term for pointer to Postings. 3 bytes per term pointer Avg. 8 bytes per term in term string 400K terms x 19 7.6 MB (as opposed to 11.2MB for fixed width)
50
Blocking Blocks of size k: Store pointers to every kth term string. Example below: k=4. Need to store term lengths (1 extra byte) ….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo…. Save 9 bytes on 3 pointers. Lose 4 bytes on term lengths.
51
Net Size Example for block size k = 4 What is the size now? What about using a larger k Advantages? Disadvantages? How much query slowdown is expected? We look at an example
52
Dictionary search without blocking Double arrows indicate traversal during binary search Assuming each dictionary term equally likely in query (not really so in practice!), average number of comparisons = (1+2∙2+4∙3+4)/8 ~2.6
53
Dictionary search with blocking Binary search down to 4-term block; Then linear search (single arrows) through terms in block. Blocks of 4 (binary tree), avg. = (1+2∙2+2∙3+2∙4+5)/8 = 3 compares
54
Front Coding Adjacent words tend to have common prefixes Why? Size of the string can be reduced if we take advantage of common prefixes With front coding we Remove common prefixes Store the common prefix size Store pointer into the concatenated string 54
55
Front Coding Example 55 jezebel jezer jezerit jeziah jeziel …ebelritiahel… 20 3 3 4 1 5 1 3 1 4 freq Postings pointer String pointer prefix size
56
Front Coding Example 56 …ebelritiahel… 20 3 3 4 1 5 1 3 1 4 freq t disk addresses address of term t prefix size Assume 1 byte to store prefix size Assuming 3 letters in common prefix (on average), what is the size of the dictionary? What is the search time?
57
3-in-4 Front Coding Front coding saves space, but binary search of the index is no longer possible To allow for binary search, “3-in-4” front coding can be used In this method, in every block of 4 words, the first is completely given, and all others are front-coded Binary search can be based on the complete words to find the correct block Combines ideas of blocking and front coding What will be the size now? 57
58
Think about and analyze these solutions 58
59
Trie Runtime? Size? 59
60
Patricia Trie Runtime? Size? 60
61
Hashtable Runtime? Size? 61
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.