Index Construction Some of these slides are based on Stanford IR Course slides at http://www.stanford.edu/class/cs276/

Index Construction Some of these slides are based on Stanford IR Course slides at

How would you construct the dictionary and inverted index?
You are given a large set of documents, and you would like to construct the following data structures: Dictionary (Lexicon) A B C D E F 1 3 2 Posting Lists (Inverted Index)

Index Construction 2 Main Issues: Constructing is difficult. Why?
Finding the documents (crawling – coming up in a few weeks) Constructing the structures Constructing is difficult. Why?

The Naïve Strategy (When) will this work?
IndexCorpus(C) //C is a set of documents H  in memory HashTable<Word,IdList> For each dC do For each token t  d do If t already appears in H then add d to idList of t in H else add d with new idList {t} to H Write H to disk (When) will this work?

Hardware Basics The indexing process must be designed based on the characteristics of hardware Limited main memory! Speed of memory! We recall some hardware basics

Hardware basics Sec. 4.1 Access to data in memory is much faster than access to data on disk. Disk seeks: No data is transferred from disk while the disk head is being positioned. Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks. Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). Block sizes: 8KB to 256 KB.

Sec. 4.1 Hardware basics Servers used in IR systems now typically have several GB of main memory, sometimes tens of GB. Available disk space is several (2–3) orders of magnitude larger. Fault tolerance is very expensive: It’s much cheaper to use many regular machines rather than one fault tolerant machine.

Hardware assumptions for this lecture
Sec. 4.1 Hardware assumptions for this lecture symbol statistic value s average seek time 5 ms = 5 x 10−3 s b transfer time per byte μs = 2 x 10−8 s processor’s clock rate 109 s−1 p low-level operation μs = 10−8 s (e.g., compare & swap a word) size of main memory several GB size of disk space 1 TB or more

RCV1: Our collection for this lecture
Sec. 4.2 RCV1: Our collection for this lecture As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection. This is one year of Reuters newswire (part of 1995 and 1996) (Actually, not that large)

Sec. 4.2 A Reuters RCV1 document

Reuters RCV1 statistics
Sec. 4.2 Reuters RCV1 statistics symbol statistic value N documents 800,000 L avg. # tokens per doc 200 M terms (= word types) 400,000 avg. # bytes per token 6 (incl. spaces/punct.) avg. # bytes per token 4.5 (no spaces/punct.) avg. # bytes per term 7.5 non-positional postings 100,000,000 4.5 bytes per word token vs. 7.5 bytes per word type: why?

Will Our Naïve Strategy Work?
How much space is needed in a Java implementation? 400,000 words times String object, which requires 4 bytes for reference to char array 3*4 bytes for the three int fields that a String has 8 bytes for the “object header” of the String 8 bytes for the “object header” of the char array 4 bytes for the length field of the char array 7.5 * 2 bytes for the actual characters memory is rounded up to be a multiple of 8 Grand total of (only for words): 56 * 400,000 = 22.4 MB

How much space is needed in a Java implementation? 100,000,000 non-positional entries; on average, each ArrayList<Integer> (=posting list) will have 250 entries. Each ArrayList needs: 8 bytes for “object header” 4 bytes * 250 for pointers to the Integer Objects 16 bytes * 250 for Integer Objects Grand total of (only for posting lists): 400,000 * 250 * ( ) = 2800MB

How much space is needed in a Java implementation? Hashtable with 400,000 entries needs 8 bytes for “object header” 16 bytes for “primitive fields” 12 bytes for slot array information 4 bytes per slot ~24 bytes per entry (for mapping objects) There are usually ~1.5 times more slots than entries Grand total of approximately (only for hashtable): 16.8 MB In total close to 3GB of memory

Assuming we only have 1GM of main memory available, we can fit posting lists for ~ 133,000 tokens in memory This means that approximately 66,000,000 times we will try to add an id to a posting list and it wont be in memory, and we will have to read from disk 5 x 10-3 seek + 2 x 10-8 X 7,000 transfer  (Not quite as bad due to locality of references) sec = 5654 min = 94.23 hours = 3.9 days!!

Index Construction Solutions
We will discuss two different solutions for constructing the index: Sort-based construction Merge-based construction

Sort-Based Construction: Intuition
First create the dictionary in memory (hopefully it will fit) To create the inverted index: Read, parse files While reading, write all pairs of (termId, docId). This is efficient since we just write to end of file Note that you have termId values available since you already created the dictionary Sort the pairs and create posting lists

Index construction Doc 1 Doc 2 I did enact Julius Caesar I was killed
Documents are parsed to extract words and these are saved with the Document ID. This slide shows pairs of terms and docId. Really, we will use termIds and docIds Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

We focus on this sort step.
Key step After all documents have been parsed, the inverted file is sorted by terms (with a subsorting by docID). We focus on this sort step. We have a large number of items to sort.

Final Step At this point, the entries of each posting list are consecutive (and in the correct order) Scan the sorted termID, docID list and create posting lists Each time a posting list is created, update pointer in dictionary Can perform compression of posting lists at this point

The Hard Part: Sorting List of pairs will not all fit into memory
How can we sort efficiently? Think about some basic sorting algorithms. How will they behave if not all data fits in memory? We need an external sorting algorithm. We now discuss an external sorting algorithm We then come back to apply this to the problem at hand

Reminder: MergeSort A divide-and-conquer technique
Each unsorted collection is split into 2 Then again ……. Until we have collections of size 1 Now we merge sorted collections Until we merge the two halves

MergeSort(array a, indexes low, high)
If (low < high) middlelower((low + high)/2) MergeSort(a,low,middle) // split 1 MergeSort(a,middle+1,high) // split 2 Merge(a,low,middle,high) // merge 1+2

Merge(arrays a, index low, mid, high)
bempty array, pHmid+1, ilow, pLlow while (pL<=mid AND pH<=high) if (a[pL]<=a[pH]) b[i]a[pL] ii+1, pLpL+1 else b[i]a[pH] ii+1, pHpH+1 if pL<=mid copy a[pL…mid] into b[i…] elseif pH<=high copy a[pH…high] into b[i…] copy b[low…high] onto a[low…high]

An example Initial: Split: Split: Split: Merge: Merge: Merge:

Spotlight on Merging Pointer to first value of each list
Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92

Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92 12

Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92 12, 25

Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92 12, 25, 33

Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92 12, 25, 33, 37

Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92 12, 25, 33, 37, 48

Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92 12, 25, 33, 37, 48, 57

Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92 12, 25, 33, 37, 48, 57, 86, 92 Single Linear Pass over Data!

Complexity Analysis of MergeSort
Standard Analysis: Every split, we half the collection How many times can this be done? log2 n At each step, perform a merge, takes time n Total: n log2 n I/O Complexity: Assume n data items in N blocks 2N log2 n I/O operations (read and write N blocks, log n times)

Simple 2-Way Merge Sort We now adapt to external memory
Intuition: Sort by breaking into small subfiles sorting each subfile merging the subfiles Note: Sorted subfiles are called a run Memory Needs: 3 buffer pages

The Algorithm Read each page into memory, sort it, write it back
While the number of runs at the end of previous pass>1: While there are runs to be merged from previous pass choose 2 runs (from previous pass) read each run into input buffer merge runs and write into output buffer flush output buffer into disk one page at a time

2-Way Sort: Requires 3 Buffers
Pass 1: Read a page, sort it, write it. only one buffer page is used Pass 2, 3, …, etc.: three buffer pages used. INPUT 1 OUTPUT INPUT 2 Main memory buffers Disk Disk 5

Two-Way External Merge Sort
Assume a block holds 2 numbers Each pass we read + write each page in file. N pages in the file => the number of passes So total cost is: Idea: Divide and conquer: sort subfiles and merge 3,4 6,2 9,4 8,7 5,6 3,1 2 Input file PASS 0 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs PASS 1 2,3 4,7 1,3 2-page runs 4,6 8,9 5,6 2 PASS 2 2,3 4,4 1,2 4-page runs 6,7 3,5 8,9 6 PASS 3 1,2 2,3 3,4 8-page runs 4,5 6,6 7,8 9 6

How Do We Merge These Runs With Only 3 Buffer Blocks?
Assume: Each block can hold a pair of numbers Same technique as before! Note that we retain linear time! Run 1: 2,4 4,6 9,10 Run 2: 1,2 3,5 6

Improving External Merge Sort
Algorithm discussed is highly inefficient, since it does not take advantage of more than 3 buffer pages We improve in the following manner. Suppose that we have B buffer pages: In first phase, read in B pages at a time, sort and write back runs of size B In latter phases, perform a (B-1)-way merge by using B-1 buffer pages for input and one for output

General External Merge Sort
To sort a file with N pages using B buffer pages: Pass 0: use B buffer pages. Produce sorted runs of B pages each. Pass 2, …, etc.: merge B-1 runs. INPUT 1 . . . INPUT 2 . . . . . . OUTPUT INPUT B-1 Disk Disk B Main memory buffers 7

I/O Cost of External Merge Sort
Number of passes: Cost = 2N * (# of passes) There are many improvements to this algorithm that further lower the I/O cost Now, back to the index construction problem…. 8

BSBI: Blocked sort-based Indexing: Algorithm
BSBI-IndexCorpus(C) Create dictionary(C) n  0 while (all document in C have not been processed) do n  n+1 block  ParseNextBlock(C) //writes (tokenId,docId) pairs InMemorySort(block) WriteBlockToDisk(block,fn) MergeBlocks(f1,…,fn,fmerged)

Other Issues: Constructing the Dictionary
Our assumption was: we can keep the dictionary in memory. We need the dictionary (which grows dynamically) in order to implement a term to termID mapping. May be very large! Requires 2 passes over the data (one to construct dictionary, one to construct inverted index) Actually, we could work with term,docID postings instead of termID,docID postings . . . . . . but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.)

Think About It Suppose that we want to sort data of size N
How much disk space is needed, for external merge sort? Hint: think about what happens halfway through the last merge.

An alternative construction method: Merge-based Construction
Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks. Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur. With these two ideas we can generate a complete inverted index for each block. These separate indexes can then be merged into one big index.

The Algorithm While there is available memory, read the next token
Look up this token in an in-memory hash table If it appears, get the in-memory posting list from the hash table and add the new entry to the posting list If it does not appear, add it to the hash table, create a new posting list, and add the new entry to the posting list When memory runs out write dictionary, sorted, to disk write posting lists, sorted, to disk Recognize this?

Repeat, until all documents have been completely parsed: While there is available memory, read the next token Look up this token in an in-memory hash table If it appears, get the in-memory posting list from the hash table and add the new entry to the posting list If it does not appear, add it to the hash table, create a new posting list, and add the new entry to the posting list When memory runs out write dictionary, sorted, to disk write posting lists, sorted, to disk

Repeat, until all documents have been completely parsed: While there is available memory, read the next token Look up this token in an in-memory hash table If it appears, get the in-memory posting list from the hash table and add the new entry to the posting list If it does not appear, add it to the hash table, create a new posting list, and add the new entry to the posting list When memory runs out write dictionary, sorted, to disk write posting lists, sorted, to disk Merge dictionaries, merge posting lists

Some Questions to Think About
How hard is it to write sorted dictionary files during the process? How hard is it to write sorted posting lists during the process? How hard is it to merge dictionary files? How hard is it to merge posting lists? Compression, when?

Distributed indexing For web-scale indexing (don’t try this at home!):
Sec. 4.4 Distributed indexing For web-scale indexing (don’t try this at home!): must use a distributed computing cluster Individual machines are fault-prone Can unpredictably slow down or fail How do we exploit such a pool of machines? Use Map Reduce Paradigm for Indexing! Discussed later in the course

Other Issues: Dynamic Indexing
Up to now, we have assumed that collections are static. They rarely are: Documents come in over time and need to be inserted. Documents are deleted and modified. This means that the dictionary and postings lists have to be modified: Postings updates for terms already in dictionary New terms added to dictionary

Simplest approach Maintain “big” main index
New docs go into “small” auxiliary index Search across both, merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation bit-vector Periodically, re-index into one main index

Issues with main and auxiliary indexes
Problem of frequent merges – you touch stuff a lot Poor performance during merge Actually: Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list. Merge is the same as a simple append. But then we would need a lot of files – inefficient for O/S. Assumption for the rest of the lecture: The index is one big file. In reality: Use a scheme somewhere in between (e.g., split very large postings lists, collect postings lists of length 1 in one file etc.)

Zoom in On Merge Problem
Suppose you have a main index I for your data As changes occur, you create an auxilliary index I0 When I0 is full, you merge with I you define full = some size n perhaps n is available main memory And so on… So, if you have T data in total, you end up with (T/n)-1 merges Together, this will take time

Sec. 4.5 Logarithmic merge Maintain a series of indexes, each twice as large as the previous one At any time, some of these powers of 2 are instantiated Keep smallest (Z0) in memory Larger ones (I0, I1, …) on disk If Z0 gets too big (> n), write to disk as I0 or merge with I0 (if I0 already exists) as Z1 Either write merge Z1 to disk as I1 (if no I1) Or merge with I1 to form Z2

Sec. 4.5

Sec. 4.5 Logarithmic merge Logarithmic merge: Each posting is merged O(log T) times, so complexity is O(T log T) So logarithmic merge is much more efficient for index construction But query processing now requires the merging of O(log T) indexes Whereas it is O(1) if you just have a main and auxiliary index

Further issues with multiple indexes
Sec. 4.5 Further issues with multiple indexes Collection-wide statistics are hard to maintain E.g., when we spoke of spell-correction: which of several corrected alternatives do we present to the user? Determined by word popularity (unigram model) How do we maintain the popularity values with multiple indexes and invalidation bit vectors? One possibility: ignore everything but the main index for such ordering Will see more such statistics used in results ranking

Dynamic indexing at search engines
Sec. 4.5 Dynamic indexing at search engines All the large search engines now do dynamic indexing Their indices have frequent incremental changes News items, blogs, new topical web pages But (sometimes/typically) they also periodically reconstruct the index from scratch Query processing is then switched to the new index, and the old index is deleted

Sec. 4.5 Webmasterworld. Search Engine watch.

Other sorts of indexes Positional indexes
Sec. 4.5 Positional indexes Same sort of sorting problem … just larger Building character n-gram indexes: As text is parsed, enumerate n-grams. For each n-gram, need pointers to all dictionary terms containing it – the “postings”. Note that the same “postings entry” will arise repeatedly in parsing the docs – need efficient hashing to keep track of this. E.g., that the trigram uou occurs in the term deciduous will be discovered on each text occurrence of deciduous Only need to process each term once Why?

Index Construction Some of these slides are based on Stanford IR Course slides at http://www.stanford.edu/class/cs276/

Similar presentations

Presentation on theme: "Index Construction Some of these slides are based on Stanford IR Course slides at http://www.stanford.edu/class/cs276/"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Index Construction Some of these slides are based on Stanford IR Course slides at http://www.stanford.edu/class/cs276/

Similar presentations

Presentation on theme: "Index Construction Some of these slides are based on Stanford IR Course slides at http://www.stanford.edu/class/cs276/"— Presentation transcript:

Similar presentations

About project

Feedback