Modified from Stanford CS276 slides Lecture 4: Index Construction

Modified from Stanford CS276 slides Lecture 4: Index Construction

How would you construct the dictionary and inverted index?
You are given a large set of documents, and you would like to construct the following data structures: Dictionary (Lexicon) A B C D E F 1 3 2 Posting Lists (Inverted Index)

Index Construction 2 Main Issues: Constructing is difficult. Why?
Finding the documents (crawling – coming up in a few weeks) Constructing the structures Constructing is difficult. Why? Not enough memory to hold entire index – many page faults Posting list should be continuous is memory but we don’t know how much space to allocate. (Pre-reading the documents doesn’t help!)

Index Construction Solutions
We will discuss two different solutions for constructing the index: Sort-based construction Merge-based construction

Sort-Based Construction: Intuition
First create the dictionary in memory (hopefully it will fit) To create the inverted index: Read, parse files While reading, write all pairs of (termId, docId). This is efficient since we just write to end of file Note that you have termId values available since you already created the dictionary Sort the pairs and create posting lists

Index construction Doc 1 Doc 2 I did enact Julius Caesar I was killed
Documents are parsed to extract words and these are saved with the Document ID. This slide shows pairs of terms and docId. Really, we will use termIds and docIds Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

We focus on this sort step.
Key step After all documents have been parsed, the inverted file is sorted by terms (with a subsorting by docID). We focus on this sort step. We have a large number of items to sort.

Final Step At this point, the entries of each posting list are consecutive (and in the correct order) Scan the sorted termID, docID list and create posting lists Each time a posting list is created, update pointer in dictionary Can perform compression of posting lists at this point

The Hard Part: Sorting List of pairs will not all fit into memory
How can we sort efficiently? Think about some basic sorting algorithms. How will they behave if not all data fits in memory? TOO SLOW!! We need an external sorting algorithm. We now discuss an external sorting algorithm We then come back to apply this to the problem at hand

Reminder: MergeSort A divide-and-conquer technique
Each unsorted collection is split into 2 Then again ……. Until we have collections of size 1 Now we merge sorted collections Until we merge the two halves

MergeSort(array a, indexes low, high)
If (low < high) middle(low + high)/2 MergeSort(a,low,middle) // split 1 MergeSort(a,middle+1,high) // split 2 Merge(a,low,middle,high) // merge 1+2

Merge(arrays a, index low, mid, high)
bempty array, pHmid+1, ilow, pLlow while (pL<=mid AND pH<=high) if (a[pL]<=a[pH]) b[i]a[pL] ii+1, pLpL+1 else b[i]a[pH] ii+1, pHpH+1 if pL<=mid copy a[pL…mid] into b[i…] elseif pH<=high copy a[pH…high] into b[i…] copy b[low…high] onto a[low…high]

An example Initial: Split: Split: Split: Merge: Merge: Merge:

Spotlight on Merging Pointer to first value of each list
Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92

Spotlight on Merging 12 Pointer to first value of each list
Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92 12

Spotlight on Merging 12, 25 Pointer to first value of each list
Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92 12, 25

Spotlight on Merging 12, 25, 33 Pointer to first value of each list
Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92 12, 25, 33

Spotlight on Merging Pointer to first value of each list Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92 12, 25, 33, 37

Spotlight on Merging Pointer to first value of each list Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92 12, 25, 33, 37, 48

Spotlight on Merging Pointer to first value of each list Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92 12, 25, 33, 37, 48, 57

Spotlight on Merging Pointer to first value of each list Copy lower value and increment pointer 25, 37, 48, 57 12, 33, 86, 92 12, 25, 33, 37, 48, 57, 86, 92 Single Linear Pass over Data!

Complexity Analysis of MergeSort
Standard Analysis: Every split, we half the collection How many times can this be done? log2 n At each step, perform a merge, takes time n Total: n log2 n I/O Complexity: Assume n data items in N blocks 2N log2 n I/O operations (read and write N blocks, log n times)

Simple 2-Way Merge Sort We now adapt to external memory
Intuition: Sort by breaking into small subfiles sorting each subfile merging the subfiles Note: Sorted subfiles are called a run Memory Needs: 3 buffer pages

The Algorithm Read each page into memory, sort it, write it back
While the number of runs at the end of previous pass>1: While there are runs to be merged from previous pass choose 2 runs (from previous pass) read each run into input buffer merge runs and write into output buffer flush output buffer into disk one page at a time

2-Way Sort: Requires 3 Buffers
Pass 1: Read a page, sort it, write it. only one buffer page is used Pass 2, 3, …, etc.: three buffer pages used. INPUT 1 OUTPUT INPUT 2 Main memory buffers Disk Disk 5

Two-Way External Merge Sort
Assume a block holds 2 numbers Each pass we read + write each page in file. N pages in the file => the number of passes So total cost is: Idea: Divide and conquer: sort subfiles and merge 3,4 6,2 9,4 8,7 5,6 3,1 2 Input file PASS 0 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs PASS 1 2,3 4,7 1,3 2-page runs 4,6 8,9 5,6 2 PASS 2 2,3 4,4 1,2 4-page runs 6,7 3,5 8,9 6 PASS 3 1,2 2,3 3,4 8-page runs 4,5 6,6 7,8 9 6

How Do We Merge These Runs With Only 3 Buffer Blocks?
Assume: Each block can hold a pair of numbers Same technique as before! Note that we retain linear time! Run 1: 2,4 4,6 9,10 Run 2: 1,2 3,5 6

Improving External Merge Sort
Algorithm discussed is highly inefficient, since it does not take advantage of more than 3 buffer pages We improve in the following manner. Suppose that we have B buffer pages: In first phase, read in B pages at a time, sort and write back runs of size B In latter phases, perform a (B-1)-way merge by using B-1 buffer pages for input and one for output

General External Merge Sort
To sort a file with N pages using B buffer pages: Pass 0: use B buffer pages. Produce sorted runs of B pages each. Pass 2, …, etc.: merge B-1 runs. INPUT 1 . . . INPUT 2 . . . . . . OUTPUT INPUT B-1 Disk Disk B Main memory buffers 7

I/O Cost of External Merge Sort
Number of passes: Cost = 2N * (# of passes) There are many improvements to this algorithm that further lower the I/O cost Now, back to the index construction problem…. 8

BSBI: Blocked sort-based Indexing
Sort before writing to disk

Other Issues: Constructing the Dictionary
Our assumption was: we can keep the dictionary in memory. We need the dictionary (which grows dynamically) in order to implement a term to termID mapping. May be very large! Requires 2 passes over the data (one to construct dictionary, one to construct inverted index) Actually, we could work with term,docID postings instead of termID,docID postings . . . . . . but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.)

Think About It Suppose that we want to sort data of size N
How much disk space is needed, for external merge sort? Hint: think about what happens halfway through the last merge.

An alternative construction method: Merge-based Construction
Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks. Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur. With these two ideas we can generate a complete inverted index for each block. These separate indexes can then be merged into one big index.

The Algorithm While there is available memory, read the next token
Look up this token in an in-memory hash table If it appears, get the in-memory posting list from the hash table and add the new entry to the posting list If it does not appear, add it to the hash table, create a new posting list, and add the new entry to the posting list When memory runs out write dictionary, sorted, to disk write posting lists, sorted, to disk

The Algorithm Repeat, until all documents have been completely parsed:
While there is available memory, read the next token Look up this token in an in-memory hash table If it appears, get the in-memory posting list from the hash table and add the new entry to the posting list If it does not appear, add it to the hash table, create a new posting list, and add the new entry to the posting list When memory runs out write dictionary, sorted, to disk write posting lists, sorted, to disk

The Algorithm Repeat, until all documents have been completely parsed:
While there is available memory, read the next token Look up this token in an in-memory hash table If it appears, get the in-memory posting list from the hash table and add the new entry to the posting list If it does not appear, add it to the hash table, create a new posting list, and add the new entry to the posting list When memory runs out write dictionary, sorted, to disk write posting lists, sorted, to disk Merge dictionaries, merge posting lists

Some details Writing sorted dictionary files
Writing sorted posting lists Merging dictionary files Merging posting lists Compression, when?

Other Issues: Dynamic Indexing
Up to now, we have assumed that collections are static. They rarely are: Documents come in over time and need to be inserted. Documents are deleted and modified. This means that the dictionary and postings lists have to be modified: Postings updates for terms already in dictionary New terms added to dictionary

Simplest approach Maintain “big” main index
New docs go into “small” auxiliary index Search across both, merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation bit-vector Periodically, re-index into one main index

Issues with main and auxiliary indexes
Problem of frequent merges – you touch stuff a lot Poor performance during merge Actually: Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list. Merge is the same as a simple append. But then we would need a lot of files – inefficient for O/S. Usually several posting lists in a single file. Need special methods to make this work efficiently See Chapter 4 in

Modified from Stanford CS276 slides Lecture 4: Index Construction

Similar presentations

Presentation on theme: "Modified from Stanford CS276 slides Lecture 4: Index Construction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modified from Stanford CS276 slides Lecture 4: Index Construction

Similar presentations

Presentation on theme: "Modified from Stanford CS276 slides Lecture 4: Index Construction"— Presentation transcript:

Similar presentations

About project

Feedback