Download presentation
Published byBlake Preston Modified over 9 years ago
1
Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal
2
Indexing in Search Engine
Linguistic Preprocessing Normalized terms User query Already built Inverted Index Lookup the documents that contain the terms Rank the returned documents according to their relevancy Documents Results
3
Forward index What is INVERTED INDEX? First look at the FORWARD INDEX!
Documents Words Querying the forward index would require sequential iteration through each document and to each word to verify a matching document Too much time, memory and resources required! Document 1 Hat, dog, the, cow, is, now Document 2 Cow, run, away, morning, in, tree Document 3 What, family, at, some, is, take
4
What is inverted index? Posting List One posting Opposed to forward index, store the list of documents per each word Directly access the set of documents containing the word
5
How to build inverted index? (1/3)
Build index in advance 1. Collect the documents 2. Turning each document into a list of tokens 3. Do linguistic preprocessing, producing list of normalized tokens, which are the indexing terms 4. Index the documents (i.e. postings) for each word (i.e. dictionary)
6
How to build inverted index? (2/3)
Given two documents: Document1 Document2 This is first document. Microsoft’s products are office, visio, and sql server This is second document. Google’s services are gmail, google labs and google code.
7
How to build inverted index? (3/3)
Sort based indexing: 1. Sort the terms alphabetically 2. Instances of the same term are grouped by word and then documentID 3. The terms and documentIDs are then separated out Reduces storage requirement Dictionary commonly kept in memory while postings list kept on disk
8
Blocked sort based indexing
Use termID instead of term Main memory is insufficient to collect termID-docID pair, we need external sorting algorithm that uses disk Segment the collection into parts of equal size Sorts and group the termID-docID pairs of each part in memory Store the intermediate result onto disk Merges all intermediate results into the final index Running Time: O (T log T)
9
Single-pass in-memory indexing
SPIMI uses term instead of termID Writes each block’s dictionary to disk, and then starts a new dictionary for the next block Assume we have stream of term-docID pairs, Tokens are processed one by one, when a term occurs for the first time, it is added to the dictionary, and a new posting list is created.
10
Difference between BSBI and SPIMI
Add postings directly to postings list It is faster then BSBI because there is no Sorting necessary It saves memory because No termID needs to be stored Time complexity O( T ) Collect term-docID pairs , sort them and then create postings list Slower then SPIMI Require to store termID , so need more space Time complexity O( T logT)
11
Distributed Indexing (1/4)
We can not perform index construction on single computer, web search engine uses distributed indexing algorithms for index construction Partitioned the work across several machine Use MapReduce architecture: A general architecture for distributed computing Divide the work into chunks that can easily assign and reassign. Map and Reduce phase
12
Distributed Indexing (2/4)
13
Distributed Indexing (3/4)
MAP PHASE: Mapping the splits of the input data to key-value pairs Each parser writes its output to local segment file These machines are called parsers REDUCE PHASE: Partition the keys into j term partitions and having the parsers write key-value pair for each term partition into a separate file. The parser write the corresponding segment files, one for each term partition.
14
Distributed Indexing (4/4)
REDUCE PHASE (cont.): Collecting all values (docIDs) for a given key (termID) into one list is the task of inverter The master assigns each term partition to a different inverter Finally, the list of values is sorted for each key and written to the final sorted postings list.
15
Dynamic indexing Motivation: what we have seen so far was static collection of documents, what if the document is added, updated or deleted? Maintain 2 indexes: Main and Auxiliary Auxiliary index is kept in memory, searches are run across both indexes, and results are merged When auxiliary index becomes too large, merge it into the main index Deleted document can be filtered out while returning the results
16
Querying distributed indexes (1/2)
Partition by terms: Partition the dictionary of index terms into subsets, along with a postings list of those term Query is routed to the nodes, allows greater concurrency Sending a long lists of postings between set of nodes for merging; cost is very high and it outweighs the greater concurrency Partition by documents: Each node contains the index for a subset of all documents Query is distributed to all nodes, then results are merged
17
Querying distributed indexes (2/2)
Partition by documents (cont.): Problem: idf must be calculated for an entire collection even though the index at single node contains only subset of documents The query is broadcasted to each of the nodes, with top k results from each node being merged to find top k documents of the query.
18
Index compression (1/8) Compression techniques for dictionary and posting list Advantages Less disk space Use of caching: frequently used terms can be cached in memory for faster processing, and compression techniques allows more terms to be stored in memory Faster data transfer from disk to memory: total time of transferring a compressed data from disk and decompress it is less than transferring uncompressed data
19
Index compression (2/8) Dictionary compression:
It’s small compared to posting lists, so why to compress? Because when large part (think of a millions of terms in it!) of dictionary is on disk, then many more disk seeks are necessary Goal is to fit this dictionary into memory for high response time
20
Index compression (3/8) 1. Dictionary as an array:
Can be stored in an array of fixed width entries For ex. We have 4,00,000 terms in dictionary; 4,00,000 * (20+4+4) = 11.2 MB
21
Index compression (4/8) Any problem in storing dictionary as an array?
1. Average length of term in English language is about eight chars, so we are wasting 12 chars 2. No way of storing terms of more than 20 chars like hydrochlorofluorocarbons SOLUTION? 2. Dictionary as a string: Store it as a one long string of characters Pointer marks the end of the preceding term and the beginning of the next
22
Index compression (5/8) 2. Dictionary as a string (cont.):
4,00,000 * ( ) = 7.6 MB (compared to 11.2 MB earlier)
23
Index compression (6/8) 3. Blocked storage:
Group the terms in the string into blocks of size k and keeping a term pointer only for the first term of each block. k=4; We save, (k-1)*3 =9 bytes for term pointer But, Need additional 4 bytes for term length 4,00,000 * (1/4) * 5 = 7.1 MB (compared to 7.6 MB)
24
Index compression (7/8) 4. Blocked storage with front coding:
Common prefixes According to experience conducted by author: Size reduced to 5.9 MB (compared to 7.1 MB)
25
Index compression (8/8) Posting file compression:
By Encoding Gaps: gaps between postings are shorter so we can store gaps rather than storing the posting itself
26
Review : Scoring , term weighting
Meta data:- information about document Metadata generally consist of “fields” E.g. date of creation , authors , title etc. Zone :- similar to fields Difference : zone is arbitrary free text E.g. Abstract , overview When we look for query terms , Some time we look in these zone first . I m sure you have seen and read paper were related terms , or reference available all those are zone
27
Review : Scoring , term weighting
Term Frequency(tf) : # of occurrence of term in document Problem : size of documents => inappropriate ranking Document frequency(dft): # of documents in collection which contain ‘term’ from query. Inverse Document Frequency(idft): idft = log( N / dft) : N =total # of doc Significance of idf If low it’s a common term (e.g. stop word ) If high rare word ( e.g. apothecary ) Now we want to score document on base of query terms Why not count number of times term appear in doc and rank our documents? But size of the document : what if it’s a big article and term comes lots of time . It will ranked high then important but small article. So now we introduce Idf
28
Review : Scoring , term weighting
Tf-idf weighting tf-idft,d = tft,d * idft . High :when term occurs many time in small # of docs Low: when it occurs fewer time in docs or it occurs in many docs Lowest: when term is in almost all documents. Score of document: Score(q,d) = ∑ (t€q)tf-idft,d So score of document for query q is summation of all tf-idf for each term belongs to query.
29
Computing score in complete search system
30
Inexact top K document retrieval
Motivation : to reduce the cost of calculating score for all N documents We calculate score ONLY for top K documents whose scores are likely to be high w.r.t given query How : Find set A of documents who are contenders where K < A << N. Return the K top scoring docs from A What do I mean by K : its just a number when you fire a query how many documents are retrieved on first shot. Generally search engine choose K – 10
31
Index Elimination Idf preset threshold : Include all terms:
Only traverse postings for terms with high idf Benefit : low idf postings are long so we remove them from counting score. Include all terms: Only traverse documents with many query terms in it. Danger: we may end up with less than K docs at last. We consider documents containing terms whose idf exceeds a present threshold. Why? Because low idf terms are generally stop words. And they don’t contribute in scoring. E.g. Parrot in the cage. We consider parrot and cage as term and discard ‘in;’’the’
32
Champion lists Champion list = fancy list = top docs
Set of r documents for each term t in dictionary which are pre-computed The weights for t are high How to create set A Take a union of champion list for each term in query Compute score only for docs which are in union How and when to decide ‘r’ Highly application dependent Create list at the time of indexing documents Problem : ???????? Champion list are created well in advance Set A what is that collection of documents fewer than N but greater than K Potential problem : we don’t know K until query is received So we might end up r < K now we are in trouble.
33
Static quality scores and ordering
In many search engine we have Measure of quality g(d) for each documents The net score is calculated Combination of g(d) and tf-idf score. How to achieve this Document posting list is in decreasing order for g(d) So we just traversed first few documents in list Global champion list : Chose r documents with highest value of g(d)+tf-idf What do I mean by quality : In news paper web site documents with favourable reviews are ranked high then negative so this documents are good quality docs.
34
Cluster pruning (1/2) We cluster document in preprocessing step
Pick √N documents : call them ‘leaders’ For each document who is not leader we compute nearest leader Followers: docs which are not leaders Each leader has approximately √N followers
35
Cluster pruning (2/2) How does it help:
Given a query q find leader L nearest to q i.e calculating score for only root N docs Set A contains leader L with root N followers
36
Tiered indexes auto Doc1 Doc 2 Tier 1 car Doc 1 Doc 2 Doc 3 best Doc 4
Preset threshold value set to 20 auto Doc 1 If we fail to get K documents from tier 1 we fall back to tier 2 car Doc1 Tier 2 best Doc 4 Preset threshold value set to 10 Addressing an issue of getting set A of contenders less than K documents
37
A complete search system
Parsing Linguistics Result Page User Query Documents Free text query parser Indexers Documents cache Spell correction Scoring and Ranking Metadata in zone and field indexes Inexact top K retrieval Tiered inverted positional index K - gram Indexes Training set Scoring parameters MLR
38
Questions ? Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.