Presentation is loading. Please wait.

Presentation is loading. Please wait.

Accumulator Representations Dr. Susan Gauch. Criteria  Fast look up by docid  Need to be able to add posting data efficiently  Acc.Add (docid, wt)

Similar presentations


Presentation on theme: "Accumulator Representations Dr. Susan Gauch. Criteria  Fast look up by docid  Need to be able to add posting data efficiently  Acc.Add (docid, wt)"— Presentation transcript:

1 Accumulator Representations Dr. Susan Gauch

2 Criteria  Fast look up by docid  Need to be able to add posting data efficiently  Acc.Add (docid, wt)  Small space in memory  Most documents do not contain any of the query words  Accumulator is thus a sparse array  Avoid storing buckets for non-matching documents  Fast sort by total weight  After scores are accumulated, sort by total weight before presenting top matches to the user

3 Option 1: Array  One element per document  Fast lookup by docid – YES  Acc[docid] += wt  O (1)  Small space in memory – NO  Store one element per document in the collection  What if there are billions of documents?  O(N) buckets where N = number of docs in collection  Fast sort by total weight – MAYBE  If just sort the array – NO (array can be huge)  O (N log N) where N = number of docs in collection

4 More Efficient Sorts  Take advantage of 2 things:  1) Array stores mostly 0  Keep track of number of non-0 entries  Copy those into new array  Sort that smaller array  O (r log r) where r is number of non-0 results  r << N

5 More Efficient Sorts  2) Take advantage of fact that usually only present p results, p << r (10? 20? 100?)  Use a bounded-size data structure to store top weighted results so far, heap or bounded-size linked list  Iterate over Acc  If list not full  Add (docid, wt) to list in sorted location  Else  if (wt > list->tail.wt)  Add (docid, wt) to list in sorted location  Remove tail element

6 More Efficient Sorts (2)  Before long, most (docid, wt) don’t make it past the cut-off and are immediately rejected  O(A) where A is the size of the accumulator when p << r << A  You must loop over accumulator, but most of the time, no inserts actually happen  When inserting, it is O(p) where p is the size of the linked list  For the array accumulator, this is O(N)

7 Option 2: Hashtable  Size of hashtable: number of expected non-0 results * 3 (r * 3)  Fast lookup by docid – YES  Loc = hashfn (docid)  HT[Loc] += wt  O (c) where c is number of collisions + 1  Small space in memory – YES  O(r)  Fast sort by total weight – MAYBE  Can use same sort approaches as for Option 1: Array  O(A) == O(r)

8 Option 3: Heap  Can bound the heap to approximate size p  Height of the heap: h = log 2 p  Fast lookup by docid – NO  Must walk the whole heap, O(p)  Small space in memory – YES  Store one element for each result you plan to present to the user (just keep top p at any time)  O (p)  Fast sort by total weight – YES  Results are always in partially sorted  Just remove top element iteratively to present results at the end  O (p log p) == heap sort

9 Option 4: Hashtable + Heap  Use both a hashtable AND a heap  Both store pointers to nodes that contain (docid, total_weight)  Fast look up by docid –YES  O (c) in hashtable  Small space in memory - YES  O (r) + O (p) for hashtable and heap  Fast sort by total weight - YES  O (p log p) from heap


Download ppt "Accumulator Representations Dr. Susan Gauch. Criteria  Fast look up by docid  Need to be able to add posting data efficiently  Acc.Add (docid, wt)"

Similar presentations


Ads by Google