Accumulator Representations Dr. Susan Gauch
Criteria Fast look up by docid Need to be able to add posting data efficiently Acc.Add (docid, wt) Small space in memory Most documents do not contain any of the query words Accumulator is thus a sparse array Avoid storing buckets for non-matching documents Fast sort by total weight After scores are accumulated, sort by total weight before presenting top matches to the user
Option 1: Array One element per document Fast lookup by docid – YES Acc[docid] += wt O (1) Small space in memory – NO Store one element per document in the collection What if there are billions of documents? O(N) buckets where N = number of docs in collection Fast sort by total weight – MAYBE If just sort the array – NO (array can be huge) O (N log N) where N = number of docs in collection
More Efficient Sorts Take advantage of 2 things: 1) Array stores mostly 0 Keep track of number of non-0 entries Copy those into new array Sort that smaller array O (r log r) where r is number of non-0 results r << N
More Efficient Sorts 2) Take advantage of fact that usually only present p results, p << r (10? 20? 100?) Use a bounded-size data structure to store top weighted results so far, heap or bounded-size linked list Iterate over Acc If list not full Add (docid, wt) to list in sorted location Else if (wt > list->tail.wt) Add (docid, wt) to list in sorted location Remove tail element
More Efficient Sorts (2) Before long, most (docid, wt) don’t make it past the cut-off and are immediately rejected O(A) where A is the size of the accumulator when p << r << A You must loop over accumulator, but most of the time, no inserts actually happen When inserting, it is O(p) where p is the size of the linked list For the array accumulator, this is O(N)
Option 2: Hashtable Size of hashtable: number of expected non-0 results * 3 (r * 3) Fast lookup by docid – YES Loc = hashfn (docid) HT[Loc] += wt O (c) where c is number of collisions + 1 Small space in memory – YES O(r) Fast sort by total weight – MAYBE Can use same sort approaches as for Option 1: Array O(A) == O(r)
Option 3: Heap Can bound the heap to approximate size p Height of the heap: h = log 2 p Fast lookup by docid – NO Must walk the whole heap, O(p) Small space in memory – YES Store one element for each result you plan to present to the user (just keep top p at any time) O (p) Fast sort by total weight – YES Results are always in partially sorted Just remove top element iteratively to present results at the end O (p log p) == heap sort
Option 4: Hashtable + Heap Use both a hashtable AND a heap Both store pointers to nodes that contain (docid, total_weight) Fast look up by docid –YES O (c) in hashtable Small space in memory - YES O (r) + O (p) for hashtable and heap Fast sort by total weight - YES O (p log p) from heap