Accumulator Representations Dr. Susan Gauch. Criteria  Fast look up by docid  Need to be able to add posting data efficiently  Acc.Add (docid, wt)

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Stacks, Queues, and Linked Lists
Linked Lists Linked Lists Representation Traversing a Linked List
Introduction to Information Retrieval
Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.
COL 106 Shweta Agrawal and Amit Kumar
1 Union-find. 2 Maintain a collection of disjoint sets under the following two operations S 3 = Union(S 1,S 2 ) Find(x) : returns the set containing x.
M180: Data Structures & Algorithms in Java
CSE 373: Data Structures and Algorithms Lecture 19: Graphs III 1.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
Advanced Query Processing Dr. Susan Gauch. Query Term Weights  The vector space model matches queries to documents with the inner product/cosine similarity.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
ICS 421 Spring 2010 Indexing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 02/18/20101Lipyeow Lim.
Binary Heaps CSE 373 Data Structures Lecture 11. 2/5/03Binary Heaps - Lecture 112 Readings Reading ›Sections
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Version TCSS 342, Winter 2006 Lecture Notes Priority Queues Heaps.
REPRESENTING SETS CSC 172 SPRING 2002 LECTURE 21.
Chapter 12 C Data Structures Acknowledgment The notes are adapted from those provided by Deitel & Associates, Inc. and Pearson Education Inc.
1 Data Structures Data Structures Topic #2. 2 Today’s Agenda Data Abstraction –Given what we talked about last time, we need to step through an example.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Storage and Indexing February 26 th, 2003 Lecture 19.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Maps A map is an object that maps keys to values Each key can map to at most one value, and a map cannot contain duplicate keys KeyValue Map Examples Dictionaries:
1 Hash Tables  a hash table is an array of size Tsize  has index positions 0.. Tsize-1  two types of hash tables  open hash table  array element type.
9/17/20151 Chapter 12 - Heaps. 9/17/20152 Introduction ► Heaps are largely about priority queues. ► They are an alternative data structure to implementing.
1 Linked Lists (continued (continued)) Lecture 5 (maybe) Copying and sorting singly linked lists Lists with head and last nodes Doubly linked lists Append/Circular.
Brought to you by Max (ICQ: TEL: ) February 5, 2005 Advanced Data Structures Introduction.
Sorting with Heaps Observation: Removal of the largest item from a heap can be performed in O(log n) time Another observation: Nodes are removed in order.
1 Joe Meehean.  Problem arrange comparable items in list into sorted order  Most sorting algorithms involve comparing item values  We assume items.
ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.
CSC 211 Data Structures Lecture 13
Advanced Search Features Dr. Susan Gauch. Pruning Search Results  If a query term has many postings  It is inefficient to add all postings to the accumulator.
Incremental Indexing Dr. Susan Gauch. Indexing  Current indexing algorithms are essentially batch processing  They start from scratch every time  What.
1 Heaps (Priority Queues) You are given a set of items A[1..N] We want to find only the smallest or largest (highest priority) item quickly. Examples:
Dynamic Array. An Array-Based Implementation - Summary Good things:  Fast, random access of elements  Very memory efficient, very little memory is required.
Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.
Programming Abstractions Cynthia Lee CS106X. Topics:  Priority Queue › Linked List implementation › Heap data structure implementation  TODAY’S TOPICS.
Evidence from Content INST 734 Module 2 Doug Oard.
1 Fat heaps (K & Tarjan 96). 2 Goal Want to achieve the performance of Fibonnaci heaps but on the worst case. Why ? Theoretical curiosity and some applications.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Advanced Data Structure By Kayman 21 Jan Outline Review of some data structures Array Linked List Sorted Array New stuff 3 of the most important.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
Exam 3 Review Data structures covered: –Hashing and Extensible hashing –Priority queues and binary heaps –Skip lists –B-Tree –Disjoint sets For each of.
CS6045: Advanced Algorithms Sorting Algorithms. Heap Data Structure A heap (nearly complete binary tree) can be stored as an array A –Root of tree is.
1 Chapter 6 Methods for Making Data Structures. 2 Dynamic Arrays in Data Structures In almost every data structure, we want functions for inserting and.
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
1 Chapter 6 Heapsort. 2 About this lecture Introduce Heap – Shape Property and Heap Property – Heap Operations Heapsort: Use Heap to Sort Fixing heap.
The Container Class A Collection of Data. What is a Container Class?  A class that can contain a collection of items  A list or bag of items of the.
ITEC 2620M Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: ec2620m.htm Office: TEL 3049.
Linked Lists and Generics Written by J.J. Shepherd.
Amortized Analysis and Heaps Intro David Kauchak cs302 Spring 2013.
 2015, Marcus Biel, Linked List Data Structure Marcus Biel, Software Craftsman
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Data Structures and Algorithm Analysis Dr. Ken Cosh Linked Lists.
LINKED LISTS.
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Binomial Heaps On the surface it looks like Binomial Heaps are great if you have no remove mins. But, in this case you need only keep track of the current.
Map interface Empty() - return true if the map is empty; else return false Size() - return the number of elements in the map Find(key) - if there is an.
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Linked List and Selection Sort
General External Merge Sort
Linked Lists.
OPIM 915 Fall 2010 Data Structures 23-38,
Presentation transcript:

Accumulator Representations Dr. Susan Gauch

Criteria  Fast look up by docid  Need to be able to add posting data efficiently  Acc.Add (docid, wt)  Small space in memory  Most documents do not contain any of the query words  Accumulator is thus a sparse array  Avoid storing buckets for non-matching documents  Fast sort by total weight  After scores are accumulated, sort by total weight before presenting top matches to the user

Option 1: Array  One element per document  Fast lookup by docid – YES  Acc[docid] += wt  O (1)  Small space in memory – NO  Store one element per document in the collection  What if there are billions of documents?  O(N) buckets where N = number of docs in collection  Fast sort by total weight – MAYBE  If just sort the array – NO (array can be huge)  O (N log N) where N = number of docs in collection

More Efficient Sorts  Take advantage of 2 things:  1) Array stores mostly 0  Keep track of number of non-0 entries  Copy those into new array  Sort that smaller array  O (r log r) where r is number of non-0 results  r << N

More Efficient Sorts  2) Take advantage of fact that usually only present p results, p << r (10? 20? 100?)  Use a bounded-size data structure to store top weighted results so far, heap or bounded-size linked list  Iterate over Acc  If list not full  Add (docid, wt) to list in sorted location  Else  if (wt > list->tail.wt)  Add (docid, wt) to list in sorted location  Remove tail element

More Efficient Sorts (2)  Before long, most (docid, wt) don’t make it past the cut-off and are immediately rejected  O(A) where A is the size of the accumulator when p << r << A  You must loop over accumulator, but most of the time, no inserts actually happen  When inserting, it is O(p) where p is the size of the linked list  For the array accumulator, this is O(N)

Option 2: Hashtable  Size of hashtable: number of expected non-0 results * 3 (r * 3)  Fast lookup by docid – YES  Loc = hashfn (docid)  HT[Loc] += wt  O (c) where c is number of collisions + 1  Small space in memory – YES  O(r)  Fast sort by total weight – MAYBE  Can use same sort approaches as for Option 1: Array  O(A) == O(r)

Option 3: Heap  Can bound the heap to approximate size p  Height of the heap: h = log 2 p  Fast lookup by docid – NO  Must walk the whole heap, O(p)  Small space in memory – YES  Store one element for each result you plan to present to the user (just keep top p at any time)  O (p)  Fast sort by total weight – YES  Results are always in partially sorted  Just remove top element iteratively to present results at the end  O (p log p) == heap sort

Option 4: Hashtable + Heap  Use both a hashtable AND a heap  Both store pointers to nodes that contain (docid, total_weight)  Fast look up by docid –YES  O (c) in hashtable  Small space in memory - YES  O (r) + O (p) for hashtable and heap  Fast sort by total weight - YES  O (p log p) from heap