Document Indexing: SPIMI Contents: 1. Single-pass in-memory indexing (SPIMI) 2. Distributed Indexing 3. Some simple examples
Problems With Earlier Approaches
SPIMI: Single-pass in-memory indexing Sec. 4.3 SPIMI: Single-pass in-memory indexing
Merging of blocks is analogous to BSBI. Sec. 4.3 SPIMI-Invert Merging of blocks is analogous to BSBI.
Compression makes SPIMI even more efficient. Sec. 4.3 SPIMI: Compression Compression makes SPIMI even more efficient. Compression of terms Compression of postings
For web-scale indexing : Individual machines are fault-prone Sec. 4.4 Distributed indexing For web-scale indexing : must use a distributed computing cluster Individual machines are fault-prone Can unpredictably slow down or fail How do we exploit such a pool of machines?
Sec. 4.4 Distributed indexing Uses a Large number of inexpensive servers instead of a single expensive machine. Maintain a master machine directing the indexing job prepare clusters of machine and Considers each node of cluster as safe. Breaks the indexing into sets of (parallel) tasks and passes it to different machines (nodes). Master machine assigns each task to an idle machine from a pool. MapReduce is a distributed programming tool designed for indexing and analysis tasks
Ref: Information Retrieval in Practice, Addison Wesley, 2008 Example “Collection” Ref: Information Retrieval in Practice, Addison Wesley, 2008
Ref: Information Retrieval in Practice, Addison Wesley, 2008 Simple Inverted Index Ref: Information Retrieval in Practice, Addison Wesley, 2008
Inverted Index with counts supports better ranking algorithms Ref: Information Retrieval in Practice, Addison Wesley, 2008
Ref: Information Retrieval in Practice, Addison Wesley, 2008 Inverted Index with positions supports proximity matches Ref: Information Retrieval in Practice, Addison Wesley, 2008
Data flow Master assign assign Postings Parser a-f g-p q-z Inverter Sec. 4.4 Data flow Master assign assign Postings Parser a-f g-p q-z Inverter a-f Parser a-f g-p q-z Inverter g-p splits Inverter q-z Parser a-f g-p q-z Map phase Reduce phase Segment files Fig: A simple Map-Reduce system, ref: Information Retrieval, Cambridge - 2009
Reference Information Retrieval, Cambridge-2009. Information Retrieval in Practice, Addison Wesley, 2008. Original publication on SPIMI: Heinz and Zobel (2003) Original publication on MapReduce: Dean and Ghemawat (2004)