Document Indexing: SPIMI

Slides:



Advertisements
Similar presentations
Lecture 4: Index Construction
Advertisements

Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Information Retrieval in Practice
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Index Construction David Kauchak cs160 Fall 2009 adapted from:
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Computations have to be distributed !
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
Indexing Implementation and Indexing Models
PrasadL06IndexConstruction1 Index Construction Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford)
1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information.
Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.
CpSc 881: Information Retrieval. 2 Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin by reviewing.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Lecture 4 Index construction
INF 2914 Information Retrieval and Web Search
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 4 9/1/2011.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Introduction to Information Retrieval Information Retrieval and Data Mining (AT71.07) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor:
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Index Construction David Kauchak cs458 Fall 2012 adapted from:
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 9 9/20/2011.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
Dictionaries and Tolerant retrieval
Web Search and Text Mining Lecture 3. Outline Distributed programming: MapReduce Distributed indexing Several other examples using MapReduce Zones in.
Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal.
MapReduce How to painlessly process terabytes of data.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
An Introduction to IR Lecture 4 Index construction 1.
Index Construction 1 Lecture 5: Index Construction Web Search and Mining.
Index Compression David Kauchak cs458 Fall 2012 adapted from:
Index Construction (Modified from Stanford CS276 Class Lecture 4 Index construction)
1 ITCS 6265 Lecture 4 Index construction. 2 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex This lecture:
Information Retrieval Techniques MS(CS) Lecture 6 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
PrasadL06IndexConstruction1 Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 4: Index Construction United International College.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Lecture 4: Index Construction Related to Chapter 4:
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 4: Index Construction.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER IV Index Construction 1.
CS276 Lecture 4 Index construction. Plan Last lecture: Tolerant retrieval Wildcards Spell correction Soundex This time: Index construction.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Lecture #4 Introduction to Data Parallelism and MapReduce CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Index Construction.
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information Retrieval in Practice
Chapter 4 Index construction
Lecture 7: Index Construction
CS276: Information Retrieval and Web Search
Index Construction: sorting
3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Lecture 4: Index Construction
Index construction 4장.
CS276: Information Retrieval and Web Search
Presentation transcript:

Document Indexing: SPIMI Contents: 1. Single-pass in-memory indexing (SPIMI) 2. Distributed Indexing 3. Some simple examples

Problems With Earlier Approaches

SPIMI: Single-pass in-memory indexing Sec. 4.3 SPIMI: Single-pass in-memory indexing

Merging of blocks is analogous to BSBI. Sec. 4.3 SPIMI-Invert Merging of blocks is analogous to BSBI.

Compression makes SPIMI even more efficient. Sec. 4.3 SPIMI: Compression Compression makes SPIMI even more efficient. Compression of terms Compression of postings

For web-scale indexing : Individual machines are fault-prone Sec. 4.4 Distributed indexing For web-scale indexing : must use a distributed computing cluster Individual machines are fault-prone Can unpredictably slow down or fail How do we exploit such a pool of machines?

Sec. 4.4 Distributed indexing Uses a Large number of inexpensive servers instead of a single expensive machine. Maintain a master machine directing the indexing job prepare clusters of machine and Considers each node of cluster as safe. Breaks the indexing into sets of (parallel) tasks and passes it to different machines (nodes). Master machine assigns each task to an idle machine from a pool. MapReduce is a distributed programming tool designed for indexing and analysis tasks

Ref: Information Retrieval in Practice, Addison Wesley, 2008 Example “Collection” Ref: Information Retrieval in Practice, Addison Wesley, 2008

Ref: Information Retrieval in Practice, Addison Wesley, 2008 Simple Inverted Index Ref: Information Retrieval in Practice, Addison Wesley, 2008

Inverted Index with counts supports better ranking algorithms Ref: Information Retrieval in Practice, Addison Wesley, 2008

Ref: Information Retrieval in Practice, Addison Wesley, 2008 Inverted Index with positions supports proximity matches Ref: Information Retrieval in Practice, Addison Wesley, 2008

Data flow Master assign assign Postings Parser a-f g-p q-z Inverter Sec. 4.4 Data flow Master assign assign Postings Parser a-f g-p q-z Inverter a-f Parser a-f g-p q-z Inverter g-p splits Inverter q-z Parser a-f g-p q-z Map phase Reduce phase Segment files Fig: A simple Map-Reduce system, ref: Information Retrieval, Cambridge - 2009

Reference Information Retrieval, Cambridge-2009. Information Retrieval in Practice, Addison Wesley, 2008. Original publication on SPIMI: Heinz and Zobel (2003) Original publication on MapReduce: Dean and Ghemawat (2004)