MR Application with optimizations for performance and scalability

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Chapter 5: Introduction to Information Retrieval

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

Inverted Indexing for Text Retrieval Chapter 4 Lin and Dyer.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

Information Retrieval in Practice

Based on Lin and Dryer’s text: Chapter 3.  Figure 2.6.

Modern Information Retrieval

Evaluating the Performance of IR Sytems

Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Overview of Search Engines

Introduction to Hadoop and HDFS

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

IRTools Software Overview Gregory B. Newby UNC Chapel Hill

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.

Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.

Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.

1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

General Architecture of Retrieval Systems 1Adrienn Skrop.

The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)

WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.

Data mining in web applications

Search Engine Optimization

Information Retrieval in Practice

Why indexing? For efficient searching of a document

Information Retrieval in Practice

An Efficient Algorithm for Incremental Update of Concept space

Modified from Stanford CS276 slides Lecture 4: Index Construction

Text Based Information Retrieval

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Information Retrieval and Web Search

IST 516 Fall 2011 Dongwon Lee, Ph.D.

CS 430: Information Discovery

Intro to PHP & Variables

MG4J – Managing GigaBytes for Java Introduction

MR Application with optimizations for performance and scalability

Data-Intensive Distributed Computing

Thanks to Bill Arms, Marti Hearst

Chapter 27 WWW and HTTP.

Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.

MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.

Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Word Co-occurrence Chapter 3, Lin and Dyer.

Implementation Based on Inverted Files

Conceptual Architecture of PostgreSQL

Conceptual Architecture of PostgreSQL

PageRank GROUP 4.

6. Implementation of Vector-Space Retrieval

Search Engine Architecture

Inverted Indexing for Text Retrieval

MapReduce Algorithm Design

The Search Engine Architecture

Information Retrieval and Web Design

Word Co-occurrence Chapter 3, Lin and Dryer.

Presentation transcript:

MR Application with optimizations for performance and scalability Ch.4 Lin and Dryer 1/13/2019

Inverted Indexing for Text Retrieval Web search is inherently a big-data problem. Common misconceptions: that the search goes into data gathering mode after the user types in the search word that search is directly executed by MR programming model In reality data is prepared ahead of time and curated before the search is applied to the well-positioned data. Data is analyzed for discovery of patterns and other information. Indices are generated for scalable access to data. Search engines rely on a data-structure called an inverted index. A regular index provides the location of an item within a document Example: an index on the primary key in a relational database An inverted index provides the list of documents that a term is found in, and other details such a frequency, proximity to something, hits, etc. 1/13/2019

The Analysis of the whole problem The web search problem decomposes into 3 problems: Gathering web content (web crawling) Construction of the inverted index (indexing) Ranking documents given a query (retrieval) The first two are offline problems. These two need to be scalable and efficient, but do not have to operate in realtime; updates can be made incrementally based on the content changes. Retrieval is a online problem that demands stringent timings: sub-second response times. Concurrent queries Query latency Load on the servers Other circumstances: day of the day Resource consumption can be spikey or highly variable Resource requirement for indexing is more predictable 1/13/2019

Web Crawling Start with a “seed” URL , say wikipedia page, and start collecting the content by following the links in the seed page; the depth of traversal is also specified by the input What are the issues? See page 67 1/13/2019

Inverted Index Inverted index consists of postings lists, one associated with each term that appears in the corpus. <t, posting>n <t, <docid, tf> >n <t, <docid, tf, other info>>n Key, value pair where the key is the term (word) and the value is the docid, followed by “payload” Payload can be empty for simple index Payload can be complex: provides such details as co- occurrences, additional linguistic processing, page rank of the doc, etc. <t2, <d1, d4, d67, d89>> <t3, <d4, d6, d7, d9, d22>> Document numbering typically do not have semantic content but docs from the same corpus are numbered together or the numbers could be assigned based on page ranks. 1/13/2019

Inverted Index: Baseline Implementation using MR Input to the mapper consists of docid and actual content. Each document is analyzed and broken down into terms. Processing pipeline assuming HTML docs: Strip HTML tags Strip Javascript code Tokenize using a set of delimiters Case fold Remove stop words (a, an the…) Remove domain-specific stop works Stem different forms (..ing, ..ed…, dogs – dog) 1/13/2019

Baseline MR for II class Mapper procedure Map(docid n; doc d) H =new AssociativeArray for all term t in doc d do H(t) H(t) + 1 for all term t in H do Emit(term t; posting (n,H[t]) class Reducer procedure Reduce(term t; postings [hn1; f1i; hn2; f2i : : :]) P = new List for all posting (t,f) in postings [(n1,f1); (n2, f2) : : :] do Append(P, (t, f)) Sort(P) Emit(term t; postings P) 1/13/2019

Sort and Shuffle Phase MR runtime performs a large, distributed group by of the postings by term. Without any additional effort by the programmer, the execution framework brings together all the postings that belong in the same posting list. This reduces the work of the reducer. Note the sort at the end of reducer to sort the list by the docs. Note that Shuffle sorts by key and not by value # of index files depends on number of reducers. See Figure 4.3 No need to consolidate the reducer output files This is a very concise implementation of II 1/13/2019

Inverted Index: Revised implementation From Baseline to an improved version Observe the sort done by the Reducer. Is there any way to push this into the MR runtime? Instead of (term t, posting<docid, f>) Emit (tuple<t, docid>, tf f) This is known as value-key conversion design pattern This switching ensures the keys arrive in order at the reducer Small memory foot print; less buffer space needed at the reducer See fig.4.4 1/13/2019

Improved MR for II class Reducer method Initialize class Mapper method Map(docid n; doc d) H = new AssociativeArray for all term t in doc d do H[t] = H[t] + 1 for all term t in H do Emit(tuple <t; n>, tf H[t]) class Reducer method Initialize tprev = 0; P = new PostingsList method Reduce(tuple <t, n>; tf [f]) if t <> tprev ^ tprev <> 0; then Emit(term t; postings P) P:Reset() P:Add(<n, f>) tprev = t method Close 1/13/2019

Index compression for space Section 4.5 (5,2), (7,3), (12,1), (49,1), (51,2)… (5,2), (2,3), (5,1), (37,1), (2,2)… 1/13/2019

What about retrieval? While MR is great for indexing, it is not great for retrieval. 1/13/2019