Download presentation
Presentation is loading. Please wait.
Published byAda Saarinen Modified over 6 years ago
1
MR Application with optimizations for performance and scalability
Ch.4 Lin and Dryer 1/13/2019
2
Inverted Indexing for Text Retrieval
Web search is inherently a big-data problem. Common misconceptions: that the search goes into data gathering mode after the user types in the search word that search is directly executed by MR programming model In reality data is prepared ahead of time and curated before the search is applied to the well-positioned data. Data is analyzed for discovery of patterns and other information. Indices are generated for scalable access to data. Search engines rely on a data-structure called an inverted index. A regular index provides the location of an item within a document Example: an index on the primary key in a relational database An inverted index provides the list of documents that a term is found in, and other details such a frequency, proximity to something, hits, etc. 1/13/2019
3
The Analysis of the whole problem
The web search problem decomposes into 3 problems: Gathering web content (web crawling) Construction of the inverted index (indexing) Ranking documents given a query (retrieval) The first two are offline problems. These two need to be scalable and efficient, but do not have to operate in realtime; updates can be made incrementally based on the content changes. Retrieval is a online problem that demands stringent timings: sub-second response times. Concurrent queries Query latency Load on the servers Other circumstances: day of the day Resource consumption can be spikey or highly variable Resource requirement for indexing is more predictable 1/13/2019
4
Web Crawling Start with a “seed” URL , say wikipedia page, and start collecting the content by following the links in the seed page; the depth of traversal is also specified by the input What are the issues? See page 67 1/13/2019
5
Inverted Index Inverted index consists of postings lists, one associated with each term that appears in the corpus. <t, posting>n <t, <docid, tf> >n <t, <docid, tf, other info>>n Key, value pair where the key is the term (word) and the value is the docid, followed by “payload” Payload can be empty for simple index Payload can be complex: provides such details as co- occurrences, additional linguistic processing, page rank of the doc, etc. <t2, <d1, d4, d67, d89>> <t3, <d4, d6, d7, d9, d22>> Document numbering typically do not have semantic content but docs from the same corpus are numbered together or the numbers could be assigned based on page ranks. 1/13/2019
6
Inverted Index: Baseline Implementation using MR
Input to the mapper consists of docid and actual content. Each document is analyzed and broken down into terms. Processing pipeline assuming HTML docs: Strip HTML tags Strip Javascript code Tokenize using a set of delimiters Case fold Remove stop words (a, an the…) Remove domain-specific stop works Stem different forms (..ing, ..ed…, dogs – dog) 1/13/2019
7
Baseline MR for II class Mapper procedure Map(docid n; doc d) H =new AssociativeArray for all term t in doc d do H(t) H(t) + 1 for all term t in H do Emit(term t; posting (n,H[t]) class Reducer procedure Reduce(term t; postings [hn1; f1i; hn2; f2i : : :]) P = new List for all posting (t,f) in postings [(n1,f1); (n2, f2) : : :] do Append(P, (t, f)) Sort(P) Emit(term t; postings P) 1/13/2019
8
Sort and Shuffle Phase MR runtime performs a large, distributed group by of the postings by term. Without any additional effort by the programmer, the execution framework brings together all the postings that belong in the same posting list. This reduces the work of the reducer. Note the sort at the end of reducer to sort the list by the docs. Note that Shuffle sorts by key and not by value # of index files depends on number of reducers. See Figure 4.3 No need to consolidate the reducer output files This is a very concise implementation of II 1/13/2019
9
Inverted Index: Revised implementation
From Baseline to an improved version Observe the sort done by the Reducer. Is there any way to push this into the MR runtime? Instead of (term t, posting<docid, f>) Emit (tuple<t, docid>, tf f) This is known as value-key conversion design pattern This switching ensures the keys arrive in order at the reducer Small memory foot print; less buffer space needed at the reducer See fig.4.4 1/13/2019
10
Improved MR for II class Reducer method Initialize
class Mapper method Map(docid n; doc d) H = new AssociativeArray for all term t in doc d do H[t] = H[t] + 1 for all term t in H do Emit(tuple <t; n>, tf H[t]) class Reducer method Initialize tprev = 0; P = new PostingsList method Reduce(tuple <t, n>; tf [f]) if t <> tprev ^ tprev <> 0; then Emit(term t; postings P) P:Reset() P:Add(<n, f>) tprev = t method Close 1/13/2019
11
Index compression for space
Section 4.5 (5,2), (7,3), (12,1), (49,1), (51,2)… (5,2), (2,3), (5,1), (37,1), (2,2)… 1/13/2019
12
What about retrieval? While MR is great for indexing, it is not great for retrieval. 1/13/2019
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.