MR Application with optimizations for performance and scalability Ch.4 Lin and Dryer 1/13/2019
Inverted Indexing for Text Retrieval Web search is inherently a big-data problem. Common misconceptions: that the search goes into data gathering mode after the user types in the search word that search is directly executed by MR programming model In reality data is prepared ahead of time and curated before the search is applied to the well-positioned data. Data is analyzed for discovery of patterns and other information. Indices are generated for scalable access to data. Search engines rely on a data-structure called an inverted index. A regular index provides the location of an item within a document Example: an index on the primary key in a relational database An inverted index provides the list of documents that a term is found in, and other details such a frequency, proximity to something, hits, etc. 1/13/2019
The Analysis of the whole problem The web search problem decomposes into 3 problems: Gathering web content (web crawling) Construction of the inverted index (indexing) Ranking documents given a query (retrieval) The first two are offline problems. These two need to be scalable and efficient, but do not have to operate in realtime; updates can be made incrementally based on the content changes. Retrieval is a online problem that demands stringent timings: sub-second response times. Concurrent queries Query latency Load on the servers Other circumstances: day of the day Resource consumption can be spikey or highly variable Resource requirement for indexing is more predictable 1/13/2019
Web Crawling Start with a “seed” URL , say wikipedia page, and start collecting the content by following the links in the seed page; the depth of traversal is also specified by the input What are the issues? See page 67 1/13/2019
Inverted Index Inverted index consists of postings lists, one associated with each term that appears in the corpus. <t, posting>n <t, <docid, tf> >n <t, <docid, tf, other info>>n Key, value pair where the key is the term (word) and the value is the docid, followed by “payload” Payload can be empty for simple index Payload can be complex: provides such details as co- occurrences, additional linguistic processing, page rank of the doc, etc. <t2, <d1, d4, d67, d89>> <t3, <d4, d6, d7, d9, d22>> Document numbering typically do not have semantic content but docs from the same corpus are numbered together or the numbers could be assigned based on page ranks. 1/13/2019
Inverted Index: Baseline Implementation using MR Input to the mapper consists of docid and actual content. Each document is analyzed and broken down into terms. Processing pipeline assuming HTML docs: Strip HTML tags Strip Javascript code Tokenize using a set of delimiters Case fold Remove stop words (a, an the…) Remove domain-specific stop works Stem different forms (..ing, ..ed…, dogs – dog) 1/13/2019
Baseline MR for II class Mapper procedure Map(docid n; doc d) H =new AssociativeArray for all term t in doc d do H(t) H(t) + 1 for all term t in H do Emit(term t; posting (n,H[t]) class Reducer procedure Reduce(term t; postings [hn1; f1i; hn2; f2i : : :]) P = new List for all posting (t,f) in postings [(n1,f1); (n2, f2) : : :] do Append(P, (t, f)) Sort(P) Emit(term t; postings P) 1/13/2019
Sort and Shuffle Phase MR runtime performs a large, distributed group by of the postings by term. Without any additional effort by the programmer, the execution framework brings together all the postings that belong in the same posting list. This reduces the work of the reducer. Note the sort at the end of reducer to sort the list by the docs. Note that Shuffle sorts by key and not by value # of index files depends on number of reducers. See Figure 4.3 No need to consolidate the reducer output files This is a very concise implementation of II 1/13/2019
Inverted Index: Revised implementation From Baseline to an improved version Observe the sort done by the Reducer. Is there any way to push this into the MR runtime? Instead of (term t, posting<docid, f>) Emit (tuple<t, docid>, tf f) This is known as value-key conversion design pattern This switching ensures the keys arrive in order at the reducer Small memory foot print; less buffer space needed at the reducer See fig.4.4 1/13/2019
Improved MR for II class Reducer method Initialize class Mapper method Map(docid n; doc d) H = new AssociativeArray for all term t in doc d do H[t] = H[t] + 1 for all term t in H do Emit(tuple <t; n>, tf H[t]) class Reducer method Initialize tprev = 0; P = new PostingsList method Reduce(tuple <t, n>; tf [f]) if t <> tprev ^ tprev <> 0; then Emit(term t; postings P) P:Reset() P:Add(<n, f>) tprev = t method Close 1/13/2019
Index compression for space Section 4.5 (5,2), (7,3), (12,1), (49,1), (51,2)… (5,2), (2,3), (5,1), (37,1), (2,2)… 1/13/2019
What about retrieval? While MR is great for indexing, it is not great for retrieval. 1/13/2019