Presentation is loading. Please wait.

Presentation is loading. Please wait.

Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

Similar presentations


Presentation on theme: "Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop."— Presentation transcript:

1 Map-Reduce examples 1

2 So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop is a MapReduce file system. MapReduce is Googles version (and it is proprietary). Phases 1. Take a series of keys and transform them into a different series of values, generally, ones that have some semantic context 2. Perform a second pass where the new series of values are compressed into far fewer values 2

3 In its strictest sense… Map-reduce is a two phase operation First, convert a list of data into a list of a different kind of data Second, turn the second list into a single or a list of scalar values, often the cardinality of the items created in the first step 3

4 Relevant computing/data application types For aggregate database processing, and not so much for set-oriented, and certainly not for object-based querying Fits well with cluster-based environments, where there are lots of opportunities for parallel processing Fits query patterns that calculate the cardinality of sets and the removal of duplicates 4

5 Strategy for M-R We try to do the computing on the machines where the data sits So we try to engineer the storage of data so that it accommodates the chaining of M-R operations 5

6 The key bottom line concept In a relational database, we try to minimize the I/O costs of moving large volumes of data from the server to the client, so that it can then be scanned and aggregated In a database that supports MP, we trying to screen (and sometimes aggregate) data on the server where it sits We also use parallel processing within cluster servers to minimize the cost of doing that aggregation if it cannot all be done on a single server housing the original data. 6

7 Another way of looking at this… We have seen the tradeoff between moving data and moving processing logic in the context of distributed, homogenous distributed data Often, in distributed databases, it is far cheaper to ship processing logic instead of data, even if it causes extra processing to have to happen This is another context in which we often choose to send processing code to a server in order to minimize the movement of large volumes of data 7

8 Example 1. We start with a set of person keys and map each of these to the names of the people. Key 1 -> Harry Key 2 -> Harry Key 3 -> Tommy 2. We aggregate the list of people names by counting how many unique names are in the list. Harry, Harry, Tommy -> 2 8

9 What actually happens? Informally: Each key leads to a name field. Then, the names are isolated. Then, each is passed to a “mapper”, which returns the name, along with a 1. Then, a “reducer” takes each name and makes a list of 1’s. The reducer adds up the 1’s for each name and returns a list of (name, count) pairs. 9

10 From wikipedia Imagine that for a database of 1.1 billion people, one would like to compute the average number of social contacts a person has according to age. SELECT age AS Y, AVG(contacts) AS A FROM social.person GROUP BY age ORDER BY age function Map is input: integer K1 between 1 and 1100, representing a batch of 1 million social.person records for each social.person record in the K1 batch do let Y be the person's age let N be the number of contacts the person has produce one output record repeat end function function Reduce is input: age (in years) Y for each input record do Accumulate in S the sum of N Accumulate in C the count of records so far repeat let A be S/C produce one output record end function 10

11 From NoSQL Distilled: 1. Creating a list with a map 11

12 2. Aggregating with a reduce 12

13 3. Partitioning the output of mappers: parallelism & adding a phase that merges the results of the reducers 13

14 4. Introducing a combiner operation to minimize the movement of redundant data – the output format must be the same as the input format 14

15 5. A combiner that removes duplicate product-customer pairs 15

16 6. Concatenating a combining and a reduce (counting) operation 16

17 7. Maintaining the 1’s counts in the mapping phase 17

18 8. Adding temporal information to the map/reduce process 18

19 9. Using a reduce operator to create product per month totals 19

20 10. A second mapper that creates base year by year comparisons 20

21 11. A reduce operation combines records for a given year 21

22 Complaints M-R is low level. It is rigid. It exists to optimize the distributed cluster model – only. It demands that an application fit perfectly into the paradigm. It takes careful planning and knowledge of exactly how the data will be used to structure the database to optimally serve a series of map/reduce operations It thus does not accommodate on-the-fly browsing 22


Download ppt "Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop."

Similar presentations


Ads by Google