Presentation is loading. Please wait.

Presentation is loading. Please wait.

MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013.

Similar presentations


Presentation on theme: "MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013."— Presentation transcript:

1 MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013

2

3 How big is Big Data? Big Data is disruption!!!
What is Big Data? How big is Big Data? Big Data is disruption!!!

4 Big Data “ You are dealing with Big Data when you are working with data that does not fit into your computer unit … Today, Big Data means working with data that does not fit in one computer” (O’Neil & Schutt, 2013)

5 Big Data & MapReduce We can try to process lots of data in one computer but the more and more data we add (holding our computing power constant) the higher the likelihood that our “fan-in”, where the results of computations are sent to the controller, will fail because of a bandwidth problem

6 Big Data & MapReduce What we need is a tree, where every group of 10 machines sends data to one local controller, and then they all send back to super controller. This will probably work Group of 10 machines Local controller Super controller

7 Big Data & MapReduce But, can we do this with 1,000 machines? The answer is no. Because of that scale, one or more computers will fail (if you do the math, with 1,000 computers, the chance that none is broken is about .001, which is small) This is not robust. What to do?

8 Fault Tolerance Take a fault tolerance approach for tree approach. This involves replicating the input (default is to have about 3 copies of everything), and making the different copies available to different machines, so if one blows, another one will still have the good data In general, we need a system that detects errors and restarts work automatically when it detects them

9 MapReduce Allows us to stop thinking about fault tolerance; it is a platform that does the fault tolerance work for us Programming 1,000 computers is now easier than programming 100 because of MapReduce (O’Neil & Schutt, 2013)

10 MapReduce: How To? To use MapReduce, you write two functions: a mapper function, and then a reducer function It takes these functions and runs them on many machines that are local to your stored data. All of the fault tolerance is automatically done for you once you place your code into the MapReduce framework

11 MapReduce: The Mapper The mapper takes each data point and produces an ordered pair of the form (key, value). The framework then sorts output via the “shuffle,” and in particular finds all the keys that match and puts them together in a pile. Then it sends these piles to machines that process them using the reducer function

12 MapReduce: The Reducer
The reducer function’s outputs are of the form (key, new value), where the new value is some aggregate function of the old values

13 [Data]  (“key”, “value”)
MapReduce: An Example Counting words: The objective of our code is simple, to count the number of instance a certain word appeared in a corpus of text. For each word, we send an ordered pair with the key as that word and the value being the integer 1: [Data]  (“key”, “value”) Red  (“red”, 1) Blue  (“blue”, 1)

14 MapReduce: An Example This goes into the “shuffle” (via the “fan-in”) and we get a pile of (“red”,1)’s, which we can rewrite as (“red”, 1, 1). This gets sent to the reducer function, which just adds up all the 1’s. We end up with (“red”, 2), (“blue”,1) Main point: one reducer handles all the values for a fixed key

15 MapReduce: Have More Data?
Obviously, yes! What to do? Increase the number of map workers and reduce workers, In other words, do it on more computers! MapReduce flattens the complexity of working with many computers—its elegant, and people use it even when they shouldn't’ (we will for A8). Like all tools, it gets overused

16 MapReduce: Another Example
Counting words was one easy function. Let’s now split up into two functions—this is not intuitive. For the prior example, the distribution of values must be uniform

17 MapReduce: Another Example
If all your words are the same, they all go to one machine during the shuffle, which causes huge problems. Google has solved this with CountSketch Assume you want to count how many unique users saw ads from each zip code and how many clicked at least once. How do you use MapReduce for this?

18 MapReduce: Another Example
You could run MapReduce to keyed by zip code so that a record with a person living in zip code is sent as [Data]  (“key”, {“saw_value”, “click_value”}) 30606  (“30606”,{1, 1}) [saw and clicked] 30606  (“30606”,{1, 0}) [saw but did not click]

19 MapReduce: Another Example
At the reducer stage, this would count the total number of clicks and impressions by zip code producing output of the form [Data]  (“key”, {“saw_value”, “click_value”}) 30606  (“30606”,{700, 333}) [saw and clicked]

20 MapReduce: Getting Fancy
What about something more complicated like using MapReduce to implement a statistical model as linear regression? Is that possible? Yes, it is. Check this paper out to learn how

21 MapReduce: Sky is the Limit
Sometimes to understand what something is, it can help to understand what is isn’t. So, what can’t MapReduce do? Well, I personally can think of lots of things, for example, give me a massage, which would be very nice. You will be forgiven for thinking that MapReduce can solve any data problem coming your way…

22 MapReduce: Conclusions & Wednesday’s Class
MapReduce is changing the way we process data Fault Tolerant, cheaper (these are commodity machines), and faster (parallel processing) For Wednesday, Read the Google paper! We will do MapReduce in R


Download ppt "MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013."

Similar presentations


Ads by Google