MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013
How big is Big Data? Big Data is disruption!!! What is Big Data? How big is Big Data? Big Data is disruption!!!
Big Data “ You are dealing with Big Data when you are working with data that does not fit into your computer unit … Today, Big Data means working with data that does not fit in one computer” (O’Neil & Schutt, 2013)
Big Data & MapReduce We can try to process lots of data in one computer but the more and more data we add (holding our computing power constant) the higher the likelihood that our “fan-in”, where the results of computations are sent to the controller, will fail because of a bandwidth problem
Big Data & MapReduce What we need is a tree, where every group of 10 machines sends data to one local controller, and then they all send back to super controller. This will probably work Group of 10 machines Local controller Super controller
Big Data & MapReduce But, can we do this with 1,000 machines? The answer is no. Because of that scale, one or more computers will fail (if you do the math, with 1,000 computers, the chance that none is broken is about .001, which is small) This is not robust. What to do?
Fault Tolerance Take a fault tolerance approach for tree approach. This involves replicating the input (default is to have about 3 copies of everything), and making the different copies available to different machines, so if one blows, another one will still have the good data In general, we need a system that detects errors and restarts work automatically when it detects them
MapReduce Allows us to stop thinking about fault tolerance; it is a platform that does the fault tolerance work for us Programming 1,000 computers is now easier than programming 100 because of MapReduce (O’Neil & Schutt, 2013)
MapReduce: How To? To use MapReduce, you write two functions: a mapper function, and then a reducer function It takes these functions and runs them on many machines that are local to your stored data. All of the fault tolerance is automatically done for you once you place your code into the MapReduce framework
MapReduce: The Mapper The mapper takes each data point and produces an ordered pair of the form (key, value). The framework then sorts output via the “shuffle,” and in particular finds all the keys that match and puts them together in a pile. Then it sends these piles to machines that process them using the reducer function
MapReduce: The Reducer The reducer function’s outputs are of the form (key, new value), where the new value is some aggregate function of the old values
[Data] (“key”, “value”) MapReduce: An Example Counting words: The objective of our code is simple, to count the number of instance a certain word appeared in a corpus of text. For each word, we send an ordered pair with the key as that word and the value being the integer 1: [Data] (“key”, “value”) Red (“red”, 1) Blue (“blue”, 1)
MapReduce: An Example This goes into the “shuffle” (via the “fan-in”) and we get a pile of (“red”,1)’s, which we can rewrite as (“red”, 1, 1). This gets sent to the reducer function, which just adds up all the 1’s. We end up with (“red”, 2), (“blue”,1) Main point: one reducer handles all the values for a fixed key
MapReduce: Have More Data? Obviously, yes! What to do? Increase the number of map workers and reduce workers, In other words, do it on more computers! MapReduce flattens the complexity of working with many computers—its elegant, and people use it even when they shouldn't’ (we will for A8). Like all tools, it gets overused
MapReduce: Another Example Counting words was one easy function. Let’s now split up into two functions—this is not intuitive. For the prior example, the distribution of values must be uniform
MapReduce: Another Example If all your words are the same, they all go to one machine during the shuffle, which causes huge problems. Google has solved this with CountSketch Assume you want to count how many unique users saw ads from each zip code and how many clicked at least once. How do you use MapReduce for this?
MapReduce: Another Example You could run MapReduce to keyed by zip code so that a record with a person living in zip code 30606 is sent as [Data] (“key”, {“saw_value”, “click_value”}) 30606 (“30606”,{1, 1}) [saw and clicked] 30606 (“30606”,{1, 0}) [saw but did not click]
MapReduce: Another Example At the reducer stage, this would count the total number of clicks and impressions by zip code producing output of the form [Data] (“key”, {“saw_value”, “click_value”}) 30606 (“30606”,{700, 333}) [saw and clicked]
MapReduce: Getting Fancy What about something more complicated like using MapReduce to implement a statistical model as linear regression? Is that possible? Yes, it is. Check this paper out to learn how
MapReduce: Sky is the Limit Sometimes to understand what something is, it can help to understand what is isn’t. So, what can’t MapReduce do? Well, I personally can think of lots of things, for example, give me a massage, which would be very nice. You will be forgiven for thinking that MapReduce can solve any data problem coming your way…
MapReduce: Conclusions & Wednesday’s Class MapReduce is changing the way we process data Fault Tolerant, cheaper (these are commodity machines), and faster (parallel processing) For Wednesday, Read the Google paper! We will do MapReduce in R