MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013.

Slides:



Advertisements
Similar presentations
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Advertisements

MapReduce Simplified Data Processing on Large Clusters
MapReduce.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
Scaling up Decision Trees. Decision tree learning.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce Algorithm Design Based on Jimmy Lin’s slides
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
AP CSP: Cleaning Data & Creating Summary Tables
MapReduce Compiler RHadoop
Hadoop Aakash Kag What Why How 1.
CS122A: Introduction to Data Management Lecture #16: AsterixDB
Large-scale file systems and Map-Reduce
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
15-826: Multimedia Databases and Data Mining
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 2nd – Map/reduce.
February 26th – Map/Reduce
Hadoop.
Hadoop Basics.
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Ch 4. The Evolution of Analytic Scalability
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Distributed System Gang Wu Spring,2018.
Coding Concepts (Basics)
Fundamentals of Data Representation
Introduction to MapReduce
MAPREDUCE TYPES, FORMATS AND FEATURES
MapReduce Algorithm Design
5/7/2019 Map Reduce Map reduce.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013

How big is Big Data? Big Data is disruption!!! What is Big Data? How big is Big Data? Big Data is disruption!!!

Big Data “ You are dealing with Big Data when you are working with data that does not fit into your computer unit … Today, Big Data means working with data that does not fit in one computer” (O’Neil & Schutt, 2013)

Big Data & MapReduce We can try to process lots of data in one computer but the more and more data we add (holding our computing power constant) the higher the likelihood that our “fan-in”, where the results of computations are sent to the controller, will fail because of a bandwidth problem

Big Data & MapReduce What we need is a tree, where every group of 10 machines sends data to one local controller, and then they all send back to super controller. This will probably work Group of 10 machines Local controller Super controller

Big Data & MapReduce But, can we do this with 1,000 machines? The answer is no. Because of that scale, one or more computers will fail (if you do the math, with 1,000 computers, the chance that none is broken is about .001, which is small) This is not robust. What to do?

Fault Tolerance Take a fault tolerance approach for tree approach. This involves replicating the input (default is to have about 3 copies of everything), and making the different copies available to different machines, so if one blows, another one will still have the good data In general, we need a system that detects errors and restarts work automatically when it detects them

MapReduce Allows us to stop thinking about fault tolerance; it is a platform that does the fault tolerance work for us Programming 1,000 computers is now easier than programming 100 because of MapReduce (O’Neil & Schutt, 2013)

MapReduce: How To? To use MapReduce, you write two functions: a mapper function, and then a reducer function It takes these functions and runs them on many machines that are local to your stored data. All of the fault tolerance is automatically done for you once you place your code into the MapReduce framework

MapReduce: The Mapper The mapper takes each data point and produces an ordered pair of the form (key, value). The framework then sorts output via the “shuffle,” and in particular finds all the keys that match and puts them together in a pile. Then it sends these piles to machines that process them using the reducer function

MapReduce: The Reducer The reducer function’s outputs are of the form (key, new value), where the new value is some aggregate function of the old values

[Data]  (“key”, “value”) MapReduce: An Example Counting words: The objective of our code is simple, to count the number of instance a certain word appeared in a corpus of text. For each word, we send an ordered pair with the key as that word and the value being the integer 1: [Data]  (“key”, “value”) Red  (“red”, 1) Blue  (“blue”, 1)

MapReduce: An Example This goes into the “shuffle” (via the “fan-in”) and we get a pile of (“red”,1)’s, which we can rewrite as (“red”, 1, 1). This gets sent to the reducer function, which just adds up all the 1’s. We end up with (“red”, 2), (“blue”,1) Main point: one reducer handles all the values for a fixed key

MapReduce: Have More Data? Obviously, yes! What to do? Increase the number of map workers and reduce workers, In other words, do it on more computers! MapReduce flattens the complexity of working with many computers—its elegant, and people use it even when they shouldn't’ (we will for A8). Like all tools, it gets overused

MapReduce: Another Example Counting words was one easy function. Let’s now split up into two functions—this is not intuitive. For the prior example, the distribution of values must be uniform

MapReduce: Another Example If all your words are the same, they all go to one machine during the shuffle, which causes huge problems. Google has solved this with CountSketch Assume you want to count how many unique users saw ads from each zip code and how many clicked at least once. How do you use MapReduce for this?

MapReduce: Another Example You could run MapReduce to keyed by zip code so that a record with a person living in zip code 30606 is sent as [Data]  (“key”, {“saw_value”, “click_value”}) 30606  (“30606”,{1, 1}) [saw and clicked] 30606  (“30606”,{1, 0}) [saw but did not click]

MapReduce: Another Example At the reducer stage, this would count the total number of clicks and impressions by zip code producing output of the form [Data]  (“key”, {“saw_value”, “click_value”}) 30606  (“30606”,{700, 333}) [saw and clicked]

MapReduce: Getting Fancy What about something more complicated like using MapReduce to implement a statistical model as linear regression? Is that possible? Yes, it is. Check this paper out to learn how

MapReduce: Sky is the Limit Sometimes to understand what something is, it can help to understand what is isn’t. So, what can’t MapReduce do? Well, I personally can think of lots of things, for example, give me a massage, which would be very nice. You will be forgiven for thinking that MapReduce can solve any data problem coming your way…

MapReduce: Conclusions & Wednesday’s Class MapReduce is changing the way we process data Fault Tolerant, cheaper (these are commodity machines), and faster (parallel processing) For Wednesday, Read the Google paper! We will do MapReduce in R