7/14/2015EECS 584, Fall 20111 MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Cloud Computing Other Mapreduce issues Keke Chen.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
MapReduce “Divide and Conquer”.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Distributed System Gang Wu Spring,2018.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng

Real world problem Count the number of occurences of each word in a huge collections of word lists. –sample input: seven book of Harry Potter

Real world problem Count the number of occurences of each word in a huge collections of word lists. WordOccurences The15414 Good5435 Never6546 Tie

Possible solution Hash table –each entry is key value pair, (word, occurrence) –scan all the file, put each word into the hash table

Real world problem--follow up What if you are given a huge set of files and have access to a large set of machines? Problem with hash table: –low concurrency –hard to scale one node fail, restart all work MapReduce solution

7/14/2015 EECS 584, Fall Map primitive Idea from functional language Given a function, apply the function to all element INDIVIDUALLY in the list, combine the result into a new list e.g. increment each elems from a list by 1

7/14/2015 EECS 584, Fall Reduce primitive Idea from functional language Apply a function to all elems from a list, combine them into a single resule e.g. calculate the sum of a list

Map reduce solution--Single node Map each single word into a (key, value) pair. –"Good" -> ("Good", 1) Put together all the pairs that have the same key. Input these pairs to a reduce program. Add the value together –[("Good", 1), ("Good", 1), ("Good", 1)] -> 3

Map reduce solution Files in_01 in_02 in_03 in_04 in_05... pairs (The, 1) (Good, 1) (The, 1) (Bad, 1) (Not, 1) (The, 1)... Map list of pairs [(The, 1), (The, 1), (The, 1)...] [(Good, 1), (Good, 1), (Good, 1)...] [(Is, 1), (Is, 1), (Is, 1)...] [(Therofiery, 1)] [(Bad, 1), (Bad, 1), (Bad, 1)...] Merge and Sort Reduce WordOccurences The15414 Good5435 Is6546 Therofiery Reduce

Map reduce solution Files in_01 in_02 in_03 in_04 in_05... pairs (The, 1) (Good, 1) (The, 1) (Bad, 1) (Not, 1) (The, 1)... Map list of pairs [(The, 1), (The, 1), (The, 1)...] [(Good, 1), (Good, 1), (Good, 1)...] [(Is, 1), (Is, 1), (Is, 1)...] [(Therofiery, 1)] [(Bad, 1), (Bad, 1), (Bad, 1)...] Merge and Sort Reduce WordOccurences The15414 Good5435 Is6546 Therofiery Reduce

Map reduce solution Files in_01 in_02 in_03 in_04 in_05... pairs (The, 1) (Good, 1) (The, 1) (Bad, 1) (Not, 1) (The, 1)... Map list of pairs [(The, 1), (The, 1), (The, 1)...] [(Good, 1), (Good, 1), (Good, 1)...] [(Is, 1), (Is, 1), (Is, 1)...] [(Therofiery, 1)] [(Bad, 1), (Bad, 1), (Bad, 1)...] Merge and Sort Reduce WordOccurences The15414 Good5435 Is6546 Therofiery Reduce

Map reduce solution What if now we are given a huge number of files. And a large number of machines.

It can be scalable! Map can be applied to different part of input in parallel. If part of map tasks failed, just need to restart them instead of restarting all.

Map reduce solution-scalable version Map : split the file into several parts, apply map function to every part of them. Shuffle : distribute intemediate result into different buckets according to the hash value of key, assign buckets to several reducers. Each reducer sort the pairs by key. Reduce : apply the reduce function to all the elements that have the same key and produce the result.

MapReduce Gerneralized Software framework Users are only responsible for provide two functions : map and reduce Easy to scale to large amount of machines.

Split the input files into several pieces

Each piece is assigned to one worker(mapper)

Before sorting, the key value pairs are hashed by key into R buckets.

(The, 1), (Good, 1), (The 1), (Never, 1)

(Is, 1), (Is, 1), (Tie, 1), (Work, 1)

(The, 1), (Good, 1), (The 1), (Never, 1) (Is, 1), (Is, 1), (Tie, 1), (Work, 1) Each bucket is read by one worker(reducer), then sort and produce the results.

Master program, control the process and assign work to workers

Fault tolerance worker fail : simply assigned the another worker to do it. master fail : restart the whole work

Implementation details Locality –Take location information of input files into account –assigned a map task to closest machine of the input data.

Implementation details Backup tasks –abnormal machines in a task lengthen the total time. –When a task is almost finished, duplicate remaining tasks as back-up tasks. –Whenever a primary or a backup execution is done, mark the task as finished. TASK1TASK2TASK3 In progress Completed

Implementation details Backup tasks –abnormal machines in a task lengthen the total time. –When a task is almost finished, duplicate remaining tasks as back-up tasks. –Whenever a primary or a backup execution is done, mark the task as finished. TASK1TASK2TASK3 In progress Completed

Implementation details Backup tasks –abnormal machines in a task lengthen the total time. –When a task is almost finished, duplicate remaining tasks as back-up tasks. –Whenever a primary or a backup execution is done, mark the task as finished. TASK1TASK2TASK3 In progress Completed TASK1 Back Up

Implementation details Backup tasks –abnormal machines in a task lengthen the total time. –When a task is almost finished, duplicate remaining tasks as back-up tasks. –Whenever a primary or a backup execution is done, mark the task as finished. TASK1TASK2TASK3 In progress Completed TASK1

7/14/2015 EECS 584, Fall Useful Extensions Partitioning Functions –hash-based or range-based –self-defined partition function Combiner Function (similar to Reduce Function) – –resolve significant repetition in intermediate outputs Skipping Bad Records –errors or bugs –acceptable to ignore a few records Local Execution –help facilitate debugging, profiling and testing

7/14/2015 EECS 584, Fall Performance & Evaluation Cluster Configuration –1800 nodes –2×2GHz, 4GB memory, 2×160GB IDE, Gb Ethernet link –2-level tree-shaped switched network Grep –in byte records –M = R = 1 –take ~150 seconds

7/14/2015 EECS 584, Fall Performance & Evaluation (Sort) Sort – byte records –M = 15000, R = 4000 –Normal, No-Backup, 200 tasks killed A few things to note –Input rate is higher than the shuffle & output rate –no backup, execution flow is similar except the long tail –kill tasks, the tasks restarted & the rate drop to zero

7/14/2015 EECS 584, Fall

Application of MapReduce Broadly applicable –large-scale machine learning problems –clustering problems for Google News –extraction of data & properties –graph computations Large-Scale Indexing –The indexing code is simpler, smaller (~3800 to ~700) –Indexing process is much easier to operate & easy spead up

MapReduce & Parallel DBMS MapReduce is not novel at all –a entriely new paradigm? MapReduce is a step backwards –no schema –no high-level access language MapReduce Is a poor implementation –no Index –overlook skew –lots of P2P Network traffic in the shuffle phase Missing features –indexes, updates, transactions Not compatible to DBMS Tools

MapReduce & Parallel DBMS

Parallel database –has significant performance advantage –take a lot of time to tune and setup –are not general enough (UDFs, UDTs) –SQL is not that easy & straightforward MapReduce –easy to setup & easy to program –scalable & fault-tolerant –bruteforce solution

What is MapReduce A parallel programming model /data processing paradigm rather than a complete DBMS –Does not target everything DBMS targets –It's simple but it works Works for those who –have a lot of data (of some specific type) –UDTs and UDFs are complex to tune –would rather program in sequencial language rather than SQL –no need to index data because data change all the time –do not need to pay

Questions?