MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Developing a MapReduce Application – packet dissection.
Spark: Cluster Computing with Working Sets
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Distributed Computations MapReduce
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Distributed and Parallel Processing Technology Chapter6
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
MapReduce M/R slides adapted from those of Jeff Dean’s.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Image taken from: slideshare
Hadoop Aakash Kag What Why How 1.
Software Systems Development
INTRODUCTION TO BIGDATA & HADOOP
Large-scale file systems and Map-Reduce
Ch 8 and Ch 9: MapReduce Types, Formats and Features
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introduction to MapReduce and Hadoop
Ministry of Higher Education
Distributed Systems CS
The Basics of Apache Hadoop
Cloud Distributed Computing Environment Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Hadoop Basics.
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Data processing with Hadoop
Lecture 16 (Intro to MapReduce and Hadoop)
Distributed Systems CS
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner CS525: Big Data Analytics MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

MapReduce Phases Deciding on what will be the key and what will be the value  developer’s responsibility

About Key-Value Pairs Developer provides Mapper and Reducer functions Developer decides what is key and what is value Developer must follow the key-value pair interface Mappers: Consume <key, value> pairs Produce <key, value> pairs Shuffling and Sorting: Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of <key, <list of values>> Reducers: Consume <key, <list of values>> Produce <key, value>

Processing Granularity Mappers Run on a record-by-record basis Your code processes that record and may produce Zero, one, or many outputs Reducers Run on a group-of-records (with same key) basis Your code processes that group and may produce

Example : Word Count Job: Count the occurrences of each word in a data set Map Tasks Reduce Tasks

How it looks like in Java Provide implementation for Hadoop’s Mapper abstract class Map function Provide implementation for Hadoop’s Reducer abstract class Reduce function Job configuration

Shuffle & Sorting based on k Example 2: Color Count Job: Count the number of each color in a data set Input blocks on HDFS Produces (k, v) ( , 1) Shuffle & Sorting based on k Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Part0003 Part0002 Part0001 The output file has 3 parts ; which may lie on 3 different machines

Example 3: Color Filter Job: Select only the blue and the green colors Each map task will select only the blue or green colors No need for reduce phase Input blocks on HDFS Produces (k, v) ( , 1) Write to HDFS Part0001 Part0002 Part0003 Part0004 The output file has 4 parts

Optimization 1: Mapper Side In Color Count example, what if number of colors is small ? Each map fct has a small main-memory hash table (color, count) With each line, update the local hash table and produce nothing When done, report each color and its local count 10 5 7 20 Gain: Reduce the amount of shuffled/sorted data over the network Q1: Where to build the hash table? Q2: When and how know when done?

Mapper Class Reducer has similar functions… Called once after all records (Here you can produce the output) Called for each record Called once before any record (Here you can build the hash table) Reducer has similar functions…

Opt. 2: Map-Combine-Reduce On each machine we partially aggregate results from mappers Mappers 1…3 Mappers 4…6 Mappers 7…9 A combiner is a reducer that runs on each machine to locally aggregate (via user code) mappers’ outputs from this machine Combiners’ output is shuffled and sorted for ‘real’ reducers

Tell Hadoop to use a Combiner Not all jobs can use a combiner Use a combiner

Opt. 3: Speculative Execution If one node is slow, it may slow the entire job Speculative Execution: Hadoop automatically runs each task multiple times in parallel on different nodes First one finishes, its result is used Others will be killed

Opt. 4: Locality Locality: Run map code on same machine that has relevant data If not possible, then machine in the same rack Best effort, as clearly no guarantees could be given

DB Operations as Hadoop Jobs? Select (Filter): Map-only job Projection: Grouping and aggregation: Map-Reduce job Duplicate Elimination: Map (Key= hash code of the tuple, Value= tuple itself) Join: Map-Reduce job (many variations)

Joining Two Large Datasets: Partition Join Dataset A Dataset B Different join keys Reducer 1 Reducer 2 Reducer N Reducers perform actual join Shuffling and Sorting Phase Shuffling and sorting over network Mapper M+N Mapper 2 Mapper 1 Mapper 3 - Each mapper processes one block (split) - Each mapper produces join key and record pairs HDFS stores data blocks (Replicas are not shown)

Join with Small Dataset: Broadcast/Replication Join Different join keys Dataset A (large) Dataset B (small) Every map task processes one block from A and the entire B Every map task performs the join (MapOnly job) Avoids expensive phases of shuffling and reducing Mapper N Mapper 1 Mapper 2 HDFS stores data blocks (Replicas are not shown)

Hadoop Fault Tolerance Intermediate data between mappers and reducers are materialized This allows for straightforward fault tolerance What if a task fails (map or reduce)? Tasktracker detects the failure Sends message to the jobtracker Jobtracker re-schedules the task What if a datanode fails? Both namenode and jobtracker detect the failure All tasks on the failed node are re-scheduled Namenode replicates the users’ data to another node What if a namenode or jobtracker fails? The entire cluster goes down Intermediate data (materialized)

More About Execution Phases

Execution Phases

Reminder about Covered Phases Job: Count the number of each color in a data set Input blocks on HDFS Produces (k, v) ( , 1) Shuffle & Sorting based on k Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Part0003 Part0002 Part0001 That’s the output file, it has 3 parts on probably 3 different machines

Partitioners The output of the mappers need to be partitioned # of partitions = # of reducers The same key in all mappers must go to the same partition (and hence same reducer) Default partitioning is hash-based Users can customize it as they need

Customized Partitioner Returns a partition Id

Opt: Balance Load among Reducers Assume N reducers but many keys {k1, k2, …, Km} Distribution is skew: K1 and K2 have many records Send K1 to Reducer 1 Send K2 to Reducer 2 Rest are hash-based K3, K5 K7, K10, K20 …..

Input/Output Formats Hadoop’s “data model” : Any data in any format ok Text, binary, in a certain structure How Hadoop understands and reads the data ?? Input format is code that understands the data and how to reads it: Hadoop has several built-in input formats: Text & binary sequence files

Input Formats Record reader reads bytes and converts them to records

Tell Hadoop which Input/Output Formats Define the formats

Data Storage: HDFS

HDFS and Placement Policy Default Placement Policy First copy is written to the node creating the file (write affinity) Second copy is written to a data node within the same rack Third copy is written to a data node in a different rack Objective: load balancing & fault tolerance Rack-aware replica placement

Hadoop Ecosystem