Overview of MapReduce and Hadoop

Slides:

Advertisements

Similar presentations

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.

Map Reduce Allan Jefferson Armando Gonçalves Rocir Leite Filipe??

MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.

CS162 Operating Systems and Systems Programming Lecture 24 DHTs and Cloud Computing April 25, 2011 Ion Stoica

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.

MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.

Distributed Computations MapReduce

MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.

Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Map Reduce Architecture

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.

Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.

Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.

1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

University of Minnesota Running MapReduce in Non-Traditional Environments Abhishek Chandra Associate Professor Department of Computer Science and Engineering.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.

MapReduce How to painlessly process terabytes of data.

MapReduce M/R slides adapted from those of Jeff Dean’s.

Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.

SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.

MapReduce and the New Software Stack CHAPTER 2 1.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.

MapReduce ： Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.

1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD

Introduction to MapReduce and Hadoop

Map reduce Cs 595 Lecture 11.

Large-scale file systems and Map-Reduce

Hadoop MapReduce Framework

Introduction to MapReduce and Hadoop

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

MapReduce Simplied Data Processing on Large Clusters

The Basics of Apache Hadoop

Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA

CS 345A Data Mining MapReduce This presentation has been altered.

5/7/2019 Map Reduce Map reduce.

Big Data Analysis MapReduce.

Presentation transcript:

Overview of MapReduce and Hadoop

Why Parallelism? Data size is increasing Single node architecture is reaching its limit Scan 100 TB on 1 node @ 100 MB/s = 12 days Standard/Commodity and affordable architecture emerging Cluster of commodity Linux nodes Gigabit ethernet interconnect 1 Question: Where did the “gain” come from? Parallel disk access!

Design Goals for Parallelism Scalability to large data volumes: Scan 100 TB on 1 node @ 100 MB/s = 12 days Scan on 1000-node cluster = 16 minutes! Cost-efficiency: Commodity nodes (cheap, but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers) 1 Question: Where did the “gain” come from? Parallel disk access!

What is MapReduce? Data-parallel programming model for clusters of commodity machines Pioneered by Google Processes 20 PB of data per day Popularized by open-source Hadoop project Used by Yahoo!, Facebook, Amazon, … In 2014, Google dumps MapReduce http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system

What is MapReduce used for? At Google: Index building for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: Index building for Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization Spam detection In research: Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Bioinformatics (Maryland) Astronomical image analysis (Washington) Ocean climate simulation (Washington) Graph OLAP (NUS) … <Your application>

Challenges Cheap nodes fail, especially if you have many Commodity network = low bandwidth Programming distributed systems is hard

Typical Hadoop Cluster Aggregation switch Rack switch 40 nodes/rack, 1000-4000 nodes in cluster 1 GBps bandwidth in rack, 8 GBps out of rack Node specs (Yahoo terasort): 8 x 2.0 GHz cores, 8 GB RAM, 4 disks (= 4 TB?) In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO

Hadoop Components Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance MapReduce implementation Executes user jobs specified as “map” and “reduce” functions Manages work distribution & fault-tolerance

Hadoop Distributed File System Files are BIG (100s of GB – TB) Typical usage patterns Append-only Data are rarely updated in place Reads common Optimized for large files, sequential (why??) reads Files split into 64-128MB blocks (called chunks) Blocks replicated (usually 3 times) across several datanodes (called chuck or slave nodes) Chunk nodes are compute nodes too Namenode File1 1 2 3 4 BIG – how about small files – use hadoop? Append – imply no ordering of data, therefore need to sequentially scan whole file, which is why it is optimized for sequential reads 1 2 1 3 2 1 4 2 4 3 3 4 Datanodes

Hadoop Distributed File System Single namenode (master node) stores metadata (file names, block locations, etc) May be replicated also Client library for file access Talks to master to find chunk servers Connects directly to chunk servers to access data Master node is not a bottleneck Computation is done at chuck node (close to data) Namenode File1 1 2 3 4 Hadoop follows the Master/Slave architecture described in the original MapReduce paper. Since Hadoop consists of two parts: data storage (HDFS) and processing engine (MapReduce), there are two types of master node: for HDFS, the master node is namenode, and slave is datanode; for MapReduce in Hadoop, the master node is jobtracker, and slave tasktracker. 1 2 1 3 2 1 4 2 4 3 3 4 Datanodes

MapReduce Programming Model Data type: key-value records File – a bag of (key, value) records Map function: (Kin, Vin)  list(Kinter, Vinter) Takes a key-value pair and outputs a set of key-value pairs, e.g., key is the line number, value is a single line in the file There is one Map call for every (kin, vin ) pair Reduce function: (Kinter, list(Vinter))  list(Kout, Vout) All values Vinter with same key Kinter are reduced together and processed in Vinter order There is one Reduce function call per unique key Kinter What is the restriction on the size of (key, value) pairs? Each pair must be small enough to fit into one single machine, e.g, a document should be fine, an image should be fine but 1PB value is probably not.

Example: Word Count Execution Input Map Sort & Shuffle Reduce Output Map the, 1 brown, 1 fox, 1 quick, 1 the quick brown fox brown, 1 brown, 2 fox, 2 how, 1 now, 1 the, 3 brown, 1 Map the, 1 fox, 1 ate, 1 mouse, 1 the fox ate the mouse Users only need to write a map and a reduce function! The number of instances to mount and how they are mounted is left to the execution engine quick, 1 ate, 1 cow, 1 mouse, 1 quick, 1 Map how, 1 now, 1 brown, 1 cow, 1 how now brown cow

Example: Word Count def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values))

Can Word Count algorithm be improved? Some questions to consider How many key-value pairs are emitted? What if the same word appear in the same line/document? What is the overhead? How could you reduce that number?

An Optimization: The Combiner A combiner is a local aggregation function for repeated keys produced by the same map For associative ops. like sum, count, max Decreases size of intermediate data Example: local counting for Word Count: def combiner(key, values): output(key, sum(values)) NOTE: For Word Count, this turns out to be exactly what the Reducer does!

Word Count with Combiner Input Map & Combine Shuffle & Sort Reduce Output the quick brown fox the fox ate the mouse how now brown cow Map Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 the, 2

Care in using Combiner Usually same as reducer How about average? If operation is associative and commutative, e.g., sum How about average? How about median?

The Complete WordCount Program Map function Reduce function Invoking Map, Combiner & Reduce Run this program as a MapReduce job

Example 2: Search Input: (lineNumber, line) records Output: lines matching a given pattern Map: if(line matches pattern): output(line) Reduce: identity function Alternative? job.setNumReduceTasks(0);

Example 3: Inverted Index Source documents hamlet.txt hamlet.txt Inverted index to, hamlet.txt be, hamlet.txt or, hamlet.txt not, hamlet.txt be, 12th.txt not, 12th.txt afraid, 12th.txt of, 12th.txt greatness, 12th.txt to be or not to be afraid, (12th.txt) be, (12th.txt, hamlet.txt) greatness, (12th.txt) not, (12th.txt, hamlet.txt) of, (12th.txt) or, (hamlet.txt) to, (hamlet.txt) 12th.txt be not afraid of greatness

Example 3: Inverted Index Input: (filename, text) records Output: list of files containing each word Map: foreach word in text.split(): output(word, filename) Combine: merge filenames for each word Reduce: def reduce(word, filenames): output(word, sort(filenames))

Example 4: Most Popular Words Input: (filename, text) records Output: the 100 words occurring in most files Two-stage solution: Job 1: Create inverted index, giving (word, list(file)) records Job 2: Map each (word, list(file)) to (count, word) Sort these records by count as in sort job Input: fname, word So: Map produces w1,fn1; w2,fn1; w3, fn1, …; w1,fn2 … Reduce produces w1,count; w2, count; MapReduce Sort. So, need 2 mapreduce tasks Optimizations: Map to (word, 1) instead of (word, file) in Job 1 Estimate count distribution in advance by sampling

Note #mappers, #reducers, #physical nodes may not be equal For different problems, the Map and Reduce functions are different, but the workflow is the same Read a lot of data Map Extract something you care about Sort and Shuffle Reduce Aggregate, summarize, filter or transform Write the result

Refinement: Partition Function Want to control how keys get partitioned Inputs to map tasks are created by contiguous splits of input file Reduce needs to ensure that records with the same intermediate key end up at the same worker System uses a default partition function: hash(key) mod R Sometimes useful to override the hash function: E.g., hash(hostname(URL)) mod R ensures URLs from a host end up in the same output file; E.g., How about sorting?

Example 5: Sort Input: (key, value) records Output: same records, sorted by key Map: identity function. Why? Reduce: identity function. Why? Trick: Pick partitioning function h such that k1<k2  h(k1)<h(k2) ant, bee Map Reduce [A-M] zebra aardvark ant bee cow elephant cow Map pig Reduce [N-Z] aardvark, elephant pig sheep yak zebra Map sheep, yak

Job Configuration Parameters 190+ parameters in Hadoop Set manually or defaults are used

Not all tasks fit the MapReduce model … Consider a data set consisting of n observations and k variables e.g., k different stock symbols or indices (say k=10,000) and n observations representing stock price signals (up / down) measured at n different times. Problem – Find very high correlations (ideally with time lags to be able to make a profit) - e.g. if Google is up today, Facebook is likely to be up tomorrow. Need to compute k * (k-1) /2 correlations to solve this problem Cannot simply split the 10,000 stock symbols into 1,000 clusters, each containing 10 stock symbols, then use MapReduce The vast majority of the correlations to compute will involve a stock symbol in one cluster, and another one in another cluster These cross-clusters computations makes MapReduce useless in this case The same issue arises if you replace the word "correlation" by any other function, say f, computed on two variables, rather than one Good for “embarrassingly parallel” jobs. What else is it not good for? Real-time processing … NOTE: Make sure you pick the right tool!

High Level Languages on top of Hadoop MapReduce is great, as many algorithms can be expressed by a series of MR jobs But it’s low-level: must think about keys, values, partitioning, etc Can we capture common “job patterns”?

Pig Started at Yahoo! Research Now runs about 30% of Yahoo!’s jobs Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25. Started at Yahoo! Research Now runs about 30% of Yahoo!’s jobs Features: Expresses sequences of MapReduce jobs Data model: nested “bags” of items Provides relational (SQL) operators (JOIN, GROUP BY, etc) Easy to plug in Java functions Pig Pen dev. env. for Eclipse Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

Ease of Translation Notice how naturally the components of the job translate into Pig Latin. Load Users Load Pages Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’; Filter by age Join on name Job 1 Group on url No need to think about how many map reduce jobs this decomposes into, or connecting data flows between map reduces jobs, or even that this is taking place on top of map reduce at all. Job 2 Count clicks Order by clicks Job 3 Take top 5

Hive Developed at Facebook Used for majority of Facebook jobs “Relational database” built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning, clustering, complex data types, some optimizations

Creating a Hive Table Sample Query CREATE TABLE page_views(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'User IP address') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) // break the table into separate files STORED AS SEQUENCEFILE; Sample Query Find all page views coming from xyz.com on March 31st: SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com'; Hive only reads partition 2008-03-01,* instead of scanning entire table

Announcement Project teams Submit list of members of your team ASAP Mail to tankl@comp.nus.edu.sg

MapReduce Environment MapReduce environment takes care of: Partitioning the input data Scheduling the program’s execution across a set of machines Perform the “group-by key” step in the Sort and Shuffle phase Handling machine failures Managing required inter-machine communication

MapReduce Execution Details Single master controls job execution on multiple slaves as well as user scheduling Mappers preferentially placed on same node or same rack as their input block Push computation to data, minimize network use Mappers save outputs to local disk rather than pushing directly to reducers Allows recovery if a reducer crashes Output of reducers are stored in HDFS May be replicated Mappers save outputs to local disk also “allows having more reducers than nodes” – not sure why. It seems that this should also be possible even if data are being pushed.

Coordinator: Master Node Master node takes care of coordination: Task status: (idle, in-progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the location and sizes of its (R) intermediate files, one for each reducer Master pushes this info to reducers Master pings workers periodically to detect failures

Input files Map phase Intermediate files (on local disk) Reduce phase User Program (1) submit (5) map output location Master (6) reducer input location (2) schedule map (2) schedule reduce worker (3) read (4) local write split 0 (8) write output file 0 split 1 (7) remote read worker split 2 worker split 3 split 4 output file 1 worker worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files

MapReduce – Single reduce task

MapReduce – Multiple reduce tasks

MapReduce – No reduce tasks

MapReduce: Shuffle and sort

Two additional notes Barrier between map and reduce phases Reduce cannot start processing until all maps have completed But we can begin copying intermediate data as soon as a mapper completes Keys arrive at each reducer in sorted order No enforced ordering across reducers

How many Map and Reduce tasks? M map tasks, R reduce tasks Rule of a thumb: Make M much larger than the number of nodes in the cluster One DFS chunk per map is common Improves dynamic load balancing and speeds up recovery from worker failures Usually R is smaller than M Because output is spread across R files JobConf's conf.setNumMapTasks(int num).

Fault Tolerance in MapReduce 1. If a task crashes: Retry on another node Okay for map because it has no dependencies Okay for reduce because map outputs are on disk If the same task repeatedly fails, fail the job or ignore that input block (user-controlled) Q: How does parallel systems handle fault tolerance traditionally? Redo everything and Checkpointing. Note: For this and the other fault tolerance features to work, your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce 2. If a node crashes: Relaunch its current tasks on other nodes Relaunch any maps the node previously ran Why? Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce 3. If a task is running slowly (straggler): Launch second copy of task on another node Take the output of whichever copy finishes first, and kill the other one Critical for performance in large clusters: stragglers occur frequently due to failing hardware, bugs, misconfiguration, etc

Takeaways By providing a data-parallel programming model, MapReduce can control job execution in useful ways: Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures & stragglers User focuses on application, not on complexities of distributed computing

Conclusions MapReduce’s data-parallel programming model hides complexity of distribution and fault tolerance Principal philosophies: Make it scale, so you can throw hardware at problems Make it cheap, saving hardware, programmer and administration costs (but requiring fault tolerance) Hive and Pig further simplify programming MapReduce is not suitable for all problems, but when it works, it may save you a lot of time

Resources Hadoop: http://hadoop.apache.org/core/ Hadoop docs: http://hadoop.apache.org/core/docs/current/ Pig: http://hadoop.apache.org/pig Hive: http://hadoop.apache.org/hive Hadoop video tutorials from Cloudera: http://www.cloudera.com/hadoop-training

Acknowledgements This lecture is adapted from slides from multiple sources, including those from RAD Lab, Berkeley Stanford Duke Maryland …

Word Count Histogram: counting occurrences of different length words in documents, what is the Map task and what is the Reduce task? Map Task: Get word counts for each category of word length. Reduce Task: Combine counts of each category of word length.

Which of these are effects of reducing the number of key-value pairs that the Map function outputs? Select all that apply (related to combiner??) The amount of work that is done in the Map phase is reduced because the output set is smaller. The amount of work that is done in the Reduce phase is reduced because the input set is smaller. (T) The amount of data that travels from machines executing Map tasks across the network to machines executing Reduce tasks is reduced. (T) Reducing the number of key-value pairs output by the Map function does not have any effects.

Schematic of Parallel Processing A big data file File is split into smaller chunks and distributed to k nodes f is a function that operates on each data item A distributed set of answers