Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
HAMS Technologies 1
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
1 MapReduce: Theory and Implementation CSE 490h – Intro to Distributed Computing, Modified by George Lee Except as otherwise noted, the content of this.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Some slides adapted from those of Yuan Yu and Michael Isard
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Hadoop MapReduce Framework
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
COS 418: Distributed Systems Lecture 1 Mike Freedman
February 26th – Map/Reduce
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Distributed System Gang Wu Spring,2018.
Introduction to MapReduce
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Map Reduce
Presentation transcript:

Hadoop Ida Mele

Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken up into parts, that run on different machines concurrently Super-computers can be replaced by large clusters of CPUs. These CPUs may be on the same machine or they may be in a network of computers Ida MeleHadoop1

Parallel programming Ida MeleHadoop2 Web graphic Super Computer Janet E. Ward, 2000 Cluster of Desktops

Parallel programming We have to identify the sets of tasks that can run concurrently Parallel programming is suitable for large datasets where we can split the data in equal-size portions The number of tasks we can perform concurrently depends on the dimension of the original dataset and on how many CPUs (nodes) we have Ida MeleHadoop3

Parallel programming One node is the master: It divides the problem in sub-problems and gives them to the other nodes, also called workers Worker node processes the small problem and returns the partial result to the master Once the master receives partial results from all the workers, it can combine them to compute the final result Ida MeleHadoop4

Parallel programming: examples Assuming a big array, it can be split into small sub-arrays Assuming a large document collection, it can be split in documents, and again each document can be broken up in paragraphs, lines, … Note: not all problems can be parallelized. For example, the current value to compute depends on the previous one Ida MeleHadoop5

MapReduce MapReduce was developed by Google for processing large amount of raw data It is used for the distributed computing on clusters of computers MapReduce is an abstraction which allows programmers to implement distributed programs Distributed programming is complex, so MapReduce hides the issues related to parallelization, data distribution, load balancing, and fault tolerance Ida MeleHadoop6

MapReduce MapReduce is inspired by the map and reduce combinators of Lisp Map: (key1, val1) → (key2, val2). The map function takes as input pairs and produces a set of zero or more intermediate pairs The framework groups together all the intermediate values associated to the same intermediate key and passes them to the reducer Reduce: (key2, [val2]) → [val3]. The reduce function aggregates the values of a key by using a binary operation, such as the sum Ida MeleHadoop7

MapReduce: dataflow Input reader: it reads the data from the stable storage and divides it into portions (splits or shards), then it assigns one split to each map function Map function: it takes a series of pairs and processes each of them to create zero or more intermediate pairs Partition function: Between the map and the reduce stages, data is shuffled: parallel-sorted and exchanged among nodes. Data is moved from the node of the map to the correct shard in which it will be reduced. In order to do that, the partition function receives the key and the number of reducers, and it returns the index of the desired reducer  load balancing Ida MeleHadoop8

MapReduce: dataflow Comparison function: the input of each reducer is sorted using the comparison function Reduce function: it takes the sorted intermediate pairs and aggregate the values by the keys, in order to produce a single output for each key Output writer: it writes the output of the reducer on the stable storage, that is usually the distributed file system Ida MeleHadoop9

Example Consider the following problem: Given a large collection of documents, we want to compute the number of occurrences of each term in the documents How can we solve it in parallel? Ida MeleHadoop10

Example We assume to have a set of workers 1.Divide the collection of documents among them, for example one document for each worker 2.Each worker returns the count of a given word in a document 3.Sum up the counts from all the documents to have the overall number of occurrences of a word in the collection of documents Ida MeleHadoop11

Example map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); Ida MeleHadoop12 Pseudo-code

Example Ida MeleHadoop13

Example Ida MeleHadoop14

Execution overview Ida MeleHadoop15

Execution overview 1.The map invocations are distributed across multiple CPUs, by automatically partitioning the input data into M splits (or shards). The shards can be processed on different CPUs concurrently 2.One copy of the program is the master and it assigns the work to the other copies (workers). In particular it has M map tasks and R reduce tasks to assign, so the master picks the idle workers and assign each one a map or reduce task 3.The worker, which has the map task, reads the content of the input shard. It applies the user-defined operations in parallel, and it produces the intermediate pairs Ida MeleHadoop16

Execution overview 4.The intermediate pairs are written on the local disk periodically. Then, they are partitioned into R regions by the partitioning function. The locations of these pairs are passed to the master, which forwards them to the reduce workers 5.The reduce worker reads the intermediate pairs and sorts them by the intermediate key 6.The reduce worker iterates over the sorted pairs and applies the reduce function, which allows to aggregate the values for each key. Then, the output of the reducer can be appended to the output file Ida MeleHadoop17

Reliability MapReduce allows to have high reliability: to detect failure the master pings every workers periodically. If a worker is silent for longer than a given time interval the master node marks the worker as failed and sends the failed worker’s work to another node When a failure occurs the completed map task has to be re- executed, since the output is stored on the local disk of the failed machine, and it is inaccessible. The completed reduce tasks do not need to be re-executed, because their output is stored in the global file system Some operations are atomic to ensure no conflicts Ida MeleHadoop18

Hadoop Apache Hadoop is a open-source framework for reliable, scalable, and distributed computing. It implements the computational paradigm named MapReduce. Useful links: Ida MeleHadoop19

Hadoop The project includes several modules: Hadoop Common: the common utilities that support the other Hadoop modules Hadoop Distributed File System (HDFS): a distributed file system that provides high-throughput access to application data Hadoop YARN: a framework for job scheduling and cluster resource management Hadoop MapReduce: a YARN-based system for parallel processing of large data sets Ida MeleHadoop20

Install Hadoop Download the release of Hadoop available at: or one of the latest release of Hadoop available at: The directory conf contains all configuration files Set the JAVA_HOME by editing the file conf/hadoop- env.sh: export JAVA_HOME= %JavaPath Ida MeleHadoop21

Install Hadoop Optional: (if needed) we can specify the classpath Optional: we can specify the maximum amount of heap to use. The default is 1000 MB, but we can increase it by editing the file conf/hadoop-env.sh: export HADOOP_HEAPSIZE=2000 Ida MeleHadoop22

Install Hadoop Optional: we can specify the directory for the temporal output Edit the file conf/core-site.xml adding the following lines: hadoop.tmp.dir /tmp/hadoop-tmp-${user.name} A base for other temporary directories. Ida MeleHadoop23

Example: WordCounter Download WordCounter.jar and text.txt, available at: Put WordCounter.jar in the Hadoop directory In the Hadoop directory, create the sub-directory einput and copy the input file text.txt into it Run the word counter by issuing the following command: bin/hadoop jar WordCounter.jar mapred.WordCount einput/ eoutput/ Note: make sure that the output directory doesn't already exist Ida MeleHadoop24

Example: WordCounter The Map class: Ida MeleHadoop25

Example: WordCounter The Reduce class: Ida MeleHadoop26

Example: WordCounter more eoutput/part-0000 Ida MeleHadoop27 '1500,' 1 'Come, 1 'Go 1 'Hareton 1 'Here 1 'I 1 'If 1 'Joseph!' 1 'Mr. 2 'No 1 'No, 1 'Not 1 'She's 1 words occurrences

Example: WordCounter sort -k2 -n -r eoutput/part | more Ida MeleHadoop28 the 93 of 73 and 64 a 60 I 57 to 47 my 27 in 23 his 19 with 18 have 16 that 15 Most frequent words Words frequencies sorted in decreasing order