Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.

Similar presentations


Presentation on theme: "Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken."— Presentation transcript:

1 Hadoop Ida Mele

2 Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken up into parts, that run on different machines concurrently Super-computers can be replaced by large clusters of CPUs. These CPUs may be on the same machine or they may be in a network of computers Ida MeleHadoop1

3 Parallel programming Ida MeleHadoop2 Web graphic Super Computer Janet E. Ward, 2000 Cluster of Desktops

4 Parallel programming We have to identify the sets of tasks that can run concurrently Parallel programming is suitable for large datasets where we can split the data in equal-size portions The number of tasks we can perform concurrently depends on the dimension of the original dataset and on how many CPUs (nodes) we have Ida MeleHadoop3

5 Parallel programming One node is the master: It divides the problem in sub-problems and gives them to the other nodes, also called workers Worker node processes the small problem and returns the partial result to the master Once the master receives partial results from all the workers, it can combine them to compute the final result Ida MeleHadoop4

6 Parallel programming: examples Assuming a big array, it can be split into small sub-arrays Assuming a large document collection, it can be split in documents, and again each document can be broken up in paragraphs, lines, … Note: not all problems can be parallelized. For example, the current value to compute depends on the previous one Ida MeleHadoop5

7 MapReduce MapReduce was developed by Google for processing large amount of raw data It is used for the distributed computing on clusters of computers MapReduce is an abstraction which allows programmers to implement distributed programs Distributed programming is complex, so MapReduce hides the issues related to parallelization, data distribution, load balancing, and fault tolerance Ida MeleHadoop6

8 MapReduce MapReduce is inspired by the map and reduce combinators of Lisp Map: (key1, val1) → (key2, val2). The map function takes as input pairs and produces a set of zero or more intermediate pairs The framework groups together all the intermediate values associated to the same intermediate key and passes them to the reducer Reduce: (key2, [val2]) → [val3]. The reduce function aggregates the values of a key by using a binary operation, such as the sum Ida MeleHadoop7

9 MapReduce: dataflow Input reader: it reads the data from the stable storage and divides it into portions (splits or shards), then it assigns one split to each map function Map function: it takes a series of pairs and processes each of them to create zero or more intermediate pairs Partition function: Between the map and the reduce stages, data is shuffled: parallel-sorted and exchanged among nodes. Data is moved from the node of the map to the correct shard in which it will be reduced. In order to do that, the partition function receives the key and the number of reducers, and it returns the index of the desired reducer  load balancing Ida MeleHadoop8

10 MapReduce: dataflow Comparison function: the input of each reducer is sorted using the comparison function Reduce function: it takes the sorted intermediate pairs and aggregate the values by the keys, in order to produce a single output for each key Output writer: it writes the output of the reducer on the stable storage, that is usually the distributed file system Ida MeleHadoop9

11 Example Consider the following problem: Given a large collection of documents, we want to compute the number of occurrences of each term in the documents How can we solve it in parallel? Ida MeleHadoop10

12 Example We assume to have a set of workers 1.Divide the collection of documents among them, for example one document for each worker 2.Each worker returns the count of a given word in a document 3.Sum up the counts from all the documents to have the overall number of occurrences of a word in the collection of documents Ida MeleHadoop11

13 Example map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); Ida MeleHadoop12 Pseudo-code

14 Example Ida MeleHadoop13

15 Example Ida MeleHadoop14

16 Execution overview Ida MeleHadoop15

17 Execution overview 1.The map invocations are distributed across multiple CPUs, by automatically partitioning the input data into M splits (or shards). The shards can be processed on different CPUs concurrently 2.One copy of the program is the master and it assigns the work to the other copies (workers). In particular it has M map tasks and R reduce tasks to assign, so the master picks the idle workers and assign each one a map or reduce task 3.The worker, which has the map task, reads the content of the input shard. It applies the user-defined operations in parallel, and it produces the intermediate pairs Ida MeleHadoop16

18 Execution overview 4.The intermediate pairs are written on the local disk periodically. Then, they are partitioned into R regions by the partitioning function. The locations of these pairs are passed to the master, which forwards them to the reduce workers 5.The reduce worker reads the intermediate pairs and sorts them by the intermediate key 6.The reduce worker iterates over the sorted pairs and applies the reduce function, which allows to aggregate the values for each key. Then, the output of the reducer can be appended to the output file Ida MeleHadoop17

19 Reliability MapReduce allows to have high reliability: to detect failure the master pings every workers periodically. If a worker is silent for longer than a given time interval the master node marks the worker as failed and sends the failed worker’s work to another node When a failure occurs the completed map task has to be re- executed, since the output is stored on the local disk of the failed machine, and it is inaccessible. The completed reduce tasks do not need to be re-executed, because their output is stored in the global file system Some operations are atomic to ensure no conflicts Ida MeleHadoop18

20 Hadoop Apache Hadoop is a open-source framework for reliable, scalable, and distributed computing. It implements the computational paradigm named MapReduce. Useful links: http://hadoop.apache.org/ http://hadoop.apache.org/docs/stable/ http://hadoop.apache.org/releases.html#Download Ida MeleHadoop19

21 Hadoop The project includes several modules: Hadoop Common: the common utilities that support the other Hadoop modules Hadoop Distributed File System (HDFS): a distributed file system that provides high-throughput access to application data Hadoop YARN: a framework for job scheduling and cluster resource management Hadoop MapReduce: a YARN-based system for parallel processing of large data sets Ida MeleHadoop20

22 Install Hadoop Download the release 1.1.1 of Hadoop available at: http://www.dis.uniroma1.it/~mele/WebIR.html http://www.dis.uniroma1.it/~mele/WebIR.html or one of the latest release of Hadoop available at: http://hadoop.apache.org/releases.html#Download The directory conf contains all configuration files Set the JAVA_HOME by editing the file conf/hadoop- env.sh: export JAVA_HOME= %JavaPath Ida MeleHadoop21

23 Install Hadoop Optional: (if needed) we can specify the classpath Optional: we can specify the maximum amount of heap to use. The default is 1000 MB, but we can increase it by editing the file conf/hadoop-env.sh: export HADOOP_HEAPSIZE=2000 Ida MeleHadoop22

24 Install Hadoop Optional: we can specify the directory for the temporal output Edit the file conf/core-site.xml adding the following lines: hadoop.tmp.dir /tmp/hadoop-tmp-${user.name} A base for other temporary directories. Ida MeleHadoop23

25 Example: WordCounter Download WordCounter.jar and text.txt, available at: http://www.dis.uniroma1.it/~mele/WebIR.html http://www.dis.uniroma1.it/~mele/WebIR.html Put WordCounter.jar in the Hadoop directory In the Hadoop directory, create the sub-directory einput and copy the input file text.txt into it Run the word counter by issuing the following command: bin/hadoop jar WordCounter.jar mapred.WordCount einput/ eoutput/ Note: make sure that the output directory doesn't already exist Ida MeleHadoop24

26 Example: WordCounter The Map class: Ida MeleHadoop25

27 Example: WordCounter The Reduce class: Ida MeleHadoop26

28 Example: WordCounter more eoutput/part-0000 Ida MeleHadoop27 '1500,' 1 'Come, 1 'Go 1 'Hareton 1 'Here 1 'I 1 'If 1 'Joseph!' 1 'Mr. 2 'No 1 'No, 1 'Not 1 'She's 1 words occurrences

29 Example: WordCounter sort -k2 -n -r eoutput/part-00000 | more Ida MeleHadoop28 the 93 of 73 and 64 a 60 I 57 to 47 my 27 in 23 his 19 with 18 have 16 that 15 Most frequent words Words frequencies sorted in decreasing order


Download ppt "Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken."

Similar presentations


Ads by Google