MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.

MapReduce & Hadoop IT332 Distributed Systems

Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2

MapReduce  MapReduce is a programming model for data processing  The power of MapReduce lies in its ability to scale to 100s or 1000s of computers, each with several processor cores  How large an amount of work?  Web-Scale data on the order of 100s of GBs to TBs or PBs  It is likely that the input data set will not fit on a single computer’s hard drive  Hence, a distributed file system (e.g., Google File System- GFS) is typically required 3

MapReduce Characteristics  MapReduce ties smaller and more reasonably priced machines together into a single cost-effective commodity cluster  MapReduce divides the workload into multiple independent tasks and schedule them across cluster nodes  A work performed by each task is done in isolation from one another 4

Data Distribution  In a MapReduce cluster, data is distributed to all the nodes of the cluster as it is being loaded in  An underlying distributed file systems (e.g., GFS) splits large data files into chunks which are managed by different nodes in the cluster  Even though the file chunks are distributed across several machines, they form a single namesapce 5 Input data: A large file Node 1 Chunk of input data Node 2 Chunk of input data Node 3 Chunk of input data

MapReduce: A Bird’s-Eye View  In MapReduce, chunks are processed in isolation by tasks called Mappers  The outputs from the mappers are denoted as intermediate outputs (IOs) and are brought into a second set of tasks called Reducers  The process of bringing together IOs into a set of Reducers is known as shuffling process  The Reducers produce the final outputs (FOs)  Overall, MapReduce breaks the data flow into two phases, map phase and reduce phase C0C1C2C3 M0M1M2M3 IO0IO1IO2IO3 R0R1 FO0FO1 chunks mappers Reducers Map Phase Reduce Phase Shuffling Data

Keys and Values  The programmer in MapReduce has to specify two functions, the map function and the reduce function that implement the Mapper and the Reducer in a MapReduce program  In MapReduce data elements are always structured as key-value (i.e., (K, V)) pairs  The map and reduce functions receive and emit (K, V) pairs (K, V) Pairs Map Function (K’, V’) Pairs Reduce Function (K’’, V’’) Pairs Input SplitsIntermediate OutputsFinal Outputs

Hadoop  Hadoop is an open source implementation of MapReduce and is currently enjoying wide popularity  Hadoop presents MapReduce as an analytics engine and under the hood uses a distributed storage layer referred to as Hadoop Distributed File System (HDFS)  HDFS mimics Google File System (GFS) 8

Cloudera Hadoop

Cloudera Virtual Manager Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop. Requirements: – A 64-bit host OS – A virtualization software: VMware Player, KVM, or VirtualBox. Virtualization Software will require a laptop that supports virtualization. If you are unsure, one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled. – A 4 GB of total RAM. The total system memory required varies depending on the size of your data set and on the other processes that are running.

Installation Step#1: Download & Run VmwareVmware Step#2: Download Cloudera VMCloudera VM Step#3: Extract to the Cloudera folder. Step#4: Open the "cloudera-quickstart-vm- 4.4.0-1-vmware"

Once you got the software installed, fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below:

WordCount Tutorial This example computes the occurrence frequency of each word in a text file. Steps: 1.Set up Hadoop environment 2.Upload files into HDFS 3.Executing Java MapReduce functions in Hadoop Tutorial: http://edataanalyst.com/2013/08/the-definitive-cloudera-hadoop- wordcount-tutorial/http://edataanalyst.com/2013/08/the-definitive-cloudera-hadoop- wordcount-tutorial/

MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.

Similar presentations

Presentation on theme: "MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.

Similar presentations

Presentation on theme: "MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2."— Presentation transcript:

Similar presentations

About project

Feedback