Inter-process Communication in Hadoop

Slides:



Advertisements
Similar presentations
Distributed and Parallel Processing Technology Chapter2. MapReduce
Advertisements

Beyond Mapper and Reducer
Starfish: A Self-tuning System for Big Data Analytics.
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce.
Beyond map/reduce functions partitioner, combiner and parameter configuration Gang Luo Sept. 9, 2010.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Developing a MapReduce Application – packet dissection.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
O’Reilly – Hadoop: The Definitive Guide Ch.6 How MapReduce Works 16 July 2010 Taewhi Lee.
Hadoop: The Definitive Guide Chap. 2 MapReduce
CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Big Data Analytics with R and Hadoop
大规模数据处理 / 云计算 Lecture 3 – Hadoop Environment 彭波 北京大学信息科学技术学院 4/23/2011 This work is licensed under a Creative Commons.
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Distributed and Parallel Processing Technology Chapter6
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 This work is licensed under a Creative Commons.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
MAP REDUCE BASICS CHAPTER 2. Basics Divide and conquer – Partition large problem into smaller subproblems – Worker work on subproblems in parallel Threads.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Hadoop & Neptune Feb 김형준.
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, January 31, 2013 Session 2: Hadoop Nuts and Bolts This work is licensed.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Software Systems Development
Chapter 10 Data Analytics for IoT
MapReduce Types, Formats and Features
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Data processing with Hadoop
Advanced Hadoop Tuning and Optimizations
MAPREDUCE TYPES, FORMATS AND FEATURES
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Inter-process Communication in Hadoop

Different types of network interactions found in Hadoop (source Cloudera) 1. Hadoop RPC calls – These are performed by clients using the Hadoop API, by MapReduce jobs, and among Hadoop services (JobTracker, TaskTrackers, NameNodes, DataNodes). 2. HDFS data transfer – Done when reading or writing data to HDFS, by clients using Hadoop API, by MapReduce jobs and among Hadoop services. HDFS data transfers are done using TCP/IP sockets directly. 3. MapReduce Shuffle – The shuffle part of a MapReduce job is the process of transferring data from the Map tasks to Reducer tasks. As this transfer is typically between different nodes in the cluster, the shuffle is done using the HTTP protocol. 4. Web UIs – The Hadoop daemons provide Web UIs for users and administrators to monitor jobs and the status of the cluster. The Web UIs use the HTTP protocol. 5. FSImage operations – These are metadata transfers between the NameNode and the Secondary NameNode. They are done using the HTTP protocol. This means that Hadoop uses three different network communication protocols: ---Hadoop RPC, Direct TCP/IP, HTTP

IPC in Hadoop Distributed System IPC: Inter-process communication Based on: Using Hadoop IPC/RPC for distributed applications, by Kelvin Tan http://www.supermind.org/blog/520/using-hadoop-ipcrpc-for-distributed-applications Hadoop IPC is the underlying mechanism is used whenever there is a need for out-of-process applications to communicate with one another.

Hadoop IPC Hadoop IPC 1. uses binary serialization using java.io.DataOutputStream and java.io.DataInputStream, unlike SOAP or XML-RPC. 2. is a simple low-fuss RPC mechanism. 3. is unicast does not support multi-cast. Why use Hadoop IPC over RMI or java.io.Serialization? RMI: Remote method invocation RPC: Remote Procedure Call

Server code Configuration conf = new Configuration(); Server server = RPC.getServer(this, "localhost", 16000, conf); // start a server on localhost:16000 server.start();

Client code Configuration conf = new Configuration(); InetSocketAddress addr = new InetSocketAddress("localhost",16000); // the server's inetsocketaddress ClientProtocol client = (ClientProtocol) RPC.waitForProxy(ClientProtocol.class, ClientProtocol.versionID, addr, conf);

ClientProtocol Interface interface ClientProtocol extends org.apache.hadoop.ipc.VersionedProtocol { public static final long versionID = 1L; HeartbeatResponse heartbeat(); } Note: HeartbeatReponse is a class

HeartbeatResponse public class HeartbeatResponse implements org.apache.hadoop.io.Writable { String status; public void write(DataOutput out) throws IOException { UTF8.writeString(out, status); } public void readFields(DataInput in) throws IOException { this.status = UTF8.readString(in);

Summary 1. A server which implements an interface (which itself extends VersionedProtocol) 2. 1 or more clients which call the interface methods. 3. Any method arguments or objects returned by methods must implement org.apache.hadoop.io.Writable Learn about RPC, RMI, …TCP/IP, .. SOAP, REST… Also see ;) http://www.cse.buffalo.edu/~bina/cse486/spring2008/RMIJan30.ppt

Now to answer Troy’s and Saurab’s questions: Output of mapper is buffered When the buffer is 80% full it is transferred to “local file system” (spill files) not to HDFS …no need… no need for replication. It is after all local… “The map outputs are copied to the reduce tasktracker’s memory if they are small enough (the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which specifies the proportion of the heap to use for this purpose); otherwise, they are copied to disk. …”, T.White, Hadoop Definitive Guide So the question is: How will you design/decide the size of the input split to the mapper? How many mappers? Memory(RAM) size/heapsize, bandwidth between machines are important

There was a question about partitioner Here is the code: public interface Partitioner<K, V> extends JobConfigurable { int getPartition(K key, V value, int numPartitions); } Return value is the partition # , that is the reducer#

#Mapper X #Reducer times Hadoop Task flow Map Partition Combine Shuffle Reduce Copy #Mapper X #Reducer times Buffer/Cache HDFS HDFS Local Disk What sort could the partiton, shuffle use?

Improvements? parameters to tune in mapred.site.xml are: io.sort.mb This is the output buffer of a mapper. When this buffer is full the data is sorted and spilled to disk. Ideally you avoid having to many spills. Note that this memory is part of the maptask heap size. mapred.map.child.java.opts This is the heap size of a map task, the higher this is the higher you can put output buffer size. In principle the number of reduce tasks also influences the shuffle speed. The number of reduce rounds is the total number of reduce slots / the number of reduce tasks. Note that the initial shuffle (during the map phase) will only shuffle data to the active reducers. So mapred.reduce.tasks is also relevant. io.sort.factor is the number threads performing the merge sort, both on the map and the reduce side. Compression also has a large impact (it speeds up the transfer from mapper to reducer but the compr/decompr comes at a cost! mapred.job.shuffle.input.buffer.percent is the percentage of the reducer's heap to store map output in memory.

More info Increasing the sort buffer memory and the performance gain from that will depend on: a)The size/total Number of the Keys being emitted by the mapper b) The Nature of the Mapper Tasks : (IO intensive, CPU intensive) c) Available Primary Memory, Map/Reduce Slots in the given Node d) Data skewness