Software Engineering Introduction to Apache Hadoop Map Reduce

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Mapreduce and Hadoop Introduce Mapreduce and Hadoop
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Resource Management with YARN: YARN Past, Present and Future
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Hadoop & Neptune Feb 김형준.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
Part III BigData Analysis Tools (YARN) Yuan Xue
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Hadoop. Introduction Distributed programming framework. Hadoop is an open source framework for writing and running distributed applications that.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
Yarn.
Hadoop.
Introduction to Distributed Platforms
Unit 2 Hadoop and big data
Slides modified from presentation by B. Ramamurthy
Software Systems Development
HDFS Yarn Architecture
Chapter 10 Data Analytics for IoT
Hadoop MapReduce Framework
Apache Hadoop YARN: Yet Another Resource Manager
Introduction to HDFS: Hadoop Distributed File System
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Central Florida Business Intelligence User Group
Meng Cao, Xiangqing Sun, Ziyue Chen May 28th, 2014
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Distributed Systems CS
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Technopoints.
Introduction to Apache
TIM TAYLOR AND JOSH NEEDHAM
Lecture 16 (Intro to MapReduce and Hadoop)
Distributed Systems CS
Apache Hadoop and Spark
Presentation transcript:

Software Engineering Introduction to Apache Hadoop Map Reduce Lamprakis Panagiotis – 03113118

What is Apache Hadoop Apache Hadoop is a framework designed for the processing of big data sets distributed over large sets of machines with commodity hardware. The basic ideas have been taken from the Google File System (GFS or GoogleFS)

What does Apache Hadoop encompass. Hadoop Common: Utilities that are used by the other modules. Hadoop Distributed File System (HDFS): A distributed file system similar to the one developed by Google under the name GFS. Hadoop YARN: This module provides the job scheduling resources used by the MapReduce framework. Hadoop MapReduce: A framework designed to process huge amount of data

HDFS Architecture (1) Assumptions of HDFS Hardware Failure is seen more as a norm than as an exception. Instead of relying on expensive fault-tolerant hardware systems, commodity hardware is chosen. This not only reduces the investment for the whole cluster setup but also enables the easy replacement of failed hardware. As Hadoop is designed for batch processing of large data sets, requirements that stem from the more user-centered POSIX standard are relaxed. Hence low-latency access to random parts of a file is less desirable than streaming files with high throughput.

HDFS Architecture (2) Assumptions of HDFS The applications using Hadoop process large data sets that reside in large files. Hence Hadoop is tuned to handle big files instead of a lot of small files. Most big data applications write the data once and read it often (log files, HTML pages, user-provided images, etc.). Therefore Hadoop makes the assumption that a file is once created and then never updated. This simplifies the coherency model and enables high throughput.

NameNode and DataNodes One image is equal to a thousand words ...

HDFS: Robust against failures (1) DataNode failure: Each DataNode sends from time to time a heartbeat message to the NameNode. If the NameNode does not receive any heartbeat for a specific amount of time from a DataNode, the node is seen to be dead and no further operations are scheduled for it. A dead DataNode decreases the replication factor of the data blocks it stored. To prevent data loss, the NameNode can start new replication processes to increase the replication factor for these blocks. Network partitions: If the cluster breaks up into two or more partitions, the NameNode loses the connection to a set of DataNodes. These DataNodes are seen as dead and no further I/O operations are scheduled for them.

HDFS: Robust against failures (2) NameNode failure: As the NameNode is a single point of failure, it is critical that its data (EditLog and FsImage) can be restored. Therefore one can configure to store more than one copy of the EditLog and FsImage file. Although this decreases the speed with which the NameNode can process operations, it ensures at the same time that multiple copies of the critical filesexist.

Hadoop MapReduce Jeffrey Dean Sanjay Ghewamat

MapReduce Architecture

YARN YARN (Yet Another Resource Negotiator) has been introduced to Hadoop with version 2.0 and solves a few issues with the resources scheduling of MapReduce in version 1.0.

Before YARN

Cheat Sheet A MapReduce job is split by the framework into tasks (Map tasks, Reducer tasks) and each task is run on of the DataNode machines on the cluster. For the execution of tasks, each DataNode machine provided a predefined number of slots (map slots, reducers slots). The JobTracker was responsible for the reservation of execution slots for the different tasks of a job and monitored their execution. If the execution failed, it reserved another slot and re-started the task. It also cleand up temoprary resources and make the reserved slot available to other tasks. Ηaving only one instance of the JobTracker limits scalability (for very large clusters with thousands of nodes). The concept of predefined map and reduce slots also caused resource problems in case all map slots are used while reduce slots are still available and vice versa.

With YARN

Cheat Sheet Hence Hadoop 2.0 introduced YARN as resource manager, which no longer uses slots to manage resources. Instead nodes have "resources" (like memory and CPU cores) which can be allocated by applications on a per request basis. This way MapReduce jobs can run together with non-MapReduce jobs in the same cluster. The heart of YARN is the Resource Manager (RM) which runs on the master node and acts as a global resource scheduler. It also arbitrates resources between competing applications. In contrast to the Resource Manager, the Node Managers (NM) run on slave nodes and communicate with the RM. The NodeManager is responsible for creating containers in which the applications run, monitors their CPU and memory usage and reports them to the RM.

Let’s see a hands-on example on Hadoop MapReduce…