What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.

Slides:

Advertisements

Similar presentations

Introduction to Hadoop Richard Holowczak Baruch College.

Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

HADOOP ADMIN: Session -2

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

Map Reduce and Hadoop S. Sudarshan, IIT Bombay

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.

1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.

MapReduce and the New Software Stack CHAPTER 2 1.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.

BIG DATA/ Hadoop Interview Questions.

CCD-410 Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera.

Map reduce Cs 595 Lecture 11.

MapReduce Compiler RHadoop

Hadoop Aakash Kag What Why How 1.

Apache hadoop & Mapreduce

Software Systems Development

INTRODUCTION TO BIGDATA & HADOOP

HADOOP ADMIN: Session -2

An Open Source Project Commonly Used for Processing Big Data Sets

Chapter 10 Data Analytics for IoT

Large-scale file systems and Map-Reduce

Hadoop MapReduce Framework

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Introduction to MapReduce and Hadoop

Introduction to HDFS: Hadoop Distributed File System

Hadoop Clusters Tess Fulkerson.

Software Engineering Introduction to Apache Hadoop Map Reduce

Central Florida Business Intelligence User Group

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Ministry of Higher Education

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

MapReduce Simplied Data Processing on Large Clusters

The Basics of Apache Hadoop

湖南大学-信息科学与工程学院-计算机与科学系

Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA

Overview of big data tools

Lecture 16 (Intro to MapReduce and Hadoop)

CS 345A Data Mining MapReduce This presentation has been altered.

Charles Tappert Seidenberg School of CSIS, Pace University

5/7/2019 Map Reduce Map reduce.

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after his son’s toy elephant.

Uses for Hadoop Data-intensive text processing Assembly of large genomes Graph mining Machine learning and data mining Large scale social network analysis

Who Uses Hadoop?

The Hadoop Ecosystem Hadoop Common Contains Libraries and other modules HDFS Hadoop Distributed File System Hadoop YARN Yet Another Resource Negotiator Hadoop MapReduce A programming model for large scale data processing

Motivations for Hadoop What considerations led to its design

Motivations for Hadoop What were the limitations of earlier large- scale computing? What requirements should an alternative approach have? How does Hadoop address those requirements?

Early Large Scale Computing Historically computation was processor- bound Data volume has been relatively small Complicated computations are performed on that data Advances in computer technology has historically centered around improving the power of a single machine

Cray-1

Advances in CPUs Moore’s Law The number of transistors on a dense integrated circuit doubles every two years Single-core computing can’t scale with current computing needs

Single-Core Limitation Power consumption limits the speed increase we get from transistor density

Distributed Systems Allows developers to use multiple machines for a single task

Distributed System: Problems Programming on a distributed system is much more complex Synchronizing data exchanges Managing a finite bandwidth Controlling computation timing is complicated

Distributed System: Problems “You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done.” –Leslie Lamport Distributed systems must be designed with the expectation of failure Example of failure issues. Linux lab is distributed file system, if the file server fails, what happens.

Distributed System: Data Storage Typically divided into Data Nodes and Compute Nodes At compute time, data is copied to the Compute Nodes Fine for relatively small amounts of data Modern systems deal with far more data than was gathering in the past

How much data? Facebook Yahoo eBay 500 TB per day Yahoo Over 170 PB eBay Over 6 PB Getting the data to the processors becomes the bottleneck

Requirements for Hadoop Must support partial failure Must be scalable

Partial Failures Failure of a single component must not cause the failure of the entire system only a degradation of the application performance Failure should not result in the loss of any data

Component Recovery If a component fails, it should be able to recover without restarting the entire system Component failure or recovery during a job must not affect the final output

Scalability Increasing resources should increase load capacity Increasing the load on the system should result in a graceful decline in performance for all jobs Not system failure

Hadoop Based on work done by Google in the early 2000s “The Google File System” in 2003 “MapReduce: Simplified Data Processing on Large Clusters” in 2004 The core idea was to distribute the data as it is initially stored Each node can then perform computation on the data it stores without moving the data for the initial processing

Core Hadoop Concepts Applications are written in a high-level programming language No network programming or temporal dependency Nodes should communicate as little as possible A “shared nothing” architecture Data is spread among the machines in advance Perform computation where the data is already stored as often as possible

High-Level Overview When data is loaded onto the system it is divided into blocks Typically 64MB or 128MB Tasks are divided into two phases Map tasks which are done on small portions of data where the data is stored Reduce tasks which combine data to produce the final output A master program allocates work to individual nodes Example of map and reduce

Fault Tolerance Failures are detected by the master program which reassigns the work to a different node Restarting a task does not affect the nodes working on other portions of the data If a failed node restarts, it is added back to the system and assigned new tasks The master can redundantly execute the same task to avoid slow running nodes

Hadoop Distributed File System HDFS

Overview Responsible for storing data on the cluster Data files are split into blocks and distributed across the nodes in the cluster Each block is replicated multiple times

HDFS Basic Concepts HDFS is a file system written in Java based on the Google’s GFS Provides redundant storage for massive amounts of data

HDFS Basic Concepts HDFS works best with a smaller number of large files Millions as opposed to billions of files Typically 100MB or more per file Files in HDFS are write once An advantage of this technique is that you don't have to bother with synchronization. Since you write once, your reader are guaranteed that the data will not be manipulated while they read. Optimized for streaming reads of large files and not random reads

How are Files Stored Files are split into blocks Blocks are split across many machines at load time Different blocks from the same file will be stored on different machines Blocks are replicated across multiple machines The NameNode keeps track of which blocks make up a file and where they are stored Default replication is 3-fold

HDFS Architecture Cluster Membership NameNode 1. filename Secondary NameNode 2. BlckId, DataNodes o Client 3.Read data Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log DataNodes

Running on a single machine, the NameNode daemon determines and tracks where the various blocks of a data file are stored. The DataNode daemon manages the data stored on each machine. If a client application wants to access a particular file stored in HDFS, the application contacts the NameNode. NameNode provides the application with the locations of the various blocks for that file. The application then communicates with the appropriate DataNodes to access the file. Each DataNode periodically builds a report about the blocks stored on the DataNode and sends the report to the NameNode.

For performance reasons, the NameNode resides in a machine’s memory. Because the NameNode is critical to the operation of HDFS, any unavailability or corruption of the NameNode results in a data unavailability event on the cluster. Thus, the NameNode is viewed as a single point of failure in the Hadoop environment. To minimize the chance of a NameNode failure and to improve performance, the NameNode is typically run on a dedicated machine.

A third daemon, the Secondary NameNode. It provides the capability to perform some of the NameNode tasks to reduce the load on the NameNode. Such tasks include updating the file system image with the contents of the file system edit logs. In the event of a NameNode outage, the NameNode must be restarted and initialized with the last file system image file and the contents of the edits logs.

Data Replication Default replication is 3-fold

Data Retrieval When a client wants to retrieve data Communicates with the NameNode to determine which blocks make up a file and on which data nodes those blocks are stored Then communicated directly with the data nodes to read the data

The latest versions of Hadoop provide an HDFS High Availability (HA) feature. This feature enables the use of two NameNodes: one in an active state, and the other in a standby state. If an active NameNode fails, the standby NameNode takes over. When using the HDFS HA feature, a Secondary NameNode is unnecessary. Figure illustrates a Hadoop cluster with ten machines and the storage of one large file requiring three HDFS data blocks. Furthermore, this file is stored using triple replication. The machines running the NameNode and the Secondary NameNode are considered master nodes.

MapReduce Distributing computation across nodes

Motivation for MapReduce (why) Large-Scale Data Processing Want to use 1000s of CPUs But don’t want hassle of managing things MapReduce Architecture provides Automatic parallelization & distribution Fault tolerance I/O scheduling Monitoring & status updates Why: Programming model for organizing computations MR is an OS for using clusters (Cloud Computing) Research : How to map problem solution to MR architecture/Cloud L06MapReduce Prasad

MapReduce Overview A method for distributing computation across multiple nodes Each node processes the data that is stored at that node Consists of two main phases Map Reduce

MapReduce Features Automatic parallelization and distribution Fault-Tolerance Provides a clean abstraction for programmers to use

JobTracker and TaskTracker Determines the execution plan for the job Assigns individual tasks TaskTracker Keeps track of the performance of an individual mapper or reducer

The Mapper Reads data as key/value pairs The key is often discarded Outputs zero or more key/value pairs

Shuffle and Sort Output from the mapper is sorted by key All values with the same key are guaranteed to go to the same machine

The Reducer Called once for each unique key Gets a list of all values associated with a key as input The reducer outputs zero or more final key/value pairs Usually just one output per input key

Typical Large Data Problem

MapReduce - Dataflow

MapReduce - Features Fine grained Map and Reduce tasks Improved load balancing Faster recovery from failed tasks Automatic re-execution on failure In a large cluster, some nodes are always slow or flaky Framework re-executes failed tasks Locality optimizations With large data, bandwidth to data is a problem Map-Reduce + HDFS is a very effective solution Map-Reduce queries HDFS for locations of input data Map tasks are scheduled close to the inputs when possible

Example: Word Count We have a large file of words, one word to a line Count the number of times each distinct word appears in the file Sample application: analyze web server logs to find popular URLs L06MapReduce Prasad

MapReduce Input: a set of key/value pairs User supplies two functions: map(k,v)  list(k1,v1) reduce(k1, list(v1))  v2 (k1,v1) is an intermediate key/value pair Output is the set of (k1,v2) pairs L06MapReduce Prasad

MapReduce: Word Count

MAP REDUCE FLOW

Combined Hadoop Architecture

Anatomy of a Cluster What parts actually make up a Hadoop cluster

Overview NameNode Secondary NameNode DataNode JobTracker TaskTracker Holds the metadata for the HDFS Secondary NameNode Performs housekeeping functions for the NameNode DataNode Stores the actual HDFS data blocks JobTracker Manages MapReduce jobs TaskTracker Monitors individual Map and Reduce tasks

The NameNode Stores the HDFS file system information in a fsimage Updates to the file system (add/remove blocks) do not change the fsimage file They are instead written to a log file When starting the NameNode loads the fsimage file and then applies the changes in the log file

The Secondary NameNode NOT a backup for the NameNode Periodically reads the log file and applies the changes to the fsimage file bringing it up to date Allows the NameNode to restart faster when required

JobTracker and TaskTracker Determines the execution plan for the job Assigns individual tasks TaskTracker Keeps track of the performance of an individual mapper or reducer

Hadoop Ecosystem Other available tools

Why do these tools exist? MapReduce is very powerful, but can be awkward to master These tools allow programmers who are familiar with other programming styles to take advantage of the power of MapReduce

Other Tools Hive Pig Cascading HBase Flume Hadoop processing with SQL Hadoop processing with scripting Cascading Pipe and Filter processing model HBase Database model built on top of Hadoop Flume Designed for large scale data movement