Introduction to Hadoop Richard Holowczak Baruch College.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Simplified Data Processing on Large Clusters
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
Nov 2006 Google released the paper on BigTable.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Microsoft Ignite /28/2017 6:07 PM
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop Aakash Kag What Why How 1.
Hadoop.
An Open Source Project Commonly Used for Processing Big Data Sets
What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.
Hadoop MapReduce Framework
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Ministry of Higher Education
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Basics.
Cse 344 May 4th – Map/Reduce.
Hadoop Technopoints.
Introduction to Apache
Overview of big data tools
CS 345A Data Mining MapReduce This presentation has been altered.
Introduction to MapReduce
Pig Hive HBase Zookeeper
Presentation transcript:

Introduction to Hadoop Richard Holowczak Baruch College

Problems of Scale As data size and processing complexity grows: Contention for disks – disks have limited throughput Processing cores per server/OS Image limited … processing throughput is limited Reliability of distributed systems: Tightly coupled distributed systems fall apart when one component (disk, network, cpu, etc.) fails What happens to processing jobs when there is a failure? Rigid structure of distributed systems Consider our ETL processes: Target schema is fixed ahead of time

Hadoop A distributed data processing eco system that is Scalable Reliable Fault Tolerant A collection of projects currently maintained under the Apache Foundation: hadoop.apache.org Storage Layer: Hadoop Distributed File System (HDFS) Scheduling Layer: Hadoop YARN Execution Layer: Hadoop MapReduce Plus many more projects built on top of this

Hadoop Distributed File System (HDFS) Created on top of commodity hardware and operating system Any functioning Linux (or Windows) system can be set up as a node Files are split in to 64MB blocks that are distributed and replicated across nodes Typically at least 3 copies of a blocks are made File I/O semantics are simplified: Write once (no notion of update) Read many times as a stream (no random file I/O) When a node fails, additional blocks copies are created on other nodes A special Name Node keeps track of how a file blocks is stored across different nodes Some location designations Node Rack Data Center

HDFS Example 1 Name Node File Block xyz.txtblock1_N1 xyz.txtblock2_N1 xyz.txtBlock1_N2 xyz.txtBlock1_N3 xyz.txtBlock2_N3 xyz.txtBlock2_N4 … Node 1 File Block xyz.txtBlock1 xyz.txtBlock2 … Node 2 File Block xyz.txtBlock1 … Node 3 File Block xyz.txtBlock1 xyz.txtBlock2 … Node 4 File Block xyz.txtBlock2 … Network

HDFS Example 2 Name Node File Block xyz.txtBlock1_N1 xyz.txtBlock2_N1 xyz.txtBlock1_N2 xyz.txtBlock1_N3 xyz.txtBlock2_N3 xyz.txtBlock2_N4 … Node 1 File Block xyz.txtBlock1 xyz.txtBlock2 … Node 2 File Block xyz.txtBlock1 … Node 3 File Block xyz.txtBlock1 xyz.txtBlock2 … Node 4 File Block xyz.txtBlock2 … Network Node Failure

HDFS Example 3 Name Node File Block xyz.txtBlock1_N1 xyz.txtBlock2_N1 xyz.txtBlock1_N2 xyz.txtBlock1_N3 xyz.txtBlock2_N3 xyz.txtBlock2_N4 xyz.txtBlock1_N4 … Node 1 File Block xyz.txtBlock1 xyz.txtBlock2 … Node 2 File Block xyz.txtBlock1 … Node 3 File Block xyz.txtBlock1 xyz.txtBlock2 … Node 4 File Block xyz.txtBlock2 xyz.txtBlock1 … Network Blocks from failed node are replicated

Hadoop Execution Layer: MapReduce Processing architecture for Hadoop Processing functions are sent to where the data reside on nodes Map function is mainly concerned about parsing and filtering data Collects instances of vales V for each key K This function is programmed by the developer Shuffle Instances of { Ki, Vi } to merge This step is done automatically by MapReduce Reduce function is mainly concerned with summarizing data Summarize a set of V for each Key K This function is programmed by the developer

Hadoop Scheduling Layer Job Tracker writes out a plan for completing a job and then tracks its progress A job is broken up into independent Tasks Route a task to CPU that is close to the data (Same Node, Same Rack, different rack) Nodes have Task Trackers that carry out the Tasks required to complete a job When a node fails, Job Tracker automatically re-starts the task on a new node Scheduler may also distribute the same task to multiple nodes and keep the results from the node that finishes first

MapReduce Example Compare 2012 total sales with 2011 total sales broken down by product category Data set: Sales transaction records: Date, Product, ProductCategory, CustomerName, …, Quantity, Price Key: [ Year, ProductCategory ] Value: [ Price * Quantity ] Map Function:For every record, form the Key then multiply Price * Quantity and then assign it to the Value. Shuffle: Merge/Sort all of the pairs on common key Reduce Function: For each K, sum up all of the associated values V

MapReduce Example Name Node File Block xyz.txtblock1_N1 xyz.txtblock2_N1 xyz.txtBlock1_N2 xyz.txtBlock1_N3 xyz.txtBlock2_N3 xyz.txtBlock2_N4 … Node 1 File Block xyz.txtBlock1 6/02/2011, Electronics, …, 3, $130 7/13/2011, Electronics, …, 1, $125 7/14/2011, Kitchen, …, 1, $65 xyz.txtBlock23/15/2012, Outdoors, …, 4, $12 8/16/2012, Outdoors, …, 1, $41 … Node 2 File Block xyz.txtBlock16/02/2011, Electronics, …, 3, $130 7/13/2011, Electronics, …, 1, $125 7/14/2011, Kitchen, …, 1, $65 … Node 3 File Block xyz.txtBlock1 6/02/2011, Electronics, …, 3, $130 7/13/2011, Electronics, …, 1, $125 7/14/2011, Kitchen, …, 1, $65 xyz.txtBlock23/15/2012, Outdoors, …, 4, $12 8/16/2012, Outdoors, …, 1, $41 … Network Job Tracker Node Job TaskNodeBlock J101TaN1Block1 J101TaN2Block1 J101TbN3Block2 … TaskManager: Ta, Tx, Ty TaskManager: Ta, Tz TaskManager: Tb, Tz

Common MapReduce domains Indexing documents or web pages Counting word frequencies Processing log files ETL Processing Image archives Common characteristics Files/Blocks can be independently processed and the results easily merged Scales with the number of nodes, size of data, number of CPUs

Additional Apache/Hadoop Projects Hbase – Large table NoSQL database Hive – Data warehousing infrastructure / SQL support PIG – Data processing scripting / MapReduce OOZIE – Workflow Scheduling FLUME – Distributed log file processing MAHOUT – Machine learning libraries