HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Chapter 2 Data Models Database Systems: Design, Implementation, and Management, Eleventh Edition, Coronel & Morris.
CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
Intro to Map-Reduce Feb 4, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
HADOOP ADMIN: Session -2
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) Sep 4, 2012 Lecture 3 Cloud Computing -
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
MapReduce Costin Raiciu Advanced Topics in Distributed Systems, 2011.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
Concurrent Algorithms. Summing the elements of an array
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce Costin Raiciu Advanced Topics in Distributed Systems, 2012.
© 2012 Unisys Corporation. All rights reserved. 1 Unisys Corporation. Proprietary and Confidential.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
MapReduce Programming Model. HP Cluster Computing Challenges  Programmability: need to parallelize algorithms manually  Must look at problems from parallel.
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
Introduction to Google MapReduce
Hadoop.
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Cse 344 May 4th – Map/Reduce.
Lecture 18 (Hadoop: Programming Examples)
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
Chapter X: Big Data.
Big Data Technology: Introduction to Hadoop
5/7/2019 Map Reduce Map reduce.
Presentation transcript:

HADOOP Priyanshu Jha A.D.Dilip 6 th IT

Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data sets on clusters of computers patented[1] software framework introduced by Google to support distributed computing on large data sets on clusters of computers patented[1]software frameworkGoogledistributed computingdata setsclusters patented[1]software frameworkGoogledistributed computingdata setsclusters MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster. Computational processing can occur on data stored either in a filesystem (unstructured) or within adatabase (structured). MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster. Computational processing can occur on data stored either in a filesystem (unstructured) or within adatabase (structured).filesystemdatabasefilesystemdatabase

“Map” step The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi- level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi- level tree structure. The worker node processes that smaller problem, and passes the answer back to its master nodetree

“Reduce” Step The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve. The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve.

Map Reduce The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time

The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) -> list(k2,v2) Map(k1,v1) -> list(k2,v2) The map function is applied in parallel to every item in the input dataset. This produces a list of (k2,v2) pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, thus creating one group for each one of the different generated keys. The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) -> list(v3) Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list.

Example void map(String name, String document): void map(String name, String document): // name: document name // name: document name // document: document contents // document: document contents for each word w in document: for each word w in document: EmitIntermediate(w, "1"); EmitIntermediate(w, "1"); void reduce(String word, Iterator partialCounts): void reduce(String word, Iterator partialCounts): // word: a word // word: a word // partialCounts: a list of aggregated partial counts // partialCounts: a list of aggregated partial counts int result = 0; int result = 0; for each pc in partialCounts: for each pc in partialCounts: result += ParseInt(pc); result += ParseInt(pc); Emit(AsString(result)); Emit(AsString(result));

Example Here, each document is split in words, and each word is counted initially with a "1" value by the Map function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to Reduce, thus this function just needs to sum all of its input values to find the total appearances of that word. Here, each document is split in words, and each word is counted initially with a "1" value by the Map function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to Reduce, thus this function just needs to sum all of its input values to find the total appearances of that word.

HADOOP Apache Hadoop is a Java software framework that supports data- intensive distributed applications under a free license Apache Hadoop is a Java software framework that supports data- intensive distributed applications under a free licenseJavaframeworkdistributed applicationsfree licenseJavaframeworkdistributed applicationsfree license The Apache Hadoop project develops open- source software for reliable, scalable, distributed computing. The Apache Hadoop project develops open- source software for reliable, scalable, distributed computing.

Hadoop includes: Hadoop includes: Hadoop Common utilities Hadoop Common utilities Avro: A data serialization system with scripting languages. Avro: A data serialization system with scripting languages. Chukwa: managing large distributed systems. Chukwa: managing large distributed systems. HBase: A scalable, distributed database for large tables. HBase: A scalable, distributed database for large tables. HDFS: A distributed file system. HDFS: A distributed file system. Hive: data summarization and ad hoc querying. Hive: data summarization and ad hoc querying. MapReduce: distributed processing on compute clusters. MapReduce: distributed processing on compute clusters. Pig: A high-level data-flow language for parallel computation. Pig: A high-level data-flow language for parallel computation. ZooKeeper: coordination service for distributed applications. ZooKeeper: coordination service for distributed applications.

HDFS File System The HDFS filesystem stores large files (an ideal file size is a multiple of 64 MB[10]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. The HDFS filesystem stores large files (an ideal file size is a multiple of 64 MB[10]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack.MB[10]RAIDMB[10]RAID The filesystem is built from a cluster of data nodes, each of which serves up blocks of data over the network using a block protocol specific to HDFS. They also serve the data over HTTP, allowing access to all content from a web browser or other client. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. The filesystem is built from a cluster of data nodes, each of which serves up blocks of data over the network using a block protocol specific to HDFS. They also serve the data over HTTP, allowing access to all content from a web browser or other client. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high.

A filesystem requires one unique server, the name node. This is a single point of failure for an HDFS installation. If the name node goes down, the filesystem is offline. When it comes back up, the name node must replay all outstanding operations. This replay process can take over half an hour for a big cluster.[11] The filesystem includes what is called a Secondary Namenode, which misleads some people into thinking that when the primary Namenode goes offline, the Secondary Namenode takes over. In fact, the Secondary Namenode regularly connects with the namenode and downloads a snapshot of the primary Namenode's directory information, which is then saved to a directory. This Secondary Namenode is used together with the edit log of the Primary Namenode to create an up-to-date directory structure. A filesystem requires one unique server, the name node. This is a single point of failure for an HDFS installation. If the name node goes down, the filesystem is offline. When it comes back up, the name node must replay all outstanding operations. This replay process can take over half an hour for a big cluster.[11] The filesystem includes what is called a Secondary Namenode, which misleads some people into thinking that when the primary Namenode goes offline, the Secondary Namenode takes over. In fact, the Secondary Namenode regularly connects with the namenode and downloads a snapshot of the primary Namenode's directory information, which is then saved to a directory. This Secondary Namenode is used together with the edit log of the Primary Namenode to create an up-to-date directory structure.single point of failure[11]single point of failure[11]

Another limitation of HDFS is that it cannot be directly mounted by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace has been developed to address this problem, at least for Linux and some other Unix systems. Another limitation of HDFS is that it cannot be directly mounted by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace has been developed to address this problem, at least for Linux and some other Unix systems.Filesystem in UserspaceFilesystem in Userspace

Replicating data three times is costly. To alleviate this cost, recent versions of HDFS have erasure coding support whereby multiple blocks of the same file are combined together to generate a parity block. HDFS creates parity blocks asynchronously and then decreases the replication factor of the file from 3 to 2. Studies have shown that this technique decreases the physical storage requirements from a factor of 3 to a factor of around 2.2 Replicating data three times is costly. To alleviate this cost, recent versions of HDFS have erasure coding support whereby multiple blocks of the same file are combined together to generate a parity block. HDFS creates parity blocks asynchronously and then decreases the replication factor of the file from 3 to 2. Studies have shown that this technique decreases the physical storage requirements from a factor of 3 to a factor of around 2.2

Word Count Example Read text files and count how often words occur. Read text files and count how often words occur. The input is text files The input is text files The output is a text file The output is a text file each line: word, tab, count each line: word, tab, count Map: Produce pairs of (word, count) Map: Produce pairs of (word, count) Reduce: For each word, sum up the counts. Reduce: For each word, sum up the counts.

Map Class public static class Map extends MapReduceBase implements Mapper public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private Text word = new Text(); public void map( public void map( LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); word.set(tokenizer.nextToken()); output.collect(word, one); output.collect(word, one); } } }

public static class Reduce extends MapReduceBase implements Reducer public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; int sum = 0; while (values.hasNext()) { while (values.hasNext()) { sum += values.next().get(); sum += values.next().get(); } output.collect(key, new IntWritable(sum)); output.collect(key, new IntWritable(sum)); }}

public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); JobClient.runJob(conf); }

HDFS Limitations “Almost” GFS (Google FS) “Almost” GFS (Google FS) No file update options (record append, etc); all files are write-once No file update options (record append, etc); all files are write-once Does not implement demand replication Does not implement demand replication Designed for streaming Designed for streaming Random seeks devastate performance Random seeks devastate performance