MIT 802 Introduction to Data Platforms and Sources Lecture 2

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
Hadoop Ecosystem Overview
Google Distributed System and Hadoop Lakshmi Thyagarajan.
HADOOP ADMIN: Session -2
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Hadoop Introduction Wang Xiaobo Outline Install hadoop HDFS MapReduce WordCount Analyzing Compile image data TeleNav Confidential.
An Introduction to HDInsight June 27 th,
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Filtering, aggregating and histograms A FEW COMPLETE EXAMPLES WITH MR, SPARK LUCA MENICHETTI, VAG MOTESNITSALIS.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Image taken from: slideshare
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Introduction to Google MapReduce
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
Hadoop.
Introduction to Distributed Platforms
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
An Open Source Project Commonly Used for Processing Big Data Sets
Chapter 10 Data Analytics for IoT
Hadoop MapReduce Framework
Spark Presentation.
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Introduction to Spark.
Airlinecount CSCE 587 Fall 2017.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Basics.
CS110: Discussion about Spark
Introduction to Apache
Overview of big data tools
Lecture 18 (Hadoop: Programming Examples)
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
Chapter X: Big Data.
Apache Hadoop and Spark
Lecture 29: Distributed Systems
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

MIT 802 Introduction to Data Platforms and Sources Lecture 2 Willem S. van Heerden wvheerden@cs.up.ac.za

MapReduce MapReduce is a programming model Composed of two procedures Different implementations (e.g. Hadoop MapReduce) Processing big data sets in parallel over many nodes Composed of two procedures Map procedure Performs filtering and sorting Splits the task up into components Maps inputs into intermediate key-value pairs Reduce procedure Performs a summary operation Produces a result Combines intermediate key-value pairs into final results

MapReduce A simple example Strategy We have a set of documents We want to count the occurrences of each word Strategy The framework creates a series of mapper nodes Give documents (or parts of documents) to mapper nodes Each mapper node Processes through its document (or part of a document) Produces intermediate key-value pairs Key: A word occurring in the document Value: Count of the word (1, because the word occurs once) Intermediate results are stored locally on the node

MapReduce Hello, 1 World, 1 Hello World Bye World Bye, 1 Hello, 1 Documents Key-value pairs Hello World Bye World Hello, 1 World, 1 Bye, 1 Mapper node 1 Hello Test Hello Bye Hello, 1 Test, 1 Bye, 1 Mapper node 2

MapReduce Strategy The framework creates a series of reducer nodes All key-value pairs with the same key are gathered together These key-value pairs are fed to the same reducer node This is referred to as shuffling If there are many instances of a key Multiple reducer nodes may handle the same key Reducers with the same key are grouped together This is referred to as sorting Finally, each reducer node processes its key-value pairs Totals the values for the key-value pairs Outputs the key with the total count as the final result This is referred to as reducing

MapReduce Bye, 1 Bye, 2 Hello, 1 World, 1 Hello, 1 Bye, 1 Hello, 3 Reducer node 1 Bye, 2 Key-value pairs Hello, 1 World, 1 Bye, 1 Hello, 1 Mapper node 1 Reducer node 2 Hello, 3 Test, 1 Hello, 1 Test, 1 Bye, 1 Reducer node 3 Test, 1 Mapper node 2 World, 1 Reducer node 4 Word, 2

Apache Hadoop Open-source software framework Components Used for distributed storage and processing of big data Designed to work on commodity hardware Consists of computer clusters Components Core ecosystem Query engines External data storage A Hadoop cluster consists of At least one master node Multiple slave or worker nodes

Core Hadoop Ecosystem Zookeeper Oozie PIG Hive Sqoop, Flume, Kafka Data Ingestion MapReduce TEZ Spark HBase Apache Storm YARN MESOS HDFS

HDFS Hadoop Distributed File System Java-based file system Scalable and reliable data storage Breaks data into smaller pieces stored over nodes

HDFS Hadoop Distributed File System Primarily consists of A NameNode Resides on the master node Manages file system metadata, primarily file system index May be a standalone server in large clusters May be secondary NameNode snapshotting NameNode memory DataNodes that store the actual data Reside in worker nodes Contain the actual data DataNodes send heartbeat to NameNode DataNode is dead if NameNode doesn’t receive a heartbeat NameNode then replicates blocks on another DataNode

YARN Yet Another Resource Negotiator MESOS is an alternative to YARN Manages computing resources on the cluster Schedules user applications for resource use An application is a single job or set of jobs organised in a graph Consists of Global ResourceManager Arbitrates resources between all applications A per-application ApplicationMaster Negotiates resources from ResourceManager A per-machine NodeManager Works with ApplicationMaster to execute and monitor tasks MESOS is an alternative to YARN

YARN

Hadoop MapReduce Hadoop implementation of MapReduce Uses mapper and reducer methods Resilient to failure ApplicationMaster monitors mappers and reducers Will restart a task on a new node if a failure is detected Written in Java

Hadoop MapReduce Example Let's look at our previous simple example One class (WordCount) encapsulates Mapper class (TokenizerMapper) Reducer class (IntSumReducer) Main method First, the mapper class (TokenizerMapper) Contains the map method Receives text as input (value) Tokenizes value to get individual words Also receives a context (context) Used to interact with rest of the Hadoop system Provides an interface for output Writes each word and count of 1 as a key-value pair to context

Hadoop MapReduce Example public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }

Hadoop MapReduce Example Let's look at our previous simple example Second, the reducer class (IntSumReducer) Contains the reduce method Receives a textual key (key) Receives an iterable structure of values (values) Iterates over values, adding to sum Also receives a context (context) Writes the key and sum as a result to the context

Hadoop MapReduce Example public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result);

Hadoop MapReduce Example Let's look at our previous simple example Finally, the main function Sets up details related to the job The job’s general configuration Job input and output details Sets the mapper class using setMapperClass Sets the reducer class using setReducerClass Sets a combiner class using setCombinerClass Performs local aggregation of the intermediate outputs Helps cut down amount of data transfer from mapper to reducer In this case, the same as the reducer class

Hadoop MapReduce Example public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }

Hadoop MapReduce Streaming Allows interfacing of MapReduce with other languages For example, mrjob and pydoop in Python

Python mrjob Example from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield word, 1 def combiner(self, word, counts): yield word, sum(counts) def reducer(self, word, counts): if __name__ == '__main__': MRWordFreqCount.run()

Apache Pig and Hive Apache Pig Hive has a similar objective to Pig High-level platform Used to create programs that run on Hadoop High-level scripting language called Pig Latin Abstracts away from MapReduce Compiles to MapReduce Hive has a similar objective to Pig Data set file looks and acts like a relational database SQL-like syntax

Apache Pig Example input_lines = LOAD 'input.txt' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES '\\w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO 'output.txt';

Apache HBase Open-source, non-relational, distributed database Runs on HDFS Fault-tolerant way of storing large quantities of data Supports database compression Tables in HBase can serve as Input to MapReduce programs Output from MapReduce programs

Apache Storm Distributed stream processing framework Processes streaming data Implemented in Clojure General framework structure is similar to MapReduce Main difference is that data is processed in real time As opposed to batch processing Thus Storm topologies run indefinitely, not just for a set of jobs

Data Ingestion Sqoop Flume Kafka Tool for efficiently transferring bulk data between Hadoop and structured datastores (e.g. relational databases) Structured datastores and Hadoop Command-line interface Flume Tool for efficiently processing log data Collection, aggregation and moving Processes large amounts of log data Based on streaming data flows Kafka Distributed streaming platform Transfers data between systems or applications

External Data Storage Cassandra MySQL mongoDB Open source Distributed NoSQL database management system Designed to handle large amounts of data MySQL Open source relational database management system mongoDB Open source and cross-platform Document-oriented database NoSQL