MIT 802 Introduction to Data Platforms and Sources Lecture 2

MIT 802 Introduction to Data Platforms and Sources Lecture 2
Willem S. van Heerden

MapReduce MapReduce is a programming model Composed of two procedures
Different implementations (e.g. Hadoop MapReduce) Processing big data sets in parallel over many nodes Composed of two procedures Map procedure Performs filtering and sorting Splits the task up into components Maps inputs into intermediate key-value pairs Reduce procedure Performs a summary operation Produces a result Combines intermediate key-value pairs into final results

MapReduce A simple example Strategy We have a set of documents
We want to count the occurrences of each word Strategy The framework creates a series of map nodes Give documents (or parts of documents) to map nodes Each map node Processes through its document (or part of a document) Produces intermediate key-value pairs Key: a word occurring in the document Value: count of the word (1, because the word occurs once) Intermediate results are stored locally on the node

MapReduce Hello, 1 World, 1 Hello World Bye World Bye, 1 Hello, 1
Documents Key-value pairs Hello World Bye World Hello, 1 World, 1 Bye, 1 Map node 1 Hello Test Hello Bye Hello, 1 Test, 1 Bye, 1 Map node 2

MapReduce Strategy The framework creates a series of reduce nodes
All key-value pairs with the same key are gathered together These key-value pairs are fed to the same reduce node This is referred to as shuffling Each reduce node Totals the values for the key-value pairs Outputs the value with the total count as the final result

MapReduce Bye, 1 Bye, 2 Hello, 1 World, 1 Hello, 1 Bye, 1 Hello, 3
node 1 Bye, 2 Key-value pairs Hello, 1 World, 1 Bye, 1 Hello, 1 Map node 1 Reduce node 2 Hello, 3 Test, 1 Hello, 1 Test, 1 Bye, 1 Reduce node 3 Test, 1 Map node 2 World, 1 Reduce node 4 Word, 2

Apache Hadoop Open-source software framework Components
Used for distributed storage and processing of big data Designed to work on commodity hardware Consists of computer clusters Components Core ecosystem Query engines External data storage

Core Hadoop Ecosystem Zookeeper Oozie PIG Hive Sqoop, Flume, Kafka
Data Ingestion MapReduce TEZ Spark HBase Apache Storm YARN MESOS HDFS

HDFS Hadoop Distributed File System Java-based file system
Scalable and reliable data storage Primarily consists of A NameNode that manages the file system metadata DataNodes that store the actual data Breaks data into smaller pieces stored over nodes

YARN Yet Another Resource Negotiator MESOS is an alternative to YARN
Manages computing resources on the cluster Schedules user applications for resource use Consists of Global ResourceManager A per-application ApplicationManager MESOS is an alternative to YARN

Hadoop MapReduce Hadoop implementation of MapReduce
Uses mapper and reducer methods Resilient to failure Application manager monitors mappers and reducers Will restart a task on a new node if a failure is detected Written in Java

Hadoop MapReduce Example
Let's look at our previous simple example One class (WordCount) encapsulates Mapper class (TokenizerMapper) Reducer class (IntSumReducer) Main method First, the mapper class (TokenizerMapper) Contains the map method Receives text as input (value) Tokenizes value to get individual words Writes each word and count of 1 as a key-value pair

public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }

Let's look at our previous simple example Second, the reducer class (IntSumReducer) Contains the reduce method Receives a textual key (key) Receives an iterable structure of values (values) Iterates over values, adding to sum Writes the key and sum as a result

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result);

Let's look at our previous simple example Finally, the main function Sets the mapper class using setMapperClass Sets the reducer class using setReducerClass Sets a combiner class using setCombinerClass Performs local aggregation of the intermediate outputs Helps cut down amount of data transfer from mapper to reducer In this case, the same as the reducer class

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }

Hadoop MapReduce Streaming
Allows interfacing of MapReduce with other languages For example, mrjob and pydoop in Python

Python mrjob Example from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield word, 1 def combiner(self, word, counts): yield word, sum(counts) def reducer(self, word, counts): if __name__ == '__main__': MRWordFreqCount.run()

Apache Pig and Hive High-level platform
Used to create programs that run on Hadoop High-level scripting language called Pig Latin Abstracts away from MapReduce Compiles to MapReduce Hive has a similar objective to Pig Data set file looks and acts like a relational database SQL-like syntax

Apache Pig Example input_lines = LOAD 'input.txt' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES '\\w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO 'output.txt';

Apache HBase Open-source, non-relational, distributed database
Runs on HDFS Fault-tolerant way of storing large quantities of data Supports database compression Tables in HBase can serve as Input to MapReduce programs Output from MapReduce programs

Apache Storm Distributes stream processing framework
Processes streaming data In general the topology is similar to MapReduce Main difference is that data is processed in real time As opposed to batch processing

Data Ingestion Sqoop Flume Kafka
Tool for efficiently transferring bulk data between Hadoop Structured datastores (e.g. relational databases) Flume Used for efficiently collecting, aggregating and moving large amounts of log data Kafka Distributed streaming platform that transfers data between systems or applications

External Data Storage Cassandra MySQL mongoDB
Open source distributed NoSQL database management system designed to handle large amounts of data MySQL Open source relational database management system mongoDB Open source cross-platform document-oriented database NoSQL

MIT 802 Introduction to Data Platforms and Sources Lecture 2

Similar presentations

Presentation on theme: "MIT 802 Introduction to Data Platforms and Sources Lecture 2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MIT 802 Introduction to Data Platforms and Sources Lecture 2

Similar presentations

Presentation on theme: "MIT 802 Introduction to Data Platforms and Sources Lecture 2"— Presentation transcript:

Similar presentations

About project

Feedback