Map/Reduce Programming Model

Name: Map/Reduce Programming Model
Uploaded: 2017-08-25T02:15:14+00:00
Duration: PTM26S16
Channel: Cassandra Weaver
Description: Map/Reduce Programming Model

Map/Reduce Programming Model
Ahmed Abdelsadek

Outlines Introduction What is Map/Reduce? Framework Architecture
Map/Reduce Algorithm Design Tools and Libraries built on top of Map/Reduce

Introduction Big Data Scaling ‘out’ not ‘up’
Scaling ‘everything’ linearly with data size Data-intensive applications

Map/Reduce Origins Google Map/Reduce Hadoop Map/Reduce
The Map and Reduce functions are both defined with respect to data structured in (key, value) pairs.

Mapper The Map function takes a key/value pair, processes it, and generates zero or more output key/value pairs. The input and output types of the mapper can be different from each other.

Reducer The Reduce function takes a key and a series of all values associated with it, processes it, and generates zero or more output key/value pairs. The input and output types of the reducer can be different from each other.

Mappers/Reducers map: (k1; v1) -> [(k2; v2)]
reduce: (k2; [v2]) -> [(k3; v3)]

WordCount Example Problem: count the number of occurrences of every word in a text collection. Map(docid a, doc d) for all term t in doc d do Emit(term t, count 1) Reduce(term t; counts [c1, c2, …]) sum = 0 for all count c in counts [c1, c2, …] do sum = sum + c Emit(term t, count sum)

Map/Reduce Framework Architecture and Execution Overview

Architecture - Overview
Map/Reduce runs on top of DFS

Data Flow

Job Timeline

Job Work Flow

Fault Tolerance Task Fails Re-execution TaskTracker Fails
Removes the node from pool of TaskTrackers Re-schedule its tasks JobTracker Fails Singe point of failure. Job fails

Map/Reduce Framework Features
Locality Move code to the data Task Granularity Mappers and reducers should be much larger than the number of machines, however, not too much! Dynamic load balancing! Backup Tasks Avoid slow workers Near completion

Map/Reduce Framework Features
Skipping bad records Many failures on the same record Local execution Debug in isolation Status information Progress of computations User Counters, report progress Periodically propagated to the master node

Hadoop Streaming and Pipes
APIs to MapReduce that allows you to write your map and reduce functions in languages other than Java Hadoop Streaming Uses Unix standard streams as the interface between Hadoop and your program You can use any language that can read standard input and write to standard output Hadoop Pipes (for C++) Pipes uses sockets as the channel to communicates with the process running the C++ map or reduce function JNI is not used

Keep in Mind Programmer has little control over many aspects of execution Where a mapper or reducer runs (i.e., on which node in the cluster). When a mapper or reducer begins or finishes Which input key-value pairs are processed by a specific mapper. Which intermediate key-value pairs are processed by a specific reducer.

Map/Reduce Algorithm Design

Partitioners Dividing up the intermediate key space. Simplest: Hash value of the key mod the number of reducers Assigns same number of keys to reducers Only considers the key and ignores the value May yield large differences in the number of values sent to each reducer More complex partitioning algorithm to handle the imbalance in the amount of data associated with each key

Combiners Combiner may be invoked zero, one, or multiple times
In WordCount example: the amount of intermediate data is larger than the input collection itself Combiners are an optimization for local aggregation before the shuffle and sort phase Compute a local count for a word over all the documents processed by the mapper Think of combiners as “mini-reducers” However, combiners and reducers are not always interchangeable Combiner input and output pair are same as mapper output pairs Same as reducer input pair Combiner may be invoked zero, one, or multiple times Combiner can emit any number of key-value pairs

Complete View of Map/Reduce

Local Aggregation Network and disk latency are high!
Features help local aggregation Single (Java) Mapper object for multiple (key,value) pairs in an input split (preserve state across multiple calls of the map() method) Share in-object data structures and counters Initialization, and finalization code across all map() calls in a single task JVM reuse across multiple tasks on the same machine

Basic WordCount Example

Per-Document Aggregation
Associative array inside the map() call to sum up term counts within a single document Emits a key-value pair for each unique term, instead of emitting a key-value pair for each term in the document substantial savings in the number of intermediate key-value pairs emitted

Per-Mapper Aggregation
Associative array inside the Mapper object to sum up term counts across multiple documents

In-Mapper Combining Pros
More control over when local aggregation occurs and how it exactly takes place (recall: no guarantees on combiners) More efficient than using actual combiners No additional overhead with object creation, serializing, reading, and writing the key-value pairs Cons Breaks the functional programming (not a big deal!) Scalability bottleneck Needs sufficient memory to store intermediate results Solution: Block and flush, every N key-value pairs have been processed or every M bytes have been used.

Correctness with Local Aggregation
Combiners are viewed as optional optimizations Correctness of algorithm should not depend on its computations Combiners and reducers are not interchangeable Unless reduce computation is both commutative and associative Make sure of the semantics of your aggregation algorithm Notice for example

Pair and Stripes In some problems: common approach is to construct complex keys and values to achieve more efficiency Example: Problem of building word co-occurrence matrix from large document collection Formally, the co-occurrence matrix of a corpus is a square N x N matrix where n is the number of unique words in the corpus Cell Mij contains the number of times word Wi co-occured with word Wj

Pairs Approach Mapper: emits co-occurring words pair as the key and the integer one Reducer: sums up all the values associated with the same co- occurring word pair

Pairs Approach Pairs algorithm generates a massive number of key-value pairs Combiners have few opportunities to perform local aggregation The sparsity of the key space also limits the effectiveness of in-memory combining

Stripes Approach Store co-occurrence information in an associative array Mapper: emits words as keys and associative arrays as values Reducer: element-wise sum of all associative arrays of the same key

Stripes Approach Much more compact representation
Much fewer intermediate key-value pairs More opportunities to perform local aggregation May cause potential scalability bottlenecks of the algorithm.

Which approach is faster?
APW (Associated Press Worldstream ): corpus of 2.27 million documents totaling 5.7 GB

Computing Relative Frequencies
In the previous example, (Wi,Wj) co-occurrence may be high just because one of the words is very common! Solution: Compute relative frequencies

Relative Frequencies with Stripes
Straightforward! In Reducer: Sum all words counts co-occur with the key word Divide the counts by that sum to get the relative frequency! Lessons: Use of complex data structures to coordinate distributed computations Appropriate structuring of keys and values, bring together all the pieces of data required to perform a computation Drawback? As with before, this algorithm also assumes that each associative array fits into memory (Scalability bottleneck!)

Relative Frequencies with Pairs
Reducer receives (Wi,Wj) as the key and the counts as value From this alone it is not possible to compute f(Wj | Wi) Hint: Reducers like Mappers, can preserve state across multiple keys Solution: at reducer side, buffer in memory all the words that co- occur with Wi In essence building the associative array in the stripes approach Problem? Word pairs can be in any arbitrary order! Solution: we must define the sort order of the pair Keys are first sorted by the left word, and then by the right word So That: when left word changes -> Sum, calculate and emit the results, flush the memory

Problem? Same left-word pairs may be sent to different reducers! Solution? We must ensure that all pairs with the same left word are sent to the same reducer How? Custom Paritioners!! Pays attention to the left word and partition based on its hash only Will it work? Yeah! Drawback? Still scalability bottleneck! 

Another approach? With no bottlenecks? Can we compute or ‘have’ the sum before processing the pairs counts? The notion of ‘before’ and ‘after’ can be seen in the ordering of the key-value pairs This insight lies in properly sequencing the data presented to the reducer Programmer should define the sort order of keys so that data needed earlier is presented earlier to the reducer So now, we need two things Compute the sum for a give word Wi Send that sum to the reducer before any words pair where Wi is its left side

How? To get the sum Modify the Mapper to additionally emits a ‘special’ key of (Wi, *), with a value of one To ensure the order defining the sort order of the keys so that pairs with the special symbol of the form (Wi, *) are ordered before any other key-value pairs where the left word is Wi In addition: Partitioner to pay attention to only the left word

Example Memory bottlenecks? No!

Order Inversion Design Pattern
To summarize Emitting a special key-value pair for getting the sum Controlling the sort order of the intermediate key Defining a custom partitioner Preserving state across multiple keys in the reducer Quite common in pattern in many problems The key insight Convert the sequencing of computations into a sorting problem

Secondary Sort In addition to sorting by key, we also need to sort by value Implemented in Google, but not in Hadoop Two main techniques Buffer all the readings in memory and then sort May lead to too much memory consumption Value-to-key conversion Move part of the value into the intermediate key to form a composite key We must define the intermediate key sort order We must define the partitioner so that all pairs associated with the same key are sent to the same reducer Reducer will need to preserve state across multiple pairs May lead to too much intermediate pairs

Relational Joins For databases, data-warehousing, and data analytics
Semi-structured data Example of a join Where S and T are datasets (relations), k is the key we want to join on, si and ti are the unique IDs of S and T respectively, Si and Ti are the rest of the tuple attributes

Reduce-side Join One-to-one join
Emit tuple’s join attribute as key, rest of attributes as value One-to-many join Buffer all tuple’s in memory Use Value-to-key pattern

Reduce-side Join Lessons Many-to-many join
The previous algorithm works as well Smaller set should come first Reducer will buffer it in memory Lessons Basic idea is to repartition the two datasets by the join key Not efficient since it shuffles both datasets across the network

Map-side Joins Assume datasets are
Both sorted by the join key Divided into same number of files Partitioned in the same manner by the join key In each file, tuples are sorted by the join key We can perform a join by scanning through both datasets simultaneously This is known as a merge join Parallelize by partitioning and sorting both datasets in the same time Map over one of the datasets (the larger one) Inside the mapper read the corresponding part of the other dataset Non-local read Perform the merge join

Map-side Joins More efficient than a reduce-side join
Doesn’t shuffle all the datasets Drawback: Strong assumption on the input files format Advice If used in a workflow with multiple Map/Reduce jobs, ensure the previous reducer writes its output in a convenient format.

Memory-backed Join If one of the datasets can fit in memory
Load it in memory Map over the other dataset Use random access to tuples based on the join key Great performance improvement

Summary Order inversion Value-to-key conversion In-mapper combining
Aggregates partial results Emit less intermediate pair Pair and Stripes Keep track of joint events One by one Stripe fashion Order inversion Convert the sequencing of computations into a sorting problem Value-to-key conversion Scalable solution for secondary sorting Moving part of the value into the key

Before we go! Remember: Limitations of Map/Reduce Model
Map/Reduce mainly designed for batch processes, not for online query Prevents modifying or adding input data while the job is running, as well as modifying the number of machines. Map/Reduce job has a single entry and a single exit We can not keep it alive waiting for an event to trigger it Map/Reduce works on flat files Lack of scheme support

What’s Next?

Map/Reduce vs RDBM A living debate in databases and data analytics communities On 2008, D. DeWitt and M. Stonebraker write “MapReduce: A major step backwards” A giant step backward in the programming paradigm An implementation uses brute force instead of indexing Not novel at all -- well known techniques developed nearly 25 years ago Missing most of the features that are routinely included in current DBMS Incompatible with all of the tools DBMS users have come to depend on MapReduce is missing features Indexing, Bulk loader, Updates, Transactions, Integrity constraints, Referential integrity, Views MapReduce is incompatible with the DBMS tools Report writers, Business intelligence tools, Data mining tools, Replication tools, Database design tools

Map/Reduce vs RDBM On 2010, same authors and others write
“MapReduce and Parallel DBMSs:Friends or Foes?“ Where they argue that Map/Reduce is a complement to DBMS not a competitive They are used in different application domain Parallel DBMSs excel at efficient querying of large data sets MR style systems excel at ETL(extract-transform-load) tasks

NoSQL Mechanism for storage and retrieval of data that use looser consistency models than traditional relational databases To achieve higher scalability and availability Usually in form of Key-Value store Built on top of Distributed File Systems Examples Google Big Table Apache HBase Apache Cassandra Amazon Dynamo

Tools on top of Hadoop Apache Pig
Apache Pig is a high-level procedural language platform developed to simplify querying large data sets in Apache Hadoop and MapReduce Apache Pig features a “Pig Latin”, a relational data-flow language enables SQL-like queries to be performed on distributed datasets within Hadoop applications. Pig originated as a Yahoo Research In 2007, Pig became an open source project of the Apache Software Foundation.

Apache Pig Pig Latin Example

Apache Pig Pig execution flow

Tools on top of Hadoop Apache Hive
Hive is a data warehouse system for the open source Apache Hadoop project. Hive features a SQL-like HiveQL language that facilitates data analysis and summarization for large datasets stored in Hadoop-compatible file systems. Hive originated as a Facebook Later became an open source project under the Apache Software Foundation.

Apache Hive HiveQL Example

Pig vs Hive They are/were independent projects and there was no centrally coordinated goal. They were in different spaces early on and have grown to overlap with time as both projects expand Some differences are Pig Latin is procedural, where HiveQL is declarative. Pig Latin allows developers to insert their own code almost anywhere in the data pipeline. Both compiles to Map and Reduce jobs.

Libraries on top of Hadoop
Mahoot Machine learning library to build scalable machine learning algorithms.

Libraries on top of Hadoop
HIPI (Hadoop Image Processing Interface) Framework that provides an API for performing image processing tasks in a distributed computing environment

Summary Map/Reduce Framework Architecture Map/Reduce Algorithm Design
Tools and Libraries built on top of Map/Reduce

Demo Starting Hadoop cluster Copying data to HDFS
Compiling our Java Map/Reduce code and create the Jar file. Submit Hadoop job Show progress and dash boards Retrieve the output from HDFS Shut down Hadoop cluster

Appendix Hadoop Configurations Single Node Cluster setup Simple guide
More detailed: linux-single-node-cluster/ Cluster setup linux-multi-node-cluster/

Appendix Packages to install on Linux Hadoop:
/hadoop tar.gz Oracle Java 7: linux-x64.tar.gz SSH $ sudo apt-get install ssh $ sudo apt-get install rsync

Appendix Studying materials
“Data-Intensive Text Processing with MapReduce” Jimmy Lin and Chris Dyer “Hadoop: The Definitive Guide” Tom White “MapReduce Design Patterns” Donald Miner and Adam Shook

Questions?

Map/Reduce Programming Model

Similar presentations

Presentation on theme: "Map/Reduce Programming Model"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Map/Reduce Programming Model

Similar presentations

Presentation on theme: "Map/Reduce Programming Model"— Presentation transcript:

Similar presentations

About project

Feedback