Download presentation
Presentation is loading. Please wait.
Published byBlake Baker Modified over 9 years ago
1
MapReduce Design Patterns CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
2
Agenda Summarization Patterns Filtering Patterns Data Organization Patterns Joins Patterns Metapatterns I/O Patterns Bloom Filters
3
SUMMARIZATION PATTERNS Numerical Summarizations, Inverted Index, Counting with Counters
4
Overview Top-down summarization of large data sets Most straightforward patterns Calculate aggregates over entire data set or groups Build indexes
5
Numerical Summarizations Group records together by a field or set of fields and calculate a numerical aggregate per group Build histograms or calculate statistics from numerical values
6
Known Uses Word Count Record Count Min/Max/Count Average/Median/Standard Deviation
7
Structure
8
Performance Perform well, especially when combiner is used Need to be concerned about data skew with from the key
9
Example Discover the first time a StackOverflow user posted, the last time a user posted, and the number of posts in between User ID, Min Date, Max Date, Count
10
public class MinMaxCountTuple implements Writable { private Date min = new Date(); private Date max = new Date(); private long count = 0; private final static SimpleDateFormat frmt = new SimpleDateFormat( "yyyy-MM-dd'T'HH:mm:ss.SSS"); public Date getMin() { return min; } public void setMin(Date min) { this.min = min; } public Date getMax() { return max; } public void setMax(Date max) { this.max = max; } public long getCount() { return count; } public void setCount(long count) { this.count = count; } public void readFields(DataInput in) { min = new Date(in.readLong()); max = new Date(in.readLong()); count = in.readLong(); } public void write(DataOutput out) { out.writeLong(min.getTime()); out.writeLong(max.getTime()); out.writeLong(count); } public String toString() { return frmt.format(min) + "\t" + frmt.format(max) + "\t" + count; }
11
public static class MinMaxCountMapper extends Mapper { private Text outUserId = new Text(); private MinMaxCountTuple outTuple = new MinMaxCountTuple(); private final static SimpleDateFormat frmt = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS"); public void map(Object key, Text value, Context context) { Map parsed = xmlToMap(value.toString()); String strDate = parsed.get("CreationDate"); String userId = parsed.get("UserId"); Date creationDate = frmt.parse(strDate); outTuple.setMin(creationDate); outTuple.setMax(creationDate) outTuple.setCount(1); outUserId.set(userId); context.write(outUserId, outTuple); }
12
public static class MinMaxCountReducer extends Reducer { private MinMaxCountTuple result = new MinMaxCountTuple(); public void reduce(Text key, Iterable values, Context context) { result.setMin(null); result.setMax(null); result.setCount(0); int sum=0; for (MinMaxCountTuple val : values) { if (result.getMin() == null || val.getMin().compareTo(result.getMin()) < 0) { result.setMin(val.getMin()); } if (result.getMax() == null || val.getMax().compareTo(result.getMax()) > 0) { result.setMax(val.getMax()); } sum += val.getCount(); } result.setCount(sum); context.write(key, result); }
14
public static void main(String[] args) { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: MinMaxCountDriver "); System.exit(2); } Job job = new Job(conf, "Comment Date Min Max Count"); job.setJarByClass(MinMaxCountDriver.class); job.setMapperClass(MinMaxCountMapper.class); job.setCombinerClass(MinMaxCountReducer.class); job.setReducerClass(MinMaxCountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(MinMaxCountTuple.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }
15
Inverted Index Generate an index from a data set to enable fast searches or data enrichment Building an index takes time, but can greatly reduce the amount of time to search for something Output can be ingested into key/value store
16
Structure
17
Counting with Counters Use MapReduce framework’s counter utility to calculate global sum entirely on the map side, producing no output Small number of counters only!!
18
Known Uses Count number of records Count a small number of unique field instances Sum fields of data together
19
Structure
20
FILTERING PATTERNS Filtering, Bloom Filtering, Top Ten, Distinct
21
Filtering Discard records that are not of interest Create subsets of your big data sets that you want to further analyze
22
Known Uses Closer view of the data Tracking a thread of events Distributed grep Data cleansing Simple random sampling
23
Structure
24
Bloom Filtering Keep records that are a member of a large predefined set of values Inherent possibility of false positives
25
Known Uses Removing most of the non-watched values Pre-filtering a data set prior to expensive membership test
26
Structure
27
Top Ten Retrieve a relatively small number of top K records based on a ranking scheme Find the outliers or most interesting records
28
Known Uses Outlier analysis Selecting interesting data Catchy dashboards
29
Structure
30
Distinct Remove duplicate entries of your data, either full records or a subset of fields That fourth V nobody talks about that much
31
Known Uses Deduplicate data Get distinct values Protect from inner join explosion
32
Structure
33
DATA ORGANIZATION PATTERNS Structured to Hierarchical, Partitioning, Binning, Total Order Sorting, Shuffling
34
Structured to Hierarchical Transformed row-based data to a hierarchical format Reformatting RDBMS data to a more conducive structure
35
Known Uses Pre-joining data Prepare data for HBase or MongoDB
36
Structure
37
Partitioning Partition records into smaller data sets Enables faster future query times due to partition pruning
38
Known Uses Partition pruning by continuous value Partition pruning by category Sharding
39
Structure
40
Binning File records into one or more categories – Similar to partitioning, but the implementation is different Can be used to solve similar problems to Partitioning
41
Known Uses Pruning for follow-on analytics Categorizing data
42
Structure
43
Total Order Sorting Sort your data set in parallel Difficult to apply “divide and conquer” technique of MapReduce
44
Known Uses Sorting
45
Structure
47
Shuffling Set of records that you want to completely randomize Instill some anonymity or create some repeatable random sampling
48
Known Uses Anonymize the order of the data set Repeatable random sampling after shuffled
49
Structure
50
JOIN PATTERNS Join Refresher, Reduce-Side Join w/ and w/o Bloom Filter, Replicated Join, Composite Join, Cartesian Product
51
Join Refresher A join is an operation that combines records from two or more data sets based on a field or set of fields, known as a foreign key Let’s go over the different types of joins before talking about how to do it in MapReduce
52
A Tale of Two Tables
53
Inner Join
54
Left Outer Join
55
Right Outer Join
56
Full Outer Join
57
Antijoin
58
Cartesian Product
59
How to implement? Reduce-Side Join w/ and w/o Bloom Filter Replicated Join Composite Join Cartesian Product stands alone
60
Reduce Side Join Two or more data sets are joined in the reduce phase Covers all join types we have discussed – Exception: Mr. Cartesian All data is sent over the network – If applicable, filter using Bloom filter
61
Structure
62
Performance Need to be concerned about data skew 2 PB joined on 2 PB means 4 PB of network traffic
63
Replicated Join Inner and Left Outer Joins Removes need to shuffle any data to the reduce phase Very useful, but requires one large data set and the remaining data sets to be able to fit into memory of each map task
64
Structure
65
Performance Fastest type of join Map-only Limited based on how much data you can safely store inside JVM Need to be concerned about growing data sets Could optionally use a Bloom filter
66
Composite Join Leverages built-in Hadoop utilities to join the data Requires the data to be already organized and prepared in a specific way Really only useful if you have one large data set that you are using a lot
67
Data Structure
68
Structure
69
Performance Good performance, join operation is done on the map side Requires the data to have the same number of partitions, partitioned in the same way, and each partition must be sorted
70
Cartesian Product Pair up and compare every single record with every other record in a data set Allows relationships between many different data sets to be uncovered at a fine-grain level
71
Known Uses Document or image comparisons Math stuff or something
72
Structure
73
Performance Massive data explosion! Can use many map slots for a long time Effectively creates a data set size O(n 2 ) – Need to make sure your cluster can fit what you are doing
74
METAPATTERNS Job Chaining, Chain Folding, Job Merging
75
Job Chaining One job is often not enough Need a combination of patterns discussed to do your workflow Sequential vs Parallel
76
Methodologies In the Driver In a Bash run script With the JobControl utility
77
Chain Folding Each record can be submitted to multiple mappers, then a reducer, then a mapper Reduces amount of data movement in the pipeline
78
Structure
80
Methodologies Just do it ChainMapper/ChainReducer
81
Job Merging Merge unrelated jobs together into the same pipeline
82
Structure
83
Methodologies Tag map output records Use MultipleOutputs
84
I/O PATTERNS Generating Data, External Source Output, External Source Input, Partition Pruning
85
Customizing I/O Unstructured and semi-structured data often calls for a custom input format to be developed
86
Generating Data Generate lots of data in parallel from nothing Random or representative big data sets for you to test your analytics with
87
Known Uses Benchmarking your new cluster Making more data to represent a sample you were given
88
Structure
89
External Source Output You want to write MapReduce output to some non-native location Direct loading into a system instead of using HDFS as a staging area
90
Known Uses Write directly out to some non-HDFS solution – Key/Value Store – RDBMS – In-Memory Store Many of these are already written
91
Structure
92
External Source Input You want to load data in parallel from some other source Hook other systems into the MapReduce framework
93
Known Uses Skip the staging area and load directly into MapReduce Key/Value store RDBMS In-Memory store
94
Structure
95
Partition Pruning Abstract away how the data is stored to load what data is needed based on the query
96
Known Uses Discard unneeded files based on the query Abstract data storage from query, allowing for powerful middleware to be built
97
Structure
98
References “MapReduce Design Patterns” – O’Reilly 2012 www.github.com/adamjshook/mapreducepatt erns www.github.com/adamjshook/mapreducepatt erns http://en.wikipedia.org/wiki/Bloom_filter
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.