CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems James Cheng CSE, CUHK Slide Ack.: modified based on the slides from Mosharaf Chowdhury
FlumeJava Easy, Efficient Data-Parallel Pipelines PLDI 2010 Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, Nathan Weizenbaum
Problem Long and complicated data-parallel pipelines Long chain of MapReduce jobs Iterative jobs … Difficult to program and manage Each MapReduce job needs to keep intermediate results High overhead at synchronization barrier between MapReduce jobs Curse of the last reducer Start-up cost
Solution Exposes a limited set of parallel operations on immutable parallel collections From 16 data-parallel operations to 2
Goals Expressiveness Abstractions Performance Data representation Implementation strategy Performance Lazy evaluation Dynamic optimization Usability & deployability Implemented as a Java library
Write a Java program using the FlumeJava library FlumeJava Workflow 1 3 Write a Java program using the FlumeJava library 2 Optimize FlumeJava.run(); PCollection<String> words = lines.parallelDo(new DoFn<String, String>() { void process(String line, EmitFn<String> emitFn) { for (String word : splitIntoWords(line)) { emitFn.emit(word); } }, collectionOf(strings())); 4 Execute
Core Abstractions Parallel Collections Data-parallel Operations PCollection<T> PTable<K, V> Primitives parallelDo() groupByKey() combineValues() flatten() Derived operations count() join() top()
Parallel Collections PCollection<T> PTable<K, V> An immutable bag of elements of type T Ordered (e.g., sequence) or unordered (e.g., collection) T can be built-in or user-defined PTable<K, V> An immutable unordered bag of key-value pairs (i.e., an immutable multi-map), where keys are of type K and values are of type V Same as PCollection<Pair<K, V>>
Primitive Operations parallelDo() PCollection<String> words = Support elementwise computation over an input PCollection<T> to produce a new output PCollection<S> Take a DoFn<T, S> argument to map each element in PCollection<T> to zero or more elements to appear in PCollection<S> E.g., split lines into words (by DoFn) in parallel (by parallelDo): PCollection<String> words = lines.parallelDo(new DoFn<String,String>() { void process(String line, EmitFn<String> emitFn) { for (String word : splitIntoWords(line)) { emitFn.emit(word); } }, collectionOf(strings()));
Primitive Operations parallelDo() Can express both map and reduce Subclasses of DoFn: MapFn, FilterFn … DoFn functions should not access any global mutable states (for data consistency as they run in parallel) DoFn objects can maintain the states of local variables Multiple DoFn replicas can operate concurrently with no shared state
Primitive Operations groupByKey() Convert a multi-map of type PTable<K,V> (which can have many key/value pairs with the same key) into a uni-map of type PTable<K, Collection<V>> where each key maps to an unordered collection of all the values with that key Capture the essence of the shuffle step of MapReduce E.g. compute a table mapping URLs to the collection of documents that link to each URL: PTable<URL,DocInfo> backlinks = docInfos.parallelDo(new DoFn<DocInfo, Pair<URL,DocInfo>>() { void process(DocInfo docInfo, EmitFn<Pair<URL,DocInfo>> emitFn) { for (URL targetUrl : docInfo.getLinks()) { emitFn.emit(Pair.of(targetUrl, docInfo)); } }, tableOf(recordsOf(URL.class), recordsOf(DocInfo.class))); PTable<URL,Collection<DocInfo>> referringDocInfos = backlinks.groupByKey(); For each url in a doc, output a (url, doc) pair Group all docs with the same url (i.e., key) as a group
Primitive Operations combineValues() Take an input, PTable<K, Collection<V>>, and an associative combining function on the elements of Collection<V>, and return a PTable<K, V> where all elements of each Collection<V> are combined into a single output value Can be implemented using parallelDo, but more efficient first using combiners in MapReduce E.g., count the occurrence of each distinct word: PTable<String,Integer> wordsWithOnes = words.parallelDo(new DoFn<String, Pair<String,Integer>>() { void process(String word, EmitFn<Pair<String,Integer>> emitFn) { emitFn.emit(Pair.of(word, 1)); } }, tableOf(strings(), ints())); PTable<String,Collection<Integer>> groupedWordsWithOnes = wordsWithOnes.groupByKey(); PTable<String,Integer> wordCounts = groupedWordsWithOnes.combineValues(SUM_INTS); For each word, output a (word, 1) pair Group all ‘1’s with the same key (i.e., same word) as a group Combine all ‘1’s with the same key (i.e., same word) into their sum
Primitive Operations flatten() Take a list of “PCollection<T>”s, and return a single PCollection<T> that contains all the elements of the input “PCollection<T>”s Do not actually copy the inputs, but rather creates a view of them as one logical PCollection
Derived Operations count() Take a PCollection<T> and return a PTable<T, Integer>, where PTable<T, Integer> gives the set of distinct elements in PCollection<T> associated with the number of occurrences of each distinct element in PCollection<T> Can be implemented using parallelDo(), groupByKey(), and combineValues() E.g., count the number of occurrences of each distinct word (same result as the code shown in the previous slide): PTable<String,Integer> wordCounts = words.count();
Derived Operations join() top() Take as input a multi-map PTable<K, V1> and a multi-map PTable<K, V2>, return a uni-map PTable<K, Tuple2<Collection<V1>, Collection<V2>>>, such that for each key in either of the input tables, Collection<V1> is the collection of all values with that key in the first table, and Collection<V2> is the collection of all values with that key in the second table Various joins can be computed from Tuple2<Collection<V1>, Collection<V2>> top() Take a comparison function and a count k, and return the greatest k elements according to the comparison function
Write a Java program using the FlumeJava library FlumeJava Workflow DONE! NEXT … NEXT … 1 3 Write a Java program using the FlumeJava library 2 Optimize FlumeJava.run(); PCollection<String> words = lines.parallelDo(new DoFn<String, String>() { void process(String line, EmitFn<String> emitFn) { for (String word : splitIntoWords(line)) { emitFn.emit(word); } }, collectionOf(strings())); 4 Execute
Deferred Evaluation To apply optimization, FlumeJava applies deferred evaluation to its parallel operations Each PCollection object is represented internally either in deferred (not yet computed) or materialized (computed) state A deferred PCollection holds a pointer to the deferred operation that computes it A deferred operation holds references to the PCollections that are its arguments (either deferred or materialized) and the deferred PCollections that are its results When a FlumeJava operation, e.g., parallelDo(), is called, it just creates a ParallelDo deferred operation object and returns a new deferred PCollection that points to it The result of executing a series of FlumeJava operations is a directed acyclic graph (DAG) of deferred PCollections and operations (the DAG is also called the execution plan)
Optimization Optimizer Strategy Optimizer Output Sink flattens Lift CombineValues Insert fusion blocks Fuse parallelDos Fuse MSCRs MSCR Flatten Operate
ParallelDo Fusion Producer-consumer fusion: Sibling fusion: One ParallelDo operation performs function f, and its result is consumed by another ParallelDo operation that performs function g Replaced by a single multi-output ParallelDo that computes both f and (g ◦ f), e.g.: A and D replaced by (A + D) to give A.1 and D.0 If result of f not needed by other operations, it won’t be produced, e.g.: A and B replaced by (A + B) to give B.0 only, A.0 is not produced Sibling fusion: Two or more ParallelDo operations read the same input Pcollection Fused into a single multi-output ParallelDo operation that computes the results of all the fused operations in a single pass over the input, e.g.: B, C and D are fused into (B + C + D)
MapShuffleCombineReduce (MSCR) Transform combinations of the four primitives into a single MapReduce Generalizes MapReduce Multiple input channels Multiple reducers/combiners Multiple outputs per reducer Pass-through outputs
MSCR Fusion An MSCR operation produced from a set of related GroupByKey operations GroupByKey operations are related if they consume the same input or inputs created by the same (fused) ParallelDo operations E.g. MSCR fusion seeded by three GroupByKey operations (starred PCollections are needed by later operations)
Let’s do it! Optimization Optimizer Strategy Optimizer Output Sink flattens Lift CombineValues Insert fusion blocks Fuse parallelDos Fuse MSCRs MSCR Flatten Operate Let’s do it!
An Example: Step 1 Initially 16 data-parallel operations After sinking Flattens
An Example: Step 2 After Step 1 After ParallelDo fusion
An Example: Step 3 After Step 2 After MSCR fusion
An Example: Final Result From 16 data-parallel operations to 2 MSCR operations (executed by 2 MapReduce!)
Some Results 5x reduction in average number of MapReduce stages Faster than other approaches Except for Hand-optimized MapReduce chains 319 users over a year period in Google