Felix Halim, Roland H.C. Yap, Yongzheng Wu

Felix Halim, Roland H.C. Yap, Yongzheng Wu
A MapReduce-Based Maximum-Flow Algorithm for Large Small-World Network Graphs Felix Halim, Roland H.C. Yap, Yongzheng Wu

Outline Parallelizing the Ford-Fulkerson Method
Background and Motivation Overview Maximum-Flow algorithm (The Ford-Fulkerson Method) MapReduce (MR) Framework Parallelizing the Ford-Fulkerson Method Incremental update, bi-directional search, multiple excess paths MapReduce Optimizations Stateful extension, trading off space vs. number of rounds Experimental Results Conclusion

Background and Motivation
Large Small-world network Graphs naturally arise in World Wide Web, Social Netwoks, Biology, etc… Have been shown to have small diameter and robust Maximum-flow algorithm on Large Graphs Isolate a group of spam sites on WWW Community Identification Defense against Sybil attack Challenge: Computation, storage and memory requirements exceed a single machine capacity Our approach: Cloud/Cluster Computing Run Max-Flow on top of the MapReduce Framework Nowdays, large graph such as www, socnet, … have grown very large. These graphs have been shown to exhibit a small-world network property. That is, it has small diameter and robust. Robust here means that it is robust to edge/vertex removal (the expected diameter is still small). When we have these graph, usually we want to run some analysis on it. For example, we may want to discover a group of spam sites on WWW. Since we know that there are only few numbers of links from trusted sites to spam sites (unless they are compromised), we can compute a maxflow from a trusted site to one of the spam site to cut and isolate the spammer sites. The same technique can be used to identify a group of community. Maxflow can also be used to limit the acceptance of Sybil accounts. Now the challenge is how to practically run a maxflow algorithm on these gigantic graph. The computational complexity of a useful graph algorithms is typically quadratic (such as maxflow). It is far too large for a single machine with quad core to process. On the other hand, the size of the graph can reach terabytes which also exceeds the memory capacity of a single conventional machine. Unless we have an expensive super computer with terabytes of memory and thousands of cores, We will need a distributed systems approach. That is, by distributing the data across several machines in a cluster and process them in parallel. However, it is not as easy as it sound because the conventional graph algorithm were designed to run sequentially. We will need to make modifications to the existing maxflow algorithm to a highly parallel version to utilize the computing power for each machine. In the next section, we well see how we can parallelize a sequential maxflow algorithm and make it effective to run on a distributed system framework, the MapReduce framework. But before that, let’s look at the overview of the maxflow algorithm and mapreduce framework.

The Maximum-Flow Overview
Given a flow network G = (V,E), s, t Residual network Gf = (V, Ef) Augmenting path A simple path from s to t in the residual network To compute the maximum-flow The Ford-Fulkerson Method O( |f*| E ) The maxflow problem is a classical graph problem. You are given a graph with a set of vertices V and edges E and two special vertices s and t. You are to find a flow from s to t so that its value is maximum. A graph with a flow on its edges is called the residual network. A path from s to t in the residual network is called an augmenting path. The Ford-Fulkerson method computes the maxflow by repeatedly find an augmenting path and augment the flow along the edges of the path until no more augmenting path can be found. Let’s take a look of an example. While an Augmenting path p is found in Gf Augment the flow along the path p

Maximum-Flow Example 10 9 2 8 6 7 s t Residual Network Legends: source node sink node intermediate node residual = C – F augmenting path p s t 4 3 2 8 10 9 7 s t Augment p 6 You can think of the graph where a vertex represents a junction and an edge represents a water pipe where you can flow water along it in the edge direction limited by the weight (capacity) of the edge. The first augmenting path is found (the red edges). When it is augmented, the flow is added to the edges. The maxflow that can be added is 6 since you can flow at most 6 unit of water along that path. When we update the residual network, we added a reverse edge capacity to allow canceling the previous flow. r

Maximum-Flow Example Residual Network 4 3 2 8 10 9 7 s t 6 Legends: source node sink node intermediate node residual = C – F augmenting path p s t Augment p 4 3 2 1 10 s t 6 7 The next augmenting path is found and augmented. r

Maximum-Flow Example Residual Network 4 3 2 1 10 s t 6 7 Legends: source node sink node intermediate node residual = C – F augmenting path p s t Augment p 3 2 1 10 s t 7 6 8 Another augmenting path is found and augmented. r

Maximum-Flow Example Flow / Capacity 7/10 7/9 1/2 8/8 10 6/6 8/9 7/7 s t Legends: source node sink node intermediate node residual = C – F augmenting path p s t Max-Flow = 14 3 2 1 10 s t 7 6 8 Finally, no more augmenting path is found. And the Ford-Fulkerson proves that in this case maxflow is reached. The maximum flow here is 14. r

MapReduce Framework Overview
Introduced by Google in 2004 Open source implementation: Hadoop Operates on very large dataset Consists of key/value pairs Across thousands of commodity machines Order terabytes of data Abstracts away distributed computing problems Data partitioning and distribution, load balancing Scheduling, fault tolerance, communication, etc… When we have a very large dataset to process on a large number of machines, We are going to need a framework that deals with distributed system problems. There exists a special graph processing framework named Pregel built by Google, However it is a closed system and the open source version of it (Apache Hama) is not mature yet. At this point, we are left with the MapReduce framework.

MapReduce Model User Defined Map and Reduce function (stateless)
Input: a list of tuples of key/value pairs (k1/v1) The user’s map function is applied to each key/value pair Produces a list of intermediate key/value pairs Output: a list of tuples of key/value pairs (k2/v2) The intermediate values are grouped by its key The user’s reduce function is applied to each group Each tuple is independent Can be processed in isolation and massively parallel manner Total input can be far larger than the total workers’ memory MapReduce is very simple to use, the users only need to define their map and reduce function and model their data as a tupple of a <key,value> pair. There are catch though: The map/reduce function must be designed to be stateless, that is, it cannot accumulate information for each tuple since it will blow up the memory capacity of the worker. It must forget or serialize the result to disks. However, a constant number of event counter can be used. Each tuple must be independent, that is, when a tuple is being processed by map/reduce function, it cannot communicate with other mappers/reducers that are processing other tuples. All the required information to process must be inside the tuple. Once you have stateless map/reduce function and model your data into independent tuples, MapReduce framework can process it in massively parallel manner. And it will be scalable to process dataset that’s far larger than total memory of the workers as long as each tuple memory requirement is within worker’s memory capacity.

Max-Flow on MR - Observations
Input: Small-world graph with small diameter D A vertex in the graph is modeled as an input tuple in MR Naïve translation of the Ford-Fulkerson method to MapReduce requires O(|f*| D) MR rounds Breadth-first search on MR requires O(D) MR rounds BFSMR using 20 machines takes 9 rounds and 6 hours for SWN with ~400M vertices and ~31B edges It would take years to compute a maxflow (|f*| > 1000) Question: How to minimize the number of rounds? We know the fact that small-world graphs has small diameter and robust. We can model each vertex (and it’s neighbors/flow/edges as values) as a tuple for MR input. If we were to run the simplest graph algorithm like BFS on this graph, It will take around O(D) MR rounds (each BFS level is executed in one MR round/job/iteration). It makes sense in multi-round MR algorithm, to measure the complexity in terms of number of rounds Since a single MR round is an expensive job that can take hours. If we were to “naively” translate the sequential maxflow into MapReduce, it would take years to compute a maxflow between two vertices since it requires O(|f*| D) MR rounds. So, the question is: how can we improve this complexity?

FF1: A Parallel Ford-Fulkerson
Goal: Minimize the number of rounds thru parallelism Do more work/round, avoid spilling tasks to the next round Use speculative execution to increase parallelism Incremental updates Bi-directional search Doubles the parallelism (number of active vertices) Effectively halves the expected number of rounds Maintain parallelism (maintain large number of active vertices) Multiple Excess Paths (k) -> most effective Each vertex stores k excess paths (avoid becoming inactive) Result: Lower the number of rounds required from O(|f*| D) to ~D Large number of augmenting path/round (MR bottleneck) We can lower the number of rounds by doing more work in each round and avoid spill computation to the next round. To do more work in each round, we can introduce speculative executions which in turns increase the parallelisms. We don’t waste the work done in the previous run, we continue and update the datastructure incrementally thus maintain the parallelism high. Bi-directional search is effective in doubling the parallelisms and halves the expected number of rounds. To maintain high parallelisms we want high number of active vertices all the time and want to avoid a vertex to be inactive (has nothing to do). To do this, we allow each vertex to store more than 1 excess paths (as backups) so that if a number of them are saturated, it still has some to be extended in the next round. Our preliminary result shows that this is the most effective in minimizing the number of rounds. With all these parallelisms we can generate a lot of augmenting paths per round. These augmenting paths need to be checked for conflict before they are augmented. That is, if two augmenting paths shared an edge, only one of it can be augmented. The decision can be made locally (without the global view of the graph), however the worker that is responsible for this decision will be overwhelmed with the large number of augmenting paths. This can create a convoy effect where all reducers have finished except one that is processing the vertex t (since all augmenting paths found are forwarded/shuffled to t directly). Moreover, it can also generate a lot of communications overhead in shuffling all these augmenting paths to t. Next question: what can we do to resolve this bottleneck?

MR Optimizations Goal: minimize MR bottlenecks and overheads
FF2 : External worker(s) for stateful MR extension Reducer for vertex t is the bottleneck for FF1 (bottleneck) An external process is used to handle augmenting paths acceptance FF3 : Schimmy Method [Lin10] Avoid shuffling the master graph FF4 : Eliminate object instantiations FF5 : Minimize MR shuffle (communication) costs Monitor the extended excess paths for saturation and resend as needed Recompute/reprocess the graph instead of shuffling the delta (not in paper) To offload the heavy decision from the reducer processing vetex t, we created an external process (worker) outside MapReduce framework. Any augmenting paths found in the reduce phase can be sent directly to the external worker And be decided immediately (instead of spilling the computation to the next round to forward it to vertex t). Our preliminary results shows that even with millions of augmenting paths generated in a round, the external process completes as soon as the last reducer finishes, hence no more convoy effect. The next MR optimizations is due to the Shimmy design pattern, and well known object instantiation overhead. Last but not least, we discovered that the communication costs (the reduce shuffled bytes) has a very high correlation with the runtime for the round. Therefore we aim to minimize the communication costs. Since MR is a push model, a vertex cannot know whether the excess path that it forwards to its neighbor was accepted or rejected and therefore need to keep resending new excess paths every round. We invent a technique that assumes the extended excess path is accepted and monitor its saturation and resend when necessary. This avoids resending new excess paths each round (minimize redundancies). When the residual graph is updated, the delta must be shuffled so that the reducer has the up to date information about the latest residual graph. We can prevent the delta to be shuffled by having the reducer to recompute the delta instead. With these optimizations FF5 can handle very large graph effective and efficiently.

Experimental Results Facebook Sub-Graphs Cluster setup
Hadoop v0.21-RC-0 installed on 21 nodes (8-cores Hyper-threaded (2 Intel 3 hard disks SATA) and Centos 5.4 (64-bit)) Graph #Vertices #Edges Size (HDFS ) MaxSize FB1 21 M 112 M 587 MB 8 GB FB2 73 M 1,047 M 6 GB 54 GB FB3 97 M 2,059 M 13 GB 111 GB FB4 151 M 4,390 M 30 GB 96 GB FB5 225 M 10,121 M 69 GB 424 GB FB6 441 M 31,239 M 238 GB 1,281 GB These are the subgraphs we collected. The smaller graph is the subset of the larger graph. Notice that the size increases exponentially in terms of number of edges.

Handling Large Max-Flow Values
This shows that even with large number of maxflow value, our FF5 requires a very low number of MR rounds. The runtime increases linearly while the maxflow value increases exponentially. FF5MR is able to process FB6 with a very small number of MR rounds (close to graph diameter).

MapReduce Optimizations
We can see the improvements over different optimizations FF1 is the parallelized FF with incremental updates, bi-direction search, multiple excess path FF2 is the stateful extension for MR: external worker for augmenting paths acceptance FF3 is the schimmy method FF4 is the elimination of object instantiation FF5 is the prevention of redundant messages, reducing communication costs We see that it is not much worse than a pure BFS on MR From FF1 to FF5: FB1 is 5.43 times faster FB4 is times faster FF1 (parallel FF) to FF5 (MR optimized) vs. BFSMR

Shuffle Bytes Reduction
The effectiveness of FF5 in reducing the communication costs (reduce shuffle bytes). The bottleneck in MR is the shuffle bytes, FF5 optimizes the shuffled bytes

Scalability (#Machines and Graph Size)
It linearly scales with the number of machines (blue -> green -> red). It linearly scales with the size of the graph (FB1 … FB6) It is only a constant factor slower than pure BFS on MR

Conclusion We showed how to parallelize a sequential max-flow algorithm, minimizing the number of MR rounds Incremental Updates, Bi-directional Search, and Multiple Excess Paths We showed MR optimizations for max-flow Stateful extension, minimize communication costs Computing max-flow on Large Small-World Graph Is practical using FF5MR

Q & A Thank you

Backup Slides Related Works Multiple Excess Paths – Results
Edge processed / second – Results MapReduce example (word count) MapReduce execution flow FF1 Map Function FF1 Reduce Function MR

Related Work The Push-Relabel algorithm
Distributed (no global view of the graph required) Have been developed for SMP architectures Needs sophisticated heuristics to push the flows Not Suitable for MapReduce model Need locks, pull information from its neighbors Low number of active vertices Pushing flow to a wrong sub-graph can lead to huge number of rounds There exists distributed maxflow algorithm, The Push-Relabel that requires no global view of the graph. However, it is not suitable for MapReduce model since it requires lock/communications between vertices and it is potential to have high number of rounds when the heuristic make a bad decision.

Multiple Excess Paths - Effectiveness
Beyond 512 is meaningless as most vertices has < 300 connections. The more the k, the less the number of MR rounds required (keep the number of active vertices high)

# Edges processed / sec The larger the graph, the more effective

MapReduce Example – Word Count
map(key,value) // document name, contents foreach word w in value EmitIntermediate(w, 1); reduce(key,values) // a word, list of counts freq = 0; foreach v in values freq = freq + v Emit(key, freq)

MapReduce Execution Flow

MapReduce Simple Applications
Word Count Distributed Grep Count URL access frequency Reverse Web-Link Graph Term-Vector per host Inverted Index Distributed Sort All these tasks can be completed in one MR job

General MR Optimizations
External worker as stateful Extension for MR (Dedicated) External workers outside mappers / reducers Immediately process requests Don’t need to wait until mappers / reducers complete Flexible synchronization point Minimize shuffling intermediate tuples Can avoid shuffling by re-(process/compute) in the reduce Use flags in the data structure to prevent re-shuffling Eliminate object instantiation Use binary serializations

Felix Halim, Roland H.C. Yap, Yongzheng Wu

Similar presentations

Presentation on theme: "Felix Halim, Roland H.C. Yap, Yongzheng Wu"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Felix Halim, Roland H.C. Yap, Yongzheng Wu

Similar presentations

Presentation on theme: "Felix Halim, Roland H.C. Yap, Yongzheng Wu"— Presentation transcript:

Similar presentations

About project

Feedback