Download presentation
Presentation is loading. Please wait.
1
Alex Pothen Purdue University CSCAPES Institute www.cs.purdue.edu/homes/apothen/ Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit Catalyurek (Ohio State) CSC’11 Workshop May 2011 Multithreaded Algorithms for Graph Coloring Multithreaded Algorithms for Graph Coloring 1
2
References Multithreaded algorithms for graph coloring. Catalyurek, Feo, Gebremedhin, Halappanavar and Pothen, 40 pp., Submitted to Parallel Computing. New multithreaded ordering and coloring algorithms for multicore architectures. Patwary, Gebremedhin, Pothen, 12pp., EuroPar 2011. Graph coloring for derivative computation and beyond: Algorithms, software and analysis. Gebremedhin, Nguyen, Pothen, and Patwary, 32 pp., Submitted to TOMS. Distributed Memory Parallel Algorithms for Matching and Coloring. Catalyurek, Dobrian, Gebremedhin, Halappanavar, and Pothen, 10pp., IPDPS Workshop PCO, 2011. 2
3
3 Graph Architecture Algorithm Thread Granularity, Synchronization Concurrency Latency Tolerance Performance
4
Outline The many-core and multi-threaded world ◦ Intel Nehalem ◦ Sun Niagara ◦ Cray XMT A case study on multithreaded graph coloring ◦ An Iterative and Speculative Coloring Algorithm ◦ A Dataflow algorithm RMAT graphs: ER, G, and B Experimental results Conclusions 4
5
Architectural Features O(|E|)-time implementations possible for all four B = max back degree over entire seq. B+1 colors suffice to color G. B = max back degree over entire seq. B+1 colors suffice to color G. Proc.Threads/ Core Cores/SocketThreadsCacheClockMultithreading, Other Detail Intel Nehalem 2416Shared L3 2.5 GSimultaneous, Cache Coher. protocol Sun Niagara 2 82128Shared L2 1.2 GSimultaneous Cray XMT 128128 Procs.16,384None500 MInterleaved, Fine-grained synchronization
6
Multithreaded: Iterative Greedy Algorithm 6 v v Forbidden Colors v V cici
7
Multi-threaded: Data Flow Algorithm 7
8
8 v v Forbidden Colors v V cici
9
Moore’s Law and Performance 9 Figure: K. Olukotun, L. Hammond, H. Sutter, and Burton Smith
10
RMAT Graphs R-MAT: Recursive MATrix method Experiments ◦ RMAT-ER (0.25, 0.25, 0.25, 0.25) ◦ RMAT-G (0.45, 0.15, 0.15, 0.25) ◦ RMAT-B (0.55, 0.15, 0.15, 0.15) Chakrabarti, D. and Faloutsos, C. 2006. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38, 1. 10
11
RMAT Graphs a b cd 11
12
Nehalem: Strong Scaling (Niagara) RMAT-ER RMAT-G RMAT-B 12
13
Cray XMT: Strong and Weak Scaling Iter-G Iter-B DF-GDF-B 13
14
Comparing Three Platforms a) ERc) Good e) Bad 14
15
No. Colors in Parallel Algorithms a) ER c) B b) G 15
16
Computing SL Orderings in Parallel: RMAT-G graphs (Nehalem) 16 SL OrderingRelaxed SL Ordering
17
Our contributions: Multithreaded Coloring Our contributions: Multithreaded Coloring Massive multithreading ◦ Can tolerate memory latency for graphs/sparse matrices ◦ Dataflow algorithms easier to implement than distributed memory versions ◦ Thread concurrency ameliorates lack of caches, and lower clock speeds ◦ Thread parallelism can be exploited at fine grain if supported by lightweight synchronization ◦ Graph structure critically influences performance Many-core machines ◦ Developed an iterative algorithm for greedy coloring (distance-1 and -2) and ordering algorithms that port to different machines ◦ Simultaneous multithreading can hide latency (X threads on 1 core vs. 1 thread on X cores) ◦ Decomposition into tasks at a finer grain than distributed-memory version, and relax synchronization to enhance concurrency ◦ Will form nodes of Peta- and Exa-scale machines, so single node performance studies are needed 17
18
Computing Smallest Last Ordering: Strict and Relaxed Orderings on SC graphs 18
19
Computing SL Orderings in Parallel: RMAT-B graphs 19
20
Computing SL Orderings in Parallel: Strict and Relaxed Orderings 20
21
Conflicts from Speculation: Three Platforms a) conflicts b) iterations c) conflicts per iter. 21
22
A Renaissance in Architectures Good news ◦ Moore’s Law marches on Bad news ◦ Power limits improvement in clock speed ◦ Memory accesses are the bottleneck for high-throughput computing Major paradigm change, huge opportunity for innovation, eventual consequences are unclear Current response: multi- and many-core processors ◦ Computation/Communication ratio will get worse ◦ Makes life harder for applications ◦ Heterogeneous programming models for extreme- scale machines built from many-core nodes 22
23
Architectural Wish List for Sparse Matrices/Graphs Small messages with low latency / high bandwidth Latency tolerance Light-weight synchronization mechanisms Global address space ◦ No partitioning of the problem required ◦ Avoid memory-consuming profusion of ghost-nodes ◦ Correctness and performance easier One Solution: Multi-threaded computations 23
24
Multi-threaded Parallelism Multi-threaded Parallelism 24
25
Memory access times determine performance By issuing multiple threads, mask memory latency if a ready thread is available when a functional unit becomes free Interleaved vs. Simultaneous multithreading (IMT or SMT) Figure from Robert Golla, Sun Time 25
26
Multi-core: Sun Niagara 2 Two 8-core sockets, 8 hw threads per core 1.2 GHz processors linked by 8 x 9 crossbar to L2 cache banks Simultaneous multithreading Two threads from a core can be issued in a cycle Shallow pipeline 26
27
Multicore: Intel Nehalem Multicore: Intel Nehalem Two quad-core sockets, 2.5 GHz Two hyperthreads per core support SMT Off chip-data latency 106 cycles Advanced architectural features: Cache coherence protocol to reduce traffic, loop-stream detection, improved branch prediction, out-of-order execution 27
28
Massive Multithreading: Cray XMT Latency tolerance via massive multi-threading ◦ Context switch between threads in a single clock cycle ◦ Global address space, hashed to memory banks to reduce hot-spots ◦ No cache or local memory, average latency 600 cycles Memory request doesn’t stall processor ◦ Other threads work while the request is fulfilled Light-weight, word-level synchr. (full/empty bits) Notes: ◦ 500 MHz clock ◦ 128 Hardware thread streams/proc., ◦ Interleaved multithreading 28
29
Multithreaded Algorithms for Graph Coloring ◦ We developed two kinds of multithreaded algorithms for graph coloring: An iterative, coarse-grained method for generic shared-memory architectures A dataflow algorithm designed for massively multithreaded architectures with hardware support for fine-grain synchronization, such as the Cray XMT ◦ Benchmarked the algorithms on three systems: Cray XMT, Sun Niagara 2 and Intel Nehalem ◦ Excellent speedup observed on all three platforms 29
30
Coloring Algorithms Coloring Algorithms 30
31
Greedy coloring algorithms 31 Distance-k, star, and acyclic coloring are NP-hard Approximating coloring to within O(n 1-e ) is NP-hard for any e>0 G REEDY (G=(V,E)) Order the vertices in V for i = 1 to |V| do Determine colors forbidden to v i Assign v i the smallest permissible color end-for A greedy heuristic usually gives a near-optimal solution The key is to find good orderings for coloring, and many have been developed Ref: Gebremedhin, Tarafdar, Manne, Pothen, SIAM J. Sci. Compt. 29:1042--1072, 2007.
32
Distance-1Coloring, Greedy Alg. a a v a a v 32
33
Many-core greedy coloring Given a graph, parallelize greedy coloring on many-core machines such that Speedup is attained, and Number of colors is roughly same as in serial Difficult task since greedy is inherently sequential, computation small relative to communication, and data accesses are irregular D1 coloring: Approaches based on Luby’s parallel algorithm for maximal independent set had limited success Gebremedhin and Manne (2000) developed a parallel greedy coloring algorithm on shared memory machines ◦ Uses speculative coloring to enhance concurrency, randomized partitioning to reduce conflicts, and serial conflict resolution ◦ Number of conflicts bounded, so this approach yields an effective algorithm ◦ Extended to distance-2 coloring by G, M and P (2002) We adapt this approach to implement the greedy algorithm for many- core computing 33
34
Parallel Coloring 34
35
Parallel Coloring: Speculation a a v w a a v w 35
36
Experimental results Iterative Dataflow Cray XMT: RMAT-G with 2 24, …, 2 27 vertices and 134M, …, 1B edges 36
37
Experimental results Niagara 2 Iterative Perf. With doubling threads on a core = Doubling cores! 37
38
Experimental results RMAT-G with 2 24 = 16M vertices and 134M edges All Platforms RMAT-B, 2 24 vertices,134M edges 38
39
Iterative Greedy Coloring: Multithreaded Algorithm Adj(v), color(w), forbidden(v): d(v) reads each forbidden(v): d(v) writes Adj(v), color(w): d(v) reads each 39
40
Experimental results RMAT-G with 2 24 = 16M vertices and 134M edges All Platforms 40
41
Tentative Conclusions, Future Work 41
42
Future Plans: Multithreaded Coloring Future Plans: Multithreaded Coloring Massive multithreading ◦ Microbechmarking to understand where the cycles go: thread management, data accesses, synchronization, instruction scheduling, function unit limitations… ◦ Develop a performance model of the computation ◦ Experiment with other graph classes ◦ Consider new algorithmic paradigms Many-core machines ◦ Four items as above ◦ Ordering for coloring: Archetype of a problem for computing a sequential ordering in a parallel environment (Mostofa Patwary and Assefaw Gebremedhin) ◦ Extend to nodes of Peta-scale machines, so single node performance is enhanced, and complete our work on the Blue Gene and the Cray XT5 42
43
Thanks Rob Bisseling, Erik Boman, Ümit Çatalürek, Karen Devine, Florin Dobrian, John Feo, Assefaw Gebremedhin, Mahantesh Halappanavar, Bruce Hendrickson, Paul Hovland, Gary Kumfert, Fredrik Manne, Ali Pınar, Sivan Toledo, Jean Utke 43
44
Further reading www.cscapes.org Gebremedhin and Manne, Scalable parallel graph coloring algorithms, Concurrency: Practice and Experience, 12: 1131-1146, 2000. Gebremedhin, Manne and Pothen, Parallel distance-k coloring algorithms for numerical optimization, Lecture Notes in Computer Science, 2400: 912-921, 2002. Bozdag, Gebremedhin, Manne, Boman and Catalyurek. A framework for scalable greedy coloring on distributed-memory parallel computers. J. Parallel Distrib. Comput. 68(4):515-535, 2008. Catalyurek, Feo, Gebremedhin, Halappanavar and Pothen, Multi-threaded algorithms for graph coloring, Preprint, Aug. 2010. 44
45
Combinatorial Scientific Computing and Petascale Simulations (CSCAPES) Institute One of four DOE Scientific Discovery thru Advanced Computing (SciDAC) Institutes (2006-2012); only one in Appl. Math ◦ Excellence in research, education and training ◦ Collaborations with science projects in SciDAC Focus not on specific application, but on algorithms and software for combinatorial problems Participants from Purdue, Sandia, Argonne, Ohio State, Colorado State CSCAPES workshops with talks, tutorials on software, discussions on collaborations www.cscapes.org 45
46
CSCAPES Institute 46
47
Advertisements! Advertisements! Journal of the ACM: New section on Scientific and High Performance Computing… venue for publishing work of excellent quality in these areas that have a Computer Science component SIAM Books, SIAM Monographs in Computational Science and Engineering: Potential book ideas? Please contact me! SIAM Workshop on Combinatorial Scientific Computing, CSC11, May 19-21, Darmstadt, www.siam.org/meetings/csc11/ 47
48
Applications Complexity Increasing Leading edge scientific applications increasingly include: ◦ Adaptive, unstructured data structures ◦ Complex, multiphysics simulations ◦ Multiscale computations in space and time ◦ Complex synchronizations (e.g., discrete events) Significant parallelization challenges on today’s machines ◦ Finite degree of coarse-grained parallelism ◦ Load balancing and memory hierarchy optimization Dramatically harder on millions of cores Huge need for new algorithmic paradigms 48
49
Massive Mutithreading: Cray XMT 128 processors, 500 MHz 128 hw thread streams / proc. in each cycle a proc. issues one ready thread Deeply pipelined, M, A, C functional units Cache-less Globally shared memory Efficient hardware synchronziation via full/empty bits Data mapped randomly in 8 Byte blocks, no locality Average Memory latency 600 cycles 49
50
Multithreaded: Data Flow 50
51
Parallelizing greedy coloring Goal: Given a distributed graph, parallelize greedy coloring such that ◦ Speedup is attained ◦ Number of colors used is roughly same as in serial Difficult task since greedy is inherently sequential, computation small relative to communication, and data accesses are irregular D1 coloring: approaches based on Luby’s parallel algorithm for maximal independent set had very limited success D2 coloring: no practical parallel algorithms existed We developed a framework for effective parallelization of greedy coloring on distributed memory architectures Using the framework, we designed various specialized parallel algorithms for D1 and D2 coloring ◦ First MPI implementations to yield speedup 51
52
Framework: Distributed-memory parallel Greedy Coloring Exploit features of initial data distribution ◦ Distinguish between interior and boundary vertices Proceed in rounds, each having two phases: ◦ Tentative coloring ◦ Conflict detection Coloring phase organized in supersteps ◦ A processor communicates only after coloring a subset of its assigned vertices infrequent, coarse-grain communication Randomization used in resolving conflicts 2 3 4 5 7 6 8 12 34 1 56 78 Superstep 1 Superstep 2 Communicate Superstep 1 Communicate Detect conflicts Round 1 Round 2 Detect conflicts 52
53
Specializations of the framework First Fit Staggered First Fit Color selection strategies Interior before boundary Interior after boundary Interior interleaved with boundary Coloring order Various degree-based techniques Local vertex ordering Synchronous Asynchronous Supersteps Customized Broadcast-based Inter-processor communication 53
54
Implementation and experimentation Using the framework (JPDC, 2008) ◦ Designed specialized parallel algorithms for distance-1 coloring ◦ Experimentally studied how to tune “parameters” according to size, density, and distribution of input graph number of processors computational platform Extending the framework (SISC, under review) ◦ Designed parallel algorithms for D2 and restricted star coloring (to support Hessian computation) ◦ Designed parallel algorithms for D2 coloring of bipartite graphs (to support Jacobian computation) New Challenge: efficient mechanism for information exchange between processors hosting D2 neighboring vertices needs to be devised Software ◦ MPI implementations of D1 and D2 coloring made available in Zoltan 54
55
D1-coloring, IBM Blue Gene/P D1-coloring, IBM Blue Gene/P 55
56
Our contributions: Parallel Coloring Our contributions: Parallel Coloring Global address-space machines ◦ Parallel algorithms for distance-1 and distance-2 coloring on graphs for Jacobian and Hessian computations ◦ Employs randomized graph partitioning, speculative coloring to improve scalability Distributed memory machines ◦ Developed a parallelization framework for greedy coloring ◦ Employs graph partitioning, speculative coloring, optimized communication granularity to achieve scalability on 16,000 processors IBM Blue Gene and Cray XT-5 ◦ Designed specialized parallel algorithms for D1 and D2 coloring ◦ Deployed implementations via the Zoltan toolkit 56
57
Coloring Algorithms: Our Contributions Serial algorithms and software ◦ Jacobian computation via distance-2 coloring algorithms on bipartite graphs ◦ Developed novel algorithms for acyclic, star, distance-k (k = 1,2) and other coloring problems; developed associated matrix recovery algorithms ◦ Delivered implementations via the software package ColPack (released Oct. 2008) ◦ Interfaced ColPack with the AD tool ADOL-C Application Highlights ◦ Enabled Jacobian computation in Simulated Moving Beds ◦ Enabled Hessian computation in optimizing electric power flow Parallel algorithms and software ◦ Developed a parallelization framework for greedy coloring ◦ Designed specialized parallel algorithms for D1 and D2 coloring ◦ Deployed implementations via the Zoltan toolkit 57
58
Further reading www.cscapes.org Gebremedhin, Manne and Pothen. What color is your Jacobian? Graph coloring for computing derivatives. SIAM Review 47(4):627—705, 2005. Gebremedhin, Tarafdar, Manne and Pothen. New acyclic and star coloring algorithms with applications to computing Hessians. SIAM J. Sci. Comput. 29:1042—1072, 2007. Gebremedhin, Pothen and Walther. Exploiting sparsity in Jacobian computation via coloring and automatic differentiation: a case study in a Simulated Moving Bed process. AD2008, LNCSE 64:339---349, 2008. Gebremedhin, Pothen, Tarafdar and Walther. Efficient computation of sparse Hessians using coloring and Automatic Differentiation. INFORMS Journal on Computing, 21:209---223, 2009. Bozdag, Gebremedhin, Manne, Boman and Catalyurek. A framework for scalable greedy coloring on distributed-memory parallel computers. J. Parallel Distrib. Comput. 68(4):515—535, 2008. Gebremedhin, Nguyen, Patwary and Pothen. ColPack: Graph Coloring Software for Derivative Computation and Beyond, Submitted, Oct. 2010. 58
59
Applications Also Getting More Complex Leading edge scientific applications increasingly include: ◦ Adaptive, unstructured data structures ◦ Complex, multiphysics simulations ◦ Multiscale computations in space and time ◦ Complex synchronizations (e.g. discrete events) Significant parallelization challenges on today’s machines ◦ Finite degree of coarse-grained parallelism ◦ Load balancing and memory hierarchy optimization Dramatically harder on millions of cores Huge need for new algorithmic ideas – CSC will be critical 59
60
Architectural Challenges for Graph Algorithms Runtime is dominated by latency ◦ Particularly true for data-centric applications ◦ Random accesses to global address space ◦ Perhaps many at once – fine-grained parallelism Essentially no computation to hide access time Access pattern is data dependent ◦ Prefetching unlikely to help ◦ Usually only want small part of cache line Potentially abysmal locality at all levels of memory hierarchy 60
61
a a v 61
62
a a v w 62
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.