Download presentation
Presentation is loading. Please wait.
1
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Graph Analysis With High-Performance Computing Jonathan Berry Scalable Algorithms Department Sandia National Laboratories April 18, 2008
2
Informatics Datasets Are Different Informatics: The analysis of datasets arising from “information” sources such as the WWW (not physical simulation) Motivating Applications: Homeland security Computer security (DOE emphasis) Biological networks, etc. Primary HPC Implication: Any partitioning is “bad” “One of the interesting ramifications of the fact that the PageRank calculation converges rapidly is that the web is an expander-like graph” Page, Brin, Motwani,Winograd 1999 From UCSD ‘08 Broder, et al. ‘00
3
Informatics Algorithms Are Different As Well “The single largest performance bottleneck in the distributed connected components algorithm is the effect of poor vertex distribution…Several methods…have been implemented but none has been successful as of yet.” D. Gregor, from Parallel Boost Graph Library documentation on connected components “The single largest performance bottleneck in the distributed connected components algorithm is the effect of poor vertex distribution…Several methods…have been implemented but none has been successful as of yet.” D. Gregor, from Parallel Boost Graph Library documentation on connected components Connected Components: find groupings of vertices such that all vertices within a group can reach each other “[in power law graphs] there is a giant component…of size O(n)” Aiello, Chung, Lu, 2000 S-T Connectivity: find a short path from vertex S to vertex T Single-Source Shortest Paths (SSSP): from a given vertex, find the shortest paths to all other vertices
4
Informatics Problems Demand New Architectures Distributed Memory Architectures Massively Multithreaded Architectures Key Issues Fast CPU (~3GHz)Slow CPU (~200-500MHz)Power, concurrency Elaborate memory hierarchyAlmost no memory hierarchyIs cache justified? Memory per-processor, partitioned Global address spaceCan you partition? Operating system for threading, synchronization Hardware for threading, synchronization How fine-grained is your data interaction? Programming paradigm is standardized (MPI) Programming paradigm is machine-specific (mta-pe) Portability, debuggability Multithreaded architectures show promise for informatics problems, but more work is necessary…
5
We Are Developing The MultiThreaded Graph Library Enables multithreaded graph algorithms Builds upon community standard (Boost Graph Library) Abstracts data structures and other application specifics Hide some shared memory issues Preserves good multithreaded performance MTGL ADAPTER MTGL C C S-T connectivity scaling (MTA-2)SSSP scaling (MTA-2) MTA-2 Processors Solve time (sec)
6
“MTGL-ized” Code for S-T Connectivity “u” is the next vertex out of the queue “numEdges” is an index “endVertex” is an array of endpoints “adjVertices” is a generic MTGL function “myIter” is a thread-local random- access iterator The compiler inlines away most of the apparent extra work, and (perhaps) data copying relieves a hot spot (we trap less)
7
Initial Algorithmic Impacts of MTGL on XMT Are Promising S-T Connectivity –Gathering evidence for 2005 prediction –This plot show results for ≤ 2 billion edges –Even with Threadstorm 2.0 limitations, XMT predictions look good Connected Components –Simple SV is fast, but hot-spots –Multilevel Kahan algorithm scales (but XMT data incomplete) –No current competitors for large power- law graphs # XMT Processors Time (s) MTGL Shiloach-Vishkin algorithm 32000p Blue Gene\L MTGL/MTA 10p Time (s) # Edges MTGL Kahan’s algorithm MGTL on XMT sets performance standard for informatics problems MTGL/XMT 10p
8
A Recent Comparison With PBGL Finds Efficiency Gap Parallel Boost Graph Library (PBGL) –Run Boost GL on clusters –Some graph algorithms can scale on some inputs PBGL - MTA Comparison on SSSP –PBGL SSSP can scale on non-power law graphs –We compared to a pre-MTGL C code on the MTA-2 –1 order of magnitude raw speed –1 order of magnitude processor efficiency PBGL SSSP Time (s) MTA SSSP # Processors Even when distributed memory approaches scale, massively multithreaded approaches are currently faster and more efficient. [Parallel Processing Letters ’07]
9
Multithreading Means Thinking Differently (e.g. MatVec) 1/2 1/311 1/2 1 1 1 2 34 11/6 1/4 1/3 7/12 Attempt #1 No hot spot; Compiler handled Hot spot
10
1/3 Multithreading Case Study: MatVec 1/2 1/311 1/2 1 1 1 2 34 1/4 1/2 11/6 1/4 1/3 7/12 1/3 Attempt #2 1 =
11
MatVec Case Study MTA-2 Results Simple example: compute the in- degree of each vertex Dotted line: attempt 1 (hot spot) Solid line: attempt 2 (more memory, but not hot) Instance –2^25 vertices –2^28 directed edges –“R-MAT” graph
12
The MTGL And Integration Algorithms/architectures/visualization integration –Sandia architects profiled early MTGL to predict performance on XMT –Titan open-source visualization framework uses MTGL –Qthreads/MTGL
13
Qthreads Lightweight, scalable threading API –Efficient memory synchronization primitives –Explicit “loop parallelization” Development platform for future multi-threaded architectures –POSIX Threads implementation –Sandia Structural Simulation Toolkit support, for emerging architecture analysis –MTA port planned Develop code today, efficient execution tomorrow
14
MTGL Code Example template T mt_incr(T& target, T inc) { #ifdef __MTA__ T res = int_fetch_add(&(target), inc); #elif USING_QTHREADS T res = qthread_incr((aligned_t*)&target, inc) - inc; #else T res = target; target += inc; #endif return res; } T mt_readff(T& toread) { #ifdef __MTA__ return readff(&(toread)); #elif USING_QTHREADS T ret; qthread_readFF(qthread_self(), &ret, &toread); return ret; #else return toread; #endif }
15
Qthreads / MTGL Scalability
16
Community Detection in the MTGL Community detection in large networks is an active field We have an algorithm that iteratively groups vertices with representatives A subroutine of our algorithm is Unconstrained Facility Location We’ve ported an open-source code for a Lagrangian-based subgradient algorithm (“Volume”) to the MTGL In serial, it’s as fast as the current fastest algorithm for community detection – and we’re working on MTA-2 scaling
17
Unconstrained Facility Location –Servers S, Customers C –Cost to open server i is f(i) –Cost for server i to serve customer j is c ij –Indicator variables: facility opening: y i, service x ij
18
Lagrangian Relaxation for Facility Location Problems What is the biggest challenge in solving this formulation well?: “Every customer must be served.”
19
Solution strategy: lift those tough constraints out of the constraint matrix into the objective. (e.g., Avella, Sassano, Vasil’ev, 2003 for p-median ) Lagrangian Relaxation for UFL Good news: remaining problem easy to solve! Bad news: some of the original constraints might not be satisfied.
20
Lifting the Service Constraints An example violated constraint (customer j not fully served): For j th service constraint (violation), j is a Lagrange multiplier: Multiplier j weights cost of violating j th service constraint: New objective:
21
Solving the Relaxed Problem New problem: Set y i = 1 for locations with negative values of (i). Set x ij = 1 if y i =1 and c ij - j < 0, x ij =0 otherwise. Gives valid lower bound on the best UFL cost Linear space and time.
22
MTA/XMT Programming: Use the Compiler Here, we sum the reduced costs of the neighbors of one vertex The removal of the reduction of “sum” prevents a “hot spot” This output is from “canal,” an MTA/XMT compiler analysis tool
23
An Early Attempt to Scale on the MTA-2 Instance: 262K vertices, 2M edges Pseudo inverse power law degree distribution –1 node of degree 262K, 64 of degree ~4000, 256K of degree ~5 Preliminary MTA-2 implementation w/o profiling/tuning Linux serial time: 573s Number of MTA-2 ProcessorsRunning time 11812s 10325s 20302s 40312s In serial, runtimes are comparable to “CNM,” the O(n log^2 n) community finder from Clauset, Newman, Moore – but good MTA scaling is expected
24
Current MTGL Algorithms Connected components (psearch, visit_edges, visit_adj) Strongly-connected components (psearch) Maximal independent set (visit_edges) Typed subgraph isomorphism (psearch, visit_edges) S-t connectivity (bfs) Single-source shortest paths (psearch) Betweenness centrality (bfs-like) Community detection (all kernels) Connection subgraphs (bfs, sparse matrix, mt-quicksort) Find triangles (psearch) Find assortativity (psearch) Find modularity (psearch) PageRank (matvec) Network Simplex for MaxFlow Under development: Motif detection more Berkeley Open-Source Licence pending
25
Acknowledgements MultiThreading Background Simon Kahan (Google (formerly Cray)) Petr Konecny (Google (formerly Cray)) MultiThreading/Distributed Memory Comparisons Bruce Hendrickson (1415) Douglas Gregor (Indiana U.) Andrew Lumsdaine (Indiana U.) MTGL Algorithm Design and Development Vitus Leung (1415) Kamesh Madduri (Georgia Tech.) William McLendon (1423) Cynthia Phillips (1412) MTGL Integration Brian Wylie (1424) Kyle Wheeler (Notre Dame) Brian Barrett (1422)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.