Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Slides:



Advertisements
Similar presentations
Social network partition Presenter: Xiaofei Cao Partick Berg.
Advertisements

Oracle Labs Graph Analytics Research Hassan Chafi Sr. Research Manager Oracle Labs Graph-TA 2/21/2014.
Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
Ensemble Emulation Feb. 28 – Mar. 4, 2011 Keith Dalbey, PhD Sandia National Labs, Dept 1441 Optimization & Uncertainty Quantification Abani K. Patra, PhD.
Engineering Distributed Graph Algorithms in PGAS languages Guojing Cong, IBM research Joint work with George Almasi and Vijay Saraswat.
Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Problem Uncertainty quantification (UQ) is an important scientific driver for pushing to the exascale, potentially enabling rigorous and accurate predictive.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
Massive Graph Visualization: LDRD Final Report Sandia National Laboratories Sand Printed October 2007.
Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.
Chapter 9 Graph algorithms Lec 21 Dec 1, Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
A scalable multilevel algorithm for community structure detection
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.
All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.
Pregel: A System for Large-Scale Graph Processing
XMT BOF SC09 XMT Status And Roadmap Shoaib Mufti Director Knowledge Management.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
Dax: Rethinking Visualization Frameworks for Extreme-Scale Computing DOECGF 2011 April 28, 2011 Kenneth Moreland Sandia National Laboratories SAND P.
L21: “Irregular” Graph Algorithms November 11, 2010.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Threads, Thread management & Resource Management.
The MultiThreaded Graph Library November 17, 2009 Jon Berry Greg Mackey Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Data Intensive Computing at Sandia September 15, 2010 Andy Wilson Senior Member of Technical Staff Data Analysis and Visualization Sandia National Laboratories.
Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.
A Distributed Framework for Correlated Data Gathering in Sensor Networks Kevin Yuen, Ben Liang, Baochun Li IEEE Transactions on Vehicular Technology 2008.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,
Site Report DOECGF April 26, 2011 W. Alan Scott Sandia National Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated.
Data Structures and Algorithms in Parallel Computing Lecture 3.
Data Structures and Algorithms in Parallel Computing Lecture 7.
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
Data Structures and Algorithms in Parallel Computing
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Parallel Computing Presented by Justin Reschke
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
James Hipp Senior, Clemson University.  Graph Representation G = (V, E) V = Set of Vertices E = Set of Edges  Adjacency Matrix  No Self-Inclusion (i.
Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory.
World’s fastest Machine Learning Engine
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Parallel Programming By J. H. Wang May 2, 2017.
PREGEL Data Management in the Cloud
Ray-Cast Rendering in VTK-m
Data Structures and Algorithms in Parallel Computing
Department of Computer Science University of California, Santa Barbara
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL Graph Analysis With High-Performance Computing Jonathan Berry Scalable Algorithms Department Sandia National Laboratories April 18, 2008

Informatics Datasets Are Different Informatics: The analysis of datasets arising from “information” sources such as the WWW (not physical simulation) Motivating Applications: Homeland security Computer security (DOE emphasis) Biological networks, etc. Primary HPC Implication: Any partitioning is “bad” “One of the interesting ramifications of the fact that the PageRank calculation converges rapidly is that the web is an expander-like graph” Page, Brin, Motwani,Winograd 1999 From UCSD ‘08 Broder, et al. ‘00

Informatics Algorithms Are Different As Well “The single largest performance bottleneck in the distributed connected components algorithm is the effect of poor vertex distribution…Several methods…have been implemented but none has been successful as of yet.” D. Gregor, from Parallel Boost Graph Library documentation on connected components “The single largest performance bottleneck in the distributed connected components algorithm is the effect of poor vertex distribution…Several methods…have been implemented but none has been successful as of yet.” D. Gregor, from Parallel Boost Graph Library documentation on connected components Connected Components: find groupings of vertices such that all vertices within a group can reach each other “[in power law graphs] there is a giant component…of size O(n)” Aiello, Chung, Lu, 2000 S-T Connectivity: find a short path from vertex S to vertex T Single-Source Shortest Paths (SSSP): from a given vertex, find the shortest paths to all other vertices

Informatics Problems Demand New Architectures Distributed Memory Architectures Massively Multithreaded Architectures Key Issues Fast CPU (~3GHz)Slow CPU (~ MHz)Power, concurrency Elaborate memory hierarchyAlmost no memory hierarchyIs cache justified? Memory per-processor, partitioned Global address spaceCan you partition? Operating system for threading, synchronization Hardware for threading, synchronization How fine-grained is your data interaction? Programming paradigm is standardized (MPI) Programming paradigm is machine-specific (mta-pe) Portability, debuggability Multithreaded architectures show promise for informatics problems, but more work is necessary…

We Are Developing The MultiThreaded Graph Library Enables multithreaded graph algorithms Builds upon community standard (Boost Graph Library) Abstracts data structures and other application specifics Hide some shared memory issues Preserves good multithreaded performance MTGL ADAPTER MTGL C C S-T connectivity scaling (MTA-2)SSSP scaling (MTA-2) MTA-2 Processors Solve time (sec)

“MTGL-ized” Code for S-T Connectivity “u” is the next vertex out of the queue “numEdges” is an index “endVertex” is an array of endpoints “adjVertices” is a generic MTGL function “myIter” is a thread-local random- access iterator The compiler inlines away most of the apparent extra work, and (perhaps) data copying relieves a hot spot (we trap less)

Initial Algorithmic Impacts of MTGL on XMT Are Promising S-T Connectivity –Gathering evidence for 2005 prediction –This plot show results for ≤ 2 billion edges –Even with Threadstorm 2.0 limitations, XMT predictions look good Connected Components –Simple SV is fast, but hot-spots –Multilevel Kahan algorithm scales (but XMT data incomplete) –No current competitors for large power- law graphs # XMT Processors Time (s) MTGL Shiloach-Vishkin algorithm 32000p Blue Gene\L MTGL/MTA 10p Time (s) # Edges MTGL Kahan’s algorithm MGTL on XMT sets performance standard for informatics problems MTGL/XMT 10p

A Recent Comparison With PBGL Finds Efficiency Gap Parallel Boost Graph Library (PBGL) –Run Boost GL on clusters –Some graph algorithms can scale on some inputs PBGL - MTA Comparison on SSSP –PBGL SSSP can scale on non-power law graphs –We compared to a pre-MTGL C code on the MTA-2 –1 order of magnitude raw speed –1 order of magnitude processor efficiency PBGL SSSP Time (s) MTA SSSP # Processors Even when distributed memory approaches scale, massively multithreaded approaches are currently faster and more efficient. [Parallel Processing Letters ’07]

Multithreading Means Thinking Differently (e.g. MatVec) 1/2 1/311 1/ /6 1/4 1/3 7/12 Attempt #1 No hot spot; Compiler handled Hot spot

1/3 Multithreading Case Study: MatVec 1/2 1/311 1/ /4 1/2 11/6 1/4 1/3 7/12 1/3 Attempt #2 1 =

MatVec Case Study MTA-2 Results Simple example: compute the in- degree of each vertex Dotted line: attempt 1 (hot spot) Solid line: attempt 2 (more memory, but not hot) Instance –2^25 vertices –2^28 directed edges –“R-MAT” graph

The MTGL And Integration Algorithms/architectures/visualization integration –Sandia architects profiled early MTGL to predict performance on XMT –Titan open-source visualization framework uses MTGL –Qthreads/MTGL

Qthreads Lightweight, scalable threading API –Efficient memory synchronization primitives –Explicit “loop parallelization” Development platform for future multi-threaded architectures –POSIX Threads implementation –Sandia Structural Simulation Toolkit support, for emerging architecture analysis –MTA port planned Develop code today, efficient execution tomorrow

MTGL Code Example template T mt_incr(T& target, T inc) { #ifdef __MTA__ T res = int_fetch_add(&(target), inc); #elif USING_QTHREADS T res = qthread_incr((aligned_t*)&target, inc) - inc; #else T res = target; target += inc; #endif return res; } T mt_readff(T& toread) { #ifdef __MTA__ return readff(&(toread)); #elif USING_QTHREADS T ret; qthread_readFF(qthread_self(), &ret, &toread); return ret; #else return toread; #endif }

Qthreads / MTGL Scalability

Community Detection in the MTGL Community detection in large networks is an active field We have an algorithm that iteratively groups vertices with representatives A subroutine of our algorithm is Unconstrained Facility Location We’ve ported an open-source code for a Lagrangian-based subgradient algorithm (“Volume”) to the MTGL In serial, it’s as fast as the current fastest algorithm for community detection – and we’re working on MTA-2 scaling

Unconstrained Facility Location –Servers S, Customers C –Cost to open server i is f(i) –Cost for server i to serve customer j is c ij –Indicator variables: facility opening: y i, service x ij

Lagrangian Relaxation for Facility Location Problems What is the biggest challenge in solving this formulation well?: “Every customer must be served.”

Solution strategy: lift those tough constraints out of the constraint matrix into the objective. (e.g., Avella, Sassano, Vasil’ev, 2003 for p-median ) Lagrangian Relaxation for UFL Good news: remaining problem easy to solve! Bad news: some of the original constraints might not be satisfied.

Lifting the Service Constraints An example violated constraint (customer j not fully served): For j th service constraint (violation), j is a Lagrange multiplier: Multiplier j weights cost of violating j th service constraint: New objective:

Solving the Relaxed Problem New problem: Set y i = 1 for locations with negative values of  (i). Set x ij = 1 if y i =1 and c ij - j < 0, x ij =0 otherwise. Gives valid lower bound on the best UFL cost Linear space and time.

MTA/XMT Programming: Use the Compiler Here, we sum the reduced costs of the neighbors of one vertex The removal of the reduction of “sum” prevents a “hot spot” This output is from “canal,” an MTA/XMT compiler analysis tool

An Early Attempt to Scale on the MTA-2 Instance: 262K vertices, 2M edges Pseudo inverse power law degree distribution –1 node of degree 262K, 64 of degree ~4000, 256K of degree ~5 Preliminary MTA-2 implementation w/o profiling/tuning Linux serial time: 573s Number of MTA-2 ProcessorsRunning time 11812s 10325s 20302s 40312s In serial, runtimes are comparable to “CNM,” the O(n log^2 n) community finder from Clauset, Newman, Moore – but good MTA scaling is expected

Current MTGL Algorithms Connected components (psearch, visit_edges, visit_adj) Strongly-connected components (psearch) Maximal independent set (visit_edges) Typed subgraph isomorphism (psearch, visit_edges) S-t connectivity (bfs) Single-source shortest paths (psearch) Betweenness centrality (bfs-like) Community detection (all kernels) Connection subgraphs (bfs, sparse matrix, mt-quicksort) Find triangles (psearch) Find assortativity (psearch) Find modularity (psearch) PageRank (matvec) Network Simplex for MaxFlow Under development: Motif detection more Berkeley Open-Source Licence pending

Acknowledgements MultiThreading Background Simon Kahan (Google (formerly Cray)) Petr Konecny (Google (formerly Cray)) MultiThreading/Distributed Memory Comparisons Bruce Hendrickson (1415) Douglas Gregor (Indiana U.) Andrew Lumsdaine (Indiana U.) MTGL Algorithm Design and Development Vitus Leung (1415) Kamesh Madduri (Georgia Tech.) William McLendon (1423) Cynthia Phillips (1412) MTGL Integration Brian Wylie (1424) Kyle Wheeler (Notre Dame) Brian Barrett (1422)