Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.SAND NO C VTK-m: Building a Visualization Toolkit for Massively Threaded Architectures Ultrascale Visualization Workshop Kenneth Moreland Sandia National Laboratories November 16, 2015
Extreme Scale: Threads, Threads Threads! A clear trend in supercomputing is ever increasing parallelism Clock increases are long gone “The Free Lunch Is Over” (Herb Sutter) *Source: Scientific Discovery at the Exascale, Ahern, Shoshani, Ma, et al. Jaguar – XT5Titan – XK7Exascale* Cores224,256299,008 and 18,688 gpu 1 billion Concurrency224,256 way70 – 500 million way10 – 100 billion way Memory300 Terabytes700 Terabytes128 Petabytes
My new computer's got the clocks, it rocks But it was obsolete before I opened the box − “Weird” Al Yankovic, It’s All About the Pentiums, circa 1999 Moore’s Law is dead. − Gordon Moore, circa 2005
Amdahl vs. Gustafson-Barsis Amdahl’s Law Any algorithm has data dependencies that makes some fraction of the software inherently serial. Parallelism is ultimately limited by this serial fraction. See also Span Law. Gustafson-Barsis Law Increasing the amount of data can potentially increase the amount of independent operations and allow an algorithm to increase parallelism indefinitely.
AMD x86 NVIDIA GPU Full x86 Core + Associated Cache 8 cores per die MPI-Only feasible 2,880 cores collected in 15 SMX Shared PC, Cache, Mem Fetches Reduced control logic MPI-Only not feasible 1mm 1 x86 core 1 Kepler core
Inter-Node Parallelism Inter-Node Parallelism
Inter-Node Parallelism Inter-Node Parallelism Intra-Node Parallelism
Example Algorithm: Contours
Total: 11
How Many Architectures to Support? GPU (NVIDIA) Sub-architectures: Fermi, Kepler, Maxwell Multiple Memory Types: Global, shared, constant, texture Memory Amount: Up to 12 GB 1000s of threads Grids, blocks, and warps CPU/MIC Mulple ISAs: Vector unit widths: 2,4,8 / 16 Single Memory Type Except when not (cache, HSM) Larger Memory Size Up to 60/260 threads No explicit organization
Performance Portability ABCDEF Algorithm Architecture
Performance Portability ABCDEF Algorithm Backend VTK-m
VTK-m Framework Execution Environment Cell Operations Field Operations Basic Math Make Cells Control Environment Grid Topology Array Handle Invoke Device Adapter Allocate Transfer Schedule Sort … Worklet
CUDA SDK 561 Lines PISTON 505 Lines VTK-m 283 Lines
CUDA SDK 561 Lines PISTON 505 Lines VTK-m 283 Lines
Contour Times Surface Simplification Times
Algorithm VTK-m is separate from VTK
Algorithm Simulation VTK-m is separate from VTK
Filter Algorithm Simulation VTK-m is not a replacement for VTK
Reader Filter Rendering Algorithm Simulation
Reader Filter Rendering Algorithm Simulation
Acknowledgements This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Award Numbers , , and Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL Lots of credit goes out to all our collaborators: Chris Sewell, Jeremy Meredith, David Pugmire, Berk Geveci, Robert Maynard, Hank Childs, and many others.