Dax: Rethinking Visualization Frameworks for Extreme-Scale Computing DOECGF 2011 April 28, 2011 Kenneth Moreland Sandia National Laboratories SAND 2010-8034P.

Slides:

Advertisements

Similar presentations

Conclusion Kenneth Moreland Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,

Advertisements

Parallel Visualization Kenneth Moreland Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Lecture 6: Multicore Systems

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,

Unstructured Data Partitioning for Large Scale Visualization CSCAPES Workshop June, 2008 Kenneth Moreland Sandia National Laboratories Sandia is a multiprogram.

Problem Uncertainty quantification (UQ) is an important scientific driver for pushing to the exascale, potentially enabling rigorous and accurate predictive.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Exploring Communication Options with Adaptive Mesh Refinement Courtenay T. Vaughan, and Richard F. Barrett Sandia National Laboratories SIAM Computational.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CUDA and the Memory Model (Part II). Code executed on GPU.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Dr. Gheith Abandah, Chair Computer Engineering Department The University of Jordan 20/4/20091.

SAND Number: P Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,

Pamgen A Parallel Finite-Element Mesh Generation Library TUG 2008 Monday, October 21, 2008 David Hensinger (SNL) Sandia is a multiprogram laboratory operated.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.

Directed Reading 2 Key issues for the future of Software and Hardware for large scale Parallel Computing and the approaches to address these. Submitted.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

System Architecture: Near, Medium, and Long-term Scalable Architectures Panel Discussion Presentation Sandia CSRI Workshop on Next-generation Scalable.

LAMMPS Users’ Workshop

Add Cool Visualizations Here Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

VTK-m Project Goals A single place for the visualization community to collaborate, contribute, and leverage massively threaded algorithms. Reduce the challenges.

Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California Craig Ulmer Cray User Group (CUG 2005) May.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Threading Opportunities in High-Performance Flash-Memory Storage Craig Ulmer Sandia National Laboratories, California Maya GokhaleLawrence Livermore National.

STK (Sierra Toolkit) Update Trilinos User Group meetings, 2014 R&A: SAND PE Sandia National Laboratories is a multi-program laboratory operated.

Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.

Site Report DOECGF April 26, 2011 W. Alan Scott Sandia National Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,

ParaView III Software Methodology Update Where to now November 29, 2006 John Greenfield Sandia is a multiprogram laboratory operated by Sandia Corporation,

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Clusters Rule! (SMPs DRUEL!) David R. White Sandia National Labs Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin.

Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,

Add Cool Visualizations Here Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Performing Fault-tolerant, Scalable Data Collection and Analysis James Jolly University of Wisconsin-Madison Visualization and Scientific Computing Dept.

On the Path to Trinity - Experiences Bringing Codes to the Next Generation ASC Platform Courtenay T. Vaughan and Simon D. Hammond Sandia National Laboratories.

Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory.

Concurrency Idea. 2 Concurrency idea Challenge –Print primes from 1 to Given –Ten-processor multiprocessor –One thread per processor Goal –Get ten-fold.

Automated File Server Disk Quota Management May 13 th, 2008 Bill Claycomb Computer Systems Analyst Infrastructure Computing Systems Department Sandia is.

The Present and Future of Parallelism on GPUs

buses, crossing switch, multistage network.

Parallel Programming By J. H. Wang May 2, 2017.

Basic CUDA Programming

Ray-Cast Rendering in VTK-m

Toward a Unified HPC and Big Data Runtime

buses, crossing switch, multistage network.

Department of Computer Science, University of Tennessee, Knoxville

DMP: Deterministic Shared Memory Multiprocessing

Presentation transcript:

Dax: Rethinking Visualization Frameworks for Extreme-Scale Computing DOECGF 2011 April 28, 2011 Kenneth Moreland Sandia National Laboratories SAND P Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Serial Visualization Pipeline Contour Clip

Parallel Visualization Pipeline Contour Clip Contour Clip Contour Clip

Exascale Projection Jaguar – XT5Exascale*Increase Cores224, million – 1 billion~1,000× Concurrency224,256 way10 billion way~50,000× Memory300 Terabytes128 Petabytes~500× *Source: International Exascale Software Project Roadmap, J. Dongarra, P. Beckman, et al.

Exascale Projection Jaguar – XT5Exascale*Increase Cores224, million – 1 billion~1,000× Concurrency224,256 way10 billion way~50,000× Memory300 Terabytes128 Petabytes~500× *Source: International Exascale Software Project Roadmap, J. Dongarra, P. Beckman, et al. MPI Only? Vis object code + state: 20MB On Jaguar: 20MB × 200,000 processes = 4TB On Exascale: 20MB × 10 billion processes = 200PB !

Exascale Projection Jaguar – XT5Exascale*Increase Cores224, million – 1 billion~1,000× Concurrency224,256 way10 billion way~50,000× Memory300 Terabytes128 Petabytes~500× *Source: International Exascale Software Project Roadmap, J. Dongarra, P. Beckman, et al. Visualization pipeline too heavyweight? On Jaguar: 1 trillion cells  5 million cells/thread On Exascale: 500 trillion cells  50K cells/thread

Hybrid Parallel Pipeline Contour Clip Contour Clip Contour Clip Distributed Memory Parallelism Shared Memory Parallel Processing

Threaded Programming is Hard Example: Marching Cubes Easy because cubes can be processed in parallel, right? How do you resolve coincident points? How do you capture topological connections? How do you pack the results?

Multicore ≠ SMP ExaScale Software Study: Software Challenges in Extreme Scale Systems

GPU: Memory Management and Scheduling Device Memory Host Memory Multiprocessor Shared Memory Caches Processor Registers

Revisiting the Pipeline Lightweight Object Serial Execution No explicit partitioning No access to larger structures No state Filter

function ( in, out )

Worklet function ( in, out )

Iteration Mechanism Executive Worklet foreach element Functor Worklet Conceptual Iteration Reality: Iterations can be scheduled in parallel.

Comparison Executive Worklet foreach element Filter foreach element

Comparison Executive Worklet 1 foreach element Filter 1 foreach element Worklet 2 Filter 2 foreach element

Dax System Layout Executive Worklet Control Environment Execution Environment

Worklet vs. Filter __worklet__ void CellGradient(...) { daxFloat3 parametric_cell_center = (daxFloat3)(0.5, 0.5, 0.5); daxConnectedComponent cell; daxGetConnectedComponent( work, in_connections, &cell); daxFloat scalars[MAX_CELL_POINTS]; uint num_elements = daxGetNumberOfElements(&cell); daxWork point_work; for (uint cc=0; cc < num_elements; cc++) { point_work = daxGetWorkForElement(&cell, cc); scalars[cc] = daxGetArrayValue(point_work, inputArray); } daxFloat3 gradient = daxGetCellDerivative( &cell, 0, parametric_cell_center, scalars); daxSetArrayValue3(work, outputArray, gradient); } int vtkCellDerivatives::RequestData(...) {...[allocate output arrays]......[validate inputs]... for (cellId=0; cellId < numCells; cellId++) {... input->GetCell(cellId, cell); subId = cell->GetParametricCenter(pcoords); inScalars->GetTuples( cell->PointIds, cellScalars); scalars = cellScalars->GetPointer(0); cell->Derivatives( subId,pcoords,scalars,1,derivs); outGradients->SetTuple(cellId, derivs); }...[cleanup]... }

Execution Types: Map Example Usage: Vector Magnitude

Execution Type: Cell Connectivity Example Usages: Cell to Point, Normal Generation

Execution Type: Topological Reduce Example Usages: Cell to Point, Normal Generation

Execution Types: Generate Geometry Example Usages: Subdivide, Marching Cubes

Execution Types: Pack Example Usage: Marching Cubes

Initial Observations Implementing engines is hard, but implementing functors is easy. Efficient memory management is the most challenging problem (so far). It is often easier to write specialized functors than to chain basic functors. GPUs scream… –…but we have yet to implement all but the obvious map engines on them.

Conclusion Why now? Why not before? –Rules of efficiency have changed. Concurrency: Coarse  Fine Execution cycles become free Minimizing DRAM I/O critical The current approach is unworkable –The incremental approach is unmanageable Designing for exascale requires lateral thinking