EAVL EAVL E XTREME - SCALE A NALYSIS AND V ISUALIZATION L IBRARY Jeremy Meredith SDAV Next-Gen Library Meeting September, 2012.

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.

Coordinatate systems are used to assign numeric values to locations with respect to a particular frame of reference commonly referred to as the origin.

Why Systolic Architecture ?. Motivation & Introduction We need a high-performance, special-purpose computer system to meet specific application. I/O and.

More on threads, shared memory, synchronization

CSE351/ IT351 Modeling And Simulation Choosing a Mesh Model Dr. Jim Holten.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

PROCESS IN DATA SYSTEMS PLANNING DATA INPUT DATA STORAGE DATA ANALYSIS DATA OUTPUT ACTIVITIES USER NEEDS.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Topic Overview One-to-All Broadcast and All-to-One Reduction

CUDA Grids, Blocks, and Threads

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Spatial data models (types)

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.

A Bridge to Your First Computer Science Course Prof. H.E. Dunsmore Concurrent Programming Threads Synchronization.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

Beyond the Visualization Pipeline Werner Benger 1, Marcel Ritter, Georg Ritter, Wolfram Schoor 1 Scientific Visualization Group Center for Computation.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual Information Technology IIIT, Hyderabad.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,

1 CSC 221: Introduction to Programming Fall 2012 Functions & Modules  standard modules: math, random  Python documentation, help  user-defined functions,

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

GPU Architecture and Programming

1 Gaspard Methodology The Y Model approach Models and UML profiles Arnaud CUCCURU, phd student ModEasy meeting, Lille, February 2005.

Representation. Objectives Introduce concepts such as dimension and basis Introduce coordinate systems for representing vectors spaces and frames for.

Hank Childs, University of Oregon Lecture #3 Fields, Meshes, and Interpolation (Part 2)

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

VTK-m Project Goals A single place for the visualization community to collaborate, contribute, and leverage massively threaded algorithms. Reduce the challenges.

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

1 Data Structures for Scientific Computing Orion Sky Lawlor /04/14.

Hank Childs, University of Oregon Jan. 21st, 2013 CIS 610: Many-core visualization libraries.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 4: Introduction to C: Control Flow.

Martin Kruliš by Martin Kruliš (v1.0)1.

Add Cool Visualizations Here Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary.

My Coordinates Office EM G.27 contact time:

Coordinatate systems are used to assign numeric values to locations with respect to a particular frame of reference commonly referred to as the origin.

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.

CS/EE 217 – GPU Architecture and Parallel Programming

Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Lecture 2: Intro to the simd lifestyle and GPU internals

3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

CS/EE 217 – GPU Architecture and Parallel Programming

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Parallel Computation Patterns (Scan)

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 498AL Lecture 15: Reductions and Their Implementation

Convolution Layer Optimization

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

EAVL EAVL E XTREME - SCALE A NALYSIS AND V ISUALIZATION L IBRARY Jeremy Meredith SDAV Next-Gen Library Meeting September, 2012

History Originally ORNL LDRD Originally ORNL LDRD – Jeremy Meredith, Sean Ahern, Dave Pugmire – plus Rob Sisneros joined as a postdoc Many hours sitting in conference rooms arguing over things like “what does it mean to have one of your dimensions be unstructured?” Many hours sitting in conference rooms arguing over things like “what does it mean to have one of your dimensions be unstructured?” – then determine what to do that’s practical without falling off the data modeling deep end.... Exascale focus Exascale focus

Approaching the Exascale Problems Update traditional data model to handle modern simulation codes and a wider range of data. Update traditional data model to handle modern simulation codes and a wider range of data. Investigate how an updated data and execution model can achieve the necessary computational, I/O, and memory efficiency. Investigate how an updated data and execution model can achieve the necessary computational, I/O, and memory efficiency. Explore methods for visualization algorithm developers to achieve these efficiency gains and better support exascale architectures. Explore methods for visualization algorithm developers to achieve these efficiency gains and better support exascale architectures.

D ATA M ODELING C HALLENGES

Connectivity 3D Point Coordinates Cell Fields Point Fields Dimensions 3D Point Coordinates Cell Fields Point Fields Dimensions 3D Axis Coordinates Cell Fields Point Fields A Traditional Data Set Model Data Set RectilinearStructuredUnstructured

Challenge: Non-Physical Data Analysis Graph Data Graph Data – topologically 0D vertices, 1D edges – non-spatial; storing X/Y/Z values is wasted space Pure Parameter Studies Pure Parameter Studies – e.g. reaction rate of combustion FOUR “spatial” dimensions – e.g. methane concentration vs oxygen concentration vs temperature vs pressure more complex reaction  higher dimensionality oxygen methane temperature pressure

Challenge: Molecular Data (e.g., LAMMPS, VASP) To represent using vtkPolyData or vtkUnstructuredGrid: To represent using vtkPolyData or vtkUnstructuredGrid: – VTK_VERTEX cells for the atoms – VTK_LINE cells for the bonds Any field data must exist on both element types Any field data must exist on both element types – Not only inefficient: dummy bond strengths on the atoms? dummy atomic numbers on the bonds? – But also incorrect: e.g. average(BondStrength) uses dummy values from atoms? H C H C H H BondStr 1 2 AtomicNum 6 1

Challenge: Side Sets (e.g. Exodus, flux surfaces) The flow from A to B is defined on a set of faces The flow from A to B is defined on a set of faces The flux variable is defined only on those faces The flux variable is defined only on those faces – do you combine them into a single mesh? waste space on dummy values, potentially introducing errors – or create a separate mesh and lose the mapping info? horribly expensive and error-prone to recalculate mapping AB flux surface lives inside the volumetric mesh

Challenge: Dimensionality, Refinement (e.g. GenASiS) (a) seven (or eight) dimensional mesh (a) seven (or eight) dimensional mesh – f(x,y,z,ϴ,ϕ,λ,F)=E, plus time (b) refinement occurs on a per-cell basis (b) refinement occurs on a per-cell basis – can’t assume per-block refinement – sometimes referred to as “unstructured AMR”

Challenge: Unique Mesh Topologies (e.g. MADNESS) MADNESS does not have a traditional mesh MADNESS does not have a traditional mesh – Just a quad-tree with polynomial coefficients – Up to 30 refinement levels / tree depth root root spatial structure internal tree representation

Challenge: Very High Order Fields (e.g. MADNESS) Legendre polynomial series at each tree node Legendre polynomial series at each tree node – Each tree node has K dim coefficients – K can be up to approx. 20 i.e. 400 coeffs per tree node in 2D, 8000 in 3D (example with K=3, dim=2)

T HE EAVL D ATA M ODEL

Connectivity 3D Point Coordinates Cell Fields Point Fields Dimensions 3D Point Coordinates Cell Fields Point Fields Dimensions 3D Axis Coordinates Cell Fields Point Fields A Traditional Data Set Model (again) Data Set RectilinearStructuredUnstructured

TreeConnectivityDimensionsFieldName Component Name Association Values Cells[] Points[] Fields[] The EAVL Data Set Model Data Set CellSet ExplicitStructured Coords Field QuadTree CellList Subset

Connectivity: (a bunch of cells) FieldName: “c” “c” “c” Component: Name: “c” Association: Points Values[3*npts] Cells[1] Points[1] Fields[1] Example: An Unstructured Grid (with interleaved coordinates) eavlDataSet eavlExplicitCellSet eavlCoordinates eavlField

Connectivity: (a bunch of cells) FieldName: “x” “y” “z” Component: Name: “x” Association: Points Values[npts] Cells[1] Points[1] Fields[3] Example: An Unstructured Grid (with separated coordinates) eavlDataSet eavlExplicitCellSet eavlCoordinates eavlField #0 Name: “y” Association: Points Values[npts] eavlField #1 Name: “z” Association: Points Values[npts] eavlField #2

RegularStructure: FieldName: “x” “y” “z” Component: Name: “x” Association: Points Values[npts] Cells[1] Points[1] Fields[3] Example: A Curvilinear Grid eavlDataSet eavlStructuredCellSet eavlCoordinates eavlField #0 Name: “y” Association: Points Values[npts] eavlField #1 Name: “z” Association: Points Values[npts] eavlField #2

RegularStructure: FieldName: “x” “y” “z” Component: Name: “x” Association: LogicalDim0 Values[ni] Cells[1] Points[1] Fields[3] Example: A Rectilinear Grid eavlDataSet eavlStructuredCellSet eavlCoordinates eavlField #0 Name: “y” Association: LogicalDim1 Values[nj] eavlField #1 Name: “z” Association: LogicalDim2 Values[nk] eavlField #2

RegularStructure: FieldName: “x” “y” “z” “μ” “ϴ” Component: Name: “x” Association: LogicalDim0 Values[ni] Cells[1] Points[1] Fields[5] Example: High-Dimensional Grid eavlDataSet eavlStructuredCellSet eavlCoordinates eavlField #0 Name: “y” Association: LogicalDim1 Values[nj] eavlField #1 Name: “z” Association: LogicalDim2 Values[nk] eavlField #2 Name: “μ” Association: LogicalDim3 Values[nμ] eavlField #3 Name: “ϴ” Association: LogicalDim4 Values[nϴ] eavlField #4

RegularStructure: FieldName: “lat” “lon” Component: 0 0 Name: “lat” Association: LogicalDim0 Values[ni] Cells[1] Points[2] Fields[3] Example: Geospatial Data eavlDataSet eavlStructuredCellSet eavlCoordinates eavlField #0 Name: “lon” Association: LogicalDim1 Values[nj] eavlField #1 Name: “c” Association: Points Values[3*npts] eavlField #2 FieldName: “c” “c” “c” Component: eavlCoordinates

Example: Molecular Data Connectivity: the atoms FieldName: “c” “c” “c” Component: Name:”atomic number” Association: Cell Set #0 Values[ncells #0] Cells[2] Points[1] Fields[3] eavlDataSet eavlExplicitCellSet #0 eavlCoordinates eavlField #1 Connectivity: the bonds eavlExplicitCellSet #1 Name: “c” Association: Points Values[3*npts] eavlField #0 Name: “bond strength” Association: Cell Set #1 Values[ncells #1] eavlField #2

Example: Face-centered Data Connectivity: volumetric FieldName: “c” “c” “c” Component: Cells[2] Points[1] Fields[2] eavlDataSet eavlExplicitCellSet eavlCoordinates Parent: ( ) eavlAllFacesOfExplicit Name: “c” Association: Points Values[3*npts] eavlField #0 Name: “facevariable” Association: Cell Set #2 Values[nfaces] eavlField #1

F ILTERING IN EAVL

Data flow networks in EAVL (or not) A “Filter” is a stage in a data flow network A “Filter” is a stage in a data flow network – Creates a new data set from an old one Many operations do not change a mesh structure (assuming data model is sufficiently descriptive) Many operations do not change a mesh structure (assuming data model is sufficiently descriptive) – Arithmetic expressions: only modifies fields – External facelist: points and structure remain – Feature edges: just a new cell set with old points – Smooth, displace, elevate: only modify coordinates So: eavlMutator is an alternative to eavlFilter So: eavlMutator is an alternative to eavlFilter – Modifies a data set in-place

eavlMutator In-place data set modification In-place data set modification Support for destructive in-place operation Support for destructive in-place operation – free memory as you go Execute multiple mutators simultaneously on the same data set (barring conflicts) Execute multiple mutators simultaneously on the same data set (barring conflicts) – e.g. displace (coords) + threshold (cells) concurrently How about data flow network support? How about data flow network support? – encapsulate an eavlMutator through a eavlFilterFromMutator facade Of course, some operations are natively eavlFilters Of course, some operations are natively eavlFilters – can facade through eavlMutatorFromFilter (?)

FieldName: “x” “y” “z” Component: Explicit cells can be combined with structured coordinates. Explicit cells can be combined with structured coordinates. Example: Thresholding an RGrid (a) eavlCoordinates Name: “x” Association: LogicalDim0 Values[ni] eavlField#0 Name: “y” Association: LogicalDim1 Values[nj] eavlField#1 Name: “z” Association: LogicalDim2 Values[nk] eavlField#2 RegularStructure: eavlStructuredCellSet FieldName: “x” “y” “z” Component: eavlCoordinates Name: “x” Association: LogicalDim0 Values[ni] eavlField#0 Name: “y” Association: LogicalDim1 Values[nj] eavlField#1 Name: “z” Association: LogicalDim2 Values[nk] eavlField#2 Connectivity: (a bunch of cells) eavlExplicitCellSet

Cells: (…) Parent: ( ) A second Cell Set can be added which refers to the first one A second Cell Set can be added which refers to the first one Example: Thresholding an RGrid (b) RegularStructure: eavlStructuredCellSeteavlSubset FieldName: “x” “y” “z” Component: eavlCoordinates Name: “x” Association: LogicalDim0 Values[ni] eavlField#0 Name: “y” Association: LogicalDim1 Values[nj] eavlField#1 Name: “z” Association: LogicalDim2 Values[nk] eavlField#2 RegularStructure: eavlStructuredCellSet Name: “x” Association: LogicalDim0 Values[ni] eavlField#0 Name: “y” Association: LogicalDim1 Values[nj] eavlField#1 Name: “z” Association: LogicalDim2 Values[nk] eavlField#2 FieldName: “x” “y” “z” Component: eavlCoordinates

30x30 or 30x40 eavlStructSubset 30x30 or 30x40 eavlStructSubset Add six new subset-cell sets to original mesh Add six new subset-cell sets to original mesh Example: Structured External Facelist FieldName: “x” “y” “z” Component: eavlCoordinates Name: “x” Association: Points Values[npts] eavlField#0 Name: “y” Association: Points Values[npts] eavlField#1 Name: “z” Association: Points Values[npts] eavlField#2 FieldName: “x” “y” “z” Component: eavlCoordinates Name: “x” Association: Points Values[npts] eavlField#0 Name: “y” Association: Points Values[npts] eavlField#1 Name: “z” Association: Points Values[npts] eavlField#2 RegularStructure: eavlStructuredCellSet eavlStructCellSet 30x30 or 30x40 eavlStructSubset x6

No problem-sized data modifications. No problem-sized data modifications. – Interleaved and separated coordinates can be used simultaneously. Example: Elevating a Structured Grid FieldName: “c” “c” Component: 0 1 eavlCoordinates Name: “c” Association: Points Value[2*npts] eavlField#0 Name: “val” Association: Points Values[npts] eavlField#1 RegularStructure: eavlStructuredCellSet FieldName: “c” “c” “val” Component: eavlCoordinates RegularStructure: eavlStructuredCellSet Name: “c” Association: Points Value[2*npts] eavlField#0 Name: “val” Association: Points Values[npts] eavlField#1

No problem-sized data modifications. No problem-sized data modifications. – Some axes on logical dims, with others on the points. Example: Elevating a Regular Grid FieldName: “x” “y” Component: 0 0 eavlCoordinates Name: “x” Association: LogicalDim0 Values[ni] eavlField#0 Name: “y” Association: LogicalDim1 Values[nj] eavlField#1 Name: “val” Association: Points Values[npts] eavlField#2 RegularStructure: eavlStructuredCellSet Name: “x” Association: LogicalDim0 Values[ni] eavlField#0 Name: “y” Association: LogicalDim1 Values[nj] eavlField#1 Name: “val” Association: Points Values[npts] eavlField#2 FieldName: “x” “y” “val” Component: eavlCoordinates RegularStructure: eavlStructuredCellSet

D EALING W ITH C ONCURRENCY

Concurrency at Multiple Levels Distributed Parallelism Distributed Parallelism – Message passing still works well – Avoid global communication local domain interconnectivity information – Hybrid (e.g. spatiotemporal) parallelism Task Parallelism Task Parallelism – Fine-grain dependency tracking e.g. displace (coords) + threshold (cells) concurrently – eavlMutator helps – single eavlDataSet container class helps Thread Parallelism Thread Parallelism – Fine-grain data parallelism; CUDA, OpenMP

Data Parallelism for Developers Functor + iterator paradigm Functor + iterator paradigm Iteration patterns for mesh topologies Iteration patterns for mesh topologies CUDA + OpenMP execution back-ends CUDA + OpenMP execution back-ends

A Simple Data-Parallel Operation void CellToCellDivide(Field &a, Field &b, Field &b, Field &c) Field &c){ for_each(i) for_each(i) c[i] = a[i] / b[i]; c[i] = a[i] / b[i];} void CalculateDensity(...) { //... //... CellToCellDivide(mass, volume, density); CellToCellDivide(mass, volume, density);} Internal Library API Provides This Algorithm Developer Writes This

Functor + Iterator Approach void CalculateDensity(...) { //... //... CellToCellBinaryOp(mass, volume, density, Divide()); CellToCellBinaryOp(mass, volume, density, Divide());} template void CellToCellBinaryOp (Field &a, Field &b, Field &b, Field &c Field &c T &f) T &f){ for_each(i) for_each(i) f(a[i],b[i],c[i]); f(a[i],b[i],c[i]);} struct Divide { void operator()(float &a, void operator()(float &a, float &b, float &b, float &c) float &c) { c = a / b; c = a / b; }}; Internal Library API Provides This Algorithm Developer Writes This

Custom Functor void CalculateDensity(...) { //... //... CellToCellBinaryOp(mass, volume, density, MyFunctor()); CellToCellBinaryOp(mass, volume, density, MyFunctor());} template void CellToCellBinaryOp (Field &a, Field &b, Field &b, Field &c Field &c T &f) T &f){ for_each(i) for_each(i) f(a[i],b[i],c[i]); f(a[i],b[i],c[i]);} struct MyFunctor { void operator()(float &a, void operator()(float &a, float &b, float &b, float &c) float &c) { c = a + 2*log(b); c = a + 2*log(b); }}; Algorithm Developer Writes These Internal Library API Provides This

Functor Efficiency on CPU and GPU Data: noise.silo Data: noise.silo Surface normal Surface normal

Binding Values to Functors struct ScaleByConst { float scale; float scale; ScaleByConst(float s) : scale(s) { } ScaleByConst(float s) : scale(s) { } void operator()(float &a, float &b) void operator()(float &a, float &b) { b = a * scale; b = a * scale; }}; void CalculateDensity(...) { //... //... cell_volume = mesh_volume / mesh_numcells; cell_volume = mesh_volume / mesh_numcells; CellToCellUnaryOp(mass, density, ScaleByConst(1.0/cell_volume)); CellToCellUnaryOp(mass, density, ScaleByConst(1.0/cell_volume));}

D ATA P ARALLELISM B ASICS

Map with 1 input, 1 output Simplest data-parallel operation. Each result item can be calculated from its corresponding input item alone. x struct f { float operator()(float x) { return x*2; } float operator()(float x) { return x*2; }}; result

Map with 2 inputs, 1 output With two input arrays, the functor takes two inputs. You can also have multiple outputs. x struct f { float operator()(float a, float b) { return a+b; } float operator()(float a, float b) { return a+b; }}; result y

Scatter with 1 input (and thus 1 output) Possibly inefficient, risks of race conditions and uninitialized results. (Can also scatter to larger array if desired.) Often used in a scatter_if –type construct. x No functor result indices

Gather with 1 input (and thus 1 output) Unlike scatter, no risk of uninitialized data or race condition. Plus, parallelization is over a shorter indices array, and caching helps more, so can be more efficient. x No functor result indices 19693

Reduction with 1 input (and thus 1 output) Example: max-reduction. Sum is also common. Often a fat-tree-based implementation. x result struct f { float operator()(float a, float b) { return a>b ? a : b; } float operator()(float a, float b) { return a>b ? a : b; }};

Inclusive Prefix Sum (a.k.a. Scan) with 1 input/output Value at result[i] is sum of values x[0]..x[i]. Surprisingly efficient parallel implementation. Basis for many more complex algorithms. x No functor. result

Exclusive Prefix Sum (a.k.a. Scan) with 1 input/output Initialize with zero, value is sum of only up to x[i-1]. May be more commonly used than inclusive scan. x No functor. result

D ATA P ARALLELISM ON M ESHES

Example: Surface Normal For each 2D cell (i.e. each polygon): For each 2D cell (i.e. each polygon): – Get three adjacent points – Pair-wise vector subtract – Cross product Data-parallel: Data-parallel: – Repeat for all cells

Example: Surface Normal INPUT: INPUT: – 3-dimensional coordinates array on the mesh NODES – example: length = 9 OUTPUT: OUTPUT: – 3-component surface normals array on the mesh CELLS – example: length = 4

Under the Covers: Node-to-Cell on CPU void NodeToCellOp3::ExecuteCPU() { #pragma omp parallel for for (int i=0; i NumCells(); i++) for (int i=0; i NumCells(); i++) { // get cell node indices // get cell node indices int nNodes, nodeIds[8]; int nNodes, nodeIds[8]; float nodeValues[3][8]; float nodeValues[3][8]; conn.GetCellNodes(index, nNodes, nodeIds); conn.GetCellNodes(index, nNodes, nodeIds); // get coordinates for nodes // get coordinates for nodes for (int i=0; i<nNodes; i++) for (int i=0; i<nNodes; i++) { nodeValues[0][i] = array0[nodeIds[i]]; nodeValues[0][i] = array0[nodeIds[i]]; nodeValues[1][i] = array1[nodeIds[i]]; nodeValues[1][i] = array1[nodeIds[i]]; nodeValues[2][i] = array2[nodeIds[i]]; nodeValues[2][i] = array2[nodeIds[i]]; } // call functor // call functor functor(nodeValues[0], nodeValues[1], nodeValues[2], functor(nodeValues[0], nodeValues[1], nodeValues[2], &out0[i], &out1[i], &out2[i]); &out0[i], &out1[i], &out2[i]); }}

Under the Covers: Node-to-Cell on GPU void NodeToCellOp3::ExecuteGPU() { float *d_arr0 = (float*)array0->GetCUDAArray(); float *d_arr0 = (float*)array0->GetCUDAArray(); float *d_arr1 = (float*)array1->GetCUDAArray(); float *d_arr1 = (float*)array1->GetCUDAArray(); float *d_arr2 = (float*)array2->GetCUDAArray(); float *d_arr2 = (float*)array2->GetCUDAArray(); float *d_out = (float*)output0->GetCUDAArray(); float *d_out = (float*)output0->GetCUDAArray(); // calculate CUDA thread grid num blocks & threads // calculate CUDA thread grid num blocks & threads // based on input->NumCells() // based on input->NumCells() nodeToCellKernel3 >>(d_arr0, d_arr1, d_arr2, nodeToCellKernel3 >>(d_arr0, d_arr1, d_arr2, d_out0, d_out1, d_out2, d_out0, d_out1, d_out2, conn, functor); conn, functor);}

Under the Covers: The CUDA Kernel template template __global__ void NodeToCellKernel3(float *array0, float *array1, float *array2, float *out0, float *array2, float *out0, float *out1, float *out2, float *out1, float *out2, ExplicitConnectivity conn, ExplicitConnectivity conn, F functor) F functor){ const int index = blockIdx.x * blockDim.x + threadIdx.x; const int index = blockIdx.x * blockDim.x + threadIdx.x; // get cell node indices // get cell node indices int nNodes, nodeIds[8]; int nNodes, nodeIds[8]; float nodeValues[3][8]; float nodeValues[3][8]; conn.GetCellNodes(index, nNodes, nodeIds); conn.GetCellNodes(index, nNodes, nodeIds); // get coordinates for nodes // get coordinates for nodes for (int i=0; i<nNodes; i++) for (int i=0; i<nNodes; i++) { nodeValues[0][i] = array0[nodeIds[i]]; nodeValues[0][i] = array0[nodeIds[i]]; nodeValues[1][i] = array1[nodeIds[i]]; nodeValues[1][i] = array1[nodeIds[i]]; nodeValues[2][i] = array2[nodeIds[i]]; nodeValues[2][i] = array2[nodeIds[i]]; } // call functor // call functor functor(nodeValues[0], nodeValues[1], nodeValues[2], functor(nodeValues[0], nodeValues[1], nodeValues[2], &out0[i], &out1[i], &out2[i]); &out0[i], &out1[i], &out2[i]);}

W RITING A LGORITHMS IN EAVL

Face Surface Normal Functor+Iterator struct PolyNormalFunctor { void operator()(int nvals, int shapetype, void operator()(int nvals, int shapetype, float x[], float y[], float z[], float x[], float y[], float z[], float *nx, float *ny, float *nz) float *nx, float *ny, float *nz) { // get two adjacent edge vectors // get two adjacent edge vectors float ax = x[1]-x[0], ay = y[1]-y[0], az = z[1]-z[0]; float ax = x[1]-x[0], ay = y[1]-y[0], az = z[1]-z[0]; float bx = x[2]-x[1], by = y[2]-y[1], bz = z[2]-z[1]; float bx = x[2]-x[1], by = y[2]-y[1], bz = z[2]-z[1]; // calculate their cross product // calculate their cross product *nx = ay*bz - az*by; *nx = ay*bz - az*by; *ny = az*bx - ax*bz; *ny = az*bx - ax*bz; *nz = ax*by - ay*bx; *nz = ax*by - ay*bx; }}; void CalculateFaceNormals(...) { eavlExecutor::AddOperation( eavlExecutor::AddOperation( new eavlTopologyMap_3_3(inputcells, EAVL_NODES_OF_CELLS, new eavlTopologyMap_3_3(inputcells, EAVL_NODES_OF_CELLS, xcoord, ycoord, zcoord, xcoord, ycoord, zcoord, xnormal, ynormal, xnormal, xnormal, ynormal, xnormal, PolyNormalFunctor())); PolyNormalFunctor()));}

Making it Easy for Developers Single functor code works on either CPU or GPU Single functor code works on either CPU or GPU Iteration patterns have multiple back-ends Iteration patterns have multiple back-ends CPU vs GPU execution choice is made at runtime CPU vs GPU execution choice is made at runtime EAVL array class automatically supports heterogeneous memory spaces for GPUs EAVL array class automatically supports heterogeneous memory spaces for GPUs Internal APIs support CUDA if developers want to write their own CUDA kernels Internal APIs support CUDA if developers want to write their own CUDA kernels

Algorithms are Sequences of Operations (a) calculate face surface normals with node-to-cell operation (b) average to point normals via cell-to-node operation

E XAMPLE : T HRESHOLD

Starting Mesh We want to threshold a mesh based on its density values (shown here) density If we threshold 35 < density < 45, we want this result:

Which Cells to Include? Evaluate a Map operation with this functor: struct InRange { float lo, hi; float lo, hi; InRange ( float l, float h ) : lo ( l ), hi ( h ) { } InRange ( float l, float h ) : lo ( l ), hi ( h ) { } int operator ()( float x ) { return x>lo && x lo && x<hi ; }} density inrange InRange()

How Many Cells in Output? Evaluate a Reduce operation using the Add<> functor. We can use this to create output cell length arrays inrange 6 result plus

Where Do the Output Cells Go? InputindicesOutputindices output cell input cell How do we create this mapping?

Create Input-to-Output Indexing? Exclusive Scan (exclusive prefix sum) gives us the output index positions inrange startidx

Create Output-to-Input Indexing? We want to work in the shorter output-length arrays and use gathers. A specialized scatter in EAVL creates this reverse index revindex

density Gather Input Mesh Arrays to Output? We can now use simple gathers to pull input arrays (density, pressure) into the output mesh revindex output_ density

P LANS AND S TATUS

Plans and Status Now Open Source (BSD “2-clause”) Now Open Source (BSD “2-clause”) – Project website at – Source code, docs at Data Model Data Model – Most infrastructure in place – Still a few areas not fully fleshed out – The client-facing API needs work Execution Model Execution Model – Currently imperative style – Exploring pipeline paths, contract/metadata needs Productization Status Productization Status – More iterators – More optimization – Lots of internal cleanup