Download presentation
Presentation is loading. Please wait.
Published byRiya Schooley Modified over 10 years ago
1
LLNL-PRES-600932 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC PyHPC 2012 Workshop Cyrus Harrison, Lawrence Livermore National Laboratory Paul Navrátil, Texas Advanced Computing Center, Univ of Texas at Austin Maysam Moussalem, Department of Computer Science, Univ of Texas at Austin Ming Jiang, Lawrence Livermore National Laboratory Hank Childs, Lawrence Berkeley National Laboratory Friday Nov 16, 2012
2
2 LLNL-PRES-600932 Motivation System Architecture Framework Components Execution Strategies Evaluation Methodology Evaluation Results
3
3 LLNL-PRES-600932
4
4 is a Python-fueled HPC research success story. Our goal: Start to address uncertainty with future HPC hardware architectures and programing models. This work: Explores moving a key visualization and analysis capability to many-core architectures. Why Python? Productivity + powerful tools (PLY, NumPy, PyOpenCL) Motivation
5
5 LLNL-PRES-600932 Creating new fields from existing fields in simulation data. A critical component of scientific visualization and analysis tool suites. Example Expressions: Motivation
6
6 LLNL-PRES-600932 Are present in many post-processing tools: Paraview, VisIt, etc. Include three key components: A set of primitives that can be used to create derived quantities. An interface which allows users to compose these primitives. A mechanism which transforms and executes the composed primitives. Ongoing issues: Lack of flexibility to exploit many-core architectures Inefficiency in executing composed primitives Motivation
7
7 LLNL-PRES-600932 Unique Contributions: 1) First-ever implementation targeting many-core architectures. 2) A flexible Python infrastructure that enables the design and testing of a wide range of execution strategies. 3) An evaluation exploring the tradeoffs between runtime performance and memory constraints. Motivation
8
8 LLNL-PRES-600932
9
9 Framework Components Host Application Interface Our framework is designed to work in-situ for codes with a NumPy interface to mesh data fields. Ndarrays are used as the input/output data interface. PLY-based front-end Parser Transforms user expressions into a dataflow speciation. Dataflow Network Module Coordinates OpenCL execution using PyOpenCL. Designed to support multiple execution strategies. System Architecture
10
10 LLNL-PRES-600932 Host Application Python Dataflow Network Data User Expressions Expression Parser PLY OpenCL Target Device(s) Execution Strategies PyOpenCL System Architecture
11
11 LLNL-PRES-600932 Basic Features Simple “create and connect” API for network definition. The API used by the parser front-end is usable by humans. Execution is decoupled from network definition and traversal: A Topological sort is used to ensure precedence. Results are managed by a reference-counting registry. Straight forward filter API is used to implement derived field primitives. Network structure can be visualized using graphviz. System Architecture
12
12 LLNL-PRES-600932 OpenCL Environment Built using PyOpenCL Records and categorizes OpenCL timing events: OpenCL Host-to-device Transfers (Inputs) OpenCL Kernel Executions OpenCL Device-to-host Transfers (Results) Manages OpenCL Device buffers: Tracks allocated device buffers, available global device memory, and global memory high-water mark. Enables reuse of allocated buffers. System Architecture
13
13 LLNL-PRES-600932 Execution Strategies Control data movement and how the OpenCL kernels of each primitive are composed to compute the final result. Implementations leverage the features of our dataflow network module: Precedence from the dataflow graph Reference counting for intermediate results OpenCL kernels for the primitives are written once and used by all strategies. System Architecture
14
14 LLNL-PRES-600932 Roundtrip: Dispatches a single kernel for each primitive. Transfers each intermediate result from OpenCL target device back to the host environment. Staged: Dispatches a single kernel for each primitive. Stores intermediate results in the global memory of the OpenCL target device. Fusion: Employs kernel fusion to construct and execute a single OpenCL kernel that composes all selected primitives. System Architecture
15
15 LLNL-PRES-600932 mag = sqrt(x*x+y*y+z*z) Corresponding Dataflow Network Corresponding Dataflow Network Example Expression x x mult y y z z add sqrt mag System Architecture
16
16 LLNL-PRES-600932 f1 = mult(x,x) x f1 f2 = mult(y,y) y f2 f3 = mult(z,z) z f3 f4 = add(f1,f2) f2 f4 f1 f5 = add(f4,f3) f3 f5 f4 f6 = sqrt(f5) f5 OpenCL Host OpenCL Target x x mult y y z z add sqrt mag mag = sqrt(x*x+y*y+z*z) mult add sqrt x x y y z z mag System Architecture
17
17 LLNL-PRES-600932 f1 = mult(x,x) x f6 = sqrt(f5) OpenCL Host OpenCL Target x x mult y y z z add sqrt mag mag = sqrt(x*x+y*y+z*z) f2 = mult(y,y) y f3 = mult(z,z) z f4 = add(f1,f2) f5 = add(f4,f3) mult add sqrt x x y y z z mag System Architecture
18
18 LLNL-PRES-600932 f1 = mult(x,x) f2 = mult(y,y) f3 = mult(z,z) f4 = add(f1,f2) f5 = add(f4,f3) f6 = sqrt(f5) OpenCL Host OpenCL Target x x mult y y z z add sqrt mag mag = sqrt(x*x+y*y+z*z) z y x mult add sqrt x x y y z z mag System Architecture
19
19 LLNL-PRES-600932 Roundtrip Staged Fusion x3 x4 x5 System Architecture
20
20 LLNL-PRES-600932
21
21 LLNL-PRES-600932 Evaluation Expressions: Detection of vortical structures in a turbulent mixing simulation. Host Application: VisIt Three Studies: Single Device Performance Single Device Memory Usage Distributed-Memory Parallel Test Environment: LLNL’s Edge HPC Cluster Provides OpenCL access to both NVIDIA Tesla M2050s and Intel Xeon processors. Evaluation Methodology
22
22 LLNL-PRES-600932 We selected three expressions used for vortex detection and analysis. Evaluation Methodology du = grad3d(u,dims,x,y,z) dv = grad3d(v,dims,x,y,z) dw = grad3d(w,dims,x,y,z) w_x = dw[1] - dv[2] w_y = du[2] - dw[0] w_z = dv[0] - du[1] w_mag = sqrt(w_x*w_x + w_y*w_y + w_z*w_z) v_mag = sqrt(u*u + v*v + w*w) Vector Magnitude: Vorticity Magnitude: Q-criterion:
23
23 LLNL-PRES-600932 Evaluation Methodology du = grad3d(u,dims,x,y,z) dv = grad3d(v,dims,x,y,z) dw = grad3d(w,dims,x,y,z) s_1 = 0.5 * (du[1] + dv[0]) s_2 = 0.5 * (du[2] + dw[0]) s_3 = 0.5 * (dv[0] + du[1]) s_5 = 0.5 * (dv[2] + dw[1]) s_6 = 0.5 * (dw[0] + du[2]) s_7 = 0.5 * (dw[1] + dv[2]) w_1 = 0.5 * (du[1] - dv[0]) w_2 = 0.5 * (du[2] - dw[0]) w_3 = 0.5 * (dv[0] - du[1]) w_5 = 0.5 * (dv[2] - dw[1]) w_6 = 0.5 * (dw[0] - du[2]) w_7 = 0.5 * (dw[1] - dv[2]) s_norm = du[0]*du[0] + s_1*s_1 + s_2*s_2 + s_3*s_3 + dv[1]*dv[1] + s_5*s_5 + s_6*s_6 + s_7*s_7 + dw[2]*dw[2] w_norm = w_1*w_1 + w_2*w_2 + w_3*w_3 + w_5*w_5 + w_6*w_6 + w_7*w_7 q_crit = 0.5 * (w_norm - s_norm)
24
24 LLNL-PRES-600932 A 3072 3 timestep of a Rayleigh–Taylor instability simulation. DNS simulation with intricate embedded vortical features. Data courtesy of Bill Cabot and Andy Cook, LLNL 27 billion cells 3072 sub-grids (each 192x129x256 cells) 3072 sub-grids (each 192x129x256 cells) Evaluation Methodology
25
25 LLNL-PRES-600932 12 sub-grids varying from 9.3 to 113.2 million cells. Fields: Mesh coords (x,y,z) Velocity vector field (u,v,w) Data courtesy of Bill Cabot and Andy Cook, LLNL Velocity Magnitude Sub-grids for single device evaluation Evaluation Methodology
26
26 LLNL-PRES-600932 MPI Data Compute Engine Data Pytho n Clients GUI CLI Viewer (State Manager) Viewer (State Manager) network connection Python Client Interface (State Control) Python Filter Runtime (Direct Mesh Manipulation) Parallel ClusterLocal Components Evaluation Methodology VisIt’s Python Interfaces
27
27 LLNL-PRES-600932 Single Device Evaluation Recorded runtime performance and memory usage Two OpenCL Target devices: — GPU: Tesla M2050 (3 GB RAM ) — CPU: Intel Xeons (96 GB RAM, shared with host environment ) 144 test cases per device: Three test expressions Our three strategies and a reference kernel Data: 12 RT3D sub-grids — Sizes range from 9.6 million to 113 million cells. Evaluation Methodology
28
28 LLNL-PRES-600932 Distributed Memory Parallel Test A “Smoke” test Q-criterion using the Fusion strategy 128 nodes using two M2050 Teslas per node Data: Full mesh from a single RT3D timestep 3072 sub-grids each with 192x192x256 cells 27 billon total cells + ghost data Each of the 256 Teslas stream 12 sub-grids. Evaluation Methodology
29
29 LLNL-PRES-600932
30
30 LLNL-PRES-600932 Velocity Magnitude Evaluation Results
31
31 LLNL-PRES-600932 Vorticity Magnitude Evaluation Results
32
32 LLNL-PRES-600932 Q-criterion Evaluation Results
33
33 LLNL-PRES-600932 Velocity Magnitude Evaluation Results
34
34 LLNL-PRES-600932 Vorticity Magnitude Evaluation Results
35
35 LLNL-PRES-600932 Q-criterion Evaluation Results
36
36 LLNL-PRES-600932 ExpressionStrategy Device Writes Device Reads Kernel Executions Velocity Magnitude Roundtrip 1166 Staged 316 Fusion 311 Vorticity Magnitude Roundtrip 3212 Staged 7118 Fusion 711 Q-criterion RoundTrip 12357 Staged 7167 Fusion 711 Evaluation Results
37
37 LLNL-PRES-600932 0.000.250.500.751.00 Q-criterion of 27 billion cell mesh Q-criterion of 27 billion cell mesh Evaluation Results
38
38 LLNL-PRES-600932 Strategy Comparison: Roundtrip: Slowest and least constrained by target device memory. Staged: Faster than Roundtrip and most constrained by target device memory. Fusion: Fastest and least amount of data movement. Device Comparison: GPU: Best runtime performance for test cases that fit into the 3 GB of global device memory. CPU: Successfully completed all test cases. Evaluation Results
39
39 LLNL-PRES-600932 Our framework provides flexible path forward for exploring strategies for efficient derived field generation on future many-core architectures. The Python ecosystem made this research possible. Future work: Distributed-memory parallel performance Strategies for streaming and using multiple devices on-node Thanks PyHPC 2012! Contact Info: Cyrus Harrison
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.