Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Visualization of Large-Scale Datasets for the Earth Simulator Li Chen Issei Fujishiro Kengo Nakajima Research Organization for Information Science.

Similar presentations


Presentation on theme: "Parallel Visualization of Large-Scale Datasets for the Earth Simulator Li Chen Issei Fujishiro Kengo Nakajima Research Organization for Information Science."— Presentation transcript:

1 Parallel Visualization of Large-Scale Datasets for the Earth Simulator Li Chen Issei Fujishiro Kengo Nakajima Research Organization for Information Science and Technology (RIST) Japan 3rd ACES Workshop, May 5-10, 2002, Maui, Hawaii.VisualizationVisualization Basic Design & Parallel/SMP/Vector Algorithm

2 Background Role of Visualization Subsystem Earth Simulator Software Hardware GeoFEM Mesh gen. Solver Application analysis Vis Subsys Tools for: (1) Post Processing (1) Post Processing (2) Data Mining etc. (2) Data Mining etc.

3 Background: Requirements Target 1: Powerful visualization functions Translate data from numerical forms to visual forms. Provide the researchers with immense assistance in the process of understanding their computational results. Target 2: Suitable for large-scale datasets High parallel performance Target 3: Available for unstructured datasets Complicated grids Target 4: SMP cluster architecture oriented Effective based on the SMP cluster architecture We have developed many visualization techniques in GeoFEM, for scalar, vector and tensor data fields, to reveal data distribution from many aspects. Our modules have been parallelized and obtained a high parallel performance All of our modules are based on unstructured datasets, and can be extended to hybrid grids. Three-level hybrid parallel programming model is adopted in our modules

4 Works after 2nd ACES (Oct. 2000)  Developed more visualization techniques for GeoFEM  Improved parallel performance  Please Visit Our Poster for Detail !!

5 Overview Visualization Subsystem in GeoFEM Newly Developed Parallel Volume Rendering (PVR) –Algorithm –Parallel/Vector Efficiency Examples Future Works

6 Parallel Visualization File Version or “DEBUGGING” Version mesh #0 mesh #1 mesh #n-1 Mesh Files FEM-#0 I/O Solver I/O FEM-#1 I/O Solver I/O FEM-#n-1 I/O Solver I/O FEM Analysis result #0 result #1 result #n-1 Result Files UCD etc. Images VIEWER AVS etc. Visualization Result Files Input Output Communication VIS-#0 VIS-#1 VIS-#n-1 Visualization on Client includes simplification, combination etc.

7 Large-Scale Data in GeoFEM 1km x 1km x 1km mesh for 1000km x 1000km x 100km "local" region 1000 x 1000 x 100 = 10 8 grid points 1GB/variable/time step ~10GB/time step for 10 variables Huge TB scale for 100 steps !!

8 Parallel Visualization Memory/Concurrent Version mesh #0 mesh #1 mesh #n-1 Mesh Files FEM-#0 I/O Solver I/O FEM-#1 I/O Solver I/O FEM-#n-1 I/O Solver I/O FEM+Visualization on GeoFEM Platform VIS-#0 VIS-#1 VIS-#n-1 UCD etc. Images VIEWER AVS etc. Visualization Result Files Input Output Communication on Client Dangerous if detailed physics is not clear

9 Parallel Visualization Techniques in GeoFEM Scalar FieldVector FieldTensor Field Cross-sectioning Isosurface-fitting Surface-fitting Interval Volume-fitting Volume Rendering Streamlines Particle Tracking Topological Map LIC Volume Rendering Hyperstreamlines In the following, we will take the Parallel Volume Rendering module as example to demonstrate our strategies on improving parallel performance available June 2002, http://geofem.tokyo.rist.or.jp/

10 Visualization Subsystem in GeoFEM Newly Developed Parallel Volume Rendering (PVR)Newly Developed Parallel Volume Rendering (PVR) –Algorithm –Parallel/Vector Efficiency Examples Future Works

11 Design of Visualization Methods Principle  Taking account of parallel performance  Taking account of huge data size  Taking account of huge data size  Taking account of unstructured grids  Taking account of unstructured grids Traversal Approach  Image-order volume rendering (Ray casting)  Object-order volume rendering (Cell projection) Grid type  Regular  Curvilinear Composition Approach Projection  Parallel Classification of Current Volume Rendering Methods  Unstructured  Hybrid order volume rendering  Perspective  From front to back  From back to front

12 Design of Visualization Methods Principle Taking account of concurrent with computational process Classification of Parallelism  Object-space parallelism Partition object space and each PE gets a portion of the dataset. Each PE calculates an image of the sub-volume.  Image-space parallelism Partition image space and each PE calculates a portion of the whole image.  Time-space parallelism Partition time space and each PE calculates the images of several timesteps.

13  Large storage requirement  Slow down volume rendering process Design for Parallel Volume Rendering Unstructured Locally Refined Octree/Hierarchical Why not unstructured grid?  Hard to build hierarchical structure  Connectivity information should be found beforehand  Unstructured grid makes image composition and load balance difficult  Irregular shape makes sampling slower regular grid? Why not

14 Parallel Transformation Unstructured Hierarchical One Solution FEM Data Resampling Hierarchical data Ray-casting PVR VR Image Original GeoFEM Meshes PE#0PE#1PE#2PE#3PE#4PE#5 PE#6PE#7PE#8PE#9PE#10PE#11 PE#12PE#13PE#14PE#15PE#16PE#17 Background Cells Voxels

15 Accelerated Ray-casting PVR VR parameters Hierarchical datasets Determine sampling and mapping parameters Build Branch-on-need octree Generate subimages for each PE Build topological structure of subvolumes on all PEs Composite subimages from front to back for each subvolume for j=startj to endj do for i=starti to endi do Fast find the intersection voxels with ray (i,j) Compute (r,g,b) at each intersection voxel based on volume illumination model and transfer functions Compute (r,g,b) for pixel(i,j) based on front- to-back composition

16 Visualization Subsystem in GeoFEM Newly Developed Parallel Volume Rendering (PVR)Newly Developed Parallel Volume Rendering (PVR) –Algorithm –Parallel/Vector Efficiency Examples Future Works

17 SMP Cluster Type Architectures Earth Simulator ASCI Hardwares Various Types of Communication, Parallelism. Inter-SMP node, Intra-SMP node, Individual PE PEPE PEPE PEPE PEPE Memory PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PEPE

18 PEPE PEPE PEPE Intra NODEInter NODEEach PE F90 + directive(OpenMP)MPI F90 MPIHPF Optimum Programming Models for Earth Simulator ?

19 Three-Level Hybrid parallelization Flat MPI parallelization Each PE: independent Hybrid Parallel Programming Model Based on Memory hierarchy Inter-SMP node MPI Intra-SMP node OpenMP for parallelization Individual PE Compiler directives for vectorization/pseudo vectorization

20 Flat MPI vs. OpenMP/MPI Hybrid PEPE PEPE PEPE PEPE Memory PEPE PEPE PEPE PEPE Hybrid : Hierarchal Structure PEPE PEPE PEPE PEPE Memory PEPE PEPE PEPE PEPE Flat-MPI : Each PE -> Independent

21 Three-Level Hybrid parallelization Previous work on hybrid parallelization R. Falgout, and J. Jones, "Multigrid on Massively Parallel Architectures", 1999. F. Cappelo, and D. Etiemble, "MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks", 2000. K. Nakajima and H. Okuda, "Parallel Iterative Solvers for Unstructured Grids using Directive/MPI Hybrid Programming Model for GeoFEM Platform on SMP Cluster Architectures", 2001 All these are in computational research area. No visualization papers are found on this topic. Previous parallel visualization methods Classification by platform Shared memory machines: J. Nieh and M. Levoy 1992, P. Lacroute 1996 Distributed memory machines: U. Neumann 1993, C. M. Wittenbrink and A. K. Somani, 1997 SMP cluster machines: almost no papers are found.

22 SMP Cluster Architecture PE MemoryMemory MemoryMemory MemoryMemory MemoryMemory Node-0Node-1 Node-2Node-3 Node-1 Node-3 Node-0 Node-2 Partitioning of data domain The Earth Simulator 640 SMP nodes, and 8 vector processors in each SMP node

23 Three-Level Hybrid parallelization Local operation and no global dependency Continuous memory access Sufficiently long loops Criteria to achieve high parallel performance

24 Vectorization for Each PE Construct Vectorizatoin Loop Combine some short loops into one long loop by reordering Exchange the innermost and outer loop to make the innermost loop longer Avoid using tree and single/double link data structure, especially in the inner loop for(i=0;i<MAX_N_VERTEX; i++) for(j=0;j<3;j++) { p[i][j]= …. …. } for(i=0;i<MAX_N_VERTEX*3; i++){ p[i/3][i % 3]= …. …. } for(i=0;i<MAX_N_VERTEX; i++) for(j=0;j<3;j++) { p[i][j]= …. …. } for(j=0;j<3;j++) for(i=0;i<MAX_N_VERTEX; i++) { p[i][j]= …. …. } link (single or double) structure tree structure

25 Intra-SMP Node Parallelization  OpenMP http://www.openmp.org Multi-coloring for Removing the Data Race [Nakajima, et al. 2001] Ex: gradient computation in PVR #pragma omp parallel { for(i=0;i<num_element;i++) { compute jacobian matrix of shape function; for(j=0; j<8;j++) { for(k=0; k<8;k++) accumulate gradient value of vertex j contributed by vertex k; } 12 34 12 34 12 34 12 34 12 34 12 34 1 2 34 12 34 PE#0 PE#1 PE#2 PE#3

26 Inter-SMP Node Parallelization  MPI  Parallel Data Structure in GeoFEM External node Internal node Communication Overlapped elements are used for reducing communication among SMP nodes Overlap removal is necessary for final results

27 Dynamic Load Repartition Why? Initial partition on each PE: (Same with analysis computing) Load on each PE: (PVR process) Load balance during PVR Keep almost equal number of rendered voxels on each PE  Number of non-empty voxels  Opacity transfer functions  Viewpoint Dynamic almost equal number of voxels the number of rendered voxels Rendered voxels often accumulate in small portions of the field during visualization

28 Dynamic Load Repartition Most previous methods Scattered decomposition [K.-L. Ma, et al, 1997]  Advantage: Can get very good load balance easily  Disadvantage Large amount of intermediate results have to be stored Large amount of intermediate results have to be stored  Large extra memory  Large extra communication Assign several continuous subvolumes on each PE  Count the number of rendered voxels during the process of gird transformation  Move a subvolume from a PE with a larger number of rendered voxels to another PE with a smaller one another PE with a smaller one

29 Dynamic Load Repartition Assign several continuous subvolumes on each PE  Count the number of rendered voxels during the process of gird transformation  Move a subvolume from a PE with a larger number of rendered voxels to another PE with a smaller one another PE with a smaller one PE0PE1 PE2 PE3 PE0 PE1 PE2PE3 Initial partitionRepartition

30 Visualization Subsystem in GeoFEM Newly Developed Parallel Volume Rendering (PVR) –Algorithm –Parallel/Vector Efficiency ExamplesExamples Future Works

31 Speedup Test 1 Demonstrate the effect of three-level hybrid parallelization Dataset: Pin Grid Array (PGA) dataset Simulate the Mises Stress distribution on the pin grid board by the Linear Elastostatic Analysis Data size: 7,869,771 nodes and 7,649,024 elements Running environment SR8000 Each node: 8 PEs 8GFLOPS peak performance 8GB memory Total system: 128 nodes (1024 PEs) 1.0TFLOPS peak performance 1.0TB memory (Data courtesy of H. Okuda and S. Ezure).

32 Speedup Test 1 Top view Bottom view Volume rendered images to show the equivalent scalar value of stress by the linear elastostatic analysis for a PGA dataset with 7,869,771 nodes and 7,649,024 elements (Data courtesy of H. Okuda and S. Ezure).

33 Speedup Test 1 Comparison of speedup performance between flat MPI and the hybrid parallel method for our parallel volume rendering module. Original (MPI) to Vector Version (Hybrid) Speed-up for 1PE : 4.30 128 3 Uniform Cubes for PVR

34 Speedup Test 2 Demonstrate the effect of three-level hybrid parallelization Test Dataset: Core dataset (Data courtesy of H. Matsui in GeoFEM) Simulate thermal convection in a rotating spherical shell Data size: 257,414 nodes and 253,440 elements Running environment SR8000 Each node: 8 PEs 8GFLOPS peak performance 8GB memory Test Module Parallel Surface Rendering module Total system: 128 nodes (1024 PEs) 1.0TFLOPS peak performance 1.0TB memory

35 Speedup Test 2 Pressure isosurfaces and temperature cross-sections for a core dataset with 257,414 nodes and 253,440 elements. The speedup of our 3-level parallel method is 231.7 for 8 nodes (64PEs) on SR8000.

36 Speedup Test 2 Comparison of speedup performance between flat MPI and the hybrid parallel method for our parallel surface rendering module. Original (MPI) to Vector Version (Hybrid) Speed-up for 1PE : 4.00

37 Speedup Test 3 Simulate groundwater flow and convection/diffusion transportation through heterogeneous porous media Dataset: Underground water dataset 200  100  100 Region Different Water Conductivity for 16,000, 128,000, 1,024,000 Meshes (∆h= 5.00/2.50/1.25) 100 timesteps Demonstrate the effect of dynamic load repartition Compaq alpha 21164 cluster machine (8 PEs, 600MHz/PE, 512M RAM/PE) Running environment Result Without dynamic load repartition: 8.15 seconds for one time-step averagely. After dynamic load repartition: 3.37 seconds for one time-step averagely. For mesh 3 (about 10 million cubes and 100 timesteps)

38 Effects of convection & diffusion for different mesh sizes  h= 5.00  h=2.50  h=1.25 Groundwater Flow Channel Speedup Test 3

39 Application (2) Flow/Transportation 50  50  50 Region Different Water Conductivity for each (  h=5) 3 cube d  /dx=0.01,  =0@x max 100 3 Meshes –  h= 0.50 64PEs : Hitachi SR2201

40 Parallel Performance Convection & Diffusion 13,280 steps for 200 Time Unit 10 6 Meshes, 1,030,301 Nodes 3,984 sec. for elapsed time including communication on Hitachi SR2201/64PEs –3,934 sec. for real CPU –98.7% parallel performance

41 Convection & Diffusion Visualization by PVR Groundwater Flow Channel

42 Conclusions and Future Work Future Work Improve Parallel Performance of Visualization Subsystem in GeoFEM ● Improve the parallel performance of the visualization algorithms ● Three-level hybrid parallel based on SMP cluster architecture Inter-SMP node MPI Intra-SMP node OpenMP for parallelization Individual PE Compiler directives for vectorization/pseudo vectorization ● Dynamic load balancing Tests on the Earth Simulator http://www.es.jamstec.go.jp/


Download ppt "Parallel Visualization of Large-Scale Datasets for the Earth Simulator Li Chen Issei Fujishiro Kengo Nakajima Research Organization for Information Science."

Similar presentations


Ads by Google