Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Running Large Graph Algorithms – Evaluation of Current State-of-the-Art Andy Yoo Lawrence Livermore National Laboratory – Google Tech Talk Feb Summarized.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
Parallel Mesh Refinement with Optimal Load Balancing Jean-Francois Remacle, Joseph E. Flaherty and Mark. S. Shephard Scientific Computation Research Center.
Parallel Programming Models and Paradigms
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Multiscalar processors
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
A Load Balancing Framework for Adaptive and Asynchronous Applications Kevin Barker, Andrey Chernikov, Nikos Chrisochoides,Keshav Pingali ; IEEE TRANSACTIONS.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
ParFUM Parallel Mesh Adaptivity Nilesh Choudhury, Terry Wilmarth Parallel Programming Lab Computer Science Department University of Illinois, Urbana Champaign.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Parallelization Of The Spacetime Discontinuous Galerkin Method Using The Charm++ FEM Framework (ParFUM) Mark Hills, Hari Govind, Sayantan Chakravorty,
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Slide 1 / 19 Mesh Generation and Load Balancing CS /11/2009 Stan Tomov Innovative Computing Laboratory Computer Science Department The University.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
7 th Annual Workshop on Charm++ and its Applications ParTopS: Compact Topological Framework for Parallel Fragmentation Simulations Rodrigo Espinha 1 Waldemar.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Adaptive Mesh Modification in Parallel Framework Application of parFUM Sandhya Mangala (MIE) Prof. Philippe H. Geubelle (AE) University of Illinois, Urbana-Champaign.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL
Design Strategies for Irregularly Adapting Parallel Applications Leonid Oliker Future Technologies Group Computational Research Division LBNL
LLNL-PRES DRAFT This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract.
Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Programmability Hiroshi Nakashima Thomas Sterling.
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Data Structures and Algorithms in Parallel Computing Lecture 7.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Parallel Computing Presented by Justin Reschke
University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.
1 1 Zoltan: Toolkit of parallel combinatorial algorithms for unstructured, dynamic and/or adaptive computations Unstructured Communication Tools -Communication.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Tuning Threaded Code with Intel® Parallel Amplifier.
Scalable Dynamic Adaptive Simulations with ParFUM Terry L. Wilmarth Center for Simulation of Advanced Rockets and Parallel Programming Laboratory University.
Design Strategies for Irregularly Adapting Parallel Applications
Performance Evaluation of Adaptive MPI
L21: Putting it together: Tree Search (Ch. 6)
Department of Computer Science University of California, Santa Barbara
Gary M. Zoppetti Gagan Agrawal
Department of Computer Science University of California, Santa Barbara
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Parallel Implementation of Adaptive Spacetime Simulations A
Presentation transcript:

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory Rupak Biswas MRJ Technology Solutions NASA Ames Research Center

Supercomputing ‘99 Motivation and Objectives  Real-life computational simulations generally require irregular data structures and dynamic algorithms  Large-scale parallelism is needed to solve these problems within a reasonable time frame  Several parallel architectures with distinct programming methodologies have emerged  Report experience with the parallelization of a dynamic unstructured mesh adaptation code using three popular programming paradigms on three state-of-the-art supercomputers

Supercomputing ‘99 2D Unstructured Mesh Adaptation  Powerful tool for efficiently solving computational problems with evolving physical features (shocks, vortices, shear layers, crack propagation)  Complicated logic and data structures  Difficult to parallelize efficiently Irregular data access patterns (pointer chasing) Workload grows/shrinks at runtime (dynamic load balancing)  Three types of element subdivision

Supercomputing ‘99 Parallel Code Development  Programming paradigms Message passing (MPI) Shared memory (OpenMP-style pragma compiler directives) Multithreading (Tera compiler directives)  Architectures Cray T3E SGI Origin2000 Tera MTA  Critical factors Runtime Scalability Programmability Portability Memory overhead

Supercomputing ‘99 Test Problem  Computational mesh to simulate flow over airfoil  Mesh geometrically refined 5 levels in specific regions to better capture fine-scale phenomena 14,605 vertices 28,404 triangles 488,574 vertices 1,291,834 triangles Serial Code 6.4 secs on 250 MHz R10K

Supercomputing ‘99 Distributed-Memory Implementation  512-node T3E (450 MHz DEC Alpha procs)  32-node Origin2000 (250 MHz dual MIPS R10K procs)  Code implemented in MPI within PLUM framework Initial dual graph used for load balancing adapted meshes Parallel repartitioning of adapted meshes (ParMeTiS) Remapping algorithm assigns new partitions to processors Efficient data movement scheme (predictive & asynchronous)  Three major steps ( refinement, repartitioning, remapping )  Overhead Programming (to maintain consistent D/S for shared objects) Memory (mostly for bulk communication buffers)

Supercomputing ‘99 INITIALIZATION Initial Mesh Partitioning Mapping Overview of PLUM Y Y N N Repartitioning Reassignment Remapping Balanced? Expensive? LOAD BALANCER Refinement MESH ADAPTOR Edge Marking Coarsening FLOW SOLVER

Supercomputing ‘99 Performance of MPI Code  More than 32 procs required to outperform serial case  Reasonable scalability for refinement & remapping  Scalable repartitioner would improve performance  Data volume different due to different word sizes

Supercomputing ‘99 Shared-Memory Implementation  32-node Origin2000 (250 MHz dual MIPS R10K procs)  Complexities of partitioning & remapping absent Parallel dynamic loop scheduling for load balance  GRAPH_COLOR strategy (significant overhead) Use SGI’s native pragma directives to create IRIX threads Color triangles (new ones on the fly) to form independent sets All threads process each set to completion, then synchronize  NO_COLOR strategy (too fine grained) Use low-level locks instead of graph coloring When thread processes triangle, lock its edges & vertices Processors idle while waiting for blocked objects

Supercomputing ‘99 Performance of Shared-Memory Code  Poor performance due to flat memory assumption  System overloaded by false sharing  Page migration unable to remedy problem  Need to consider data locality and cache effects to improve performance ( require partitioning & reordering )  For GRAPH_COLOR Cache misses 15 M (serial) to 85 M (P=1) TLB misses 7.3 M (serial) to 53 M (P=1)

Supercomputing ‘99 Multithreaded Implementation  8-processor 250 MHz Tera MTA 128 streams/proc, flat hashed memory, full-empty bit for sync Executes pipelined instruction from different stream at each clock tick  Dynamically assigns triangles to threads Implicit load balancing Low-level synchronization variables ensure adjacent triangles do not update shared edges or vertices simultaneously  No partitioning, remapping, graph coloring required Basically, the NO_COLOR strategy  Minimal programming to create multithreaded version

Supercomputing ‘99 Performance of Multithreading Code  Sufficient instruction level parallelism exists to tolerate memory access overhead and lightweight synchronization  Number of streams changed via compiler directive

Supercomputing ‘99 Schematic of Different Paradigms Shared memoryMultithreadingDistributed memory Before and after adaptation (P=2 for distributed memory)

Supercomputing ‘99 Comparison and Conclusions  Different programming paradigms require varying numbers of operations and overheads  Multithreaded systems offer tremendous potential for solving some of the most challenging real-life problems on parallel computers