Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Computer Science 320 Clumping in Parallel Java. Sequential vs Parallel Program Initial setup Execute the computation Clean up Initial setup Create a parallel.

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Scheduling and Performance Issues for Programming using OpenMP

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.

CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR Collaborators: Adam Frank Brandon Shroyer Chen Ding Shule Li.

A High-Performance Java Dialect Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham,

1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.

Programming with MPI – A Detailed Example By: Llewellyn Yap Wednesday, EE524 LECTURE EE524 LECTURE How to Build a Beowulf, Chapter 9.

I/O Optimization for ENZO Cosmology Simulation Using MPI-IO Jianwei Li12/06/2001.

CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.

CS 584 Lecture 11 l Assignment? l Paper Schedule –10 Students –5 Days –Look at the schedule and me your preference. Quickly.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

Making Sequential Consistency Practical in Titanium Amir Kamil and Jimmy Su.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.

Yelick 1 ILP98, Titanium Titanium: A High Performance Java- Based Language Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul.

UPC Runtime Layer Jason Duell. The Big Picture The Runtime layer handles everything that is both: 1) Platform/Environment specific —So compiler can output.

Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Evaluation of Memory Consistency Models in Titanium.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Scalable Algorithms for Structured Adaptive Mesh Refinement Akhil Langer, Jonathan Lifflander, Phil Miller, Laxmikant Kale Parallel Programming Laboratory.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Performance Evaluation of Parallel Processing. Why Performance?

Adaptive MPI Milind A. Bhandarkar

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Unified Parallel C at LBNL/UCB UPC AMR Status Report Michael Welcome LBL - FTG.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Thread-Level Speculation Karan Singh CS

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.

 In computer programming, a loop is a sequence of instruction s that is continually repeated until a certain condition is reached.  PHP Loops :  In.

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Virtual Memory By CS147 Maheshpriya Venkata. Agenda Review Cache Memory Virtual Memory Paging Segmentation Configuration Of Virtual Memory Cache Memory.

CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.

Multi-Grid Esteban Pauli 4/25/06. Overview Problem Description Problem Description Implementation Implementation –Shared Memory –Distributed Memory –Other.

4- Performance Analysis of Parallel Programs

6/16/2010 Parallel Performance Parallel Performance.

Lecture 18: Coherence and Synchronization

5.2 Eleven Advanced Optimizations of Cache Performance

Martin Rinard Laboratory for Computer Science

CMSC 611: Advanced Computer Architecture

Making Sequential Consistency Practical in Titanium

CMSC 611: Advanced Computer Architecture

Fast Communication and User Level Parallelism

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

Presentation transcript:

Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su

Adaptive Mesh Refinement Grid-based V-cycle code written in Titanium Severely limited by communication overheads Communication is between levels on the V-cycle: Interpolation Communication within levels of a V-cycle: Boundary Exchange AMR uses hierarchy of grids, with fine grids nested inside coarse ones. Grids are refined only where extra resolution is needed.

Unix Processes vs. Pthreads Pthreads expected to perform better, due to shared memory communication. However, some parts of the code perform worse with Pthreads. 1/12/11/24/11/48/11/8 SolveAMR Exchange GSRB CFInterp Initialization Time in seconds (processes/Pthreads per process)

Static vs. Instance Accesses Static variable accesses are slow in Titanium. This slowdown is much greater with Pthreads (~7 times slower than with processes). Indirect accesses using wrappers can eliminate much of the cost. 10M accesses100M accesses 8/11/88/11/8 Instance Static Indirect Time in seconds (processes/Pthreads per process) This only accounts for a small fraction of the slowdown in AMR.

Cache Misses Since Pthreads share the heap segment of memory, false sharing can occur, resulting in many more cache misses. Slowest two lines of code in initialization show many more data cache misses with Pthreads. Time (seconds) Cache misses (millions) 8/11/88/11/8 Line Line Time and cache misses (processes/Pthreads per process)

Load Balancing Conflicts with optimizing to reduce communication - Communication between coarse grids and fine grids for interpolation. Favors allocation of overlapping grids to a single processor so interpolation requires no communication. Rough estimate: Time spent waiting at barriers Different data sets for each configuration

Optimizing with Communication Primitives Titanium exchange over broadcast during initialization P1P2 D1D3 P1P2 PROCESSOR 1PROCESSOR 2 Distributed Data Structure Broadcast Broadcast each data block (D) to everyone else. Broadcasts execute serially. D2 D4 Exchange Exchanges can happen in parallel. * Results Pending

Exchange The exchange operation copies the elements in the overlap sections between boxes. The operation accounts for 30% of total running time for the sequential backend. The original code has two implementations of the exchange operations. –Array copy –Foreach loop over the intersection domain

Exchange Optimizations Domain intersection calculation accounts for 46% of the running time of exchange. Amortize the intersection cost by caching the calculation About 20% of the intersections only has a single element in it. –Most resulted from the intersections of corners of two boxes, which are not needed in the calculation. –44% speedup in exchange by checking for this case

Box Overlap Representation Both the boxes and box intersections are static throughout the execution of the program. Box overlaps are currently represented as intersections of Titanium arrays. Even with caching of intersection calculations, some overheads are unavoidable. –Loop overhead for small intersections –Translation of points in a domain to virtual addresses

Experimental Representation Represent all intersections as an array of source/dest C pointer pairs 40% speedup over the intersection caching version Improving on the initial version to reduce the size of the representation –Represent small intersections pointer pairs –Represent large intersections as a base pointer and strides

Conclusion Faster than C++/MPI implementation of Chombo More speedups pending due to work in progress