Barnes Hut – A Broad Review Abhinav S Bhatele The 27th day of April, 2006.

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

MPI Message Passing Interface

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Parallel Processing with OpenMP

Introduction to Openmp & openACC

Introductions to Parallel Programming Using OpenMP

Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering

Lecture 3: Parallel Algorithm Design

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

1 Distributed Computing Algorithms CSCI Distributed Computing: everything not centralized many processors.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

INTEL CONFIDENTIAL Confronting Race Conditions Introduction to Parallel Programming – Part 6.

Introduction Computational Challenges Serial Solutions Distributed Memory Solution Shared Memory Solution Parallel Analysis Conclusion Introduction: 

10/04/2011CS4961 CS4961 Parallel Programming Lecture 12: Advanced Synchronization (Pthreads) Mary Hall October 4, 2011.

Enabling Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey P. Gardner Andrew.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

1 Charm++ on the Cell Processor David Kunzman, Gengbin Zheng, Eric Bohm, Laxmikant V. Kale.

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Lecture 10 CSS314 Parallel Computing

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.

Advanced / Other Programming Models Sathish Vadhiyar.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

Parallel Programming & Cluster Computing N-Body Simulation and Collective Communications Henry Neeman, University of Oklahoma Paul Gray, University of.

Memory Coherence in Shared Virtual Memory Systems Yeong Ouk Kim, Hyun Gi Ahn.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

L19: Putting it together: N-body (Ch. 6) November 22, 2011.

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

Synchronization These notes introduce:

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.

Exploring Parallelism with Joseph Pantoga Jon Simington.

Parallel Computing Presented by Justin Reschke

Multi-Grid Esteban Pauli 4/25/06. Overview Problem Description Problem Description Implementation Implementation –Shared Memory –Distributed Memory –Other.

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Lecture 3: Parallel Algorithm Design

ChaNGa: Design Issues in High Performance Cosmology

The University of Adelaide, School of Computer Science

Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

Multi-Processing in High Performance Computer Architecture:

Distributed Shared Memory

Course Outline Introduction in algorithms and applications

Distributed Systems CS

Hybrid Programming with OpenMP and MPI

By Brandon, Ben, and Lee Parallel Computing.

Introduction to parallelism and the Message Passing Interface

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Presentation transcript:

Barnes Hut – A Broad Review Abhinav S Bhatele The 27th day of April, 2006

Overview -Introduction -Sequential Code -OpenMP, MPI, Charm Implementations -Performance Results -Match with other models

Introduction  Barnes Hut is a solution to the N-body problem which takes 0(nlogn) time  The basic steps are: - Initialise the tree (masses, positions) -Compute the forces on particles -Update the positions and migrate the particles  We use optimizations for force calculations on particles to reduce O(n 2 ) to O(nlogn)

Data Structures  To reduce the no. of particles in the force calculation, we use quad trees and oct trees

Sequential Code for (t=0; t<N; t++) { initialise the tree calculate forces on each particle due to every other particle update the positions and velocities migrate the particles if required }

Parallelization  Common steps across all programming paradigms: - Divide the initialised quad tree among the processors - Each processor is responsible for the calculation of forces for the particles in its subtree - Update the positions and velocities - Finally migrate the particles and hence update the entire tree across all the processors

Serial Implementation void doTimeStep(TreeNode* subTree, TreeNode* root) { int i, j; if(subTree) { if(!subTree->isLeaf) { for(i=0; i<2; i++) for(j=0; j<2; j++) if(subTree->quadrant[i][j]) doTimeStep(subTree->quadrant[i][j], root); } else // subtree is a leaf cell. calcForces(subTree, root); } return; }

Shared Memory: OpenMP  Writing the openMP code was relatively easier as compared to ??? (we’ll see soon!)  Since the data structure is not an array, so I couldn’t use simple pragmas like “parallel for”  Created threads and explicitly assigned different parts of the tree to them  The threads were supposed to do the calculations for the respective subtrees: the rest was managed by OpenMP!!

Message Passing Models  Two ways of solving the problem in this case: -Request other threads for the part of the tree you want to work on and then calculate the forces -Send your particles’ information to the other threads and let them calculate the forces for you  The second approach is easier but doesn’t scale very well because you end up sending a huge no. of messages  The first approach is better if we use a local cache to bring in some part of the remote subtree and then use it for all the particles

MPI Implementation  Tough going till now!   Initial Synchronization: all the threads should know about the root and no of particles with each thread  For all the particles in your tree, send a message to the other trees with your particles’ info and receive the answer (force) in reply  Optimization: send info about multiple particles (at a leaf) at a time and receive the same  Still having issues with the implementation

Charm++ Implementation  Not different from the MPI implementation  To have a better degree of virtualization, each chare will get a small part of the tree  Use entry methods for asynchronous calls -sending particle info and asking other threads to do the calculations -Sending calculated answers back to the requesting thread

Performance  The tests for the serial and OpenMP code were performed on the SGI Altix machine at NCSA called Cobalt  On OpenMP, there is a speedup of 2 for two processors but then the performance gets worse (for 2000 particles)  For particles, we get speedups on 2 and 8 processors and the time on 4 processors is almost the same as on a single processor

Performance of OpenMP

Other Models: Good ones!  We can look at how suitable other models are from two viewpoints -Ease of coding -Performance  OpenMP is by far the best model in terms of coding but not the best when it comes to performance  Shared memory models like UPC, CID which are designed for irregular data structures might be a good match  Multithreaded programming models like Cilk would be a good match wherein you can spawn off functions

The Evil Devils!  Models which just support shared or distributed arrays are not suitable – HPF, CAF, GA  Cumbersome in messaging passing/ explicit synchronization models like MPI, BSP, Charm++  Don’t know if you can even write it in these without killing yourself a multiple times: linda and a few others …

Thank You!!! Questions???