Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

Slides:



Advertisements
Similar presentations
Christian Lauterbach COMP 770, 2/16/2009. Overview  Acceleration structures  Spatial hierarchies  Object hierarchies  Interactive Ray Tracing techniques.
Advertisements

Distributed Systems CS
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
CS267, Yelick1 Cosmology Applications N-Body Simulations Credits: Lecture Slides of Dr. James Demmel, Dr. Kathy Yelick, University of California, Berkeley.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
Introduction to MIMD architectures
Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.
Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.
Parallel Programming Henri Bal Vrije Universiteit Faculty of Sciences Amsterdam.
Daniel Blackburn Load Balancing in Distributed N-Body Simulations.
1 Parallel multi-grid summation for the N-body problem Jesús A. Izaguirre with Thierry Matthey Department of Computer Science and Engineering University.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Programming Henri Bal Vrije Universiteit Faculty of Sciences Amsterdam.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Parallel Programming Todd C. Mowry CS 740 October 16 & 18, 2000 Topics Motivating Examples Parallel Programming for High Performance Impact of the Programming.
Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
Parallel Programming Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
FLANN Fast Library for Approximate Nearest Neighbors
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer.
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
A Parallelisation Approach for Multi-Resolution Grids Based Upon the Peano Space-Filling Curve Student: Adriana Bocoi Advisor: Dipl.-Inf.Tobias Weinzierl.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Parallel Computer Architecture and Interconnect 1b.1.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
(Short) Introduction to Parallel Computing CS 6560: Operating Systems Design.
1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Dynamic Scenes Paul Arthur Navrátil ParallelismJustIsn’tEnough.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Static Process Scheduling
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
High-Performance Computing 12.2: Parallel Algorithms.
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
ChaNGa: Design Issues in High Performance Cosmology
Course Outline Introduction in algorithms and applications
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Department of Computer Science University of California,Santa Barbara
Summary Background Introduction in algorithms and applications
Performance Evaluation of the Parallel Fast Multipole Algorithm Using the Optimal Effectiveness Metric Ioana Banicescu and Mark Bilderback Department of.
Course Outline Introduction in algorithms and applications
Distributed Systems CS
Cosmology Applications N-Body Simulations
High Performance Computing
Vrije Universiteit Amsterdam
N-Body Gravitational Simulations
Presentation transcript:

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster computers, BlueGene Programming methods, languages, and environments Message passing (SR, MPI, Java) Higher-level language: HPF Applications N-body problems, search algorithms, bioinformatics Grid computing Multimedia content analysis on Grids (guest lecture Frank Seinstra)

N-Body Methods Source: Load Balancing and Data Locality in Adaptive Hierarchical N-Body Methods: Barnes-Hut, Fast Multipole, and Radiosity by Singh, Holt, Totsuka, Gupta, and Hennessy (except Sections , 4.2, 9, and 10)

N-body problems Given are N bodies (molecules, stars,...) The bodies exert forces on each other (Coulomb, gravity,...) Problem: simulate behavior of the system over time Many applications: Astrophysics (stars in a galaxy) Plasma physics (ion/electrons) Molecular dynamics (atoms/molecules) Computer graphics (radiosity)

Basic N-body algorithm for each timestep do Compute forces between all bodies Compute new positions and velocities od O(N 2 ) compute time per timestep Too expensive for realistics problems (e.g., galaxies) Barnes-Hut is O(N log N) algorithm for hierarchical N-body problems

Hierarchical N-body problems Exploit physics of many applications: Forces fall very rapidly with distance between bodies Long-range interactions can be approximated Key idea: group of distant bodies is approximated by a single body with same mass and center-of-mass

Data structure Octree (3D) or quadtree (2D): Hierarchical representation of physical space Building the tree: Start with one cell with all bodies (bounding box) Recursively split cells with multiple bodies into sub-cells Example (Fig. 5 from paper)

Barnes-Hut algorithm for each timestep do Build tree Compute center-of-mass for each cell Compute forces between all bodies Compute new positions and velocities od Building the tree: recursive algorithm (can be parallelized) Center-of-mass: upward pass through the tree Compute forces: 90% of the time Update positions and velocities: simple (given the forces)

Force computation of Barnes-Hut for each body B do B.force := ComputeForce(tree.root, B) Od function ComputeForce(cell, B): float; if distance(B, cell.CenterOfMass) > threshold then return DirectForce(B.position, B.Mass, cell.CenterOfMass, cell.Mass) else sum := 0.0 for each subcell C in cell do sum +:= ComputeForce(C, B) return sum

Parallelizing Barnes-Hut Distribute bodies over all processors In each timestep, processors work on different bodies Communication/synchronization needed during Tree building Center-of-mass computation Force computation Key problem is efficient parallelization of force-computation Issues: Load balancing Data locality

Load balancing Goal: Each processor must get same amount of work Problem: Amount of work per body differs widely

Data locality Goal: - Each CPU must access small number of bodies many times - Reduces communication overhead Problems - Access patterns to bodies not known in advance - Distribution of bodies in space changes (slowly)

Simple distribution strategies Distribute iterations Iteration = computations on single body in 1 timestep Strategy-1: Static distribution Each processor gets equal number of iterations Strategy-2: Dynamic distribution Distribute iterations dynamically Problems Distributing iterations does not take locality into account Static distribution leads to load imbalances

More advanced distribution strategies Load balancing: cost model Associate a computational cost with each body Cost = amount of work (number of interactions) during previous timestep Each processor gets same total cost Works well, because system changes slowly Data locality: costzones Observation: octree more or less represents spatial (physical) distribution of bodies Thus: partition the tree, not the iterations Costzone: contiguous zone of costs

Example costzones Optimization: improve locality using clever child numbering scheme

Experimental system-DASH DASH multiprocessor Designed at Stanford university One of the first NUMAs (Non-Uniform Memory Access) DASH architecture Memory is physically distributed Programmer sees shared address space Hardware moves data between processors and caches it Implemented using directory-based cache coherence protocol

DASH prototype 48-node DASH system 12 clusters of 4 processors (MIPS R3000) each Shared bus within each cluster Mesh network between clusters Remote reads 4x more expensive than local reads Also built a simulator More flexible than real hardware Much slower

Performance results on DASH Costzones reduce load imbalance and communication overhead Moderate improvement in speedups on DASH - Low communication/computation ratio

Speedups measured on DASH (figure 17)

Simulator statistics (figure 18)

Conclusions Parallelizing efficient O(N log N) algorithm is much harder that parallelizing O(N 2 ) algorithm Barnes-Hut has nonuniform, dynamically changing behavior Key issues to obtain good speedups for Barnes-Hut Load balancing -> cost model Data locality -> costzones Optimizations exploit physical properties of the application