Lecture 10 CSS314 Parallel Computing

Slides:



Advertisements
Similar presentations
The University of Adelaide, School of Computer Science
Advertisements

F As a rocket accelerates it experiences a force F which is responsible for the acceleration. The reaction force to this force F (the other force in the.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Interactions A force is one side of an interaction. When two objects interact, they exert forces on each other. A B Newton’s third law says that these.
1 Chapter Four Newton's Laws. 2  In this chapter we will consider Newton's three laws of motion.  There is one consistent word in these three laws and.
Lecture 11 CSS314 Parallel Computing
Gravity.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Chapter Ten Oscillatory Motion. When a block attached to a spring is set into motion, its position is a periodic function of time. When we considered.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CHAPTER 4 Decidability Contents Decidable Languages
Example 9.1 Goal Programming | 9.3 | Background Information n The Leon Burnit Ad Agency is trying to determine a TV advertising schedule.
Chapter 12 Trees. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Define trees as data structures Define the terms.
Linear Programming Applications
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Chapter 2: Algorithm Discovery and Design
Newton’s Laws 8 th Grade Science Create for BCSD.
CS 221 – May 13 Review chapter 1 Lab – Show me your C programs – Black spaghetti – connect remaining machines – Be able to ping, ssh, and transfer files.
Electric Field You have learned that two charges will exert a force on each other even though they are not actually touching each other. This force is.
Solving Scalar Linear Systems Iterative approach Lecture 15 MA/CS 471 Fall 2003.
Lecture 9 ASTR 111 – Section 002.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
Unit 11, Part 2: Introduction To Logarithms. Logarithms were originally developed to simplify complex arithmetic calculations. They were designed to transform.
CPSC 171 Introduction to Computer Science 3 Levels of Understanding Algorithms More Algorithm Discovery and Design.
Lecture 2 CSS314 Parallel Computing
Unit 11, Part 2: Introduction To Logarithms. Logarithms were originally developed to simplify complex arithmetic calculations. They were designed to transform.
All that remains is to connect the edges in the variable-setters to the appropriate clause-checkers in the way that we require. This is done by the convey.
Simpson Rule For Integration.
The Nature of Forces.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
ISAAC NEWTON’S PHYSICS PRINCIPLES. WHAT NEWTON DID When it comes to science, Isaac Newton is most famous for his creation of the THREE LAWS OF MOTION.
Introduction : In this chapter we will introduce the concepts of work and kinetic energy. These tools will significantly simplify the manner in which.
Chapter 10: Forces Section 1 Review.
A discrete-time Markov Chain consists of random variables X n for n = 0, 1, 2, 3, …, where the possible values for each X n are the integers 0, 1, 2, …,
 As we saw in Section 2.1, the problem of finding the tangent line to a curve and the problem of finding the velocity of an object both involve finding.
Monday, June 22, 2015PHYS , Summer 2015 Dr. Jaehoon Yu 1 PHYS 1441 – Section 001 Lecture #8 Monday, June 22, 2015 Dr. Jaehoon Yu Newton’s Second.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Introduction to Particle Simulations Daniel Playne.
Parallel Programming & Cluster Computing N-Body Simulation and Collective Communications Henry Neeman, University of Oklahoma Paul Gray, University of.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
L19: Putting it together: N-body (Ch. 6) November 22, 2011.
Electric Field.
Data Structures and Algorithms in Parallel Computing Lecture 10.
Objectives. Objective The purpose of this tutorial is to teach the viewer how to solve for the acceleration of each mass in a standard two- body pulley.
Communication Diagrams Lecture 8. Introduction  Interaction Diagrams are used to model system dynamics  How do objects change state?  How do objects.
Lecture 7 CSS314 Parallel Computing Book: “An Introduction to Parallel Programming” by Peter Pacheco
6.1 Areas Between Curves In this section we learn about: Using integrals to find areas of regions that lie between the graphs of two functions. APPLICATIONS.
Section 5.3 Section 5.3 Force and Motion in Two Dimensions In this section you will: Section ●Determine the force that produces equilibrium when.
Competency Goal 4: The learner will develop an understanding of forces and Newton's Laws of Motion Assess the independence of the vector components.
Force and Motion. Prior Concepts Related to Forces PreK-2 Forces are pushes and pulls that change the motion of an object. Forces are required to change.
General Physics I Lecturer: Rashadat Gadmaliyev Lecture 4: Dynamics; Force, Newton’s Laws.
Introduction.
Parallel Programming By J. H. Wang May 2, 2017.
Copyright © Cengage Learning. All rights reserved.
CHAPTER 3: FORCES 3.3 THE THIRD LAW OF MOTION.
Newton's Third Law of Motion and Momentum
Parallel Programming in C with MPI and OpenMP
CE 102 Statics Chapter 1 Introduction.
Introduction.
Copyright © Cengage Learning. All rights reserved.
Chapter 10 Section 1 Key Concepts: How is a force described? How are unbalanced and balanced forces related to an object’s motion? Key terms: force, newton,
Coulomb’s Law Section
The Nature of Forces.
ENGINEERING MECHANICS
Chapter 3, Section 3 Notes The Third Law of Motion.
Chapter 4 Force Ewen et al (2005)
Parallel Programming in C with MPI and OpenMP
ECE 352 Digital System Fundamentals
The Nature of Forces.
Presentation transcript:

Lecture 10 CSS314 Parallel Computing Book: “An Introduction to Parallel Programming” by Peter Pacheco http://instructor.sdu.edu.kz/~moodle/ M.Sc. Bogdanchikov Andrey, Suleyman Demirel University

Content (Chapter 6) Two n-Body Solvers The problem Two serial programs Parallelizing the n-body solvers Performance of the MPI solvers

Parallel Program Development In this chapter, we’ll look at a couple of larger examples: solving n-body problems and solving the traveling salesperson problem. For each problem, we’ll start by looking at a serial solution and examining modifications to the serial solution. As we apply Foster’s methodology, we’ll see that there are some striking similarities between developing shared- and distributed-memory programs.

Two n-body solvers In an n-body problem, we need to find the positions and velocities of a collection of interacting particles over a period of time. For example, an astrophysicist might want to know the positions and velocities of a collection of stars, while a chemist might want to know the positions and velocities of a collection of molecules or atoms. An n-body solver is a program that finds the solution to an n-body problem by simulating the behavior of the particles.

n-body problem The input to the problem is the mass, position, and velocity of each particle at the start of the simulation, and the output is typically the position and velocity of each particle at a sequence of user-specified times, or simply the position and velocity of each particle at the end of a user-specified time period. Let’s first develop a serial n-body solver. Then we’ll try to parallelize it for both shared- and distributed-memory systems.

The problem For the sake of explicitness, let’s write an n-body solver that simulates the motions of planets or stars. We’ll use Newton’s second law of motion and his law of universal gravitation to determine the positions and velocities. Thus, if particle q has position sq(t) at time t, and particle k has position sk(t), then the force on particle q exerted by particle k is given by And a lot of other smart formulas are Self-Study.

The problem We’ll suppose that we either want to find the positions and velocities at the times t = 0,dt,2dt, … ,Tdt, or, more often, simply the positions and velocities at the final time Tdt. Here, dt and T are specified by the user, so the input to the program will be n, the number of particles, dt, T, and, for each particle, its mass, its initial position, and its initial velocity.

Two serial programs In outline, a serial n-body solver can be based on the following pseudocode:

Refined program We can use our formula for the total force on a particle (Formula 6.2) to refine our pseudocode for the computation of the forces in Lines 4–5:

Observation We can use Newton’s third law of motion, that is, for every action there is an equal and opposite reaction, to halve the total number of calculations required for the forces. If the force on particle q due to particle k is fqk, then the force on k due to q is -fqk.

Modified solution Our original solver simply adds all of the entries in row q to get forces[q]. In our modified solver, when q=0, the body of the loop for each particle q will add the entries in row 0 into forces[0]. It will also add the kth entry in column 0 into forces[k] for k = 1, 2, … ,n-1. In general, the qth iteration will add the entries to the right of the diagonal (that is, to the right of the 0) in row q into forces[q], and the entries below the diagonal in column q will be added into their respective forces, that is, the kth entry will be added in to forces[k].

Modified Solution

Two n-body solutions In order to distinguish between the two algorithms, we’ll call the n-body solver with the original force calculation, the basic algorithm, and the solver with the number of calculations reduced, the reduced algorithm. More on discussion of realization is for Self-Study Chapter 6.1.2

Parallelizing the n-body solvers Let’s try to apply Foster’s methodology to the n-body solver. Since we initially want lots of tasks, we can start by making our tasks the computations of the positions, the velocities, and the total forces at each timestep. In the basic algorithm, the algorithm in which the total force on each particle is calculated directly from Formula 6.2.

Communications among tasks in the basic n-body solver

Agglomeration The communications among the tasks can be illustrated as shown in previous Figure. The figure makes it clear that most of the communication among the tasks occurs among the tasks associated with an individual particle, so if we agglomerate the computations of sq(t),vq(t), and Fq(t), our intertask communication is greatly simplified. Now the tasks correspond to the particles and, in the figure, we’ve labeled the communications with the data that’s being communicated. For example, the arrow from particle q at timestep t to particle r at timestep t is labeled with sq, the position of particle q.

Communications among agglomerated tasks in the basic n-body solver

Reduced algorithm For the reduced algorithm, the “intra-particle” communications are the same. That is, to compute sq(t+dt) we’ll need sq(t) and vq(t), and to compute vq(t+dt), we’ll need vq(t) and Fq(t). Therefore, once again it makes sense to agglomerate the computations associated with a single particle into a composite task.

Communications among agglomerated tasks in the reduced n-body solver (q < r)

Reduced Algorithm Recollect that in the reduced algorithm, we make use of the fact that the force frq = -fqr. So if q < r, then the communication from task r to task q is the same as in the basic algorithm—in order to compute Fq(t), task/particle q will need sr(t) from task/particle r. However, the communication from task q to task r is no longer sq(t), it’s the force on particle q due to particle r, that is, fqr(t).

Mapping The final stage in Foster’s methodology is mapping. If we have n particles and T timesteps, then there will be nT tasks in both the basic and the reduced algorithm. Astrophysical n-body problems typically involve thousands or even millions of particles, so n is likely to be several orders of magnitude greater than the number of available cores. However, T may also be much larger than the number of available cores. So, in principle, we have two “dimensions” to work with when we map tasks to cores.

Mapping basic algorithm At first glance, it might seem that any assignment of particles to cores that assigns roughly n/thread_count particles to each core will do a good job of balancing the workload among the cores, and for the basic algorithm this is the case. In the basic algorithm the work required to compute the position, velocity, and force is the same for every particle.

Mapping reduced algorithm However, in the reduced algorithm the work required in the forces computation loop is much greater for lower-numbered iterations than the work required for higher-numbered iterations. Then, for example, when q = 0, we’ll make n-1 passes through the for each particle k > q loop, while when q=n-1, we won’t make any passes through the loop. Thus, for the reduced algorithm we would expect that a cyclic partition of the particles would do a better job than a block partitition of evenly distributing the computation.

Shared-memory system However, in a shared-memory setting, a cyclic partition of the particles among the cores is almost certain to result in a much higher number of cache misses than a block partition, and in a distributed-memory setting, the overhead involved in communicating data that has a cyclic distribution will probably be greater than the overhead involved in communicating data that has a block distribution

Therefore 1. A block distribution will give the best performance for the basic n-body solver. 2. For the reduced n-body solver, a cyclic distribution will best distribute the workload in the computation of the forces. However, this improved performance may be offset by the cost of reduced cache performance in a shared-memory setting and additional communication overhead in a distributed- memory setting.

Experiments 6.1.8 Parallelizing the solvers using pthreads 6.1.9 Parallelizing the basic solver using MPI 6.1.10 Parallelizing the reduced solver using MPI Implementations of these solvers are for your Self-Study.

Performance of the MPI solvers

Homework 1. Exercise 6.1 2. Find and explain what is difference between cyclic partitioning and block partitioning. 3. Define Strength and weaknesses of MPI realized systems. 4. Define Strength and Weaknesses of Pthread realized systems.

THE END Thank you