High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University March 6 st, 2007

2 Topics Introduction Midterm Exam Review Matrix Multiplication N-Body Problem Fast Fourier Transform (FFT) Summary – Materials for Test

Half-Way Through (almost) More of more of the same Today: some basic algorithms (in MPI) –Matrix-matrix multiply –N-body –FFT But first: a brief walk through in preparation for the Midterm exam! (Good Luck)

How to Prepare for Midterm Closed Book exam Will look like a problem set (use as a template) Study aids: –Summary slide at the end of each lecture –Problem sets Will emphasize –Basic knowledge –Skills –Performance models Note: –To be held in room 338 Johnston Hall –1 hour 15 minutes –Bring a calculator (know how to use it)

HPC in Overview (1 st half) Supercomputing evolves as interplay –Device technology –Computer architecture –Execution models and programming methods Three classes of parallel computing –Capacity –Cooperative –Capability Three execution models –Throughput –Shared memory multithreaded –Communicating sequential processes (message passing) Three programming formalisms –Condor –OpenMP –MPI Performance modeling and measurement –Metrics –Models –Measurement tools

S1 – L3 - Benchmarking Basic performance metrics (slide 4) Definition of benchmark in own words; purpose of benchmarking; properties of good benchmark (slides 5, 6, 7) Linpack: what it is, what does it measure, concepts and complexities (slides 15, 17, 18) HPL: algorithms and concepts (slides 21 through 24) Linpack compare and contrast (slide 25) General knowledge about HPCC and NPB suites (slides 31, 34, 35) Benchmark result interpretation (slides 49, 50) 8

S1 – L4 : Capacity Computing Understand material on slide (4,5),(7,8) Understand example detailed in slides 17, 18 Understand (19) and be able to derive (20,21), (22, 23) Understand Condor concepts detailed in slides 30,31,32 Condor Commands (37-47) : know what the basic commands are, what they do and interpret output presented by them etc. (No need to memorize command-line options) Understand issues listed on slide 53 Required reading materials : –http://www.cct.lsu.edu/~cdekate/7600/beowulf-chapter-rev1.pdfhttp://www.cct.lsu.edu/~cdekate/7600/beowulf-chapter-rev1.pdf –Specific pages to focus on : 3-16 9

10 S2 – L1 : Architecture Need to know content on slides 11, 15, 22, 23, 33 Understand how each of the technologies listed on slide 7 affects performance Understand concepts on slides 8,9 Understand concepts on slides 17, 18, 20 Understand pipelining concepts and equations detailed in slides 27, 28 Understand vector processing concepts and equations detailed in slides 29, 30 10

11 S2 – L2 : SMP Please make sure that you have addressed all points outlined on slide 5 Understand content on slide 7 Understand concepts, equations, problems on slides 11, 12, 13 Understand content on 21, 24, 26, 29 Understand concepts on slides 32, 33 Understand content on slides 36, 55 Required reading material : http://arstechnica.com/articles/paedia/hardware/pcie.ars/1

S2 – L3 : PThreads Performance & cpi: slide 8 Multi thread concepts: 13, 16, 18, 19, 22, 24, 31 Thread implementations: 35 – 37 Pthreads: 43 – 45, 48, 55

S2 – L4 : Open MP Components: 6 Compiling: 9, 12 Environment variables: 13, 14 Top level: 15 Shared data: 18, 19, 20 Parallel Flow Control: 23, 24, 25 Synchronization: 32, 34 Performance: 39 Synopsis: 44

S2 – L5 : Performance 2 Measuring System Operation slides: 11, 13, 17 Gprof slides: 21, 22 Perfsuite slides: 25, 29 PAPI slides: 33 – 36 (inclusive) Tau slides: 56 – 60 (inclusive)

15 S3 – L1 : Communicating Sequential Processes Basics: slides 6 – 9, 16 CSP: slides 19 Unix: slides 24, 28 - 30

S3 – L2 : MPI MPI standard: slides 4,7 Compile and Run an MPI Program: slides 10,11 Environment functions: slides 12,14 Point-to-point functions: slides 27,28 Blocking vs. nonblocking: slides 25,26 Deadlock: slides 29-31 Basic collective functions: slides 33,34,36,38,40,41,43

17 S3 – L3 : Performance 3 Essential MPI - Slide: 9 Performance Models - Slide: 12, 15, 16, 18 (Hockney) LogP - Slide: 20 – 23 Effective Bandwidth – Slide: 30 Tau/MPI – Slide: 41, 43

18 S3 – L4 : Parallel Algorithms Introduction – Slides: 4, 5, 6 Array decomposition – Slides: 11, 12 Mandelbrot load balancing – Slides: 25, 26 Monte Carlo create Communicators – Slides: 40, 42

System level Overview Understand the 3 classes of parallel computing (capacity, cooperative, capacity). Software System –Understand the Software stack (eg OS, Compilers… ) used in various supercomputers –Conceptual understanding of different Parallel Programming models (eg. shared memory, message passing …), advantages and disadvantages of each system. Computer Architecture –Understand and be able to discuss different sources of performance degradation (latency, overhead…etc) –Understand Amdahl’s Law and be able to solve problems related to the same, as well as scalability, efficiency, and cpi –Understand and be able to describe the different forms of Hardware parallelism (pipelining, ILP, multiprocessors (SIMD,MIMD etc) –Understand numerical problems provided in section 1 Problem Sets(1,2,3) and associated equations & theory behind them 19

Execution Models Throughput Execution Model (eg. Condor) –Be aware of the various condor commands –Thoroughly understand core Condor concepts (eg. Class Ads and Matchmaking), and how these concepts work together Shared Memory multithreaded (eg. OpenMP) –Understand sources of contention (Race Condition) and how to resolve them (Critical Sections etc… ) –Understand various OpenMP constructs and how they work. (eg be able to answer questions like how and when to use “critical” construct and its performance implication etc. ) –Understand the concept of Shared, Private and Reduction variables. –Be able read and understand simple OpenMP code ( C )and be able to make conceptual changes where and when asked. 20

Execution Models and Performance Communicating Sequential Processes –conceptual understanding of CSP –Know the meaning of various MPI constructs common usage –Understand the fundamental concepts like deadlock and how to resolve them –Be able to read a small code snippet and correct conceptual ( NOT SYNTACTICAL ) problems. You donot need to memorize SYNTAX of MPI constructs. Performance & Benchmarking –Be aware of the Top500 list, and benchmarks used –be aware of the different benchmarks, what each of them stress (linpack, HPL, different components of HPCC …) –be aware of the different performance tools discussed in class and what they measure. –Understand and be able to solve Problems related to LogP Model 21

Key Terms and Concepts Speedup : Relative reduction of execution time of a fixed size workload through parallel execution Efficiency : Ratio of the actual performance to the best possible performance. 22

Ideal Speedup Example 23 W 2 20 w1w1 w 2 10 2 10 P28P28 Processors 2 12 P1P1 T(1)=2 20 T(2 8 )=2 12 Units : steps

Ideal Speedup Issues 24 W is total workload measured in elemental pieces of work (e.g. operations, instructions, etc.) T(p) is total execution time measured in elemental time steps (e.g. clock cycles) where p is # of execution sites (e.g. processors, threads) w i is work for a given task i Example: here we divide a million (really Mega) operation workload, W, in to a thousand tasks, w 1 to w 1024 each of a 1 K operations Assume 256 processors performing workload in parallel T(256) = 4096 steps, speedup = 256, Eff = 1

Amdahl’s Law startend TOTO TFTF startend TATA T F /g

Overhead 26 vvww W=4v+4w v = overhead w = work unit W = Total work T i = execution time with i processors P = # processors Assumption : Workload is infinitely divisible

Scalability & Overhead 27 when W >> v v = overhead w g = work unit W = Total work T i = execution time with i processors P = # Processors J = # Tasks

Scalability and Overhead for fixed sized work tasks 28 W is divided in to J w g sized tasks Each task requires v overhead work to manage For P processors there are approximates J/P tasks to be performed in sequence so, T P is J(w g + v)/P Note that S = T 1 / T P So, S = P / (1 + v / w g )

Measuring LogP Parameters Finding L+2*o –Proc 0: (MPI_Send() then MPI_Recv()) x N –Proc 1: (MPI_Recv() then MPI_Send()) x N –L+2*o = total time/N Figure 1: Time diagram for benchmark 1 (a) is Time diagram of processor 0 (b) is Time diagram of processor 1

Measuring LogP Parameters Finding o –Proc 0: (MPI_Send() then some_work then MPI_Recv() ) x N –Proc 1: (MPI_Recv() then MPI_Send() then some_work) x N –o = (1/2)total time/N – time(some_work) –requires time(some_work) > 2*L+2*o Figure 2: Time diagram for benchmark 2 with X > 2*L+Or+Os (a) is Time diagram of processor 1 (b) is Time diagram of processor 2

Performance Metrics Peak floating point operations per second (flops) Peak instructions per second (ips) Sustained throughput –flops, Mflops, Gflops, Tflops, Pflops –flops, Megaflops, Gigaflops, Teraflops, Petaflops –ips, Mips, … Cycles per instruction –cpi –Alternatively: instructions per cycle, ipc Memory access latency –cycles per second Memory access bandwidth –bytes per second –or Gigabytes per second, GBps, GB/s Bi-section bandwidth –bytes per second 31

32 CPI (continued)

33 Basic Performance Metrics Time related: –Execution time [seconds] wall clock time system and user time –Latency –Response time Rate related: –Rate of computation floating point operations per second [flops] integer operations per second [ops] –Data transfer (I/O) rate [bytes/second] Effectiveness: –Efficiency [%] –Memory consumption [bytes] –Productivity [utility/($*second)] Modifiers: –Sustained –Peak –Theoretical peak

34 Basic Parallel (MPI) Program Steps Establish logical bindings Initialize application execution environment Distribute data and work Perform core computations in parallel (across nodes) Synchronize and Exchange intermediate data results –Optional for non-embarrassingly parallel (cooperative) Detect “stop” condition –Maybe implicit with a barrier etc. Aggregate final results –Often a reduction operator Output results and error code Terminate and return to OS

The Essential MPI API Elements : –MPI_Init(), MPI_Finalize() –MPI_Comm_size(), MPI_Comm_rank() –MPI_COMM_WORLD –Error checking using MPI_SUCCESS –MPI basic data types (slide 27) –Blocking : MPI_Send(), MPI_Recv() –Non-Blocking : MPI_Isend(), MPI_Irecv(), MPI_Wait() –Collective Calls : MPI_Barrier(), MPI_Bcast(), MPI_Gather(), MPI_Scatter(), MPI_Reduce() Commands : –Running MPI Programs : mpirun –Compile : mpicc –Compile : mpif77

37 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. Matrices — A Review An n x m matrix

38 Matrix Multiplication Multiplication of two matrices, A and B, produces the matrix C whose elements, ci,j (0 <= i < n, 0 <= j < m), are computed as follows: where A is an n x l matrix and B is an l x m matrix. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

39 Matrix multiplication, C = A x B Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

40 Implementing Matrix Multiplication Sequential Code Assume throughout that the matrices are square (n x n matrices). The sequential code to compute A x B could simply be for (i = 0; i < n; i++) for (j = 0; j < n; j++) { c[i][j] = 0; for (k = 0; k < n; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } This algorithm requires n3 multiplications and n3 additions, leading to a sequential time complexity of O(n3). Very easy to parallelize. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

41 Block Matrix Multiplication Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

43 Performance Improvement Using tree construction n numbers can be added in log n steps using n processors: Computational time complexity of O(log n) using n3 processors. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

44 Flowchart for Matrix Multiplication “master”“workers” Initialize MPI Environment … Initialize Array Partition Array into workloads Send Workload to “workers” Recv. work … wait for “workers“ to finish task Calculate matrix product … Send result … Recv. results Print results End

45 Matrix Multiplication (source code) #include "mpi.h" #include #define NRA 62 /* number of rows in matrix A */ #define NCA 15 /* number of columns in matrix A */ #define NCB 7 /* number of columns in matrix B */ #define MASTER 0 /* taskid of first task */ #define FROM_MASTER 1 /* setting a message type */ #define FROM_WORKER 2 /* setting a message type */ int main(argc,argv) int argc; char *argv[]; { intnumtasks, /* number of tasks in partition */ taskid, /* a task identifier */ numworkers, /* number of worker tasks */ source, /* task id of message source */ dest, /* task id of message destination */ mtype, /* message type */ rows, /* rows of matrix A sent to each worker */ averow, extra, offset, /* used to determine rows sent to each worker */ i, j, k, rc; /* misc */ doublea[NRA][NCA], /* matrix A to be multiplied */ b[NCA][NCB], /* matrix B to be multiplied */ c[NRA][NCB]; /* result matrix C */ MPI_Status status; Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

46 Matrix Multiplication (source code) /* Initialize MPI Environment */ MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&taskid); MPI_Comm_size(MPI_COMM_WORLD,&numtasks); if (numtasks < 2 ) { printf("Need at least two MPI tasks. Quitting...\n"); MPI_Abort(MPI_COMM_WORLD, rc); exit(1); } numworkers = numtasks-1; /* Master block*/ if (taskid == MASTER) { printf("mpi_mm has started with %d tasks.\n",numtasks); printf("Initializing arrays...\n"); for (i=0; i<NRA; i++) for (j=0; j<NCA; j++) a[i][j]= i+j; /* Initialize array a */ for (i=0; i<NCA; i++) for (j=0; j<NCB; j++) b[i][j]= i*j; /* Initialize array b */ /* Send matrix data to the worker tasks */ averow = NRA/numworkers; /* determining fraction of array to be processed by “workers” */ extra = NRA%numworkers; offset = 0; mtype = FROM_MASTER; /* Message Tag */ Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

47 Matrix Multiplication (source code) for (dest=1; dest<=numworkers; dest++) {/* To each worker send : Start point, number of rows to process, and sub-arrays to process */ rows = (dest <= extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset); MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); offset = offset + rows; } /* Receive results from worker tasks */ mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */ for (i=1; i<=numworkers; i++) { source = i; /* offset stores the (processing) starting point of work chunk */ MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); /* The array C contains the product of sub-array A and the array B */ MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status); printf("Received results from task %d\n",source); } printf("******************************************************\n"); printf("Result Matrix:\n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", c[i][j]); } printf("\n******************************************************\n"); printf ("Done.\n"); }

48 Matrix Multiplication (source code) /**************************** worker task ************************************/ if (taskid > MASTER) { mtype = FROM_MASTER; MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); for (k=0; k<NCB; k++) for (i=0; i<rows; i++) { c[i][k] = 0.0; for (j=0; j<NCA; j++) /* Calculate the product and store result in C */ c[i][k] = c[i][k] + a[i][j] * b[j][k]; } mtype = FROM_WORKER; MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); /* Worker sends the resultant array to the master */ MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD); } MPI_Finalize(); } Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

49 Demo : Matrix Multiplication [cdekate@compute-0-6 matrix_multiplication]$ mpirun -np 4./mpi_mm mpi_mm has started with 4 tasks. Initializing arrays... Sending 21 rows to task 1 offset=0 Sending 21 rows to task 2 offset=21 Sending 20 rows to task 3 offset=42 Received results from task 1 Received results from task 2 Received results from task 3 ****************************************************** Result Matrix: 0.00 1015.00 2030.00 3045.00 4060.00 5075.00 6090.00 0.00 1120.00 2240.00 3360.00 4480.00 5600.00 6720.00 0.00 1225.00 2450.00 3675.00 4900.00 6125.00 7350.00 0.00 1330.00 2660.00 3990.00 5320.00 6650.00 7980.00 0.00 1435.00 2870.00 4305.00 5740.00 7175.00 8610.00 0.00 1540.00 3080.00 4620.00 6160.00 7700.00 9240.00 0.00 1645.00 3290.00 4935.00 6580.00 8225.00 9870.00 … 0.00 6475.00 12950.00 19425.00 25900.00 32375.00 38850.00 0.00 6580.00 13160.00 19740.00 26320.00 32900.00 39480.00 0.00 6685.00 13370.00 20055.00 26740.00 33425.00 40110.00 0.00 6790.00 13580.00 20370.00 27160.00 33950.00 40740.00 0.00 6895.00 13790.00 20685.00 27580.00 34475.00 41370.00 0.00 7000.00 14000.00 21000.00 28000.00 35000.00 42000.00 0.00 7105.00 14210.00 21315.00 28420.00 35525.00 42630.00 0.00 7210.00 14420.00 21630.00 28840.00 36050.00 43260.00 0.00 7315.00 14630.00 21945.00 29260.00 36575.00 43890.00 0.00 7420.00 14840.00 22260.00 29680.00 37100.00 44520.00 ****************************************************** Done. [cdekate@compute-0-6 matrix_multiplication]$

51 N Bodies OU Supercomputing Center for Education & Research

52 OU Supercomputing Center for Education & Research Img src : http://www.lsbu.ac.uk/water N-Body Problems An N-body problem is a problem involving N “bodies” – that is, particles (e.g., stars, atoms) – each of which applies a force to all of the others. For example, if you have N stars, then each of the N stars exerts a force (gravity) on all of the other N–1 stars. Likewise, if you have N atoms, then every atom exerts a force on all of the other N–1 atoms. The forces are Coulombic and van der Waal’s.

53 2-Body Problem When N is 2, you have – surprise! – a 2-Body Problem: exactly two particles, each exerting a force that acts on the other. The relationship between the 2 particles can be expressed as a differential equation that can be solved analytically, producing a closed-form solution. So, given the particles’ initial positions and velocities, you can immediately calculate their positions and velocities at any later time. OU Supercomputing Center for Education & Research

54 N-Body Problems For N of 3 or more, no one knows how to solve the equations to get a closed form solution. So, numerical simulation is pretty much the only way to study groups of 3 or more bodies. Popular applications of N-body codes include astronomy and chemistry. Note that, for N bodies, there are on the order of N 2 forces, denoted O(N 2 ). OU Supercomputing Center for Education & Research

55 N-Body Problems Given N bodies, each body exerts a force on all of the other N–1 bodies. Therefore, there are N (N–1) forces in total. You can also think of this as (N (N–1))/2 forces, in the sense that the force from particle A to particle B is the same (except in the opposite direction) as the force from particle B to particle A. OU Supercomputing Center for Education & Research

56 N-Body Problems Given N bodies, each body exerts a force on all of the other N–1 bodies. Therefore, there are N (N–1) forces in total. In Big-O notation, that’s O(N 2 ) forces to calculate. So, calculating the forces takes O(N 2 ) time to execute. But, there are only N particles, each taking up the same amount of memory, so we say that N-body codes are of: O(N) spatial complexity (memory) O(N 2 ) time complexity OU Supercomputing Center for Education & Research

57 O(N 2 ) Forces Note that this picture shows only the forces between A and everyone else. A OU Supercomputing Center for Education & Research

58 How to Calculate? Whatever your physics is, you have some function, F(A,B), that expresses the force between two bodies A and B. For example, F(A,B) = G · m A · m B / dist(A,B) 2 where G is the gravitational constant and m is the mass of the particle in question. If you have all of the forces for every pair of particles, then you can calculate their sum, obtaining the force on every particle. OU Supercomputing Center for Education & Research

59 How to Parallelize? Okay, so let’s say you have a nice serial (single-CPU) code that does an N-body calculation. How are you going to parallelize it? You could: have a master feed particles to processes; have a master feed interactions to processes; have each process decide on its own subset of the particles, and then share around the forces; have each process decide its own subset of the interactions, and then share around the forces. OU Supercomputing Center for Education & Research

60 Do You Need a Master? Let’s say that you have N bodies, and therefore you have ½N(N-1) interactions (every particle interacts with all of the others, but you don’t need to calculate both A  B and B  A). Do you need a master? Well, can each processor determine on its own either (a) which of the bodies to process, or (b) which of the interactions? If the answer is yes, then you don’t need a master. OU Supercomputing Center for Education & Research

61 N-Body “Pipeline” Implementation Flowchart Create ring communicator Initialize particle parameters Copy local particle data to send buffer Update positions of local particles All iterations done? Finalize MPI N Y Initiate transmission of send buffer to the RIGHT neighbor in ring Initiate reception of data from the LEFT neighbor in ring Compute forces between local and send buffer particles Processed particles from all remote nodes? N Wait for message exchange to complete Copy particle data from receive buffer to send buffer Y Initialize MPI environment

62 N-Body (source code) #include "mpi.h" #include /* Pipeline version of the algorithm... */ /* we really need the velocities as well… */ /* Simplified structure describing parameters of a single particle */ typedef struct { double x, y, z; double mass; } Particle; /* We use leapfrog for the time integration... */ /* Structure to hold force components and old position coordinates of a particle */ typedef struct { double xold, yold, zold; double fx, fy, fz; } ParticleV; void InitParticles( Particle[], ParticleV [], int ); double ComputeForces( Particle [], Particle [], ParticleV [], int ); double ComputeNewPos( Particle [], ParticleV [], int, double, MPI_Comm ); #define MAX_PARTICLES 4000 #define MAX_P 128

63 N-Body (source code) main( int argc, char *argv[] ) { Particle particles[MAX_PARTICLES]; /* Particles on ALL nodes */ ParticleV pv[MAX_PARTICLES]; /* Particle velocity */ Particle sendbuf[MAX_PARTICLES], /* Pipeline buffers */ recvbuf[MAX_PARTICLES]; MPI_Request request[2]; int counts[MAX_P], /* Number on each processor */ displs[MAX_P]; /* Offsets into particles */ int rank, size, npart, i, j, offset; /* location of local particles */ int totpart, /* total number of particles */ cnt; /* number of times in loop */ MPI_Datatype particletype; double sim_t; /* Simulation time */ double time; /* Computation time */ int pipe, left, right, periodic; MPI_Comm commring; MPI_Status statuses[2]; /* Initialize MPI Environment */ MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); /* Create 1-dimensional periodic Cartesian communicator (a ring) */ periodic = 1; MPI_Cart_create( MPI_COMM_WORLD, 1, &size, &periodic, 1, &commring ); MPI_Cart_shift( commring, 0, 1, &left, &right ); /* Find the closest neighbors in ring */

/* Calculate local fraction of particles */ if (argc < 2) { fprintf( stderr, "Usage: %s n\n", argv[0] ); MPI_Abort( MPI_COMM_WORLD, 1 ); } npart = atoi(argv[1]) / size; if (npart * size > MAX_PARTICLES) { fprintf( stderr, "%d is too many; max is %d\n", npart*size, MAX_PARTICLES ); MPI_Abort( MPI_COMM_WORLD, 1 ); } MPI_Type_contiguous( 4, MPI_DOUBLE, &particletype ); /* Data type corresponding to Particle struct */ MPI_Type_commit( &particletype ); /* Get the sizes and displacements */ MPI_Allgather( &npart, 1, MPI_INT, counts, 1, MPI_INT, commring ); displs[0] = 0; for (i=1; i<size; i++) displs[i] = displs[i-1] + counts[i-1]; totpart = displs[size-1] + counts[size-1]; /* Generate the initial values */ InitParticles( particles, pv, npart); offset = displs[rank]; cnt = 10; time = MPI_Wtime(); sim_t = 0.0; /* Begin simulation loop */ while (cnt--) { double max_f, max_f_seg; 64 N-Body (source code)

65 N-Body (source code) /* Load the initial send buffer */ memcpy( sendbuf, particles, npart * sizeof(Particle) ); max_f = 0.0; for (pipe=0; pipe<size; pipe++) { if (pipe != size-1) { /* Initialize send to the “right” neighbor, while receiving from the “left” */ MPI_Isend( sendbuf, npart, particletype, right, pipe, commring, &request[0] ); MPI_Irecv( recvbuf, npart, particletype, left, pipe, commring, &request[1] ); } /* Compute forces */ max_f_seg = ComputeForces( particles, sendbuf, pv, npart ); if (max_f_seg > max_f) max_f = max_f_seg; /* Wait for updates to complete and copy received particles to the send buffer */ if (pipe != size-1) MPI_Waitall( 2, request, statuses ); memcpy( sendbuf, recvbuf, counts[pipe] * sizeof(Particle) ); } /* Compute the changes in position using the already calculated forces */ sim_t += ComputeNewPos( particles, pv, npart, max_f, commring ); /* We could do graphics here (move particles on the display) */ } time = MPI_Wtime() - time; if (rank == 0) { printf( "Computed %d particles in %f seconds\n", totpart, time ); } MPI_Finalize(); return 0; }

66 N-Body (source code) /* Initialize particle positions, masses and forces */ void InitParticles( Particle particles[], ParticleV pv[], int npart ) { int i; for (i=0; i<npart; i++) { particles[i].x = drand48(); particles[i].y = drand48(); particles[i].z = drand48(); particles[i].mass = 1.0; pv[i].xold = particles[i].x; pv[i].yold = particles[i].y; pv[i].zold = particles[i].z; pv[i].fx = 0; pv[i].fy = 0; pv[i].fz = 0; } /* Compute forces (2-D only) */ double ComputeForces( Particle myparticles[], Particle others[], ParticleV pv[], int npart ) { double max_f, rmin; int i, j; max_f = 0.0; for (i=0; i<npart; i++) { double xi, yi, mi, rx, ry, mj, r, fx, fy; rmin = 100.0; xi = myparticles[i].x; yi = myparticles[i].y; fx = 0.0; fy = 0.0;

67 N-Body (source code) for (j=0; j<npart; j++) { rx = xi - others[j].x; ry = yi - others[j].y; mj = others[j].mass; r = rx * rx + ry * ry; /* ignore overlap and same particle */ if (r == 0.0) continue; if (r < rmin) rmin = r; /* compute forces */ r = r * sqrt(r); fx -= mj * rx / r; fy -= mj * ry / r; } pv[i].fx += fx; pv[i].fy += fy; /* Compute a rough estimate of (1/m)|df / dx| */ fx = sqrt(fx*fx + fy*fy)/rmin; if (fx > max_f) max_f = fx; } return max_f; } /* Update particle positions (2-D only) */ double ComputeNewPos( Particle particles[], ParticleV pv[], int npart, double max_f, MPI_Comm commring ) { int i; double a0, a1, a2; static double dt_old = 0.001, dt = 0.001; double dt_est, new_dt, dt_new;

68 N-Body (source code) /* integation is a0 * x^+ + a1 * x + a2 * x^- = f / m */ a0 = 2.0 / (dt * (dt + dt_old)); a2 = 2.0 / (dt_old * (dt + dt_old)); a1 = -(a0 + a2); /* also -2/(dt*dt_old) */ for (i=0; i<npart; i++) { double xi, yi; /* Very, very simple leapfrog time integration. We use a variable step version to simplify time-step control. */ xi = particles[i].x; yi = particles[i].y; particles[i].x = (pv[i].fx - a1 * xi - a2 * pv[i].xold) / a0; particles[i].y = (pv[i].fy - a1 * yi - a2 * pv[i].yold) / a0; pv[i].xold = xi; pv[i].yold = yi; pv[i].fx = 0; pv[i].fy = 0; } /* Recompute a time step. Stability criteria is roughly 2/sqrt(1/m |df/dx|) >= dt. We leave a little room */ dt_est = 1.0/sqrt(max_f); if (dt_est < 1.0e-6) dt_est = 1.0e-6; MPI_Allreduce( &dt_est, &dt_new, 1, MPI_DOUBLE, MPI_MIN, commring ); /* Modify time step */ if (dt_new < dt) { dt_old = dt; dt = dt_new; } else if (dt_new > 4.0 * dt) { dt_old = dt; dt *= 2.0; } return dt_old; }

69 Demo : N-Body Problem > mpirun –np 4 nbodypipe 4000 Computed 4000 particles in 1.119051 seconds > mpirun –np 4 nbodypipe 4000 Computed 4000 particles in 1.119051 seconds

Serial FFT Let i = sqrt(-1) and index matrices and vectors from 0. The Discrete Fourier Transform of an m-element vector v is: F*v Where F is the m×m matrix defined as: F[j,k] =  (j*k) Where  is:  = e (2  i/m) = cos(2  /m) + i*sin(2  /m) This is a complex number with whose m th power is 1 and is therefore called the m th root of unity E.g., for m = 4:  = 0+1*i,   = -1+0*i,   = 0-1*i,   = 1+0*i, Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.pptwww.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt

Related Transforms Most applications require multiplication by both F and inverse(F). Multiplying by F and inverse(F) are essentially the same. (inverse(F) is the complex conjugate of F divided by n.) For solving the Poisson equation and various other applications, we use variations on the FFT –The sin transform -- imaginary part of F –The cos transform -- real part of F Algorithms are similar, so we will focus on the forward FFT. Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.pptwww.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt

Compute the FFT of an m-element vector v, F*v (F*v)[j] =  F(j,k)*v(k) =   (j*k) * v(k) =  (  j ) k * v(k) = V(  j ) Where V is defined as the polynomial V(x) =  x k * v(k) Serial Algorithm for the FFT m-1 k = 0 m-1 k = 0 m-1 k = 0 m-1 k = 0 Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.pptwww.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt

Divide and Conquer FFT V can be evaluated using divide-and-conquer V(x) =  (x) k * v(k) = v[0] + x 2 *v[2] + x 4 *v[4] + … + x*(v[1] + x 2 *v[3] + x 4 *v[5] + … ) = V even (x 2 ) + x*V odd (x 2 ) V has degree m, so V even and V odd are polynomials of degree m/2-1 We evaluate these at points (  j ) 2 for 0<=j<=m-1 But this is really just m/2 different points, since (  (j+m/2) ) 2 = (  j *  m/2) ) 2 = (  2j *  ) = (  j ) 2 m-1 k = 0 Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.pptwww.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt

Divide-and-Conquer FFT FFT(v, v, m) if m = 1 return v[0] else v even = FFT(v[0:2:m-2],  2, m/2) v odd = FFT(v[1:2:m-1],  2, m/2)  -vec = [  0,  1, …  (m/2-1) ] return [v even + (  -vec.* v odd ), v even - (  -vec.* v odd ) ] The.* above is component-wise multiply. The […,…] is construction an m-element vector from 2 m/2 element vectors This results in an O(m log m) algorithm. precomputed Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.pptwww.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt

1D FFT: Butterfly Pattern Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.pptwww.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt

Higher Dimension FFTs FFTs on 2 or 3 dimensions are defined as 1D FFTs on vectors in all dimensions. E.g., a 2D FFT does 1D FFTs on all rows and then all columns There are 3 obvious possibilities for the 2D FFT: –(1) 2D blocked layout for matrix, using 1D algorithms for each row and column –(2) Block row layout for matrix, using serial 1D FFTs on rows, followed by a transpose, then more serial 1D FFTs –(3) Block row layout for matrix, using serial 1D FFTs on rows, followed by parallel 1D FFTs on columns –Option 1 is best For a 3D FFT the options are similar –2 phases done with serial FFTs, followed by a transpose for 3rd –can overlap communication with 2nd phase in practice Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.pptwww.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt

78 2-D FFT Flowchart Initialize matrix Distribute matrix by rows (MPI_Scatter) Reorganize matrix slice into square chunks Finalize MPI Initialize MPI environment Compute 1-D FFTs on rows Redistribute matrix chunks (MPI_Alltoall) Transpose matrix chunks Compute 1-D FFTs on rows Collect matrix slices (MPI_Gather) Transpose assembled matrix Receive matrix slice (MPI_Scatter) Reorganize matrix slice into square chunks Finalize MPI Initialize MPI environment Compute 1-D FFTs on rows Redistribute matrix chunks (MPI_Alltoall) Transpose matrix chunks Compute 1-D FFTs on rows Send matrix slice (MPI_Gather) MASTERMASTER WORKERWORKER

Fast Fourier Transform 79 MPI - Two-Dimensional Fast Fourier Transform - C Version The image originates on a single processor (SOURCE_PROCESSOR). This image, a[], is distributed by rows to all other processors. Each processor then performs a one-dimensional FFT on the rows of the image stored locally. The image is then transposed using the MPI_Alltoall() routine; this partitions the intermediate image by columns. Each processor then performs a one-dimensional FFT on the columns of the image. Finally, the columns of the image are collected back at the destination processor and the output image is tested for correctness. Input is a 512x512 complex matrix. The input matrix is initialized with a point source. Output is a 512x512 complex matrix that overwrites the input matrix. Timing and Mflop results are displayed following execution. A straightforward unsophisticated 1D FFT kernel is used. It is sufficient to convey the general idea, but be aware that there are better 1D FFTs available on many systems.

2D FFT – Code Walkthrough … 1 80 #include #include "mpi_2dfft.h" #define IMAGE_SIZE512 #define NUM_CELLS4 #define IMAGE_SLICE(IMAGE_SIZE / NUM_CELLS) #define SOURCE_PROCESSOR0 #define DEST_PROCESSORSOURCE_PROCESSOR int numtasks; /* Number of processors */ int taskid; /* ID number for each processor */ mycomplex a[IMAGE_SIZE][IMAGE_SIZE];/* input matrix: complex numbers */ mycomplex a_slice[IMAGE_SLICE][IMAGE_SIZE]; mycomplex a_chunks[NUM_CELLS][IMAGE_SLICE][IMAGE_SLICE]; mycomplex b[IMAGE_SIZE][IMAGE_SIZE];/* intermediate matrix */ mycomplex b_slice[IMAGE_SIZE][IMAGE_SLICE]; mycomplex b_chunks[NUM_CELLS][IMAGE_SLICE][IMAGE_SLICE]; mycomplex *collect; mycomplex w_common[IMAGE_SIZE/2]; /* twiddle factors */ struct timevaletime[10]; intcheckpoint; floatdt[10], sum;

2D FFT – Code Walkthrough … 2 81 int main(argc,argv) int argc; char *argv[]; { int rc, cell, i, j, n, nx, logn, errors, sign, flops; float mflops; checkpoint=0; /** Initialize MPI environment and get task's ID and number of tasks in the partition. **/ rc = MPI_Init(&argc,&argv); rc|= MPI_Comm_size(MPI_COMM_WORLD,&numtasks); rc|= MPI_Comm_rank(MPI_COMM_WORLD,&taskid); /* Must have 4 tasks for this program */ /** Checking if numtasks is a power of 2 (in this case we have set it to 4) **/ if (numtasks != NUM_CELLS) { printf("Error: this program requires %d MPI tasks\n", NUM_CELLS); exit(1); } if (rc != MPI_SUCCESS) printf ("error initializing MPI and obtaining task ID information\n"); else printf ("MPI task ID = %d\n", taskid); n = IMAGE_SIZE; /* compute logn and ensure that n is a power of two */ nx = n; logn = 0;

/** Checking if IMAGE_SIZE is a power of 2 **/ while(( nx >>= 1) > 0) logn++; nx = 1; for (i=0; i<logn; i++) nx = nx*2; if (nx != n) { (void)fprintf(stderr, "%d: fft size must be a power of 2\n", IMAGE_SIZE); exit(0); } /** Initialize real and imaginary parts of array (??) **/ if (taskid == SOURCE_PROCESSOR) { for (i=0; i<n; i++) for (j=0; j<n; j++) a[i][j].r = a[i][j].i = 0.0; a[n/2][n/2].r = a[n/2][n/2].i = (float)n; /* real and imaginary array[256][256] are initialized to 512.0 and rest to 0.0 */ /* print table headings in anticipation of timing results */ printf("512 x 512 2D FFT\n"); printf(" Timings(secs)\n"); printf(" scatter 1D-FFT-row transpose 1D-FFT-col gather"); printf(" total\n"); } /* precompute the complex constants (twiddle factors) for the 1D FFTs */ for (i=0;i<n/2;i++) { w_common[i].r = (float) cos((double)((2.0*PI*i)/(float)n)); w_common[i].i = (float) -sin((double)((2.0*PI*i)/(float)n)); } 2D FFT – Code Walkthrough … 3 82

/* Distribute Input Matrix By Rows */ rc = MPI_Barrier(MPI_COMM_WORLD); if (rc != MPI_SUCCESS) { printf("Error: MPI_Barrier() failed with return code %d\n", rc); return(-1); } gettimeofday(&etime[checkpoint++], (struct timeval*)0); /* IMAGE_SLICE = dimension of slice of image per process Each slice of image is delivered to corresponding process using MPI_Scatter() */ rc = MPI_Scatter((char *) a, IMAGE_SLICE * IMAGE_SIZE * 2, MPI_FLOAT, (char *) a_slice, IMAGE_SLICE * IMAGE_SIZE * 2, MPI_FLOAT, SOURCE_PROCESSOR, MPI_COMM_WORLD); if (rc != MPI_SUCCESS) { printf("Error: MPI_Scatter() failed with return code %d\n", rc); return(-1); } gettimeofday(&etime[checkpoint++], (struct timeval*)0); /* Perform 1-D Row FFTs */ /* a_slice[ ][ ] is the buffer containing each individual image chunk. For each row in image slice this section of code computes 1D FFT */ for (i=0;i<IMAGE_SLICE;i++) fft(&a_slice[i][0], w_common, n, logn); gettimeofday(&etime[checkpoint++], (struct timeval*)0); 2D FFT – Code Walkthrough … 4 83

2D FFT – Code Walkthrough … 5 84 /* Transpose 2-D image */ for(cell=0;cell<NUM_CELLS;cell++) { for(i=0;i<IMAGE_SLICE;i++) { for(j=0;j<IMAGE_SLICE;j++) { a_chunks[cell][i][j].r = a_slice[i][j + (IMAGE_SLICE * cell)].r; a_chunks[cell][i][j].i = a_slice[i][j + (IMAGE_SLICE * cell)].i; } /* IMAGE_SLICE * IMAGE_SLICE * 2 (because we have real and imaginary); Each component chunk is delivered to corresponding process using MPI_Alltoall() */ rc = MPI_Alltoall(a_chunks, IMAGE_SLICE * IMAGE_SLICE * 2, MPI_FLOAT, b_slice, IMAGE_SLICE * IMAGE_SLICE * 2, MPI_FLOAT, MPI_COMM_WORLD); if (rc != MPI_SUCCESS) { printf("Error: MPI_Alltoall() failed in cell %d return code %d\n", taskid, rc); return(-1); } gettimeofday(&etime[checkpoint++], (struct timeval*)0);

2D FFT – Code Walkthrough … 6 85 for(i=0;i<IMAGE_SLICE;i++) { for(j=0;j<IMAGE_SIZE;j++) { a_slice[i][j].r = b_slice[j][i].r; a_slice[i][j].i = b_slice[j][i].i; } /* Perform 1-D FFTs (effectively on columns) */ for (i=0;i<IMAGE_SLICE;i++) fft(&a_slice[i][0], w_common, IMAGE_SIZE, logn); gettimeofday(&etime[checkpoint++], (struct timeval*)0); /* Undistribute Output Matrix by Rows */ collect = ( mycomplex *) malloc(IMAGE_SIZE * IMAGE_SIZE * sizeof( mycomplex)); /* Every process executes MPI_Gather() */ rc = MPI_Gather(a_slice, IMAGE_SLICE * IMAGE_SIZE * 2, MPI_FLOAT, a, IMAGE_SLICE * IMAGE_SIZE * 2, MPI_FLOAT, DEST_PROCESSOR, MPI_COMM_WORLD); if (rc != MPI_SUCCESS) { printf("Error: MPI_Gather() failed with return code %d\n", rc); fflush(stdout); }

2D FFT – Code Walkthrough … 7 86 /* If destination processor then perform another transpose of a[ ][ ] into b[ ][ ]*/ if (taskid == DEST_PROCESSOR) { for(i=0;i<IMAGE_SIZE;i++) { for(j=0;j<IMAGE_SIZE;j++) { b[i][j].r = a[j][i].r; b[i][j].i = a[j][i].i; } gettimeofday(&etime[checkpoint++], (struct timeval*)0); fflush(stdout); /* Calculate event timings and flops - then print them */ for(i=1;i<checkpoint;i++) dt[i] = ((float) ((etime[i].tv_sec - etime[i-1].tv_sec) * 1000000 + etime[i].tv_usec - etime[i-1].tv_usec)) / 1000000.0; printf("cell %d: ", taskid); for(i=1;i<checkpoint;i++) printf("%2.6f ", dt[i]); sum=0; for(i=1;i<checkpoint;i++) sum+=dt[i]; printf(" %2.6f \n", sum);

2D FFT – Code Walkthrough … 8 87 if (taskid == DEST_PROCESSOR) { flops = (n*n*logn)*10; mflops = ((float)flops/1000000.0); mflops = mflops/(float)sum; printf("Total Mflops= %3.4f\n", mflops); errors = 0; for (i=0;i<n;i++) { if (((i+1)/2)*2 == i) sign = 1; else sign = -1; for (j=0;j<n;j++) { if (b[i][j].r > n*sign+EPSILON || b[i][j].r n*sign+EPSILON || b[i][j].i < n*sign-EPSILON) { printf("[%d][%d] is %f,%f should be %f\n", i, j, b[i][j].r, b[i][j].i, (float) n*sign); errors++; } sign *= -1; } if (errors) { printf("%d errors!!!!!\n", errors); exit(0); } MPI_Finalize(); exit(0); }

2D FFT – Code Walkthrough … 9 88 fft(data,w_common,n,logn) mycomplex *data,*w_common; int n,logn; { int incrvec, i0, i1, i2, nx; float f0, f1; void bit_reverse(); /* bit-reverse the input vector */ (void)bit_reverse(data,n); /* do the first logn-1 stages of the fft */ i2 = logn; for (incrvec=2;incrvec<n;incrvec<<=1) { i2--; for (i0 = 0; i0 > 1; i0++) { for (i1 = 0; i1 < n; i1 += incrvec) { f0 = data[i0+i1 + incrvec/2].r * w_common[i0<<i2].r - data[i0+i1 + incrvec/2].i * w_common[i0<<i2].i; f1 = data[i0+i1 + incrvec/2].r * w_common[i0<<i2].i + data[i0+i1 + incrvec/2].i * w_common[i0<<i2].r; data[i0+i1 + incrvec/2].r = data[i0+i1].r - f0; data[i0+i1 + incrvec/2].i = data[i0+i1].i - f1; data[i0+i1].r = data[i0+i1].r + f0; data[i0+i1].i = data[i0+i1].i + f1; }

2D FFT – Code Walkthrough … 10 89 /* do the last stage of the fft */ for (i0 = 0; i0 < n/2; i0++) { f0 = data[i0 + n/2].r * w_common[i0].r - data[i0 + n/2].i * w_common[i0].i; f1 = data[i0 + n/2].r * w_common[i0].i + data[i0 + n/2].i * w_common[i0].r; data[i0 + n/2].r = data[i0].r - f0; data[i0 + n/2].i = data[i0].i - f1; data[i0].r = data[i0].r + f0; data[i0].i = data[i0].i + f1; } /* bit_reverse - simple (but somewhat inefficient) bit reverse */ void bit_reverse(a,n) mycomplex *a; int n; { int i,j,k; j = 0; for (i=0; i<n-2; i++){ if (i < j) { SWAP(a[j],a[i]); } k = n>>1; while (k <= j) { j -= k; k >>= 1; } j += k; }

FFT Header File 90 /*************************************************************************** * FILE: mpi_2dfft.h * DESCRIPTION: see mpi_2dfft.c * AUTHOR: George Gusciora * LAST REVISED: ***************************************************************************/ #define MAXN 2048 /* max 2d fft size */ #define EPSILON 0.00001 /* for comparing fp numbers */ #define PI 3.14159265358979 /* 4*atan(1.0) */ typedef struct {float r,i;} mycomplex; /* swap a pair of complex numbers */ #define SWAP(a,b) {float swap_temp=(a).r;(a).r=(b).r;(b).r=swap_temp;\ swap_temp=(a).i;(a).i=(b).i;(b).i=swap_temp;} /* swap a pair of complex numbers */ #define MYSWAP(a,b) {float swap_temp=a;a=b;b=swap_temp;}

92 Summary – Material for the Test Introduction – Slides: 4, 5, 6 Matrix Multiply basic algorithm – Slides: 49 – 54 Nbody – FFT -

High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Similar presentations

Presentation on theme: "High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Similar presentations

Presentation on theme: "High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University."— Presentation transcript:

Similar presentations

About project

Feedback