Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallelism Parallelism means doing multiple things at the same time: you can get more work done in the same time.

Similar presentations


Presentation on theme: "Parallelism Parallelism means doing multiple things at the same time: you can get more work done in the same time."— Presentation transcript:

1 Parallelism Parallelism means doing multiple things at the same time: you can get more work done in the same time.

2 Instruction Level Parallelization
Superscalar: perform multiple operations at a time (in each clock cycle) – like a working group with different people working on different parts of a project. Pipeline: start performing an operation on one piece of data while finishing the same operation on another piece of data (like an assembly line) Superpipeline: perform multiple pipelined operations at the same time (like a factory with multiple assembly lines) Vector: load multiple pieces of data into special registers in the CPU and perform the same operation on all of them at the same time (single-instruction, multi-data/SIMD operations – such CPUs are expensive)

3 Cray-1, one of the first ‘vector supercomputer’, 1976
Single vector processor 133 MFLOPS peak speed 1 million 8 byte words = 8mb of memory Price $5-8 million Liquid cooling system using Freon Right, Cray 1A on exhibit at NCAR Meso Lab.

4 Cray-T90, the last major shared-memory parallel vector supercomputer made by US Company
Up to 32 vector processors 1.76 GFLOPS/CPU 512MW = 4 GB memory Decommissioned from San Diego Supercomputer Center in March 2002 See Historic Cray Systems at

5 Cray-J90 Air-cooled, low-cost vector supercomputers introduced by Cray in the mid-'90s. OU owned a 16-processor J90. Decommissioned in early 2003.

6 What’s an Instruction? Load a value from a specific address in main memory into a specific register Store a value from a specific register into a specific address in main memory Add two specific registers together and put their sum in a specific register – or subtract, multiply, divide, square root, etc Determine whether two registers both contain nonzero values (“AND”) Jump from one sequence of instructions to another (branch) … and so on

7 How would this statement be executed?
Scalar Operation z = a * b + c * d How would this statement be executed? Load a into register R0 Load b into R1 Multiply R0 and R1 Store the result of R0* R1 into register R2 Load c into R3 Load d into R4 Multiply R3 and R4 Store the result of R3* R4 into register R5 Add R6 = R2 + R5 Store R6 into z Takes 10 operations to complete the calculations

8 z = a * b + c * d In the cases where order doesn’t matter, we say that
Does Order Matter? z = a * b + c * d Load a into R0 Load b into R1 Multiply R2 = R0 * R1 Load c into R3 Load d into R4 Multiply R5 = R3 * R4 Add R6 = R2 + R5 Store R6 into z Load d into R4 Load c into R3 Multiply R5 = R3 * R4 Load a into R0 Load b into R1 Multiply R2 = R0 * R1 Add R6 = R2 + R5 Store R6 into z In the cases where order doesn’t matter, we say that the operations are independent of one another.

9 Superscalar Operation
z = a * b + c * d Load a into R0 AND load b into R1 Multiply R2 = R0 * R1 AND load c into R3 AND load d into R4 Multiply R5 = R3 * R4 Add R6 = R2 + R5 Store R6 into z So, we go from 8 operations down to 5.

10 z(1) = a(1)*b(1) + c(1)*d(1) z(2) = a(2)*b(2) + c(2)*d(2)
Superscalar Loops DO i = 1, n z(i) = a(i)*b(i) + c(i)*d(i) END DO !! i = 1, n Each of the iterations is completely independent of all of the other iterations; e.g., z(1) = a(1)*b(1) + c(1)*d(1) has nothing to do with z(2) = a(2)*b(2) + c(2)*d(2) Operations that are independent of each other can be performed in parallel.

11 Load a[i] into R0 AND load b[i] into R1
Superscalar Loops for (i = 0; i < n; i++) { z[i] = a[i]*b[i] + c[i]*d[i]; } /* for i */ Load a[i] into R0 AND load b[i] into R1 Multiply R2 = R0 * R1 AND load c[i] into R3 AND load d[i] into R4 Multiply R5 = R3 * R4 AND load a[i+1] into R0 AND load b[i+1] into R1 Add R6 = R2 + R5 AND load c[i+1] into R3 AND load d[i+1] into R4 Store R6 into z[i] AND multiply R2 = R0 * R1 etc etc etc

12 Pipelining Pipelining is like an assembly line.
An operation consists of multiple stages. After a set of operands complete a particular stage, they move into the next stage. Then, another set of operands can move into the stage that was just abandoned.

13 If each stage takes, say, one CPU cycle, then once
Pipelining Example t = 0 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 Instruction Fetch Decode Operand Execution Result Writeback i = 1 Instruction Fetch Decode Operand Execution Result Writeback i = 2 Instruction Fetch Decode Operand Execution Result Writeback i = 3 Instruction Fetch Decode Operand Execution Result Writeback i = 4 Computation time If each stage takes, say, one CPU cycle, then once the loop gets going, each iteration of the loop only increases the total time by one cycle. So a loop of length 1000 takes only 1004 cycles, instead of 5000 for the above example.

14 Superpipelining Superpipelining is a combination of superscalar and pipelining. So, a superpipeline is a collection of multiple pipelines that can operate simultaneously. In other words, several different operations can execute simultaneously, and each of these operations can be broken into stages, each of which is filled all the time (e.g., car engine assembly line and car body assembly lines can operate at the same time). For example, z = a * b + c * d can employ superscalar operations (reducing operations counts from 8 to 5), and pipelining can be achieved at the loop level (as in previous slide) So you can get multiple operations per CPU cycle. For example, a IBM Power4 can have over 200 different operations “in flight” at the same time.

15 Vector Processing The most powerful CPUs or processors (e.g., NEC SX-6) are vector processors that can perform operations on a stream of data almost simultaneously . A vector here is defined as an ordered list of scalar values. For example, an array stored in memory is a vector. Vector systems have machine instructions (vector instructions) that fetch a vector of values (instead of single ones) from memory and place them into special vector registers, operate on them essentially simultaneously and store them back to memory. Basically, vector processing is a version of the Single Instruction Multiple Data (SIMD) parallel processing technique. On the other hand, scalar processing usually requires one instruction to act on each data value.

16 Vector Processing A vector processor is a CPU design that is able to run mathematical operations on multiple data elements simultaneously. This is in contrast to a scalar processor which handles one element at a time. The vast majority of CPUs are scalar (or close to it). Vector processors were common in the scientific computing area, where they formed the basis of most supercomputers through the 1980s and into the 1990s, but general increases in performance and processor design saw the near disappearance of the pure vector processor as a general-purpose CPU. Today commodity CPU designs include some vector processing instructions, typically known as SIMD (Single Instruction, Multiple Data), like SSE (Streaming SIMD Extensions) in Intel processors. Modern Video game consoles and consumer computer-graphics hardware in particular rely heavily on vector processors in their architecture. In 2000, IBM, Toshiba and Sony collaborated to create a Cell processor, consisting of one scalar processor and eight vector processors, for the Sony PlayStation 3.

17 History of Vector Processing
Vector processing was first worked on in the early 1960s at Westinghouse in their Solomon project. Solomon's goal was to dramatically increase math performance by using a large number of simple math co-processors (or ALUs) under the control of a single master CPU. The CPU fed a single common instruction to all of the ALUs, one per "cycle", but with a different data point for each one to work on. This allowed the Solomon machine to apply a single algorithm to a large data set, fed in the form of an array. In 1962 Westinghouse cancelled the project, but the effort was re-started at the University of Illinois as the ILLIAC IV. Their version of the design originally called for a 1 GFLOPS machine with 256 ALUs, but when it was finally delivered in 1972 it had only 64 ALUs and could reach only 100MFLOPS. Nevertheless it showed that the basic concept was sound, and when used on data-intensive applications, such as computational fluid dynamics, the "failed" ILLIAC was the fastest machine in the world. The first successful implementation of a vector processor appears to be the CDC STAR-100 and the Texas Instruments Advanced Scientific Computer. The basic ASC (i.e., "one pipe") ALU used a pipeline architecture which supported both scalar and vector computations, with peak performance reaching approximately 20 MOPS or MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes. The STAR was otherwise slower than CDC's own supercomputers like the CDC 7600, but at data related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up. The technique was first fully exploited in the famous Cray-1. Instead of leaving the data in memory like the STAR and ASC, the Cray design had eight "vector registers" which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. In addition the design had completely separate pipelines for different instructions, for example, addition/subtraction was implemented in different hardware than multiplication. This allowed a batch of vector instructions themselves to be pipelined, a technique they called vector chaining. The Cray-1 normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS – a respectable number even today. Other examples followed. CDC tried once again with its ETA-10 machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. Various Japanese companies (Fujitsu, Hitachi and NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller. Oregon-based Floating Point Systems (FPS) built add-on array processors for minicomputers, later building their own minisupercomputers. However Cray continued to be the performance leader, continually beating the competition with a series of machines that led to the Cray-2, Cray X-MP and Cray Y-MP. Since then the supercomputer market has focused much more on massively parallel processing rather than better implementations of vector processors. However, recognizing the benefits of vector processing IBM developed Virtual Vector Architecture for use in supercomputers coupling several scalar processors to act as a vector processor. Today the average computer at home crunches as much data watching a short QuickTime video as did all of the supercomputers in the 1970s. Vector processor elements have since been added to almost all modern CPU designs, although they are typically referred to as SIMD. In these implementations the vector processor runs beside the main scalar CPU, and is fed data from programs that know it is there.

18 Processor board of a CRAY YMP
Processor board of a CRAY YMP vector computer (operational ca ). The board was liquid cooled and is for one vector processor with shared memory (access to one central memory)

19 Vector Processing - Example
DO I = 1, N A(I) = B(I) + C(I) ENDDO If the above code is vectorized, the following processes will take place, A vector of values in B(I) will be fetched from memory. A vector of values in C(I) will be fetched from memory. A vector add instruction will operate on pairs of B(I) and C(I) values. After a short start-up time, stream of A(I) values will be stored back to memory, one value every clock cycle. If the code is not vectorized, the following scalar processes will take place, B(1) will be fetched from memory. C(1) will be fetched from memory. A scalar add instruction will operate on B(1) and C(1). A(1) will be stored back to memory Step (1) to (4) will be repeated N times.

20 Vector Processing Vector processing allows a vector of values to be fed continuously to the vector processor. If the value of N is large enough to make the start-up time negligible in comparison, on the average the vector processor is capable of producing close to one result per clock cycle (imaging the automobile assembly line)

21 Vector Processing Vector processors can often chain operations such as add and multiplication together (because they have multiple functional units, for add, multiply etc), in much the same way as a superscalar processors do, so that both operations can be done in one clock cycles. This further increases the processing speed. It usually helps to have long statements inside vector loops in this case.

22 Vectorization for Vector Computers
Characteristics of Vectorizable Code Vectorization can only be done within a DO loop, and it must be the innermost DO loop (unless loops at different levels are merged together by the compiler). There need to be sufficient iterations in the DO loop to offset the start-up time overhead. Try to put more work into a vectorizable statement (by having more operations) to provide more opportunities for concurrent operations (However, the compiler may not be able to vectorize a loop if it is too complicated).

23 Vectorization for Vector Computers
Vectorization Inhibitors Recursive data dependencies is one of the most 'destructive' vectorization inhibitors. E.g., A(I) = A(I-1) + B(I) Subroutine calls (unless they are inlined by the compiler – i.e., subroutine calls are replaced by explicit codes from the subroutines), References to external functions input/output statements Assigned GOTO statements Certain nested IF blocks and backward transfers within loops.

24 Vectorization for Vector Computers
Inhibitors such as subroutine or function calls inside loop can be removed by expanding the function or inlining subroutine at the point of reference. Vectorization Directive – compiler directives can be manually inserted into code to force or prevent vectorization of specific loops Most of the loop vectorizations can be achieved automatically by compilers with proper option (e.g., -parallel with Intel Fortran compiler ifort version 8 or later). Although true vector processors (e.g., Cray processors) have become few, limited vector support has been incorporated into super-scalar processors in recent years such as Pentium 4, AMD and PowerPC processors used in Mac. Code that vectorizes well generally also performs well on super-scalar processors since the latter also exploits pipeline parallelism in the program

25 Shared-memory Parallelization (SMP)
Vectorization occurs only at the loop level, on the same processor Parallelization usually refers to parallel processing across CPUs/cores For shared-memory systems, one can partition the work load across processors without explicitly worrying about distributing the data – because all processors have access to the same data, e.g., arrays Shared-memory parallelization can occur at the loop level or task/subroutine levels. Certain compiler is capable of auto-parallelizing codes, usually at loop level, by analyzing the dependency of operations within loops OpenMP ( is now the standard that one can use to explicitly tell the compiler how to parallelize – through insertion of directives into the code. Different compiler options are usually needed to invoke auto-parallelization or OpenMP (for Intel ifort compiler, they are –parallel and –openmp, respectively). When running SMP parallel programs, environmental variables need to be set to tell the program how many processors to use. For example: setenv OMP_NUM_THREADS 16

26 Example with OpenMP directives
From Try the exercises! See complete OpenMP tutorial at

27 Amhdal’s Law Amhdal’s Law (1967):
where a is the time needed for the serial portion of the task. When N approaches infinity, speedup = 1/ a.

28 Moore’s Law Moore's law is the observation that the number of transistors in a dense integrated circuit doubles approximately every two years. The observation is named after Gordon E. Moore, the co-founder of Intel, whose 1965 paper described a doubling every year in the number of components per integrated circuit, and projected this rate of growth would continue for at least another decade. In 1975, looking forward to the next decade, Moore revised the forecast to doubling every two years. His prediction proved accurate for several decades, and the law was used in the semiconductor industry to guide long-term planning and to set targets for research and development. Advancements in digital electronics are strongly linked to Moore's law: quality-adjusted microprocessor prices, memory capacity, sensors and even the number and size of pixels in digital cameras. The doubling period is often quoted as 18 months because of Intel executive David House, who predicted that chip performance would double every 18 months (being a combination of the effect of more transistors and their being faster).

29

30

31 Message Passing for Distributed Memory Parallel (DMP) Systems
Messaging Passing Interface (MPI) has become the standard for parallel programming on distributed-memory parallel systems MPI is essentially a library of subroutines that handle data exchanges between programs running on different nodes Domain decomposition is the commonly used strategy for distributing data and computations among nodes

32 Domain Decomposition Strategy used by ARPS and WRF Models
Spatial domain decomposition of ARPS grid. Each square in the upper panel corresponds to a single processor with its own memory. The lower panel details a single processor. Grid points having a white background are updated without communication, while those in dark stippling require information from neighboring processors. To avoid communication in the latter, data from the outer border of neighboring processors (light stippling) are stored locally.

33 Example MPI Fortran Program
From

34

35 Advanced Regional Prediction System (ARPS): http://arps.caps.ou.edu
Performance of the ARPS on various computing platforms. The ARPS is executed using a 19x19x43 km grid per processor such that the overall grid domain increases proportionally as the number of processors increase. For a perfect system, the run time should not increase as processors are added, which would results in a line of zero slope. Advanced Regional Prediction System (ARPS):

36 Issues with Parallel Computing
Load-balance / Synchronization Try to give equal amount of workload to each processor Try to give processors that finish first more work to do (load rebalance) The goal is to keep all processors as busy as possible Communication / Locality Inter-processor or internode communications are typically the largest overhead on distributed-memory platforms, because network is slow relative to CPU speed Try to keep data access local E.g., 2nd-order finite difference requires data at 3 points 4th-order finite difference requires data at 5 points

37 A Few Simple Roles for Writing Efficient Code
Use multiplies instead of divides whenever possible (more advanced processors may not care) Make innermost loop the longest Slower loop: Do 100 i=1000 Do 10 j=1,10 a(i,j)=… 10 continue Faster loop Do 100 j=10 Do 10 i=1,1000 For the short loop like Do I=1,3, write out the associated expressions explicitly since the startup cost may be very high Avoid complicated logics (IF’s) inside Do loops Avoid subroutine and function calls inside long DO loops Vectorizable codes typically also run faster on super-scalar processors KISS - Keep it simple, stupid - principle

38 Transition in Computing Architectures at NCAR SCD
This chart depicts major NCAR SCD (National Center for Atmospheric Research Scientific Computing Division) computers from the 1960s onward, along with the sustained gigaflops (billions of floating-point calculations per second) attained by the SCD machines from 1986 to the end of fiscal year Arrows at right denote the machines that will be operating at the start of FY00. The division was aiming to bring its collective computing power to 100 Gfps by the end of FY00, 200 Gfps in FY01, and 1 teraflop by FY03. (Source at

39 Consists of 640 8Gflops NEC Vector Processors
Earth Simulator – No. 1 in top500 in achieved 33 sustained teraflops - a project of Japanese agencies NASDA, JAERI and JAMSTEC to develop a 40 TFLOPS system for climate modeling. Consists of Gflops NEC Vector Processors

40 NSF Track-2 supercomputers
NICS/U. of Tennessee, Kraken (~66K cores, ~100K cores by 10/2009) June 2013: Tianhe-2, a supercomputer developed by China’s National University of Defense Technology, is the world’s new No. 1 system with a performance of petaflop/s on the Linpack benchmark. Tianhe-2, will be deployed at the National Supercomputer Center in Guangzhou, China, by the end of the year. Tianhe-2 has 16,000 nodes, each with two Intel Xeon IvyBridge processors and three Xeon Phi processors for a combined total of 3,120,000 computing cores.

41 Based on World Top 500 Computing Systems

42

43 A paper discussing peta-scale computing issues for local weather prediction
Xue, M., K. K. Droegemeier, and D. Weber, 2007: Numerical prediction of high-impact local weather: A driver for petascale computing. In Petascale Computing: Algorithms and Applications, D. Bader, Ed., Taylor & Francis. Link is

44 Summary An understanding of the underlying computer architecture is necessary for designing efficient algorithms and for optimizing program codes Massively parallel cluster-type supercomputers have become the norm Off-the-shelf workstation-class/PC processors are commonly used for building such systems – the fastest supercomputers don’t necessarily use current generation processors for economy and power/heating efficiency. Shared memory systems are limited by scalability although not all algorithms are suitable for distributed memory systems. Shared memory systems also tends to be more expensive Message Passing Interface (MPI) has become de facto standard for distributed-memory parallel programming Numerical algorithms that requires frequent access of global data (e.g., spectral method) are becoming less competitive than those that use mostly local data (e.g., explicit finite difference method) For more detailed discussions on High Performance Computing and code optimization, attend Dr. Henry Neeman’s seminars and/or go through his presentations (at

45 Programming and coding standard/convention
ARPS coding standard at


Download ppt "Parallelism Parallelism means doing multiple things at the same time: you can get more work done in the same time."

Similar presentations


Ads by Google