Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers to use, What type of code to write, What kind of CPU time and memory your jobs will need, What tools (e.g., visualization software) to use to analyze the output data. In short, how to make maximum advantage and to make most effective use of available computing resources.

Definitions – Clock Cycles, Clock Speed Computer chip operates at discrete intervals called clocks. Often measured in nanoseconds (ns) or megahertz. 1800 megaHz = 1.8 GHz (fastest Pentium V as of today) ~ clock speed of 0.5 ns 100 mhz (Cray J90 vector processor) -> 10 ns May take several clocks to do one multiplication Memory access also takes time, not just computation mHz is not the only measure of CPU speed. Different CPUs of the same mHz often differ in speed.

Definitions – FLOPS Floating Operations / Second Megaflops – million FLOPS Gigaflops – billion FLOPS Teraflops – trillion FLOPS A good measure of code performance – typically one add is one flop, one multiplication is also on flop Cray J90 peak speed = 200 Mflops, most codes achieves only 1/3 of peak Cray T90 perk = 3.2 Gflops NEC XS-5 CPU = 8 Gflops Fastest Workstation-class Processor as of today (Alpha EV68) ~ 2Gflops See http://www.specbench.org for the latest benchmarks of processors for real world problems. Specbench numbers are relative.http://www.specbench.org

MIPS Million instructions per second – also a measure of computer speed – used most the old days when computer architectures were relatively simple

Bandwidth The speed at which data flow across a network or wire 56K Modem = 56 kilobits / second T1 link = 1.554 mbits / sec T3 link = 45 mbits / sec FDDI = 100 mbits / sec Fiber Channel = 800 mbits /sec 100 BaseT (fast) Ethernet = 100 mbits/ sec Gigabit Ethernet = 1000 mbits /sec Brain system = 3 Gbits / s 1 bytes = 8 bits

Hardware Evolution Mainframe computers Supercomputers Workstations Microcomputers / Personal Computers Desktop Supercomputers Workstation Super Clusters Handheld, Palmtop, Calculators, et al….

Types of Processors Scalar (Serial) One operation per clock cycle Vector Multiple operations per clock cycle. Typically achieved at the loop level where the instructions are the same or similar for each loop index Superscalar (most of today’s microprocessors) Several instructions per clock cycle

Types of Computer Systems Single Processor Scalar (e.g., ENIAC, IBM704, traditional IBM-PC and Mac) Single Processor Vector (CDC7600, Cray-1) Multi-Processor Vector (e.g., Cray XMP, Cray C90, Cray J90, NEC SX-5), Single Processor Super-scalar (Sun Sparc Workstations) Multi-processor scalar (e.g., Multi-processor Pentium PC) Multi-processor super-scalar (e.g., DEC Alpha based Cray T3E, RS/6000 based IBM SP-2, SGI Origin 2000) Clusters of the above (e.g., Linux clusters, Earth Simulator – Cluster of multiple vector processor nodes)

Memory Architectures Shared Memory Parallel (SMP) Systems Distributed Memory Parallel (DMP) Systems Memory can be accessed and addressed uniformly by all processors Fast/expensive CPU, Memory, and networks Easy to use Difficult to scale to many (> 32) processors Each processor has its own memory Others can access its memory only via network communications Often off-the-shelf components, therefore low cost Hard to use, explicit user specification of communications often needed. Single CPU slow. Not suitable for inherently serial codes High-scalability - largest current system has nearly 10K processors

Memory Architectures Multi-level memory (cache and main memory) architectures Cache – fast and expensive memory Typical L1 cache size in current day microprocessors ~ 32 K L2 size ~ 256K to 8mb Main memory a few Mb to many Gb. Try to reuse the content of cache as much as possible before the content is replaced by new data or instructions

Vector Processing The most power CPU or processors (e.g., Cray T90 and NEC SX-5) are vector processors that can perform operations on a stream of data in a pipelined fashion. A vector here is defined as an ordered list of scalar values. For example, an array stored in memory is a vector. Vector systems have machine instructions (vector instructions) that fetch a vector of values from memory, operate on them and store them back to memory. Basically, vector processing is a version of the Single Instruction Multiple Data (SIMD) parallel processing technique. On the other hand, scalar processing requires one instruction to act on each data value.

Vector Processing - Example DO I = 1, N A(I) = B(I) + C(I) ENDDO If the above code is vectorized, the following processes will take place, 1.A vector of values in B(I) will be fetched from memory. 2.A vector of values in C(I) will be fetched from memory. 3.A vector add instruction will operate on pairs of B(I) and C(I) values. 4.After a short start-up time, stream of A(I) values will be stored back to memory, one value every clock cycle. If the code is not vectorized, the following scalar processes will take place, 1.B(1) will be fetched from memory. 2.C(1) will be fetched from memory. 3.A scalar add instruction will operate on B(1) and C(1). 4.A(1) will be stored back to memory 5.Step (1) to (4) will be repeated N times.

Vector Processing Vector processing allows a vector of values to be fed continuously to the vector processor. If the value of N is large enough to make the start- up time negligible in comparison, on the average the vector processor is capable of producing close to one result per clock cycle. If the same code is not vectorized (using J90 as an example), for every I iteration, e.g. I=1, a clock cycle each is needed to fetch B(1) and C(1), about 4 clock cycles are needed to complete a floating-point add operation, and another clock cycle is needed to store the value A(1). Thus a minimum of 6 clock cycles are needed to produce one result (complete one iterations). We can say that there is a speed up of about 6 times for this example if the code is vectorized. Vector processors can often chain operations such as add and multiplication together, so that both operations can be done in one clock cycles. This further increases the processing speed. It usually helps to have long statements inside vector loops.

Vectorization for Vector Computers Characteristics of Vectorizable Code Vectorization can only be done within a DO loop, and it must be the innermost DO loop. There need to be sufficient iterations in the DO loop to offset the start-up time overhead. Try to put more work into a vectorizable statement (by having more operations) to provide more opportunities for concurrent operation (However, the compiler may not vectorize a loop if it is too complicated). Vectorization Inhibitors Recursive data dependencies is one of the most 'destructive' vectorisation inhibitors. E.g., A(I) = A(I-1) + B(I) Subroutine calls, References to external functions input/output statements Assigned GOTO statements Certain nested IF blocks and backward transfers within loops. Inhibitors such as subroutine or function calls inside loop can be removed by expanding the function or inlining subroutine at the point of reference. Vectorization Directive – compiler directives can be manually inserted into code for force or prevent vectorization of specific loops Most of the loop vectorizations can be achieved automatically by compilers with proper option.

Parallel Processing Parallel processing means doing multiple jobs/tasks simultaneously. Vectorization is a type of parallel processing within a processor. Code parallelization usually means parallel processing across many processors, with within a single compute node or across many nodes. One can build a parallel processing system by networking a bunch of PC’s together – e.g., the Beowulf linux cluster. Amhdal’s Law (1967): where  is the time needed for the serial portion of the task. When N  approaches infinity, speedup = 1/ 

Issues with Parallel Computing Load-balance / Synchronization Try to give equal amount of workload to each processor Try to give processors that finish first more work to do (load rebalance) The goal is to keep all processors as busy as possible Communication / Locality Inter-processor communications typically the biggest overhead on MPP platforms, because network is slow relative to CPU speed Try to keep data access local E.g., 2 nd -order finite difference requires data at 3 points requires data at 5 points 4 th -order finite difference

A Few Simple Roles for Writing Efficient Code Use multiplies instead of divides whenever possible Make innermost loop the longest Slower loop: Do 100 i=1000 Do 10 j=1,10 a(i,j)=… 10 continue Faster loop Do 100 j=100 Do 10 i=1,1000 a(i,j)=… 10 continue For the short loop like Do I=1,3, write out the associated expressions explicitly since the startup cost may be very high Avoid complicated logics (IF’s) inside Do loops Avoid subroutine and function calls inside long DO loops Vectorizable codes typically also run faster on RISC based super-scalar processors KISS - Keep it simple, stupid - principle

Transition in Computing Architectures at NCAR SCD This chart depicts major NCAR SCD computers from the 1960s onward, along with the sustained gigaflops (billions of floating-point calculations per second) attained by the SCD machines from 1986 to the end of fiscal year 1999. Arrows at right denote the machines that will be operating at the start of FY00. The division is aiming to bring its collective computing power to 100 Gfps by the end of FY00, 200 Gfps in FY01, and 1 teraflop by FY03. (Source at http://www.ucar.edu/staffnotes/9909/IBMSP.html)

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Similar presentations

Presentation on theme: "Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Similar presentations

Presentation on theme: "Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers."— Presentation transcript:

Similar presentations

About project

Feedback