Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Architectures... High Performance Computing I Fall 2001 MAE609 /Mth667 Abani Patra.

Similar presentations


Presentation on theme: "Computer Architectures... High Performance Computing I Fall 2001 MAE609 /Mth667 Abani Patra."— Presentation transcript:

1 Computer Architectures... High Performance Computing I Fall 2001 MAE609 /Mth667 Abani Patra

2 AP:Lec012 Microprocessor Basic Architecture CISC vs. RISC Superscalar EPIC

3 AP:Lec013 Performance Measures Floating Point Operations Per Second (FLOPS) 1 MFLOP, workstations 1 GFLOP readily available HPC 1 TFLOP BEST NOW !! 1 PFLOP … 2010 ??

4 AP:Lec014 Performance T theor : theoretical peak performance; obtained by multiplying clock rate with no. of CPU and no. of FPU/CPU T real :real performance on some specific operation e.g. vector add and multiply T sustained : sustained performance on an application e.g. CFD T sustained << T real << T theor

5 AP:Lec015 Performance Performance degrades if the CPU has to wait for data to operate Fast CPU => need adequate fast memory Thumb rule -- Memory in MB = T theor in MFLOPS

6 AP:Lec016 Making a Supercomputer Faster Reduce Cycle time Pipelining Instruction Pipelines Vector Pipelines Internal Parallelism Superscalar EPIC External Parallelism

7 AP:Lec017 Making a SuperComputer Faster Reduce Cycle time increase clock rate Limited by semiconductor manufacture! Current generation 1-2GHz( Immediate future 10GHz) Pipelining fine subdivision of an operation into sub-operations leading to shorter cycle time but larger start-up time

8 AP:Lec018 Pipelining Instruction Pipelining 1 2 34 1 2 34563456 4 stage instruction pipeline 3 instructions A,B,C 4 cycles needed by each instruction ABCABC ABCABC ABCABC ABCABC cycle stage one result per cycle after pipe is “full” -- startup time Fetch Ins Fetch Data Execute Store

9 AP:Lec019 Pipelining Almost all current computers use some pipelining e.g. IBM RS6000 Speedup of instruction pipelining cannot always be achieved !! Next instruction may not be known till execution --e.g. branch Data for execution may not be available

10 AP:Lec0110 Vector Pipelines Effective for operations like do 10 I=1,1000 10 c(I)=a(I)*b(I) same instructions executed 1000 times with different data using a “vector pipe” the whole loop is one vector instruction Cray XMP, YMP, T90...

11 AP:Lec0111 Vector pipelining For some operations like a(I) = b(I) + c(I)*d(I) the results of the multiply are chained to the addition pipeline Disadvantage: startup time of vector code has to be vectorized; loops have to be blocked into vector lengths

12 AP:Lec0112 Internal Parallelism Use multiple Functional Units per processor Cray T90 has 2 track vector units;NEC SX4, Fujitsu VPP300 -- 8 track vector units superscalar e.g. IBM RS6000 Power2 uses 2 arithmetic units EPIC Need to provide data to multiple functional unit => fast memory access Limiting factors are memory-processor bandwidth

13 AP:Lec0113 External Parallelism Use multiple processors Shared Memory (SMP:Symmetric Multi- processors) many processors accessing the same memory limited by memory-processors bandwidth SUN Ultra2, SGI Octane, SGI Onyx, Compaq... CPU 0 CPU 1 Memory banks

14 AP:Lec0114 External Parallelism Distributed memory many processors each with local memory and some type of high speed interconnect CPU 0 CPU 1 Local Memories Interconnection E.g. IBM SPx, Cray T3E, network of W/S, Beauwolf Clusters of Pentium PCs

15 AP:Lec0115 External Parallelism SMP Clusters nodes with multiple processors that have shared local memory; nodes connected by interconnect “best of both ?”

16 AP:Lec0116 Classification of Computers Hardware SISD (Single Instruction Single Data) SIMD(Single Instruction Multiple Data) MIMD (Multiple Instruction Multiple Data) Programming Model SPMD(Single Program Multiple Data) MPMD(Multiple Program Multiple Data)

17 AP:Lec0117 Hardware Classification SISD (Single Instruction Single Data) classical scalar/vector computer -- one instruction one datum superscalar -- instructions may run in parallel SIMD (Single Instruction Multiple Data) vector computers Data Parallel -- Connection Machine etc. extinct now

18 AP:Lec0118 Hardware Classification MIMD (Multiple Instruction Multiple Data) usual parallel computer each processor executes its own instructions on different data streams need synchronization to get meaningful result

19 AP:Lec0119 Programming Model SPMD(Single Program Multiple Data) single program is run on all processors with different data each processor knows its ID -- thus if(proc ID.eq. N) then …. Else …. Constructs can be used for program control

20 AP:Lec0120 Programming Model MPMD(Multiple Program Multiple Data) Different programs run on different processors usually a master-slave model is used

21 AP:Lec0121 Topologies/Interconnects Hypercube Torus

22 Prototype Supercomputers and Bottlenecks

23 AP:Lec0123 Types of Processors/Computers used in HPC Prototype processors Vector Processors Superscalar Processors Prototype Parallel Computers Shared Memory Without Cache With Cache SMP Distributed Memory

24 AP:Lec0124 Vector Processors

25 AP:Lec0125 Vector Processors Components Vector registers ADD/Logic pipeline and MULTIPLY Pipelines Load/Store pipelines Scalar registers + pipelines

26 AP:Lec0126 Vector Registers Finite length of vector registers 32/64/128 etc. Strip mining to operate on longer vectors Codes often manually restructured to vector- length loops Sawtooth performance curve -- maximum at multiples of vector length

27 AP:Lec0127 Vector Processors Memory-processor bandwidth performance depends completely on keeping the vector registers supplied with operands from memory Size of main memory and extended memory bandwidth of main memory is much higher but main memory is more expensive size determines -- size of problem that can be run scalar registers/scalar processors for scalar instructions I/O through special processor - - T90 can produce data at 14400 MB/sec -- Disk 20MB/s. Thus single word can take 720 cycles on Cray T90 !!

28 AP:Lec0128 Superscalar Processor Workstations and nodes of parallel supercomputers

29 AP:Lec0129 Superscalar Processor main components are Multiple ALU and FPU data and instruction caches superscalar since the ALU and FPU’s can operate in parallel producing more than one result per cycle e.g. IBM POWER2 -- 2 FPU/ALU’s each can operate in parallel producing up to 4 results per cycle if operands are in registers

30 AP:Lec0130 Superscalar Processor RISC architecture operating at very high clock speeds (>1GHz now -- more in a year) Processor works only on data in registers which come only from and go only to data cache. If data is not in cache -- “cache miss” -- processor is idle while another cache line (4 -16 words) are fetched from memory !!

31 AP:Lec0131 Superscalar Processor Large off chip Level 2 caches to help in data availability. L1 cache data is accessed in 1/2 cycles while L2 cache is 3/4 cycles and memory can be 8 times that! Efficiency directly related to reuse of data in cache Remedies: Blocked algorithms, contiguous storage, avoid strides and random/non-deterministic access

32 AP:Lec0132 Superscalar Processor Remedies: Blocked algorithms, do I=1,1000do j=1,20 a(I)=…. do i=(j-1)*50,j*50 a(i)=.... contiguous storage, avoid strides and random/non-deterministic access a(ix(i)) =...

33 AP:Lec0133 Superscalar Processors Memory bandwidth critical to performance Many engineering applications are difficult to optimize for cache efficiency Application efficiency => memory bandwidth Size of memory determines size of problem that can be solved DMA (direct memory access) channels take memory access duties for external application (I/O) remote processor request away from CPU

34 AP:Lec0134 Shared Memory Parallel Computer Memory in banks is accessed equally through a switch (crossbar) by the processors (usually vector) Processors run “p” independent tasks with possibly shared data Usually some compilers and preprocessors can extract the fine-grained parallelism available Shared Memory Computer Cray T90 P1P2 P3 Shared Memory Switch...

35 AP:Lec0135 Shared Memory Paralllel... Memory contention and bandwidth limits the number of processors that may be connected Memory contention can be reduced by increasing banks and reducing the bank busy time(bbt) This type of parallel computer is closest in programming model to the general purpose single processor computer

36 AP:Lec0136 Symmetric Multiprocessors (SMP) Processors are usually superscalar -- SUN Ultra, MIPS R10000 with large cache Bus/crossbar used to connect to memory modules For bus -- 1 processor can access memory at a time SMP Computer P1P2 P3 Bus/Crossbar... c1c2 c3 M1 M2M3 Sun Ultraenterprise 10000, SGI Powerchallenge

37 AP:Lec0137 Symmetric Multi-processors If interconnect -- then there will be memory contention Data flows from memory to cache to processors; Cache coherence: If a piece of data is changed in one cache then all other caches that contain that data must update the value. Hardware and software must take care of this.

38 AP:Lec0138 Symmetric Multi-Processors Performance depends dramatically on the reuse of data in cache; Fetching data from larger memory with potential memory contention can be expensive! Caches and cache lines also are bigger Large L2 cache really plays the role of local fast memory with memory banks are more like extended memory accessed in blocks

39 AP:Lec0139 Distributed Memory Parallel Computer Prototype DMP Processors are superscalar RISC with only LOCAL memory Each processor can only work on data in local memory Communication required for access to remote memory Comm. network P M P M P M IBM SP, Intel Paragon, SGI Origin2000

40 AP:Lec0140 Distributed Memory Parallel Computer Problems need to be broken up into independent tasks with independent memory -- naturally matches a data based decomposition of problem using a “owner computes” rule Parallelization mostly at high granularity level controlled by user -- difficult for compilers/ automatic parallelization tools Computers are scalable to very large numbers of processors

41 AP:Lec0141 Distributed Memory Parallel Computer Hybrid Parallel Computer NUMA : non uniform memory access based classification Intel Paragon (1st teraflop machine had 4 Pentiums per node with a bus) HP exemplar has bus at node P M P M Bus Comm. network P M P M Bus ….

42 AP:Lec0142 Distributed Memory Parallel Computer Semi-autonomous memory Semi-automomous memory: Processor can access remote memory using memory control units (MCU) CRAY T3E and SGI Origin 2000 Comm. network P M MCU …. P M MCU

43 AP:Lec0143 Distributed Memory Parallel Computer Fully autonomous memory Memory and procesors are equally distributed over the network Tera MTA is only example Latency and data transfer from memory is at the speed of network! Comm. network MPPM

44 AP:Lec0144 Accessing Distributed Memory Message Passing User transfers all data using explicit send/receive instructions synchronous message passing can be slow Programming with NEW programming model ! User must optimize communication asynchronous/one-sided get and put are faster but need more care in programming Codes used to be machine specific -- Intel NEXUS etc. until standardized to PVM (parallel virtual machine) and subsequently MPI (message passing interface)

45 AP:Lec0145 Accessing Distributed Memory Global distributed memory Physically distributed and globally addressable -- Cray T3E/ SGI Origin 2000 User formally accesses remote memory as if it were local -- operating system/compilers will translate such accesses to fetches/stores over the communication network High Performance FORTRAN (HPF) -- software realization of distributed memory -- arrays etc. when declared can be distributed using compiler directives. Compiler translates remote memory access to appropriate calls (message passing/ OS calls as supported by the hardware)

46 AP:Lec0146 Processor interconnects/topologies Buses Lower cost -- but only one pair of devices (processors/memories etc. can communicate at a time) e.g. ethernet used to link workstation networks Switches Like the telephone network -- can sustain many-many communications; higher cost! Critical measure is bisection bandwidth -- how much data can be passed between units

47 AP:Lec0147 Processor interconnects/topologies.

48 AP:Lec0148 Processor interconnects/topologies.

49 AP:Lec0149 Processor interconnects/topologies Workstation network on ethernet Very high latency -- processors must participate in communication

50 AP:Lec0150 Processor interconnects/topologies 1D and 2D Meshes and rings/toruses

51 AP:Lec0151 Processor interconnects/topologies 3DMeshes and rings/toruses

52 AP:Lec0152 Processor interconnects/topologies D- dimensional hypercubes

53 AP:Lec0153 Processor Scheduling Space Sharing Processor banks of 4/8/16 etc. assigned to users for specific times Time sharing on processor partitions Livermore Gang Scheduling

54 AP:Lec0154 IBM RS/6000 SP Distributed Memory Parallel Computer Assembly of workstations using a HPS (a crossbar type switch) Comes with a choice of processors -- POWER2 (variants), POWER3 and clusters of PowerPC (also used by Apple G3 G4 etc.)

55 AP:Lec0155 POWER 2 Processor Different versions -- with different frequency, cache size and bandwidth

56 AP:Lec0156 POWER 2 ARCHITECTURE

57 AP:Lec0157 POWER2 Double fixed point/floating point units -- multiply/add in each Max. 4 Floating Point results/cycle ICU (with 32 KB instruction cache) can execute a branch and a condition/cycle Per cycle 8 instructions may be issued and executed -- truly SUPERSCALAR!

58 AP:Lec0158 Wide 77 Node Performance Theoretical peak performance: 2*77 = 154 MFLOP for dyad 4*77 = 308 MFLOP for triad Cache Effects dominate performance 256 KB Cache and 256 bit path to cache and from cache to memory -- 2 words (8 bytes each) may be fetched and 2 words stored per cycle

59 AP:Lec0159 Expected Performance For Dyad a i = b i *c i or a i =b i +c i -- needs 2 load and 1 store i.e. 6 memory references to feed 2 FPUs -- only 4 are available : (2*77)*(4/6) = 102.7 MFLOP For linked triad a i = b i + s*c i (2 load 1 store) (4*77)*(4/6) = 205.3 MFLOP For vector triad a i = b i + c i * d i (3 load 1 store) (4*77)*(4/8)=154 MFLOPS

60 AP:Lec0160 Cache Hit/Miss The Performance numbers assumed that data was available in cache If data is not in cache it must be fetched in cache lines of 256 bytes each from memory at a much slower pace

61 AP:Lec0161

62 AP:Lec0162 TERM PAPER Based on the analysis of the Power 2 processor and IBM SP presented here prepare a similar analysis (including estimates of performance) for the new Power4 chip in the IBM SP or a cluster of Pentium4s.


Download ppt "Computer Architectures... High Performance Computing I Fall 2001 MAE609 /Mth667 Abani Patra."

Similar presentations


Ads by Google