3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2
Parallel Computing ● Parallel computing: the use of multiple computers or processors working together on a common task ● Parallel computer: a computer that contains multiple processors: ➔ each processor works on its section of the problem ➔ processors are allowed to exchange information with other processors
Parallel vs. Serial Computers Two big advantages of parallel computers: 1. total performance 2. total memory ● Parallel computers enable us to solve problems that: ➔ benefit from, or require, fast solution ➔ require large amounts of memory ➔ example that requires both: weather forecasting
Parallel vs. Serial Computers Some benefits of parallel computing include: more data points ➔ bigger domains ➔ better spatial resolution ➔ more particles ● more time steps ➔ longer runs ➔ better temporal resolution ● faster execution ➔ faster time to solution ➔ more solutions in same time ➔ lager simulations in real time
Serial Processor Performance Although Moore’s Law ‘predicts’ that single processor performance doubles every 18 months, eventually physical limits on manufacturing technology will be reached
Types of Parallel Processor The simplest and most useful way to classify modern parallel computers is by their memory model: ➔ shared memory ➔ distributed memory
Shared vs Distributed memory Shared memory - single address space. All processors have access to a pool of shared memory. (Ex: SGI Origin, Sun E10000) Distributed memory - each processor has it’s own local memory. Must do message passing to exchange data between processors. (Ex: CRAY T3E, IBM SP, clusters)
Shared Memory Uniform memory access (UMA): Each processor has uniform access to memory. Also known as symmetric multiprocessors, or SMPs (Sun E10000) Non-uniform memory access (NUMA): Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scale than SMPs (SGI Origin)
Distributed Memory Processor-memory nodes are connected by some type of interconnect network ➔ Massively Parallel Processor (MPP): tightly integrated, single system image. ➔ Cluster: individual computers connected by s/w
Processor, Memory & Network Both shared and distributed memory systems have: ➔ processors: now generally commodity RISC processors ➔ memory: now generally commodity DRAM ➔ network/interconnect: between the processors and memory (bus, crossbar, fat tree, torus, hypercube, etc.)
Processor-Related Terms Clock period (cp): the minimum time interval between successive actions in the processor. Fixed: depends on design of processor. Measured in nanoseconds (~1-5 for fastest processors). Inverse of frequency (MHz). Instruction: an action executed by a processor, such as a mathematical operation or a memory operation. Register: a small, extremely fast location for storing data or instructions in the processor.
Processor-Related Terms Functional Unit (FU): a hardware element that performs an operation on an operand or pair of operations. Common FUs are ADD, MULT, INV, SQRT, etc. Pipeline : technique enabling multiple instructions to be overlapped in execution. Superscalar: multiple instructions are possible per clock period. Flops: floating point operations per second.
Processor-Related Terms Cache: fast memory (SRAM) near the processor. Helps keep instructions and data close to functional units so processor can execute more instructions more rapidly. Translation-Lookaside Buffer (TLB): keeps addresses of pages (block of memory) in main memory that have recently been accessed (a cache for memory addresses)
Memory-Related Terms SRAM: Static Random Access Memory (RAM). Very fast (~10 nanoseconds), made using the same kind of circuitry as the processors, so speed is comparable. DRAM: Dynamic RAM. Longer access times (~100 nanoseconds), but hold more bits and are much less expensive (10x cheaper). Memory hierarchy: the hierarchy of memory in a parallel system, from registers to cache to local memory to remote memory. More later.
Interconnect-Related Terms Latency: Networks: How long does it take to start sending a "message"? Measured in microseconds. Processors: How long does it take to output results of some operations, such as floating point add, divide etc., which are pipelined?) Bandwidth: What data rate can be sustained once the message is started? Measured in Mbytes/sec or Gbytes/sec
Interconnect-Related Terms ● Topology: the manner in which the nodes are connected. Best choice would be a fully connected network (every processor to every other). Unfeasible for cost and scaling reasons. Instead, processors are arranged in some variation of a grid, torus, or hypercube.
Putting the Pieces Together ● Shared memory architectures: ➔ Uniform Memory Access (UMA): Symmetric Multi-Processors (SMP). Ex: Sun E10000 ➔ Non-Uniform Memory Access (NUMA): Most common are Distributed Shared Memory (DSM), or cc-NUMA (cache coherent NUMA) systems. Ex: SGI Origin 2000 ● Distributed memory architectures: ➔ Massively Parallel Processor (MPP): tightly integrated system, single system image. Ex: CRAY T3E, IBM SP ➔ Clusters: commodity nodes connected by interconnect. Example: Beowulf clusters.
Symmetric Multiprocessors (SMPs) ● SMPs connect processors to global shared memory using one of: ➔ Bus ➔ crossbar ● Provides simple programming model, but has problems: ➔ buses can become saturated ➔ crossbar size must increase with # processors ● Problem grows with number of processors, limiting maximum size of SMPs
Shared Memory Programming Programming models are easier since message passing is not necessary Techniques: ➔ autoparallelization via compiler options ➔ oop-level parallelism via compiler directives ➔ OpenMP ➔ pthreads
Massively Parallel Processors ● Each processor has it’s own memory: ➔ memory is not shared globally ➔ adds another layer to memory hierarchy (remote memory) ● Processor/memory nodes are connected by interconnect network ➔ many possible topologies ➔ processors must pass data via messages ➔ communication overhead must be minimized
Types of Interconnections ● Fully connected ➔ not feasible ● Array and torus ➔ Intel Paragon (2D array), CRAY T3E (3D torus) ● Crossbar ➔ IBM SP (8 nodes) ● Hypercube ➔ SGI Origin 2000 (hypercube), Meiko CS-2 (fat tree) ● Combinations of some of the above ➔ IBM SP (crossbar & fully connected for 80 nodes) ➔ IBM SP (fat tree for > 80 nodes)
Distributed Memory Programming ● Message passing is most efficient ➔ MPI ➔ MPI-2 ➔ Active/one-sided messages Vendor: SHMEM (T3E), LAPI (SP Coming in MPI-2 ● Shared memory models can be implemented in software, but are not as efficient.
Distributed Shared Memory ● More generally called cc-NUMA (cache coherent NUMA) ● Consists of m SMPs with n processors in a global address space: ➔ Each processor has some local memory (SMP) ➔ All processors can access all memory: extra “directory” hardware on each SMP tracks values stored in all SMPs ➔ Hardware guarantees cache coherency ➔ Access to memory on other SMPs slower (NUMA)
Distributed Shared Memory ● Easier to build because of slower access to remote memory (no expensive bus/crossbar) ● Similar cache problems ● Code writers should be aware of data distribution ● Load balance: Minimize access of “far” memory