Parallel and Multiprocessor Architectures

Parallel and Multiprocessor Architectures
If the application is implemented correctly to exploit parallelism Higher throughput could be realized (Speed up) Better fault tolerance could be realized Better cost-benefit could be realized It makes economical sense to improve processor speed and efficiency by distributing the computational load amongst several processors Parallelism is key Although parallelism is key, the application suited for parallel processing – if not, could be counter-productive 1 1

As mentioned, with parallelism speed up can be realized The perfect speedup would be, if x number of processors were used, the processing would finish in 1/x of the time – or the runtime would be decreased by a factor of x “Perfect” speedup isn’t a reality because not all of the processors will run at the same speed – the slower processor will be the “bottleneck” Also, any portion of the task that must be processed serially, the other processors will need to wait until it is complete – adding more processors will not bring forth any benefit 2 2

Recall Superscalar architectures – architectures that allow processors to implement instruction-level parallelism (ILP) Recall Very Long Instruction Word (VLIW) architectures – architectures designed to handle very long instructions “setup upfront” for parallelism versus the typical sequenced type instructions setup upfront. In comparing these two, let’s recall Pipelining For every clock cycle, one small step is carried out, and the stages are overlapped. S1. Fetch instruction. S4. Fetch operands. S2. Decode opcode S5. Execute. S3. Calculate effective S6. Store result. address of operands. Superpipelining is when a pipeline stage needs to be executed in less than half a clock cycle – 2nd internal clock added that is twice the speed of the regular cycle – allows two tasks to be completed in a single cycle (versus one) Superpipelining maps to the superscalar architectures – multiple instructions can be executed at the same time in each cycle

Superpipelining is only one aspect of superscalar design. Superscalar architectures include multiple execution units such as specialized integer and floating-point adders and multipliers. A critical component of this architecture is the instruction fetch unit, which can simultaneously retrieve several instructions from memory. A decoding unit determines which of these instructions can be executed in parallel and combines them accordingly. This architecture also requires compilers that make optimum use of the hardware. 4 4

Very long instruction word (VLIW) architectures differ from superscalar architectures in that Superscalar relies on both the hardware and compiler, and VLIW relies ONLY on the compiler The VLIW compiler, instead of a hardware decoding unit, packs independent instructions into one long instruction that is sent down the pipeline to the execution units. Some say this is the best approach because the compiler can better identify instruction dependencies. However, compilers cannot formulate a view of the run time code – so the compiler will be conservative in its scheduling. 5 5

Vector computers are processors that operate on entire vectors or matrices at once. These systems are often called supercomputers. Vector computers are highly pipelined so that arithmetic instructions can be overlapped. Vector processors fall into the SIMD category Vector processors can be categorized according to how operands are accessed. Register-register vector processors require all operands to be in registers. Memory-memory vector processors allow operands to be sent from memory directly to the arithmetic units. 6 6

A disadvantage of register-register vector computers is that large vectors must be broken into fixed-length segments so they will fit into the register sets. Memory-memory vector computers have a longer startup time until the pipeline becomes full. In general, vector machines are efficient because there are fewer instructions to fetch, and corresponding pairs of values can be prefetched because the processor knows it will have a continuous stream of data. 7 7

Parallel MIMD systems can communicate through shared memory or through an interconnection network. Interconnection networks are often classified according to their topology, routing strategy, and switching technique. Of these, the topology is a major determining factor in the overhead cost of message passing. The efficiency of the message passing is limited by: Bandwidth: info-carrying capacity of the network Message Latency: time required for the first bit of a message to reach its destination Transport Latency: the time the message spends in the network Overhead: message-processing activities in the Tx and Rx The objective of network design is to minimize the number of required messages and minimize the distance the messages travel. 8 8

Interconnection networks can be either static or dynamic. Dynamic Networks allow a path between two entities to change (two processors or a processor and a memory) Static Networks do not allow the change Processor-to-memory connections usually employ dynamic interconnections. These can be blocking or nonblocking. Nonblocking interconnections allow new connections to occur in the presence of other simultaneous connections. Blocking type doesn’t allow new connections Processor-to-processor message-passing interconnections are usually static, and can employ any of several different topologies, as shown on the following slide. 9 9

Expensive to build and harder to manage as processors are added Hub can be a bottle, however, provide good connectivity Any entity can directly communicate with its two neighbors Noncyclic structure that has bottleneck potential Any entity can directly communicate with multiple neighbors A mesh network with wrap around Multidimensional Mesh – where each dimension has two processors Simplest and most efficient- bottleneck potential with bus contention as entities grow large 10 10

Switching networks use switches to dynamically alter routing There are two types of switches: (1) crossbar switches or (2) 2  2 switches. Crossbar switches are switches that either open or close Any entity can connect to any other entity by the switch closing (making a connection) Networks using the crossbar switch are fully connected If only one switch is needed per crosspoint, n entities will require n^2 switches A processor can connect to only one memory at a time, so there will at most one closed switch per column 11 11

2x2 Switch can rout its inputs to different destinations. 2x2 Switch has 2 inputs and 2 outputs At any stage, it can be in one of four states: (1) through, (2) cross, (3) upper broadcast and (4) lower broadcast Through state: upper input is directed to upper output and lower input is directed to lower output Cross state: upper input is directed to lower output and lower input is directed to upper output Upper Broadcast state: upper input broadcast to upper and lower outputs Lower Broadcast state: lower input broadcast to upper and lower outputs 12 12

Multistage interconnection (or shuffle) networks are the most advanced class of switching networks. Is built using 2x2 switches using stages with processors on one side and memories on the other side The interior switches dynamically configure to allow a path from any processor to any memory Depending on the configuration of the interior switches, blocking can occur (ie. Given CPU 00 is connected to Memory 00 via 1A and 2A, CPU 10 can’t communicate to Memory 10) By adding more switches and more stages, non-blocking can be achieved A network of x nodes requires log2x stages with x/2 switches per stage 13 13

There are advantages and disadvantages to each switching approach. Bus-based networks, while economical, can be bottlenecks. Parallel buses can alleviate bottlenecks, but are costly. Crossbar networks are nonblocking, but require n2 switches to connect n entities. Omega networks are blocking networks, but exhibit less contention than bus-based networks. They are somewhat more economical than crossbar networks, n nodes needing log2n stages with n / 2 switches per stage. 14 14

Parallel and Multiprocessor Architectures

Similar presentations

Presentation on theme: "Parallel and Multiprocessor Architectures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel and Multiprocessor Architectures

Similar presentations

Presentation on theme: "Parallel and Multiprocessor Architectures"— Presentation transcript:

Similar presentations

About project

Feedback