Parallel and Multiprocessor Architectures

Parallel and Multiprocessor Architectures
If the application is implemented correctly to exploit parallelism Higher throughput could be realized (Speed up) Better fault tolerance could be realized Better cost-benefit could be realized It makes economical sense to improve processor speed and efficiency by distributing the computational load amongst several processors Parallelism is key Although parallelism is key, the application suited for parallel processing – if not, could be counter-productive 1 1

As mentioned, with parallelism speed up can be realized The perfect speedup would be, if x number of processors were used, the processing would finish in 1/x of the time – or the runtime would be decreased by a factor of x “Perfect” speedup isn’t a reality because not all of the processors will run at the same speed – the slower processor will be the “bottleneck” Also, any portion of the task that must be processed serially, the other processors will need to wait until it is complete – adding more processors will not bring forth any benefit 2 2

Recall Superscalar architectures – architectures that allow processors to implement instruction-level parallelism (ILP) Recall Very Long Instruction Word (VLIW) architectures – architectures designed to handle very long instructions “setup upfront” for parallelism versus the typical sequenced type instructions setup upfront. In comparing these two, let’s recall Pipelining For every clock cycle, one small step is carried out, and the stages are overlapped. S1. Fetch instruction. S4. Fetch operands. S2. Decode opcode S5. Execute. S3. Calculate effective S6. Store result. address of operands. Superpipelining is when a pipeline stage needs to be executed in less than half a clock cycle – 2nd internal clock added that is twice the speed of the regular cycle – allows two tasks to be completed in a single cycle (versus one) Superpipelining maps to the superscalar architectures – multiple instructions can be executed at the same time in each cycle

Superpipelining is only one aspect of superscalar design. Superscalar architectures include multiple execution units such as specialized integer and floating-point adders and multipliers. A critical component of this architecture is the instruction fetch unit, which can simultaneously retrieve several instructions from memory. A decoding unit determines which of these instructions can be executed in parallel and combines them accordingly. This architecture also requires compilers that make optimum use of the hardware. 4 4

Very long instruction word (VLIW) architectures differ from superscalar architectures in that Superscalar relies on both the hardware and compiler, and VLIW relies ONLY on the compiler The VLIW compiler, instead of a hardware decoding unit, packs independent instructions into one long instruction that is sent down the pipeline to the execution units. Some say this is the best approach because the compiler can better identify instruction dependencies. However, compilers cannot formulate a view of the run time code – so the compiler will be conservative in its scheduling. 5 5

Vector computers are processors that operate on entire vectors or matrices at once. These systems are often called supercomputers. Vector computers are highly pipelined so that arithmetic instructions can be overlapped. Vector processors fall into the SIMD category Vector processors can be categorized according to how operands are accessed. Register-register vector processors require all operands to be in registers. Memory-memory vector processors allow operands to be sent from memory directly to the arithmetic units. 6 6

A disadvantage of register-register vector computers is that large vectors must be broken into fixed-length segments so they will fit into the register sets. Memory-memory vector computers have a longer startup time until the pipeline becomes full. In general, vector machines are efficient because there are fewer instructions to fetch, and corresponding pairs of values can be prefetched because the processor knows it will have a continuous stream of data. 7 7

Parallel MIMD systems can communicate through shared memory or through an interconnection network. Interconnection networks are often classified according to their topology, routing strategy, and switching technique. Of these, the topology is a major determining factor in the overhead cost of message passing. The efficiency of the message passing is limited by: Bandwidth: info-carrying capacity of the network Message Latency: time required for the first bit of a message to reach its destination Transport Latency: the time the message spends in the network Overhead: message-processing activities in the Tx and Rx The objective of network design is to minimize the number of required messages and minimize the distance the messages travel. 8 8

Interconnection networks can be either static or dynamic. Dynamic Networks allow a path between two entities to change (two processors or a processor and a memory) Static Networks do not allow the change Processor-to-memory connections usually employ dynamic interconnections. These can be blocking or nonblocking. Nonblocking interconnections allow new connections to occur in the presence of other simultaneous connections. Blocking type doesn’t allow new connections Processor-to-processor message-passing interconnections are usually static, and can employ any of several different topologies, as shown on the following slide. 9 9

Expensive to build and harder to manage as processors are added Hub can be a bottle, however, provide good connectivity Any entity can directly communicate with its two neighbors Noncyclic structure that has bottleneck potential Any entity can directly communicate with multiple neighbors A mesh network with wrap around Multidimensional Mesh – where each dimension has two processors Simplest and most efficient- bottleneck potential with bus contention as entities grow large 10 10

Switching networks use switches to dynamically alter routing There are two types of switches: (1) crossbar switches or (2) 2  2 switches. Crossbar switches are switches that either open or close Any entity can connect to any other entity by the switch closing (making a connection) Networks using the crossbar switch are fully connected If only one switch is needed per crosspoint, n entities will require n^2 switches A processor can connect to only one memory at a time, so there will at most one closed switch per column 11 11

2x2 Switch can rout its inputs to different destinations. 2x2 Switch has 2 inputs and 2 outputs At any stage, it can be in one of four states: (1) through, (2) cross, (3) upper broadcast and (4) lower broadcast Through state: upper input is directed to upper output and lower input is directed to lower output Cross state: upper input is directed to lower output and lower input is directed to upper output Upper Broadcast state: upper input broadcast to upper and lower outputs Lower Broadcast state: lower input broadcast to upper and lower outputs 12 12

Multistage interconnection (or shuffle) networks are the most advanced class of switching networks. Is built using 2x2 switches using stages with processors on one side and memories on the other side The interior switches dynamically configure to allow a path from any processor to any memory Depending on the configuration of the interior switches, blocking can occur (ie. Given CPU 00 is connected to Memory 00 via 1A and 2A, CPU 10 can’t communicate to Memory 10) By adding more switches and more stages, non-blocking can be achieved A network of x nodes requires log2x stages with x/2 switches per stage 13 13

There are advantages and disadvantages to each switching approach. Bus-based networks, while economical, can be bottlenecks. Parallel buses can alleviate bottlenecks, but are costly. Crossbar networks are nonblocking, but require n2 switches to connect n entities. Omega networks are blocking networks, but exhibit less contention than bus-based networks. They are somewhat more economical than crossbar networks, n nodes needing log2n stages with n / 2 switches per stage. 14 14

Parallel and Multiprocessor Architectures – Shared Memory
Recall: Microprocessors are classified by how memory is organized Tightly-coupled multiprocessor systems use the same memory. They are also referred to as shared memory multiprocessors. The processors do not necessarily have to share the same block of physical memory: Each processor can have its own memory, but it must share it with the other processors. Configurations such as these are called distributed shared memory multiprocessors. 15 15

Also, each processor can have its local cache memory used with a single global memory 16 16

A type of distributed shared memory system is called a Shared Virtual Memory System Each processor contains a cache The system has no primary memory Data is accessed through cache directories maintained in each processor Processors are connect via two unidirectional rings The level-one ring can connect 8 to 32 processors The level-two ring can connect up to 34 level-one rings Example: If Processor A referenced data in location X, Processor B will place Address X’s data on the ring with a destination address of Processor A Processor B Data at Address X Processor A 17 17

Shared memory MIMD machines can ALSO be divided into two categories based upon how they access memory or synchronize their memory operations. Uniform access approach Non uniform access approach In Uniform Memory Access (UMA) systems, all memory accesses take the same amount of time. A switched interconnection UMA system becomes very expensive as the number of processors grow Bus-based UMA systems saturate when the bandwidth of the bus becomes insufficient Multistage UMA run into wiring constraints and latency issues as the number of processors grow Symmetric multiprocessor UMA must be fast enough to support multiple concurrent accesses to memory, or it will slow down the whole system. The interconnection network of a UMA system limits the number of processors – scalability is limited. 18 18

The other category of MIMD machines are the Nonuniform Memory Access (NUMA) systems. NUMA systems can overcome the inherent UMA problems by providing each processor with its own memory. Although the memory is distributed, NUMA processors see the memory as one contiguous addressable space. Thus, a processor can access its own memory much more quickly than it can access memory that is elsewhere. Not only does each processor have its own memory, it also has its own cache, a configuration that can lead to cache coherence problems. Cache coherence problems arise when main memory data is changed and the cached image is not. (We say that the cached value is stale.) To combat this problem, some NUMA machines are equipped with snoopy cache controllers that monitor all caches on the systems. These systems are called cache coherent NUMA (CC-NUMA) architectures. A simpler approach is to ask the processor having the stale value to either void the stale cached value or to update it with the new value. 19 19

When a processor’s cached value is updated concurrently with the update to memory, we say that the system uses a write-through cache update protocol. If the write-through with update protocol is used, a message containing the update is broadcast to all processors so that they may update their caches. If the write-through with invalidate protocol is used, a broadcast asks all processors to invalidate the stale cached value. Write-invalidate uses less bandwidth because it uses the network only the first time the data is updated, but retrieval of the fresh data takes longer. Write-update creates more message traffic, but all caches are kept current. Another approach is the write-back protocol that delays an update to main memory until the modified cache block is replaced and written to memory. At replacement time, the processor writing the cached value must obtain exclusive rights to the data. When rights are granted, all other cached copies are invalidated. 20 20

Parallel and Multiprocessor Architectures – Distributed Computing
Distributed computing is another form of multiprocessing. However, the term distributed computing means different things to different people. In a sense, all multiprocessor systems are distributed systems because the processing load is distributed among processors that work collaboratively. What is really meant by distributed system is, the processing units are very loosely-coupled. Each processor is independent with its own memory and cache, and the processors communicate via a high speed network. Another name for this is Cluster Computing Processing units connected via a bus are considered tightly-coupled. Grid Computing is an example of distributed computing – make use of heterogeneous CPUs and storage devices located in different domains to solve computation problems too large for any single supercomputer The difference between Grid Computing and Cluster Computing is that Grid Computing can use resources in different domains versus only the same domain. Global Computing is grid computing with resources provided by volunteers. 21 21

Parallel and Multiprocessor Architectures – Distributed Systems
For general-use computing, transparency is important – details about the distributed nature of the system should be hidden. Using remote system resources should require no more effort than a local system. An example of this type of distributed system is called Ubiquitous computing systems (or Pervasive computing systems). These systems are totally embedded in the environment – simple to use – completely connected – mobile – invisible and in the background. Remote procedure calls (RPCs) enable this transparency. RPCs use resources on remote machines by invoking procedures that reside and are executed on the remote machines. RPCs are employed by numerous vendors of distributed computing architectures including the Common Object Request Broker Architecture (CORBA) and Java’s Remote Method Invocation (RMI). 22 22

Parallel and Multiprocessor Architectures – Distributed Systems
Cloud computing is distributed computing to the extreme. It provides services over the Internet through a collection of loosely-coupled systems. In theory, the service consumer has no awareness of the hardware, or even its location. Your services and data may even be located on the same physical system as that of your business competitor. The hardware might even be located in another country. Security concerns are a major inhibiting factor for cloud computing. 23 23

Parallel and Multiprocessor Architectures

Similar presentations

Presentation on theme: "Parallel and Multiprocessor Architectures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel and Multiprocessor Architectures

Similar presentations

Presentation on theme: "Parallel and Multiprocessor Architectures"— Presentation transcript:

Similar presentations

About project

Feedback