Parallel and Multiprocessor Architectures

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

SE-292 High Performance Computing
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

Chapter 17 Parallel Processing.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
Chapter 18 Parallel Processing (Multiprocessing).
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Parallel Computer Architecture and Interconnect 1b.1.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
Pipelining and Parallelism Mark Staveley
Outline Why this subject? What is High Performance Computing?
Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
INTERCONNECTION NETWORK
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Computer Organization
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Multiprocessor System Distributed System
Overview Parallel Processing Pipelining
Parallel Architecture
Advanced Architectures
Distributed Processors
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Parallel Processing - introduction
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
Overview Parallel Processing Pipelining
Parallel and Multiprocessor Architectures
CMSC 611: Advanced Computer Architecture
Superscalar Processors & VLIW Processors
Parallel and Multiprocessor Architectures – Shared Memory
Multiprocessor Introduction and Characteristics of Multiprocessor
Chapter 17 Parallel Processing
Multiprocessors - Flynn’s taxonomy (1966)
Multiple Processor Systems
Lecture 24: Memory, VM, Multiproc
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
AN INTRODUCTION ON PARALLEL PROCESSING
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Advanced Computer and Parallel Processing
High Performance Computing
The University of Adelaide, School of Computer Science
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
Advanced Computer and Parallel Processing
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Memory System Performance Chapter 3
Database System Architectures
Operating System Overview
The University of Adelaide, School of Computer Science
Multiprocessors and Multi-computers
Presentation transcript:

Parallel and Multiprocessor Architectures If the application is implemented correctly to exploit parallelism Higher throughput could be realized (Speed up) Better fault tolerance could be realized Better cost-benefit could be realized It makes economical sense to improve processor speed and efficiency by distributing the computational load amongst several processors Parallelism is key Although parallelism is key, the application suited for parallel processing – if not, could be counter-productive 1 1

Parallel and Multiprocessor Architectures As mentioned, with parallelism speed up can be realized The perfect speedup would be, if x number of processors were used, the processing would finish in 1/x of the time – or the runtime would be decreased by a factor of x “Perfect” speedup isn’t a reality because not all of the processors will run at the same speed – the slower processor will be the “bottleneck” Also, any portion of the task that must be processed serially, the other processors will need to wait until it is complete – adding more processors will not bring forth any benefit 2 2

Parallel and Multiprocessor Architectures Recall Superscalar architectures – architectures that allow processors to implement instruction-level parallelism (ILP) Recall Very Long Instruction Word (VLIW) architectures – architectures designed to handle very long instructions “setup upfront” for parallelism versus the typical sequenced type instructions setup upfront. In comparing these two, let’s recall Pipelining For every clock cycle, one small step is carried out, and the stages are overlapped. S1. Fetch instruction. S4. Fetch operands. S2. Decode opcode. S5. Execute. S3. Calculate effective S6. Store result. address of operands. Superpipelining is when a pipeline stage needs to be executed in less than half a clock cycle – 2nd internal clock added that is twice the speed of the regular cycle – allows two tasks to be completed in a single cycle (versus one) Superpipelining maps to the superscalar architectures – multiple instructions can be executed at the same time in each cycle

Parallel and Multiprocessor Architectures Superpipelining is only one aspect of superscalar design. Superscalar architectures include multiple execution units such as specialized integer and floating-point adders and multipliers. A critical component of this architecture is the instruction fetch unit, which can simultaneously retrieve several instructions from memory. A decoding unit determines which of these instructions can be executed in parallel and combines them accordingly. This architecture also requires compilers that make optimum use of the hardware. 4 4

Parallel and Multiprocessor Architectures Very long instruction word (VLIW) architectures differ from superscalar architectures in that Superscalar relies on both the hardware and compiler, and VLIW relies ONLY on the compiler The VLIW compiler, instead of a hardware decoding unit, packs independent instructions into one long instruction that is sent down the pipeline to the execution units. Some say this is the best approach because the compiler can better identify instruction dependencies. However, compilers cannot formulate a view of the run time code – so the compiler will be conservative in its scheduling. 5 5

Parallel and Multiprocessor Architectures Vector computers are processors that operate on entire vectors or matrices at once. These systems are often called supercomputers. Vector computers are highly pipelined so that arithmetic instructions can be overlapped. Vector processors fall into the SIMD category Vector processors can be categorized according to how operands are accessed. Register-register vector processors require all operands to be in registers. Memory-memory vector processors allow operands to be sent from memory directly to the arithmetic units. 6 6

Parallel and Multiprocessor Architectures A disadvantage of register-register vector computers is that large vectors must be broken into fixed-length segments so they will fit into the register sets. Memory-memory vector computers have a longer startup time until the pipeline becomes full. In general, vector machines are efficient because there are fewer instructions to fetch, and corresponding pairs of values can be prefetched because the processor knows it will have a continuous stream of data. 7 7

Parallel and Multiprocessor Architectures Parallel MIMD systems can communicate through shared memory or through an interconnection network. Interconnection networks are often classified according to their topology, routing strategy, and switching technique. Of these, the topology is a major determining factor in the overhead cost of message passing. The efficiency of the message passing is limited by: Bandwidth: info-carrying capacity of the network Message Latency: time required for the first bit of a message to reach its destination Transport Latency: the time the message spends in the network Overhead: message-processing activities in the Tx and Rx The objective of network design is to minimize the number of required messages and minimize the distance the messages travel. 8 8

Parallel and Multiprocessor Architectures Interconnection networks can be either static or dynamic. Dynamic Networks allow a path between two entities to change (two processors or a processor and a memory) Static Networks do not allow the change Processor-to-memory connections usually employ dynamic interconnections. These can be blocking or nonblocking. Nonblocking interconnections allow new connections to occur in the presence of other simultaneous connections. Blocking type doesn’t allow new connections Processor-to-processor message-passing interconnections are usually static, and can employ any of several different topologies, as shown on the following slide. 9 9

Parallel and Multiprocessor Architectures Expensive to build and harder to manage as processors are added Hub can be a bottle, however, provide good connectivity Any entity can directly communicate with its two neighbors Noncyclic structure that has bottleneck potential Any entity can directly communicate with multiple neighbors A mesh network with wrap around Multidimensional Mesh – where each dimension has two processors Simplest and most efficient- bottleneck potential with bus contention as entities grow large 10 10

Parallel and Multiprocessor Architectures Switching networks use switches to dynamically alter routing There are two types of switches: (1) crossbar switches or (2) 2  2 switches. Crossbar switches are switches that either open or close Any entity can connect to any other entity by the switch closing (making a connection) Networks using the crossbar switch are fully connected If only one switch is needed per crosspoint, n entities will require n^2 switches A processor can connect to only one memory at a time, so there will at most one closed switch per column 11 11

Parallel and Multiprocessor Architectures 2x2 Switch can rout its inputs to different destinations. 2x2 Switch has 2 inputs and 2 outputs At any stage, it can be in one of four states: (1) through, (2) cross, (3) upper broadcast and (4) lower broadcast Through state: upper input is directed to upper output and lower input is directed to lower output Cross state: upper input is directed to lower output and lower input is directed to upper output Upper Broadcast state: upper input broadcast to upper and lower outputs Lower Broadcast state: lower input broadcast to upper and lower outputs 12 12

Parallel and Multiprocessor Architectures Multistage interconnection (or shuffle) networks are the most advanced class of switching networks. Is built using 2x2 switches using stages with processors on one side and memories on the other side The interior switches dynamically configure to allow a path from any processor to any memory Depending on the configuration of the interior switches, blocking can occur (ie. Given CPU 00 is connected to Memory 00 via 1A and 2A, CPU 10 can’t communicate to Memory 10) By adding more switches and more stages, non-blocking can be achieved A network of x nodes requires log2x stages with x/2 switches per stage 13 13

Parallel and Multiprocessor Architectures There are advantages and disadvantages to each switching approach. Bus-based networks, while economical, can be bottlenecks. Parallel buses can alleviate bottlenecks, but are costly. Crossbar networks are nonblocking, but require n2 switches to connect n entities. Omega networks are blocking networks, but exhibit less contention than bus-based networks. They are somewhat more economical than crossbar networks, n nodes needing log2n stages with n / 2 switches per stage. 14 14

Parallel and Multiprocessor Architectures – Shared Memory Recall: Microprocessors are classified by how memory is organized Tightly-coupled multiprocessor systems use the same memory. They are also referred to as shared memory multiprocessors. The processors do not necessarily have to share the same block of physical memory: Each processor can have its own memory, but it must share it with the other processors. Configurations such as these are called distributed shared memory multiprocessors. 15 15

Parallel and Multiprocessor Architectures – Shared Memory Also, each processor can have its local cache memory used with a single global memory 16 16

Parallel and Multiprocessor Architectures – Shared Memory A type of distributed shared memory system is called a Shared Virtual Memory System Each processor contains a cache The system has no primary memory Data is accessed through cache directories maintained in each processor Processors are connect via two unidirectional rings The level-one ring can connect 8 to 32 processors The level-two ring can connect up to 34 level-one rings Example: If Processor A referenced data in location X, Processor B will place Address X’s data on the ring with a destination address of Processor A Processor B Data at Address X Processor A 17 17

Parallel and Multiprocessor Architectures – Shared Memory Shared memory MIMD machines can ALSO be divided into two categories based upon how they access memory or synchronize their memory operations. Uniform access approach Non uniform access approach In Uniform Memory Access (UMA) systems, all memory accesses take the same amount of time. A switched interconnection UMA system becomes very expensive as the number of processors grow Bus-based UMA systems saturate when the bandwidth of the bus becomes insufficient Multistage UMA run into wiring constraints and latency issues as the number of processors grow Symmetric multiprocessor UMA must be fast enough to support multiple concurrent accesses to memory, or it will slow down the whole system. The interconnection network of a UMA system limits the number of processors – scalability is limited. 18 18

Parallel and Multiprocessor Architectures – Shared Memory The other category of MIMD machines are the Nonuniform Memory Access (NUMA) systems. NUMA systems can overcome the inherent UMA problems by providing each processor with its own memory. Although the memory is distributed, NUMA processors see the memory as one contiguous addressable space. Thus, a processor can access its own memory much more quickly than it can access memory that is elsewhere. Not only does each processor have its own memory, it also has its own cache, a configuration that can lead to cache coherence problems. Cache coherence problems arise when main memory data is changed and the cached image is not. (We say that the cached value is stale.) To combat this problem, some NUMA machines are equipped with snoopy cache controllers that monitor all caches on the systems. These systems are called cache coherent NUMA (CC-NUMA) architectures. A simpler approach is to ask the processor having the stale value to either void the stale cached value or to update it with the new value. 19 19

Parallel and Multiprocessor Architectures – Shared Memory When a processor’s cached value is updated concurrently with the update to memory, we say that the system uses a write-through cache update protocol. If the write-through with update protocol is used, a message containing the update is broadcast to all processors so that they may update their caches. If the write-through with invalidate protocol is used, a broadcast asks all processors to invalidate the stale cached value. Write-invalidate uses less bandwidth because it uses the network only the first time the data is updated, but retrieval of the fresh data takes longer. Write-update creates more message traffic, but all caches are kept current. Another approach is the write-back protocol that delays an update to main memory until the modified cache block is replaced and written to memory. At replacement time, the processor writing the cached value must obtain exclusive rights to the data. When rights are granted, all other cached copies are invalidated. 20 20

Parallel and Multiprocessor Architectures – Distributed Computing Distributed computing is another form of multiprocessing. However, the term distributed computing means different things to different people. In a sense, all multiprocessor systems are distributed systems because the processing load is distributed among processors that work collaboratively. What is really meant by distributed system is, the processing units are very loosely-coupled. Each processor is independent with its own memory and cache, and the processors communicate via a high speed network. Another name for this is Cluster Computing Processing units connected via a bus are considered tightly-coupled. Grid Computing is an example of distributed computing – make use of heterogeneous CPUs and storage devices located in different domains to solve computation problems too large for any single supercomputer The difference between Grid Computing and Cluster Computing is that Grid Computing can use resources in different domains versus only the same domain. Global Computing is grid computing with resources provided by volunteers. 21 21

Parallel and Multiprocessor Architectures – Distributed Systems For general-use computing, transparency is important – details about the distributed nature of the system should be hidden. Using remote system resources should require no more effort than a local system. An example of this type of distributed system is called Ubiquitous computing systems (or Pervasive computing systems). These systems are totally embedded in the environment – simple to use – completely connected – mobile – invisible and in the background. Remote procedure calls (RPCs) enable this transparency. RPCs use resources on remote machines by invoking procedures that reside and are executed on the remote machines. RPCs are employed by numerous vendors of distributed computing architectures including the Common Object Request Broker Architecture (CORBA) and Java’s Remote Method Invocation (RMI). 22 22

Parallel and Multiprocessor Architectures – Distributed Systems Cloud computing is distributed computing to the extreme. It provides services over the Internet through a collection of loosely-coupled systems. In theory, the service consumer has no awareness of the hardware, or even its location. Your services and data may even be located on the same physical system as that of your business competitor. The hardware might even be located in another country. Security concerns are a major inhibiting factor for cloud computing. 23 23