Parallel and Multiprocessor Architectures

Slides:

Advertisements

Similar presentations

Computer Organization and Architecture

Advertisements

Super computers Parallel Processing By: Lecturer \ Aisha Dawood.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Introduction to Parallel Processing Ch. 12, Pg

Pipelining By Toan Nguyen.

Chapter 9 Alternative Architectures. 2 Chapter 9 Objectives Learn the properties that often distinguish RISC from CISC architectures. Understand how multiprocessor.

Processor Structure & Operations of an Accumulator Machine

Parallelism Processing more than one instruction at a time. Pipelining

Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.

Basics and Architectures

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

CSCI 232© 2005 JW Ryder1 Parallel Processing Large class of techniques used to provide simultaneous data processing tasks Purpose: Increase computational.

Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.

Principles of Linear Pipelining

Processor Architecture

Pipelining and Parallelism Mark Staveley

Chapter One Introduction to Pipelined Processors

Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

EKT303/4 Superscalar vs Super-pipelined.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Chapter One Introduction to Pipelined Processors

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Chapter One Introduction to Pipelined Processors.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

PipeliningPipelining Computer Architecture (Fall 2006)

INTERCONNECTION NETWORK

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.

Overview Parallel Processing Pipelining

Parallel Architecture

Advanced Architectures

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.

A Closer Look at Instruction Set Architectures

William Stallings Computer Organization and Architecture 8th Edition

Parallel Processing - introduction

Refer example 2.4on page 64 ACA(Kai Hwang) And refer another ppt attached for static scheduling example.

Chapter 9 a Instruction Level Parallelism and Superscalar Processors

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Instruction Level Parallelism and Superscalar Processors

CMSC 611: Advanced Computer Architecture

Pipelining and Vector Processing

Superscalar Processors & VLIW Processors

Multiprocessors Interconnection Networks

Instruction Level Parallelism and Superscalar Processors

Simultaneous Multithreading in Superscalar Processors

buses, crossing switch, multistage network.

Overview Parallel Processing Pipelining

Chap. 9 Pipeline and Vector Processing

AN INTRODUCTION ON PARALLEL PROCESSING

High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub

Advanced Computer and Parallel Processing

Computer Architecture

COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING

Advanced Computer and Parallel Processing

ARM ORGANISATION.

Created by Vivi Sahfitri

Memory System Performance Chapter 3

Parallel and Multiprocessor Architectures

COMPUTER ORGANIZATION AND ARCHITECTURE

Presentation transcript:

Parallel and Multiprocessor Architectures If the application is implemented correctly to exploit parallelism Higher throughput could be realized (Speed up) Better fault tolerance could be realized Better cost-benefit could be realized It makes economical sense to improve processor speed and efficiency by distributing the computational load amongst several processors Parallelism is key Although parallelism is key, the application suited for parallel processing – if not, could be counter-productive 1 1

Parallel and Multiprocessor Architectures As mentioned, with parallelism speed up can be realized The perfect speedup would be, if x number of processors were used, the processing would finish in 1/x of the time – or the runtime would be decreased by a factor of x “Perfect” speedup isn’t a reality because not all of the processors will run at the same speed – the slower processor will be the “bottleneck” Also, any portion of the task that must be processed serially, the other processors will need to wait until it is complete – adding more processors will not bring forth any benefit 2 2

Parallel and Multiprocessor Architectures Recall Superscalar architectures – architectures that allow processors to implement instruction-level parallelism (ILP) Recall Very Long Instruction Word (VLIW) architectures – architectures designed to handle very long instructions “setup upfront” for parallelism versus the typical sequenced type instructions setup upfront. In comparing these two, let’s recall Pipelining For every clock cycle, one small step is carried out, and the stages are overlapped. S1. Fetch instruction. S4. Fetch operands. S2. Decode opcode. S5. Execute. S3. Calculate effective S6. Store result. address of operands. Superpipelining is when a pipeline stage needs to be executed in less than half a clock cycle – 2nd internal clock added that is twice the speed of the regular cycle – allows two tasks to be completed in a single cycle (versus one) Superpipelining maps to the superscalar architectures – multiple instructions can be executed at the same time in each cycle

Parallel and Multiprocessor Architectures Superpipelining is only one aspect of superscalar design. Superscalar architectures include multiple execution units such as specialized integer and floating-point adders and multipliers. A critical component of this architecture is the instruction fetch unit, which can simultaneously retrieve several instructions from memory. A decoding unit determines which of these instructions can be executed in parallel and combines them accordingly. This architecture also requires compilers that make optimum use of the hardware. 4 4

Parallel and Multiprocessor Architectures Very long instruction word (VLIW) architectures differ from superscalar architectures in that Superscalar relies on both the hardware and compiler, and VLIW relies ONLY on the compiler The VLIW compiler, instead of a hardware decoding unit, packs independent instructions into one long instruction that is sent down the pipeline to the execution units. Some say this is the best approach because the compiler can better identify instruction dependencies. However, compilers cannot formulate a view of the run time code – so the compiler will be conservative in its scheduling. 5 5

Parallel and Multiprocessor Architectures Vector computers are processors that operate on entire vectors or matrices at once. These systems are often called supercomputers. Vector computers are highly pipelined so that arithmetic instructions can be overlapped. Vector processors fall into the SIMD category Vector processors can be categorized according to how operands are accessed. Register-register vector processors require all operands to be in registers. Memory-memory vector processors allow operands to be sent from memory directly to the arithmetic units. 6 6

Parallel and Multiprocessor Architectures A disadvantage of register-register vector computers is that large vectors must be broken into fixed-length segments so they will fit into the register sets. Memory-memory vector computers have a longer startup time until the pipeline becomes full. In general, vector machines are efficient because there are fewer instructions to fetch, and corresponding pairs of values can be prefetched because the processor knows it will have a continuous stream of data. 7 7

Parallel and Multiprocessor Architectures Parallel MIMD systems can communicate through shared memory or through an interconnection network. Interconnection networks are often classified according to their topology, routing strategy, and switching technique. Of these, the topology is a major determining factor in the overhead cost of message passing. The efficiency of the message passing is limited by: Bandwidth: info-carrying capacity of the network Message Latency: time required for the first bit of a message to reach its destination Transport Latency: the time the message spends in the network Overhead: message-processing activities in the Tx and Rx The objective of network design is to minimize the number of required messages and minimize the distance the messages travel. 8 8

Parallel and Multiprocessor Architectures Interconnection networks can be either static or dynamic. Dynamic Networks allow a path between two entities to change (two processors or a processor and a memory) Static Networks do not allow the change Processor-to-memory connections usually employ dynamic interconnections. These can be blocking or nonblocking. Nonblocking interconnections allow new connections to occur in the presence of other simultaneous connections. Blocking type doesn’t allow new connections Processor-to-processor message-passing interconnections are usually static, and can employ any of several different topologies, as shown on the following slide. 9 9

Parallel and Multiprocessor Architectures Expensive to build and harder to manage as processors are added Hub can be a bottle, however, provide good connectivity Any entity can directly communicate with its two neighbors Noncyclic structure that has bottleneck potential Any entity can directly communicate with multiple neighbors A mesh network with wrap around Multidimensional Mesh – where each dimension has two processors Simplest and most efficient- bottleneck potential with bus contention as entities grow large 10 10

Parallel and Multiprocessor Architectures Switching networks use switches to dynamically alter routing There are two types of switches: (1) crossbar switches or (2) 2  2 switches. Crossbar switches are switches that either open or close Any entity can connect to any other entity by the switch closing (making a connection) Networks using the crossbar switch are fully connected If only one switch is needed per crosspoint, n entities will require n^2 switches A processor can connect to only one memory at a time, so there will at most one closed switch per column 11 11

Parallel and Multiprocessor Architectures 2x2 Switch can rout its inputs to different destinations. 2x2 Switch has 2 inputs and 2 outputs At any stage, it can be in one of four states: (1) through, (2) cross, (3) upper broadcast and (4) lower broadcast Through state: upper input is directed to upper output and lower input is directed to lower output Cross state: upper input is directed to lower output and lower input is directed to upper output Upper Broadcast state: upper input broadcast to upper and lower outputs Lower Broadcast state: lower input broadcast to upper and lower outputs 12 12

Parallel and Multiprocessor Architectures Multistage interconnection (or shuffle) networks are the most advanced class of switching networks. Is built using 2x2 switches using stages with processors on one side and memories on the other side The interior switches dynamically configure to allow a path from any processor to any memory Depending on the configuration of the interior switches, blocking can occur (ie. Given CPU 00 is connected to Memory 00 via 1A and 2A, CPU 10 can’t communicate to Memory 10) By adding more switches and more stages, non-blocking can be achieved A network of x nodes requires log2x stages with x/2 switches per stage 13 13

Parallel and Multiprocessor Architectures There are advantages and disadvantages to each switching approach. Bus-based networks, while economical, can be bottlenecks. Parallel buses can alleviate bottlenecks, but are costly. Crossbar networks are nonblocking, but require n2 switches to connect n entities. Omega networks are blocking networks, but exhibit less contention than bus-based networks. They are somewhat more economical than crossbar networks, n nodes needing log2n stages with n / 2 switches per stage. 14 14