Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 11: Alternative Architectures

Similar presentations


Presentation on theme: "Chapter 11: Alternative Architectures"— Presentation transcript:

1 Chapter 11: Alternative Architectures
Computing Machinery Chapter 11: Alternative Architectures

2 Flynn's Taxonomy

3 Parallel Architectures Functional Diagrams

4 Pipeline Processing

5 PRAM (Parallel Random Access Machine)
EREW - Exclusive Read/Exclusive Write CREW - Concurrent Read/Exclusive Write ERCW - Not Used CRCW - Concurrent Read/Concurrent Write

6 Concurrent Read/Exclusive Write (CREW)
In this model, a particular address in shared memory can be read by multiple processors concurrently. However only one processor at a time can write to a particular address in shared memory. Concurrent means that the order in which two operations occur, does not affect the outcome (or state) of the system.

7 Concurrent Read/Concurrent Write (CRCW) -
In the concurrent read, concurrent write PRAM model, multiple processors can read from or write to the same address in shared memory concurrently. A number of alternative interpretations for the concurrent write operation have been studied. We can choose from a number of operations for concurrent write such as RANDOM, PRIORITY, MAX, and SUM.

8 Parallel Architecture Performance Analysis
Speed - The speed of a computing system is the amount of work accomplished (e.g. number of instructions completed) in a specified time. So we normally refer to processing speed in terms of instructions per second. Speedup - The speedup for a multi-processor system is the ratio of the time required to solve a problem using a multi-processor computer to the time required for a single-processor computer. Since speedup is the ratio of two quantities that have the same units (instructions per second), it is a unitless quantity.

9 Efficiency - The efficiency of an n-processor multi-processor computer system is defined as the speedup of the multi-processor divided by the number of processors, n. Traditionally it has been assumed that efficiency cannot be greater than unity (1).

10 Pipelining the Fetch-Execute Cycle
There are seven operations comprising the Fetch-Execute cycle of the VSC. Some of these operations such as PC increment do not necessarily require register transfers. Generally the fetch-execute cycle can be divided into four steps. Fetch Decode Execute Write (or "write-back") Ref: Pipelining: An Overview Jon Stokes - 9/19/2004, 11:05 PM

11 Non-Pipelined Fetch-Execute
One instruction completed in 4 ns Ref: Pipelining: An Overview Jon Stokes

12 A Four-Stage Pipeline Principle of Locality - With high probability the next instruction to be executed in a program is the instruction located in the next memory address from the current instruction. Four instructions completed in 4 ns Ref: Pipelining: An Overview Jon Stokes

13 An Eight-Stage Pipeline
Eight instructions in 4 ns Ref: Pipelining: An Overview Jon Stokes

14 The effect of Pipeline Stalls
two-cycle stall ten-cycle stall The effect of Pipeline Stalls Ref: Pipelining: An Overview Jon Stokes

15 Latency Latency - The time required for an instruction to pass through the pipeline. In the ideal case for the eight-stage pipeline we assumed that we could divide each stage of the four-stage pipeline into two stages that each took half the time to complete. In reality some stages will always require a full clock cycle. In addition each stage in a pipeline must occupy the same amount of time which means that the actual time for each stage will be the time of the slowest (longest period) operation.

16 throwing hardware at the problem
Superscalar Computing and Pipelining Superscalar computing allows a microprocessor to increase the number of instructions per clock that it completes beyond 1 instruction/clock. Recall that 1 instruction/clock was the maximum theoretical instruction throughput for a pipelined processor as described above. Because a superscalar machine can have multiple instructions in multiple write stages on each clock cycle, the superscalar machine can complete multiple instructions per cycle. throwing hardware at the problem Ref: Pipelining: An Overview Jon Stokes

17 Simultaneous Multithreading and Pipelining
One of the ways that the latest processors for Intel, IBM, and AMD solve this problem is by including support for simultaneous multithreading (a.k.a. hyperthreading or "SMT") in their processors and then asking the programmer and/or compiler to make the code stream as explicitly parallel as possible. Only multithreaded applications can take full advantage of SMT, and multithreading can only be done by the party that designs the application. Multithreaded application design involves identifying portions of an application that can be split into discrete and independent tasks, and assigning those tasks to separate threads of execution. Hyperthreading and multi-core systems shift the burden of instruction-level parallelism from the processor to the programmer/compiler designer. Ref: Pipelining: An Overview Jon Stokes

18 Simultaneous Multithreading (SMT)
The functional difference between conventional multiprocessing and SMT is that in the first case each functional processor is a separate physical processor and in the second case one set of arithmetic and logical functions are shared between the logical processors within a physical (multicore) CPU.

19 Scheduling Priority in SMT
When two instructions are in contention for a resource the one from the higher-priority thread slot "wins" the contention. In order to prevent indefinite postponement, the SMT scheduling policy rotates the priority ranking periodically.

20 Internal Organization of an SMT Architecture

21 Array Processor for Video Decoding

22 Shared-Memory Multiprocessor
For speed and efficiency each processor of shared memory multiprocessor system keeps a cache of local memory, periodically updated from a common shared memory. The shared memory of a parallel processing system needs a management scheme that ensures that all processors keep a current version of all data values (this is called memory coherence).

23 MESI Protocol In the MESI protocol, a two bit tag is used to designate the status of each address of shared memory. modified - When the status is modified this means that the data value in cache has been altered but is not currently held in the cache of any other processor. This status indicates that the address must be written back to shared memory before it is overwritten by another word. exclusive - When the status is exclusive this means that the data value is being held only by the current processor and has not been modified. When it is time to write over this value in cache, it does not need to be written back to the shared memory. shared - The shared status means that copies of this value may be stored in the caches of other processors. invalid - The invalid status indicates that this cache line is not valid. In order to validate these data, the cache must be updated from shared memory.

24 Multicore Data Coherence
The MOESI protocol is an extension of the MESI protocol that adds a new status called owned. A processor can write to a cache line it owns even if other processors are holding copies. When a processor modifies data it owns, it is responsible for updating the copies being held by other processors. The MOESI protocol is used in multicore CPU's in which processor-to-processor communication is much faster than access to shared memory.

25 4-D Hypercube Interconnections

26 Deep Neural Networks possibly multiple hidden layers

27 The Future of Computer Architecture
the end of Moore's Law It is believed that the ability to achieve process shrinks will continue as far as into the early 2010's but relatively soon. Specifically, the quantum mechanical properties of electrons and other atoms begin to dominate in the substrate when the feature size reaches around 50 nanometers. At sizes smaller than this, only a few electrons are needed to saturate the channel. Statistical fluctuations due to thermal effects will make the switching of transistors difficult to control.

28


Download ppt "Chapter 11: Alternative Architectures"

Similar presentations


Ads by Google