Chapter 11: Alternative Architectures

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism and Superscalar Processors
Advertisements

Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Performance of Cache Memory
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
Chapter 17 Parallel Processing.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
Computer Architecture 2 nd year (computer and Information Sc.)
Processor Architecture
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
EKT303/4 Superscalar vs Super-pipelined.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Processor Level Parallelism 1
Chapter Overview General Concepts IA-32 Processor Architecture
GCSE Computing - The CPU
COMP 740: Computer Architecture and Implementation
Multiprocessing.
Advanced Architectures
Computer Architecture Chapter (14): Processor Structure and Function
Distributed Processors
A Closer Look at Instruction Set Architectures
William Stallings Computer Organization and Architecture 8th Edition
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
Parallel Processing - introduction
CSC 4250 Computer Architectures
Multi-core processors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
Assembly Language for Intel-Based Computers, 5th Edition
Chapter 14 Instruction Level Parallelism and Superscalar Processors
William Stallings Computer Organization and Architecture 7th Edition
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Cache memory Direct Cache Memory Associate Cache Memory
Instruction Level Parallelism and Superscalar Processors
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
CISC AND RISC SYSTEM Based on instruction set, we broadly classify Computer/microprocessor/microcontroller into CISC and RISC. CISC SYSTEM: COMPLEX INSTRUCTION.
Instruction Level Parallelism and Superscalar Processors
Multivector and SIMD Computers
CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation
Overview Parallel Processing Pipelining
Chapter 1 Introduction.
Miss Rate versus Block Size
/ Computer Architecture and Design
* From AMD 1996 Publication #18522 Revision E
Computer Architecture
Computer Architecture
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
The University of Adelaide, School of Computer Science
ARM ORGANISATION.
Created by Vivi Sahfitri
GCSE Computing - The CPU
The University of Adelaide, School of Computer Science
Presentation transcript:

Chapter 11: Alternative Architectures Computing Machinery Chapter 11: Alternative Architectures

Flynn's Taxonomy

Parallel Architectures Functional Diagrams

Pipeline Processing

PRAM (Parallel Random Access Machine) EREW - Exclusive Read/Exclusive Write CREW - Concurrent Read/Exclusive Write ERCW - Not Used CRCW - Concurrent Read/Concurrent Write

Concurrent Read/Exclusive Write (CREW) In this model, a particular address in shared memory can be read by multiple processors concurrently. However only one processor at a time can write to a particular address in shared memory. Concurrent means that the order in which two operations occur, does not affect the outcome (or state) of the system.

Concurrent Read/Concurrent Write (CRCW) - In the concurrent read, concurrent write PRAM model, multiple processors can read from or write to the same address in shared memory concurrently. A number of alternative interpretations for the concurrent write operation have been studied. We can choose from a number of operations for concurrent write such as RANDOM, PRIORITY, MAX, and SUM.

Parallel Architecture Performance Analysis Speed - The speed of a computing system is the amount of work accomplished (e.g. number of instructions completed) in a specified time. So we normally refer to processing speed in terms of instructions per second. Speedup - The speedup for a multi-processor system is the ratio of the time required to solve a problem using a multi-processor computer to the time required for a single-processor computer. Since speedup is the ratio of two quantities that have the same units (instructions per second), it is a unitless quantity.

Efficiency - The efficiency of an n-processor multi-processor computer system is defined as the speedup of the multi-processor divided by the number of processors, n. Traditionally it has been assumed that efficiency cannot be greater than unity (1).

Pipelining the Fetch-Execute Cycle There are seven operations comprising the Fetch-Execute cycle of the VSC. Some of these operations such as PC increment do not necessarily require register transfers. Generally the fetch-execute cycle can be divided into four steps. Fetch Decode Execute Write (or "write-back") Ref: Pipelining: An Overview Jon Stokes - 9/19/2004, 11:05 PM https://arstechnica.com/features/2004/09/pipelining-1/ https://arstechnica.com/features/2004/09/pipelining-2/

Non-Pipelined Fetch-Execute One instruction completed in 4 ns Ref: Pipelining: An Overview Jon Stokes - https://arstechnica.com/features/2004/09/pipelining-1/ https://arstechnica.com/features/2004/09/pipelining-2/

A Four-Stage Pipeline Principle of Locality - With high probability the next instruction to be executed in a program is the instruction located in the next memory address from the current instruction. Four instructions completed in 4 ns Ref: Pipelining: An Overview Jon Stokes - https://arstechnica.com/features/2004/09/pipelining-1/ https://arstechnica.com/features/2004/09/pipelining-2/

An Eight-Stage Pipeline Eight instructions in 4 ns Ref: Pipelining: An Overview Jon Stokes - https://arstechnica.com/features/2004/09/pipelining-1/ https://arstechnica.com/features/2004/09/pipelining-2/

The effect of Pipeline Stalls two-cycle stall ten-cycle stall The effect of Pipeline Stalls Ref: Pipelining: An Overview Jon Stokes - https://arstechnica.com/features/2004/09/pipelining-1/ https://arstechnica.com/features/2004/09/pipelining-2/

Latency Latency - The time required for an instruction to pass through the pipeline. In the ideal case for the eight-stage pipeline we assumed that we could divide each stage of the four-stage pipeline into two stages that each took half the time to complete. In reality some stages will always require a full clock cycle. In addition each stage in a pipeline must occupy the same amount of time which means that the actual time for each stage will be the time of the slowest (longest period) operation.

throwing hardware at the problem Superscalar Computing and Pipelining Superscalar computing allows a microprocessor to increase the number of instructions per clock that it completes beyond 1 instruction/clock. Recall that 1 instruction/clock was the maximum theoretical instruction throughput for a pipelined processor as described above. Because a superscalar machine can have multiple instructions in multiple write stages on each clock cycle, the superscalar machine can complete multiple instructions per cycle. throwing hardware at the problem Ref: Pipelining: An Overview Jon Stokes - https://arstechnica.com/features/2004/09/pipelining-1/ https://arstechnica.com/features/2004/09/pipelining-2/

Simultaneous Multithreading and Pipelining One of the ways that the latest processors for Intel, IBM, and AMD solve this problem is by including support for simultaneous multithreading (a.k.a. hyperthreading or "SMT") in their processors and then asking the programmer and/or compiler to make the code stream as explicitly parallel as possible. Only multithreaded applications can take full advantage of SMT, and multithreading can only be done by the party that designs the application. Multithreaded application design involves identifying portions of an application that can be split into discrete and independent tasks, and assigning those tasks to separate threads of execution. Hyperthreading and multi-core systems shift the burden of instruction-level parallelism from the processor to the programmer/compiler designer. Ref: Pipelining: An Overview Jon Stokes - https://arstechnica.com/features/2004/09/pipelining-1/ https://arstechnica.com/features/2004/09/pipelining-2/

Simultaneous Multithreading (SMT) The functional difference between conventional multiprocessing and SMT is that in the first case each functional processor is a separate physical processor and in the second case one set of arithmetic and logical functions are shared between the logical processors within a physical (multicore) CPU.

Scheduling Priority in SMT When two instructions are in contention for a resource the one from the higher-priority thread slot "wins" the contention. In order to prevent indefinite postponement, the SMT scheduling policy rotates the priority ranking periodically.

Internal Organization of an SMT Architecture

Array Processor for Video Decoding

Shared-Memory Multiprocessor For speed and efficiency each processor of shared memory multiprocessor system keeps a cache of local memory, periodically updated from a common shared memory. The shared memory of a parallel processing system needs a management scheme that ensures that all processors keep a current version of all data values (this is called memory coherence).

MESI Protocol In the MESI protocol, a two bit tag is used to designate the status of each address of shared memory. modified - When the status is modified this means that the data value in cache has been altered but is not currently held in the cache of any other processor. This status indicates that the address must be written back to shared memory before it is overwritten by another word. exclusive - When the status is exclusive this means that the data value is being held only by the current processor and has not been modified. When it is time to write over this value in cache, it does not need to be written back to the shared memory. shared - The shared status means that copies of this value may be stored in the caches of other processors. invalid - The invalid status indicates that this cache line is not valid. In order to validate these data, the cache must be updated from shared memory.

Multicore Data Coherence The MOESI protocol is an extension of the MESI protocol that adds a new status called owned. A processor can write to a cache line it owns even if other processors are holding copies. When a processor modifies data it owns, it is responsible for updating the copies being held by other processors. The MOESI protocol is used in multicore CPU's in which processor-to-processor communication is much faster than access to shared memory.

4-D Hypercube Interconnections

Deep Neural Networks possibly multiple hidden layers

The Future of Computer Architecture the end of Moore's Law It is believed that the ability to achieve process shrinks will continue as far as into the early 2010's but relatively soon. Specifically, the quantum mechanical properties of electrons and other atoms begin to dominate in the substrate when the feature size reaches around 50 nanometers. At sizes smaller than this, only a few electrons are needed to saturate the channel. Statistical fluctuations due to thermal effects will make the switching of transistors difficult to control.