Samira Khan University of Virginia Sep 10, 2018

Slides:



Advertisements
Similar presentations
Chapter 3 Instruction Set Architecture Advanced Computer Architecture COE 501.
Advertisements

Fall 2012SYSC 5704: Elements of Computer Systems 1 MicroArchitecture Murdocca, Chapter 5 (selected parts) How to read Chapter 5.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.
The University of Adelaide, School of Computer Science
Chapter 6 สถาปัตยกรรมไมโครโพรเซสเซอร์แบบต่างๆ Processor Architectures
More ISA. Property of ISA vs. Uarch? ADD instruction’s opcode Number of general purpose registers Number of cycles to execute the MUL instruction Whether.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Processor Organization and Architecture
Computer Organization and Architecture Reduced Instruction Set Computers (RISC) Chapter 13.
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
Basics and Architectures
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
1 Instruction Sets and Beyond Computers, Complexity, and Controversy Brian Blum, Darren Drewry Ben Hocking, Gus Scheidt.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.
ISA's, Compilers, and Assembly
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
15-740/ Computer Architecture Lecture 2: ISA, Tradeoffs, Performance Prof. Onur Mutlu Carnegie Mellon University.
Samira Khan University of Virginia Feb 2, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: ISA Tradeoffs The content and concept of this course.
Computer Architecture Lecture 4: ISA Tradeoffs (Continued) and Single-Cycle Microarchitectures Prof. Onur Mutlu Carnegie Mellon University Spring.
Samira Khan University of Virginia Jan 28, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs The content and.
Computer Architecture Lecture 3: ISA Tradeoffs Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 1/17/2014.
Computer Architecture: VLIW, DAE, Systolic Arrays Prof. Onur Mutlu Carnegie Mellon University.
Computer Architecture & Operations I
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Topics to be covered Instruction Execution Characteristics
Design of Digital Circuits Lecture 24: Systolic Arrays and Beyond
Advanced Architectures
18-447: Computer Architecture Lecture 30B: Multiprocessors
Computer Architecture Lecture 3: ISA Tradeoffs
15-740/ Computer Architecture Lecture 4: Pipelining
15-740/ Computer Architecture Lecture 3: Performance
Complex and Reduced Instruction Set Computers
Computer Architecture: Parallel Processing Basics
A Closer Look at Instruction Set Architectures
15-740/ Computer Architecture Lecture 4: ISA Tradeoffs
Parallel Processing - introduction
Computer Architecture Lecture 4: More ISA Tradeoffs
Samira Khan University of Virginia Sep 6, 2017
CS161 – Design and Architecture of Computer Systems
课程名 编译原理 Compiling Techniques
15-740/ Computer Architecture Lecture 7: Pipelining
Overview Introduction General Register Organization Stack Organization
Samira Khan University of Virginia Sep 4, 2017
Morgan Kaufmann Publishers
Computer Architecture Lecture 3: ISA Tradeoffs
Chapter 1 Fundamentals of Computer Design
Central Processing Unit
Computer Architecture Dataflow (Part II) and Systolic Arrays
Samira Khan University of Virginia Sep 12, 2018
Samira Khan University of Virginia Sep 5, 2018
What is Computer Architecture?
What is Computer Architecture?
What is Computer Architecture?
Chapter 12 Pipelining and RISC
Chapter 4 Multiprocessors
Design of Digital Circuits Lecture 19a: VLIW
Lecture 4: Instruction Set Design/Pipelining
COMPUTER ORGANIZATION AND ARCHITECTURE
Samira Khan University of Virginia Jan 30, 2019
CSE378 Introduction to Machine Organization
Design of Digital Circuits Lecture 23a: Systolic Arrays and Beyond
Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 1/25/2012
Presentation transcript:

Samira Khan University of Virginia Sep 10, 2018 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs Samira Khan University of Virginia Sep 10, 2018 The content and concept of this course are adapted from CMU ECE 740

AGENDA Review from last lecture Fundamental concepts Computing models ISA tradeoffs

FLYNN’S TAXONOMY OF COMPUTERS Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

VECTOR PROCESSOR -- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list? -- Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks

VECTOR MACHINE EXAMPLE: CRAY-1 Russell, “The CRAY-1 computer system,” CACM 1978. Scalar and vector modes 8 64-element vector registers 64 bits per element 16 memory banks 8 64-bit scalar registers 8 24-bit address registers

AMDAHL’S LAW: BOTTLENECK ANALYSIS Speedup= timewithout enhancement / timewith enhancement Suppose an enhancement speeds up a fraction f of a task by a factor of S timeenhanced = timeoriginal·(1-f) + timeoriginal·(f/S) Speedupoverall = 1 / ( (1-f) + f/S ) f (1 - f) timeoriginal (1 - f) timeenhanced f/S Focus on bottlenecks with large f (and large S)

FLYNN’S TAXONOMY OF COMPUTERS Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

SYSTOLIC ARRAYS

WHY SYSTOLIC ARCHITECTURES? Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory Similar to an assembly line of processing elements Different people work on the same car Many cars are assembled simultaneously Why? Special purpose accelerators/architectures need Simple, regular design (keep # unique parts small and regular) High concurrency  high performance Balanced computation and I/O (memory) bandwidth

SYSTOLIC ARRAYS Memory: heart PEs: cells Memory pulses data through cells H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982.

SYSTOLIC ARCHITECTURES Basic principle: Replace one PE with a regular array of PEs and carefully orchestrate flow of data between the PEs Balance computation and memory bandwidth Differences from pipelining: These are individual PEs Array structure can be non-linear and multi-dimensional PE connections can be multidirectional (and different speed) PEs can have local memory and execute kernels (rather than a piece of the instruction)

SYSTOLIC COMPUTATION EXAMPLE Convolution Used in filtering, pattern matching, correlation, polynomial evaluation, etc … Many image processing tasks

SYSTOLIC ARCHITECTURE FOR CONVOLUTION

y1=w1x1 y1=0 W3 W2 W1 x1

y1=w1x1 + w2x2 y1=w1x1 W3 W2 W1 x2 x2

y1=w1x1 + w2x2 + w3x3 y1=w1x1 + w2x2 W3 W2 W1 x3 x3

CONVOLUTION y1 = w1x1 + w2x2 + w3x3 y2 = w1x2 + w2x3 + w3x4

SYSTOLIC ARRAYS: PROS AND CONS Advantage: Specialized (computation needs to fit PE organization/functions)  improved efficiency, simple design, high concurrency/ performance  good to do more with less memory bandwidth requirement Downside: Specialized  not generally applicable because computation needs to fit the PE functions/organization

ISA VS. MICROARCHITECTURE What is part of ISA vs. Uarch? Gas pedal: interface for “acceleration” Internals of the engine: implements “acceleration” Add instruction vs. Adder implementation Implementation (uarch) can be various as long as it satisfies the specification (ISA) Bit serial, ripple carry, carry lookahead adders x86 ISA has many implementations: 286, 386, 486, Pentium, Pentium Pro, … Uarch usually changes faster than ISA Few ISAs (x86, SPARC, MIPS, Alpha) but many uarchs Why?

TRADEOFFS: SOUL OF COMPUTER ARCHITECTURE ISA-level tradeoffs Uarch-level tradeoffs System and Task-level tradeoffs How to divide the labor between hardware and software

ISA-LEVEL TRADEOFFS: SEMANTIC GAP Where to place the ISA? Semantic gap Closer to high-level language (HLL) or closer to hardware control signals?  Complex vs. simple instructions RISC vs. CISC vs. HLL machines FFT, QUICKSORT, POLY, FP instructions? VAX INDEX instruction (array access with bounds checking) e.g., A[i][j][k] one instruction with bound check

SEMANTIC GAP High-Level Language Software Semantic Gap ISA Hardware Control Signals

SEMANTIC GAP High-Level Language Software Semantic Gap ISA CISC RISC Hardware Control Signals

ISA-LEVEL TRADEOFFS: SEMANTIC GAP Where to place the ISA? Semantic gap Closer to high-level language (HLL) or closer to hardware control signals?  Complex vs. simple instructions RISC vs. CISC vs. HLL machines FFT, QUICKSORT, POLY, FP instructions? VAX INDEX instruction (array access with bounds checking) Tradeoffs: Simple compiler, complex hardware vs. complex compiler, simple hardware Caveat: Translation (indirection) can change the tradeoff! Burden of backward compatibility Performance? Optimization opportunity: Example of VAX INDEX instruction: who (compiler vs. hardware) puts more effort into optimization? Instruction size, code size

X86: SMALL SEMANTIC GAP: STRING OPERATIONS REP MOVS DEST SRC How many instructions does this take in Alpha?

SMALL SEMANTIC GAP EXAMPLES IN VAX FIND FIRST Find the first set bit in a bit field Helps OS resource allocation operations SAVE CONTEXT, LOAD CONTEXT Special context switching instructions INSQUEUE, REMQUEUE Operations on doubly linked list INDEX Array access with bounds checking STRING Operations Compare strings, find substrings, … Cyclic Redundancy Check Instruction EDITPC Implements editing functions to display fixed format output Digital Equipment Corp., “VAX11 780 Architecture Handbook,” 1977-78.

CISC vs. RISC Which one is easy to optimize? X: MOV ADD REPMOVS COMP JMP X REPMOVS Which one is easy to optimize?

SMALL VERSUS LARGE SEMANTIC GAP CISC vs. RISC Complex instruction set computer  complex instructions Initially motivated by “not good enough” code generation Reduced instruction set computer  simple instructions John Cocke, mid 1970s, IBM 801 Goal: enable better compiler control and optimization RISC motivated by Memory stalls (no work done in a complex instruction when there is a memory stall?) When is this correct? Simplifying the hardware  lower cost, higher frequency Enabling the compiler to optimize the code better Find fine-grained parallelism to reduce stalls

SMALL VERSUS LARGE SEMANTIC GAP John Cocke’s RISC (large semantic gap) concept: Compiler generates control signals: open microcode Advantages of Small Semantic Gap (Complex instructions) + Denser encoding  smaller code size  saves off-chip bandwidth, better cache hit rate (better packing of instructions) + Simpler compiler Disadvantages - Larger chunks of work  compiler has less opportunity to optimize - More complex hardware  translation to control signals and optimization needs to be done by hardware Read Colwell et al., “Instruction Sets and Beyond: Computers, Complexity, and Controversy,” IEEE Computer 1985.

HOW HIGH OR LOW CAN YOU GO? Very large semantic gap Each instruction specifies the complete set of control signals in the machine Compiler generates control signals Open microcode (John Cocke, 1970s) Gave way to optimizing compilers Very small semantic gap ISA is (almost) the same as high-level language Java machines, LISP machines, object-oriented machines, capability-based machines

EFFECT OF TRANSLATION One can translate from one ISA to another ISA to change the semantic gap tradeoffs Examples Intel’s and AMD’s x86 implementations translate x86 instructions into programmer-invisible microoperations (simple instructions) in hardware Transmeta’s x86 implementations translated x86 instructions into “secret” VLIW instructions in software (code morphing software) Think about the tradeoffs

TRANSLATION LAYER High-Level Language Control Signals ISA Semantic Gap Software Hardware High-Level Language Control Signals uISA (uops) Semantic Gap Software Hardware ISA (x86) Translation Layer Not Exposed to programmer

Samira Khan University of Virginia Sep 10, 2018 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models Samira Khan University of Virginia Sep 10, 2018 The content and concept of this course are adapted from CMU ECE 740