Samira Khan University of Virginia Sep 10, 2018 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs Samira Khan University of Virginia Sep 10, 2018 The content and concept of this course are adapted from CMU ECE 740
AGENDA Review from last lecture Fundamental concepts Computing models ISA tradeoffs
FLYNN’S TAXONOMY OF COMPUTERS Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor
VECTOR PROCESSOR -- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list? -- Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks
VECTOR MACHINE EXAMPLE: CRAY-1 Russell, “The CRAY-1 computer system,” CACM 1978. Scalar and vector modes 8 64-element vector registers 64 bits per element 16 memory banks 8 64-bit scalar registers 8 24-bit address registers
AMDAHL’S LAW: BOTTLENECK ANALYSIS Speedup= timewithout enhancement / timewith enhancement Suppose an enhancement speeds up a fraction f of a task by a factor of S timeenhanced = timeoriginal·(1-f) + timeoriginal·(f/S) Speedupoverall = 1 / ( (1-f) + f/S ) f (1 - f) timeoriginal (1 - f) timeenhanced f/S Focus on bottlenecks with large f (and large S)
FLYNN’S TAXONOMY OF COMPUTERS Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor
SYSTOLIC ARRAYS
WHY SYSTOLIC ARCHITECTURES? Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory Similar to an assembly line of processing elements Different people work on the same car Many cars are assembled simultaneously Why? Special purpose accelerators/architectures need Simple, regular design (keep # unique parts small and regular) High concurrency high performance Balanced computation and I/O (memory) bandwidth
SYSTOLIC ARRAYS Memory: heart PEs: cells Memory pulses data through cells H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982.
SYSTOLIC ARCHITECTURES Basic principle: Replace one PE with a regular array of PEs and carefully orchestrate flow of data between the PEs Balance computation and memory bandwidth Differences from pipelining: These are individual PEs Array structure can be non-linear and multi-dimensional PE connections can be multidirectional (and different speed) PEs can have local memory and execute kernels (rather than a piece of the instruction)
SYSTOLIC COMPUTATION EXAMPLE Convolution Used in filtering, pattern matching, correlation, polynomial evaluation, etc … Many image processing tasks
SYSTOLIC ARCHITECTURE FOR CONVOLUTION
y1=w1x1 y1=0 W3 W2 W1 x1
y1=w1x1 + w2x2 y1=w1x1 W3 W2 W1 x2 x2
y1=w1x1 + w2x2 + w3x3 y1=w1x1 + w2x2 W3 W2 W1 x3 x3
CONVOLUTION y1 = w1x1 + w2x2 + w3x3 y2 = w1x2 + w2x3 + w3x4
SYSTOLIC ARRAYS: PROS AND CONS Advantage: Specialized (computation needs to fit PE organization/functions) improved efficiency, simple design, high concurrency/ performance good to do more with less memory bandwidth requirement Downside: Specialized not generally applicable because computation needs to fit the PE functions/organization
ISA VS. MICROARCHITECTURE What is part of ISA vs. Uarch? Gas pedal: interface for “acceleration” Internals of the engine: implements “acceleration” Add instruction vs. Adder implementation Implementation (uarch) can be various as long as it satisfies the specification (ISA) Bit serial, ripple carry, carry lookahead adders x86 ISA has many implementations: 286, 386, 486, Pentium, Pentium Pro, … Uarch usually changes faster than ISA Few ISAs (x86, SPARC, MIPS, Alpha) but many uarchs Why?
TRADEOFFS: SOUL OF COMPUTER ARCHITECTURE ISA-level tradeoffs Uarch-level tradeoffs System and Task-level tradeoffs How to divide the labor between hardware and software
ISA-LEVEL TRADEOFFS: SEMANTIC GAP Where to place the ISA? Semantic gap Closer to high-level language (HLL) or closer to hardware control signals? Complex vs. simple instructions RISC vs. CISC vs. HLL machines FFT, QUICKSORT, POLY, FP instructions? VAX INDEX instruction (array access with bounds checking) e.g., A[i][j][k] one instruction with bound check
SEMANTIC GAP High-Level Language Software Semantic Gap ISA Hardware Control Signals
SEMANTIC GAP High-Level Language Software Semantic Gap ISA CISC RISC Hardware Control Signals
ISA-LEVEL TRADEOFFS: SEMANTIC GAP Where to place the ISA? Semantic gap Closer to high-level language (HLL) or closer to hardware control signals? Complex vs. simple instructions RISC vs. CISC vs. HLL machines FFT, QUICKSORT, POLY, FP instructions? VAX INDEX instruction (array access with bounds checking) Tradeoffs: Simple compiler, complex hardware vs. complex compiler, simple hardware Caveat: Translation (indirection) can change the tradeoff! Burden of backward compatibility Performance? Optimization opportunity: Example of VAX INDEX instruction: who (compiler vs. hardware) puts more effort into optimization? Instruction size, code size
X86: SMALL SEMANTIC GAP: STRING OPERATIONS REP MOVS DEST SRC How many instructions does this take in Alpha?
SMALL SEMANTIC GAP EXAMPLES IN VAX FIND FIRST Find the first set bit in a bit field Helps OS resource allocation operations SAVE CONTEXT, LOAD CONTEXT Special context switching instructions INSQUEUE, REMQUEUE Operations on doubly linked list INDEX Array access with bounds checking STRING Operations Compare strings, find substrings, … Cyclic Redundancy Check Instruction EDITPC Implements editing functions to display fixed format output Digital Equipment Corp., “VAX11 780 Architecture Handbook,” 1977-78.
CISC vs. RISC Which one is easy to optimize? X: MOV ADD REPMOVS COMP JMP X REPMOVS Which one is easy to optimize?
SMALL VERSUS LARGE SEMANTIC GAP CISC vs. RISC Complex instruction set computer complex instructions Initially motivated by “not good enough” code generation Reduced instruction set computer simple instructions John Cocke, mid 1970s, IBM 801 Goal: enable better compiler control and optimization RISC motivated by Memory stalls (no work done in a complex instruction when there is a memory stall?) When is this correct? Simplifying the hardware lower cost, higher frequency Enabling the compiler to optimize the code better Find fine-grained parallelism to reduce stalls
SMALL VERSUS LARGE SEMANTIC GAP John Cocke’s RISC (large semantic gap) concept: Compiler generates control signals: open microcode Advantages of Small Semantic Gap (Complex instructions) + Denser encoding smaller code size saves off-chip bandwidth, better cache hit rate (better packing of instructions) + Simpler compiler Disadvantages - Larger chunks of work compiler has less opportunity to optimize - More complex hardware translation to control signals and optimization needs to be done by hardware Read Colwell et al., “Instruction Sets and Beyond: Computers, Complexity, and Controversy,” IEEE Computer 1985.
HOW HIGH OR LOW CAN YOU GO? Very large semantic gap Each instruction specifies the complete set of control signals in the machine Compiler generates control signals Open microcode (John Cocke, 1970s) Gave way to optimizing compilers Very small semantic gap ISA is (almost) the same as high-level language Java machines, LISP machines, object-oriented machines, capability-based machines
EFFECT OF TRANSLATION One can translate from one ISA to another ISA to change the semantic gap tradeoffs Examples Intel’s and AMD’s x86 implementations translate x86 instructions into programmer-invisible microoperations (simple instructions) in hardware Transmeta’s x86 implementations translated x86 instructions into “secret” VLIW instructions in software (code morphing software) Think about the tradeoffs
TRANSLATION LAYER High-Level Language Control Signals ISA Semantic Gap Software Hardware High-Level Language Control Signals uISA (uops) Semantic Gap Software Hardware ISA (x86) Translation Layer Not Exposed to programmer
Samira Khan University of Virginia Sep 10, 2018 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models Samira Khan University of Virginia Sep 10, 2018 The content and concept of this course are adapted from CMU ECE 740