Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE 586 Computer Architecture

Similar presentations


Presentation on theme: "CSE 586 Computer Architecture"— Presentation transcript:

1 CSE 586 Computer Architecture
Jean-Loup Baer CSE 586 Spring 00

2 Introduction--CSE 586 : Computer Architecture
Architecture of modern computer systems Central processing unit: pipelined, exhibiting instruction level parallelism, and allowing speculation . Memory hierarchy: multi-level cache hierarchy and its management, including hardware and software assists for enhanced performance; interaction of hardware/software for virtual memory systems. Input/output: Buses; Disks – performance and reliability (RAIDs) Multiprocessors: SMP’s and cache coherence CSE 586 Spring 00

3 Course mechanics Please see Web page: Instructors:
Instructors: Jean-Loup Baer: Vivek Sahasranaman : Textbook and reading material: John Hennessy and David Patterson: Computer Architecture: A Quantitative Approach, 2nd Edition, Morgan Kaufman, 1996 Papers drawn form the literature: the list will be updated every week in the “outline” link of the Web page CSE 586 Spring 00

4 Course mechanics (cont’d)
Prerequisites Knowledge of computer organization as taught in an UG class (cf.,e.g., CSE 378 at UW) Material on pipelining and memory hierarchy as found in Hennessy and Patterson Computer Organization and Design, 2nd Edition, Morgan Kaufman Chapters 6 and 7 Your first assignment (due next week 4/6/00 see Web page) will test that knowledge CSE 586 Spring 00

5 Course mechanics (cont’d)
Assignments (almost one a week) Paper and pencil (from the book or closely related) Programming (simulation) . We will provide skeletons in C but you can use any language you want Paper review (one at the end of the quarter) Exam One take home final Grading Assignments 75% Final 25% CSE 586 Spring 00

6 Class list and Please subscribe right away to the list for this class. See the Web page for instructions (use Majordomo) CSE 586 Spring 00

7 Course outline (will most certainly be modified; look at web page periodically)
Weeks 1 and 2: Performance of computer systems. ISA . RISC and CISC. Current trends (EPIC) and extensions (MMX) Review of pipelining Weeks 3 and 4 Simple branch predictors Instruction level parallelism (scoreboard, Tomasulo algorithm) Multiple issue: Superscalar and out-of-of order execution Predication CSE 586 Spring 00

8 Course outline (cont’d)
Weeks 5 and 6 Caches and performance enhancements. TLB’s Week 7 Hardware/software interactions at the Virtual memory level Week 8 Buses Disk Weeks 9 and 10 Multiprocessors. SMP. Cache coherence Synchronization CSE 586 Spring 00

9 Technological improvements
CPU : Annual rate of speed improvement is 35% before 1985 and 60% since 1985 Slightly faster than increase in transistors Memory: Annual rate of speed improvement is < 10% Density quadruples in 3 years. I/O : Access time has improved by 30% in 10 years Density improves by 50% every year CSE 586 Spring 00

10 Processor-Memory Performance Gap
x Memory system (10 x over 8 years but densities have increased 100x over the same period) o x86 CPU (100x over 10 years) 1000 o 100 o o o x 10 o x o x x x 1 CSE 586 Spring 00

11 Improvements in Processor Speed
Technology Faster clock (commercially 700 Mhz available; prototype 1.5 Ghz) More transistors = More functionality Instruction Level parallelism (ILP) Multiple functional units, superscalar or out-of-order execution 10 Million transistors but Moore law still applies. Extensive pipelining From single 5 stage to multiple pipes as deep as 20 stages Sophisticated instruction fetch units Branch prediction; register renaming; trace caches On-chip Memory One or two levels of caches. TLB’s for instruction and data CSE 586 Spring 00

12 Intel x86 Progression CSE 586 Spring 00

13 Speed improvement: expose ISA to the compiler/user
Pipeline level Scheduling to remove hazards, reduce load and branch delays Control flow prediction Static prediction and/or predication; code placement in cache Loop unrolling Reduce branching but increase register pressure Memory hierarchy level Instructions to manage the data cache (prefetch, purge) Etc… CSE 586 Spring 00

14 Performance evaluation basics
Performance inversely proportional to execution time Elapsed time includes: user + system; I/O; memory accesses; CPU per se CPU execution time (for a given program): 3 factors Number of instructions executed Clock cycle time (or rate) CPI: number of cycles per instruction (or its inverse IPC) CPU execution time = Instruction count * CPI * clock cycle time 12/3/2018 CSE 586 Spring 00

15 Components of the CPI CPI for single instruction issue with ideal pipeline = 1 Previous formula can be expanded to take into account classes of instructions For example in RISC machines: branches, f.p., load-store. For example in CISC machines: string instructions CPI =  CPIi * fi where fi is the frequency of instructions in class i Will talk about “contributions to the CPI” from, e.g,: memory hierarchy branch (misprediction) hazards etc. CSE 586 Spring 00

16 Benchmarking Measure a real workload for your installation
Weight programs according to frequency of execution If weights are not available, normalize so each program takes equal time on a given machine CSE 586 Spring 00

17 Available benchmark suites
Spec 95 (integer and floating-points) now SPEC CPU2000 Too many spec-specific compiler optimizations Other “specific” SPEC: SPEC Web, SPEVC JVM etc. Perfect Club and NASA benchmarks Mostly for scientific and parallelizable programs TCP-A, TCP-B, TCP-C , TPC-D benchmarks Transaction processing (response time); decision support (data mining) Desktop applications Recent UW paper ( CSE 586 Spring 00

18 Comparing and summarizing benchmark performance
For execution times, use (weighted) arithmetic mean: Weight. Ex. Time =  Weighti * Timei For rates, use (weighted) harmonic mean: Weight. Rate = 1 /  (Weighti / Rate i ) See paper by Jim Smith (link in outline) “Simply put, we consider one computer to be faster than another if it executes the same set of programs in less time” CSE 586 Spring 00

19 Normalized execution times
Compute an aggregate performance measure before normalizing Average normalized (wrt another machine) execution time: Either with arithmetic mean or like here with geometric mean: Geometric mean does not measure execution time CSE 586 Spring 00

20 Computer design: Make the common case fast
Amdahl’s law (speedup) Speedup = (performance with enhancement)/(performance base case) Or equivalently Speedup = (exec.time base case)/(exec.time with enhancement) Application to parallel processing s fraction of program that is sequential Speedup S is at most 1/s That is if 20% of your program is sequential the maximum speedup with an infinite number of processors is at most 5 CSE 586 Spring 00

21 Instruction Set Architecture
Part of the interface between hardware and software that is visible to the programmer Instruction Set RISC, CISC , VLIW-EPIC but also other issues such as how branches are handled, multimedia/graphics extensions etc. Addressing modes (including how the PC is used) Registers Integer, floating-point, but also flat vs.windows, and special-purpose registers, e.g. for multiply/divide or for condition codes or for predication CSE 586 Spring 00

22 What is not part of the ISA (but interesting!)
Caches, TLB’s etc. Branch prediction mechanisms Register renaming Number of instruction issued per cycle Number of functional units, pipeline structure etc ... CSE 586 Spring 00

23 CPU-centric operations (arith-logical)
Registers-only (load-store architectures) Synonym with RISC? In general 3 operands (2 sources, 1 result) Fixed-size instructions. Few formats. Registers + memory (CISC). Vary between 2 and 3 operands (depends on instruction formats). At the extreme can have n operands (Vax) Variable-size instructions (expanding opcodes and operand specifiers) Stack oriented Historical? But what about JVM byte codes Memory only (historical?) Used for “long-string” instructions CSE 586 Spring 00

24 RISC vs. CISC (highly abstracted)
CSE 586 Spring 00

25 Addressing modes for either load-store or cpu-centric ops
Basic : Register, Immediate, Indexed or displacement (subsumes register indirect), Absolute Between RISC and CISC Basic , Base + Index (e.g., in IBM Power PC), Scale-index (index multiplied by the size of the element being addressed) Very CISC-like Memory indirect, Auto-increment and decrement in conjunction with all others (increased complexity in pipelined units) CSE 586 Spring 00

26 Flow of control – Conditional branches
About 30% of executed instructions Generally PC-relative Compare and branch Only one instr. but a “heavy one”; Often limited to equality/inequality or comparisons with 0 Condition register Simple but uses a register and uses two instructions Condition codes CC is extra state (bad for pipelining) but can be set “for free” and allow for branch executing in 0 time CSE 586 Spring 00

27 Unconditional transfers
Jumps (long branches and indirect jumps) Procedure call and return Who saves what (return address, registers, parameters etc.), where (stack, register) and when (caller, callee) Combination of hardware/software protocols CSE 586 Spring 00

28 Registers (visible to the ISA)
Integer (Generally 32 but see x86) Floating-point (Generally 32 but see x86) Some GPR are special stack pointer, frame pointer, even the PC (VAX) Some registers have special functions control registers, segment registers (x86) Flat registers vs. windows (Sparc) or hierarchy (Cray) CSE 586 Spring 00

29 A sample of less conventional features
Decimal operations (HP-PA; handled by F-P in x86) String operations (x86, IBM mainframes) Lack of byte instructions (Alpha) Synchronization (to be seen at end of quarter) Atomic swap, load linked / store cond., “fence” Predicated execution (conditional move) Cache hints (prefetch, flush) Interaction with the operating system (PALcode) TLB instructions (TLB miss handled by software in MIPS) CSE 586 Spring 00

30 Extensions to basic ISA
Multimedia/Graphics extensions MMX (Intel) for some SIMD (single-instruction-multiple-data, i.e., vector-like) instructions using f-p registers and “slicing” them in 8-bit units Sun SPARC Visual Instruction Set for on-chip graphics functional units Full-fledge Multimedia processors (e.g., Equator Map 1000 with hundreds of SIMD instructions) Vector processors The “old” Cray-like supercomputers Might make a come-back for stream processing CSE 586 Spring 00

31 From 32b to 64b ISA’s The most costly design error: addressing space too small 16 bits PDP-11 led to 32-bit Vax 32 bits can only address 4 Gbytes so move to 64 bits Start from scratch: Dec Alpha Still more or less tailored after MIPS ISA Start from scratch: Intel IA-64 (Merced) With backward compatibility MIPS III ( implemented as MIPS R4000) and MIPS IV (R10000) HP PA-8000 Sun Sparc V8 and V9 (UltraSparc) IBM /Motorola Power PC 620 CSE 586 Spring 00

32 Instruction Execution Cycle
Loop forever: Fetch next instruction and increment PC Decode Read operands Execute or compute memory address or compute branch address Store result or access memory or modify PC But this logical decomposition does not correspond well to a break-up in steps of the same complexity CSE 586 Spring 00

33 Multiple cycle implementation (not pipelined)
Instruction fetch and increment PC Decode and read registers; branch address computation ALU use for either arithmetic or memory address computation; branch condition and next PC determined Memory access Store result in register With this decomposition, instructions will take either 3 cycles (branch) 4 cycles (everything else except branch and load) 5 cycles (load) CSE 586 Spring 00

34 IF ID/RR EXE Mem WB 4 NPC A zero PC Inst. mem. Regs. ALU Data mem. IR
ALUout MD B s e 2 ALU target CSE 586 Spring 00

35 Multiple cycle implementation (RTL level)
For example (using MIPS R3000 ISA): 1. Instruction fetch and increment PC IR <- Mem[PC] NPC <- PC + 4 2. Instruction decode and register read A <- Reg[IR[25:21]] (1st input to ALU) B <- Reg[IR[20:16]] (2nd input to ALU) target <- NPC + sign-extend(IR[15:0]*4) (in case the instruction is a branch) CSE 586 Spring 00

36 Multiple cycle impl. (cont’d)
3. ALU execution (Part I) ALUoutput <- A op B (if Arit-logic) ALUoutput <- A + sign-extend (IR[15:0]) for immediate or memory address computation If (A = B) then PC <- target else PC <- NPC For branches; of course can be other conditions 4. Memory access or ALU execution (Part II) Memory-data <- Memory[ALUoutput] (Load) Memory[ALUoutput] <- B (Store) Reg[IR[15:11]] <- ALUoutput (if Arit-logic) 5. Write-back stage Reg[IR[20:16]] <- Memory-data CSE 586 Spring 00

37 EXE IF ID/RR Mem WB Tracing an arit. Instr. 4 NPC A zero PC Inst. mem.
ALU NPC A zero PC Inst. mem. Regs. ALU Data mem. IR ALUout MD B s e 2 ALU target CSE 586 Spring 00

38 EXE IF ID/RR Mem WB Tracing a Load Instr. 4 NPC A zero PC Inst. mem.
ALU NPC A zero PC Inst. mem. Regs. ALU Data mem. IR ALUout MD B s e 2 ALU target CSE 586 Spring 00

39 EXE IF ID/RR Mem WB Tracing a branch Instr. 4 NPC A zero PC Inst. mem.
ALU NPC A zero PC Inst. mem. Regs. ALU Data mem. IR ALUout MD B s e 2 ALU target CSE 586 Spring 00

40 Pipelining One instruction/result every cycle (ideal)
Not in practice because of hazards Increase throughput Throughput = number of results/second Improve speed-up In the ideal case, if n stages , the speed-up will be close to n. Can’t make n too large: load balancing between stages & hazards Might slightly increase the latency of individual instructions (pipeline overhead) CSE 586 Spring 00

41 Basic pipeline implementation
Five stages: IF, ID, EXE, MEM, WB What are the resources needed and where ALU’s, Registers, Multiplexers etc. What info. is to be passed between stages Requires pipeline registers between stages: IF/ID, ID/EXE, EXE/MEM and MEM/WB What is stored in these pipeline registers? Design of the control unit. CSE 586 Spring 00

42 EXE IF ID/RR Mem WB EX/MEM ID/EX MEM/WB IF/ID 4 (PC) zero (Rd) PC
ALU (PC) zero (Rd) PC Inst. mem. Regs. ALU Data mem. s e 2 ALU data control CSE 586 Spring 00

43 IF ID/RR EXE Mem WB Five instructions in progress; one of each color
EX/MEM ID/EX MEM/WB IF/ID 4 ALU (PC) zero (Rd) PC Inst. mem. Regs. ALU Data mem. s e 2 ALU data control CSE 586 Spring 00

44 Hazards Structural hazards Data dependencies Control hazards
Resource conflict (mostly in multiple issue machines; also for resources which are used for more than one cycle see later) Data dependencies Most common RAW but also WAR and WAW in OOO execution Control hazards Branches and other flow of control disruptions Consequence: stalls in the pipeline Equivalently: insertion of bubbles or of no-ops CSE 586 Spring 00

45 Pipeline speed-up CSE 586 Spring 00

46 Example of structural hazard
For single issue machine: common data and instruction memory (unified cache) Pipeline stall every load-store instruction (control easy to implement) Better solutions Separate I-cache and D-cache Instruction buffers Both + sophisticated instruction fetch unit! Will see more cases in multiple issue machines CSE 586 Spring 00

47 Data hazards Data dependencies between instructions that are in the pipe at the same time. For single pipeline in order issue: Read After Write hazard (RAW) Add R1, R2, R3 #R1 is result register Sub R4, R1,R2 #conflict with R1 Add R3, R5, R1 #conflict with R1 Or R6,R1,R2 #conflict with R1 Add R5, R2, R1 #R1 OK now (5 stage pipe) CSE 586 Spring 00

48 OK if in ID stage one can write
IF ID EXE MEM WB R1 available here Add R1, R2, R3 | | | | | | R 1 needed here Sub R4,R1,R2 | | | | | | ADD R3,R5,R1 | | | | | | OK if in ID stage one can write In 1st part of cycle and read in 2nd part OR R6,R1,R2 | | | | | | OK Add R5,R1,R2 | | | | | | CSE 586 Spring 00

49 Forwarding Result of ALU operation is known at end of EXE stage
Forwarding between: EXE/MEM pipeline register to ALUinput for instructions i and i+1 MEM/WB pipeline register to ALUinput for instructions i and i+2 Note that if the same register has to be forwarded, forward the last one to be written Forwarding through register file (write 1st half of cycle, read 2nd half of cycle) Forwarding between load and store (memory copy) CSE 586 Spring 00

50 IF ID EXE MEM WB R1 available here Add R1, R2, R3 | | | | | |
R 1 needed here Sub R4,R1,R2 | | | | | | ADD R3,R5,R1 | | | | | | OR R6,R1,R2 | | | | | | OK w/o forwarding Add R5,R1,R2 | | | | | | CSE 586 Spring 00

51 Other data hazards Write After Write (WAW). Can happen in
Pipelines with more than one write stage More than one functional unit with different latencies (see later) Write After Read (WAR). Very rare With VAX-like autoincrement addressing modes CSE 586 Spring 00

52 Forwarding cannot solve all conflicts
At least in our simple MIPS-like pipeline Lw R1, 0(R2) #Result at end of MEM stage Sub R4, R1,R2 #conflict with R1 Add R3, R5, R1 #OK with forwarding Or R6,R1,R2 # OK with forwarding CSE 586 Spring 00

53 IF ID EXE MEM WB R1 available here LW R1, 0(R2) | | | | | |
R 1 needed here No way! Sub R4,R1,R2 | | | | | | OK ADD R3,R5,R1 | | | | | | OK OR R6,R1,R2 | | | | | | CSE 586 Spring 00

54 IF ID EXE MEM WB R1 available here LW R1, 0(R2) | | | | | |
R 1 needed here Insert a bubble Sub R4,R1,R2 | | | | | | | ADD R3,R5,R1 | | | | | | | | | | | | | OR R6,R1,R2 CSE 586 Spring 00

55 Compiler solution: pipeline scheduling
Try to fill the “load delay” slot More difficult for multiple issue machines Increases register pressure Increases sequence of loads (might be harder on the cache) CSE 586 Spring 00


Download ppt "CSE 586 Computer Architecture"

Similar presentations


Ads by Google