1 CE 454 Computer Architecture Lecture 8 Ahmed Ezzat The Microarchitecture Level, Ch-4.4, 4.5,

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Computer Organization and Architecture
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Organization and Architecture
Computer Organization and Architecture
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Computer Organization and Architecture The CPU Structure.
Computer Architecture I - Class 9
Chapter 12 Pipelining Strategies Performance Hazards.
Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved The Microarchitecture Level.
An Example Implementation
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
CH12 CPU Structure and Function
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.
An Example Implementation  In principle, we could describe the control store in binary, 36 bits per word.  We will use a simple symbolic language to.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
The Microarchitecture Level
Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved The Microarchitecture Level.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock.
CS161 – Design and Architecture of Computer
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Advanced Architectures
Cache Memory.
CE 454 Computer Architecture
Computer Organization and Architecture + Networks
Computer Architecture Chapter (14): Processor Structure and Function
CS161 – Design and Architecture of Computer
ARM Organization and Implementation
William Stallings Computer Organization and Architecture 8th Edition
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
CSC 4250 Computer Architectures
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Cache Memory Presentation I
Micro-programmed Control Unit
Introduction to Pentium Processor
Morgan Kaufmann Publishers The Processor
Instruction Level Parallelism and Superscalar Processors
Computer Organization and ASSEMBLY LANGUAGE
Computer Architecture
Ka-Ming Keung Swamy D Ponpandi
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
* From AMD 1996 Publication #18522 Revision E
Instruction Execution Cycle
Computer Architecture
Contents Memory types & memory hierarchy Virtual memory (VM)
Chapter 12 Pipelining and RISC
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Created by Vivi Sahfitri
Chapter 11 Processor Structure and function
Sarah Diesburg Operating Systems CS 3430
Ka-Ming Keung Swamy D Ponpandi
Sarah Diesburg Operating Systems COP 4610
Presentation transcript:

1 CE 454 Computer Architecture Lecture 8 Ahmed Ezzat The Microarchitecture Level, Ch-4.4, 4.5,

CE 454Ahmed Ezzat 2 Outline Design of the Microarchitecture Level – Speed vs Cost – Reducing Execution Path Length – Instruction Prefetching Design - The Mic-2 – Pipelined Design - The Mic-3 – Seven-Stage Pipeline Design: The Mic-4 Improving Performance – Cache Memory – Branch Prediction – Out-of-Order Execution and Register Renaming – Speculative Execution Reading Assignment: Examples of the Microarchitecture Level

CE 454Ahmed Ezzat 3 Simple machines are slow and fast machines are complex Speed improvement due to organization vs faster technology can’t be ignored Ways to make faster machines – Reduce the number of clock cycles needed to execute an instruction (known as path length) – Make clock cycle shorter (I.e reducing execution path length) – Overlap the execution of instructions, e.g., Instruction pipelining Ways to measure cost – Count of number of components, transistors, etc. – Area (real estate) required on the IC is more important Design of the Microarchitecture Level: Speed vs Cost

CE 454Ahmed Ezzat 4 Design of the Microarchitecture Level: Reducing Execution Path Length Mic-1 is simple CPU with minimum hardware (less than 5000 transistors) + control store (ROM) + main memory (RAM) IJVM was implemented in microcode with little hardware. Now, let us look for a faster alternative The above POP instruction cost 4 clock cycles (3 microinstructions + 1 main loop)

CE 454Ahmed Ezzat 5 Design of the Microarchitecture Level: Reducing Execution Path Length Merging the Interpreter Loop with the Microcode Merge interpreter loop with microcode – Main loop instruction can be overlapped with the previous instruction – When ALU not used in POP2, use it. As a result POP2 cost is reduced to 3 clock cycles – Having dead cycle, where ALU is not used, is not common, so merging Main1 into the end of each microinstruction sequence is worth doing

CE 454Ahmed Ezzat 6 Three-Bus Architecture Using Mic-1 architecture, let us revisit ILOAD instruction (push local variable onto stack) Have two input buses, A and B: Can add any two registers in one cycle Design of the Microarchitecture Level: Reducing Execution Path Length

CE 454Ahmed Ezzat 7 Instruction Fetch Unit: IFU Execution Loop – PC passed through ALU and incremented – PC used to fetch next byte of instruction – Operands read from memory – Operands written to memory – ALU compute and store result ALU intervenes in instruction fetching (fetch one byte at a time then assemble) – this ties the ALU – 1 cycle/byte: – Have a separate Instruction Fetch Unit to Increment PC Fetch Bytes Assemble 8- and 16-bit operands Design of the Microarchitecture Level: Instruction Prefetching Design – Mic-2

CE 454Ahmed Ezzat 8 Instruction Fetch Unit: IFU Two ways: – IFU interpret code, fetch additional fields and assemble in register for use by the main execution unit (ALU) – Always fetch next 8- or 16- bytes regardless of use – design shown in next page Use 2 MBR’s. (MBR1 holds oldest, and MBR2 two oldest bytes in the shift register): – Automatically senses when MBR1 is read – Read next byte into MBR1 – When MBR1 is read, shift register shifts I Byte R – When MBR2 is read it is loaded 2 bytes – IFU has its own IMAR, to address memory when new word is needed Design of the Microarchitecture Level: Instruction Prefetching Design – Mic-2

CE 454Ahmed Ezzat 9 Instruction Fetch Unit: IFU Design of the Microarchitecture Level: Instruction Prefetching Design – Mic-2

CE 454Ahmed Ezzat 10 The Whole Design Design of the Microarchitecture Level: Instruction Prefetching Design – Mic-2

CE 454Ahmed Ezzat 11 Summary – Mic-2 is an enhanced version of Mic-1 – Eliminates the main loop entirely – Avoid tying the ALU incrementing the PC – Reu8ces path length whenever 16-bit index or offset is calculated – no need to assemble in H – Mic-2 improves some instructions more than others. For example it reduces: LDC_W from 9  3 microinstructions ILOAD from 6  3 microinstructions SWAP from 8  6 microinstructions IADD from 4  3 microinstructions Design of the Microarchitecture Level: Instruction Prefetching Design – Mic-2

CE 454Ahmed Ezzat 12 Mic-2 is faster than Mic-1 with little increase in the real estate introduced by the IFU Reducing cycle time is tied into technology used. How about exploiting parallelism as Mic-2 is highly sequential except the IFU! Major components of the data path cycle – Driving selected registers onto A and B – ALU and shifter work – Results get back to registers and stored Can introduce latches to partition buses – Parts operate independently – Why Can speed up clock because maximum delay is less Can use parts during every sub cycle Design of the Microarchitecture Level: Pipelined Design – Mic-3

CE 454Ahmed Ezzat 13 3-bus architecture with 3 latches Latch is inserted in the middle of each bus In effect, latch partition the data path into 3 distinct parts that can operate independently (Mic-3) Each subcycle is about 1/3 original length, hence triple the clock speed Previously during 1, 3 subcycles ALU is idle. Now ALU can be used on every subcycle – better throughput Design of the Microarchitecture Level: Pipelined Design – Mic-3

CE 454Ahmed Ezzat 14 In Mic-3, need 3 microsteps to use the data path: – Load A and B – Perform operation and load C – Write result back SWAP in Mic-2 Design of the Microarchitecture Level: Pipelined Design – Mic-3

CE 454Ahmed Ezzat 15 Design of the Microarchitecture Level: Pipelined Design – Mic-3 SWAP in Mic-3 Mic-3 instructions takes more cycles than Mic-2 However, Mic-3 cycle is 1/3 of Mic-2 cycle For SWAP, Mic-3 costs 11 microsteps, while Mic-2 would cost (6x3) = 18 microsteps

CE 454Ahmed Ezzat 16 Dependencies Like to start SWAP3 in cycle 3, but MDR is available only in cycle 5. This is called true Dependence or RAW (Read After Write) dependence. SWAP3 has to wait/stall till SWAP1 completes Pipelining is a key technique in all modern CPUs. An analogy is a car assembly line – produces 1 car/hr independent of how long it actually takes to assemble a car. Reading assignment: A Seven-Stage Pipeline (Mic-4) Design of the Microarchitecture Level: Pipelined Design – Mic-3

CE 454Ahmed Ezzat 17 Ways to improve performance, primarily CPU and memory): – Implementation improvement without architectural changes Means old programs run without changes, Major selling point through Pentiums improvements are like this – Architectural changes New or additional instructions and/or registers New architecture such as RISC, IA-64, etc. Major Techniques – Cache memory – Branch prediction – Out of order execution with register renaming – Speculative execution Improving Performance

CE 454Ahmed Ezzat 18 Memory latency vs bandwidth are at odds (e.g., pipelining) – hence cache Split cache: Separate caches for instructions and data – Two separate memory ports – Doubles the speed with independent access Level 2+ cache: between I/D cache and main memory Improving Performance: Cache Memory

CE 454Ahmed Ezzat 19 Caches are generally inclusive – L3 cache includes L2 cache content, and L2 cache includes L1 cache content Depends on Locality of reference – Spatial locality: memory locations with addresses numerically similar to the recently accessed memory are likely to be accessed in the near future – Temporal locality: recently accessed memory locations are likely to be accessed again Cache Model – Main memory is divided into fixed size blocks called caches lines – 4 to 64 consecutive bytes If memory referenced, – Cache controller checks if memory referenced is in the caches, – else a cache line is removed and new line is cached from main memory Improving Performance: Cache Memory

CE 454Ahmed Ezzat 20 Given memory word stored exactly in one place – If not there, not in cache Format: – VALID BIT: on if cache line has valid data – TAG: (16 bit) unique value identifying corresponding line of memory – DATA: (32 bytes) copy of data from memory Improving Performance: Direct-Mapped Caches

CE 454Ahmed Ezzat 21 Address Translation TAG: Tag bit in address corresponds to TAG field in the cache entry LINE: which cache entry holds the data, if it is present WORD: which word within line BYTE: which byte within the word (not used normally) When CPU gives address, HW extracts LINE bits – Indexes into cache, finds one of 2048 entries, if valid TAG field are compared, If same cache HIT! – Else cache miss!, whole cache line fetched from memory, stored in cache, existing line stored back in necessary Improving Performance: Direct-Mapped Caches

CE 454Ahmed Ezzat 22 Consecutive memory lines in consecutive cache line entries If access pattern (e.g., address “X” and address “X + cache size”) the line will be overwritten and if this pattern is frequent, it would result in poor performance – frequent misses Direct-mapped cache is very common, and typically they are effective as collisions as the ones described above are rare Improving Performance: Direct-Mapped Caches

CE 454Ahmed Ezzat 23 Allow “n” entries for each hashed address (address modulo cache-size). These entries need to be ordered as LRU for replacement Each entry must be checked to see if the needed line is present 2-way and 4-way caches have performed well Improving Performance: N-way Set Associative Caches

CE 454Ahmed Ezzat 24 Cache replacement policy: LRU Writing Cache Back – Write through – Write deferred or write back Writing to address that is not in the cache: – Write Allocation: Bring to cache – typically used with write back cache – Write memory directly: typically used with write through cache Improving Performance: Issues in Cache Design

CE 454Ahmed Ezzat 25 Pipelining works best with linear code, but 20% of code is either branches or conditional branches, hence branch prediction is important Most pipelined machines execute instruction following branch, logically should not do so (because we know the opcode after we started the next instruction fetch) – Try to find useful instruction to execute after branches! – Compilers can stuff No Op instructions, but it slows down and makes the code longer Example Predictions – Backward branches will be taken, e.g., end of loop. Some Forward branches occurs due to error condition, which is rare, so not taking forward branch is O.K. Two ways of branch prediction – Execute until change state (i.e., write to register) then update scratch temporarily until we know if the branch prediction is correct – Record update value to be able to rollback in case of need Improving Performance: Branch Prediction

CE 454Ahmed Ezzat 26 CPU maintains history table of previous branches in HW. – Look up history table for predictions (a) Organized just like caches (b) End of Loop takes wrong guess, having 2-bit branch history Hence change branch only after two correct executions (c) 2- or 4-way associative entry approach as with cache Can take a Finite State Machine approach Improving Performance: Dynamic Branch Prediction

CE 454Ahmed Ezzat 27 Dynamic branch prediction is carried out at run time – requires special expensive hardware Compiler passes hints (new branch instruction format) – Sets a bit to indicate which branch will be mostly taken – Requires special hardware (enhanced instructions) Profiling – Program run though a profiler (simulator ) to capture branch behavior, and pass the info to the compiler which in turn can pass it into special branch instructions – IA-64 supports profiling Improving Performance: Static Branch Prediction

CE 454Ahmed Ezzat 28 Pipelined superscalar machines fetches and issues instructions before they are needed In order instruction issue and retirements is simpler but inefficient Some instructions depend on others, hence cannot resort to out of order execution. Example machine: – 8 registers, 2 for operands, one for result – Decoded in cycle N, execution starts in N+1 – Addition & subtraction is written back in N+2 – Multiplication is written back in N+3 Scoreboard is a table to reflect use of registers for reading and writing at run time Improving Performance: Out-of-Order Execution and Register Renaming

CE 454Ahmed Ezzat 29 Improving Performance: Example: In Order Execution

CE 454Ahmed Ezzat 30 In order issue and in order retirement Instruction Dependencies – Read After Write (RAW): If any operand being written, do not issue – Write After Read (WAR): If result register being read, do not issue – Write After Write (WAW): If result register being written, do not issue Instruction (I4) has RAW dependency, it stalls – Decode units stalls until R4 is available – Stops pulling from fetch unit – When buffer full fetch unit stalls fetching from memory Improving Performance: Example: In Order Execution

CE 454Ahmed Ezzat 31 Issued out of order and may retire out of order Instruction (I5) is issued while Instruction (I4) is stalled Problem: (I5) can use an operand (I4) computed New Rule: Do not issue instructions that uses operand stored by previous instruction Example: (I7) uses R1, written by (I6), – never uses again because (I8) writes R1, – hence (I6) can use different register to hold value Register renaming: decode unit changes R1 in (I6), (I7) to S1 (secret) S1 so (I5), (I6) can be issued concurrently Eliminates WAW and WAR dependencies often Improving Performance: Out-of-Order Execution and Register Renaming

CE 454Ahmed Ezzat 32 Improving Performance: Out-of-Order Execution and Register Renaming Same eight instructions: Executed in 18 cycles using in- order issue and retirement Executed in 9 cycles using out- of-order issue and retirement

CE 454Ahmed Ezzat 33 Code consists of basic blocks with no control structures such as if then else or while statements. Only linear sequence of code. No branches. Within each block, reordering works well. Program can be represented as a directed graph. Problem: blocks are short, insufficient parallelism If slow instructions can be moved up across blocks (hoisting), so that if they are executed, then the result is there when needed! Speculative execution: execute code before known if they will be executed Improving Performance: Speculative Execution

CE 454Ahmed Ezzat 34 Improving Performance: Speculative Execution - Example

CE 454Ahmed Ezzat 35 In the example, – say except even-sum and odd-sum, all variables are kept in registers. – Can move LOAD of even-sum and odd-sum variables to top of loop. – Only one of {even-sum, odd-sum} will be needed in any iteration, and the other LOAD is wasted Reordering instructions must have no irrevocable results Can rename all destination registers in speculative code Problem: Speculative code causing exceptions (cache miss or page fault Solution: Use SPECULATIVE-LOAD instead of load so that in case of cache miss does not cause load from memory Poison Bit: If speculative instruction (LOAD) causes trap, a special version of such instruction is used that instead, it sets the poison-bit on the result register. If that register is touched by the regular instruction in the future, it will cause the trap. If the result in that register is never used, the poison bit is eventually cleared and no harm is done Improving Performance: Speculative Execution - Problems

CE 454Ahmed Ezzat 36