* From AMD 1996 Publication #18522 Revision E

* From AMD 1996 Publication #18522 Revision E
07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered. ENEL515 AMD-K5 Processor From AMD 1996 Publication #18522 Revision E 1/18/2019 ENEL AMD on K5 Copyright M. Smith *

Compare 1978 and 1996 CISC processors
Want to compare Motorola CISC Processor (based on era 1978/81) with a AMD K5 CISC Processor (era 1996 CISC) Look at common features present between AMD K5 CISC and 21K DSP Comment on paper “Microprocessors outperform DSP 2:1 1/18/2019 ENEL AMD on K5 Copyright M. Smith

ENEL515 -- AMD on K5 Copyright M. Smith smith@enel.ucalgary.ca
68332 Block Diagram -- CISC 1/18/2019 ENEL AMD on K5 Copyright M. Smith

68332 block detail 1/18/2019 ENEL AMD on K5 Copyright M. Smith

68K registers and ALU 1/18/2019 ENEL AMD on K5 Copyright M. Smith

Problems with CISC compatability
X86 CISC architecture -- dominant standard over many generation Backward compatibility involves inherent limitations of X86 CISC variable length instructions few general registers complex addressing mode Could make same comments to 68K CISC 1/18/2019 ENEL AMD on K5 Copyright M. Smith

K5 overcomes backwards X86 compatibility problems
Superscalar RISC core instruction predecoding improved cache branch prediction speculative execution out of order execution register renaming 1/18/2019 ENEL AMD on K5 Copyright M. Smith

1/18/2019 ENEL AMD on K5 Copyright M. Smith

64 bit data bus interface 64 bit data bus cache/burst oriented line refill for both instruction cache and data cache Cache refills as five clock cycles per cache line -- (cache line -- 4 instructions?) 1/18/2019 ENEL AMD on K5 Copyright M. Smith

Cache Architecture Separate instruction and data caches Permits snooping and aliasing? Cache can be retained after context switches 8-K byte data cache -- two cache lines of data accessed in 1 cycle to overcome X86 bottleneck Uses MESI to maintain data coherency with other system caches to ensure valid reads. (Modified Exclusive Shared Invalid) Write-back cache updates (external) memory only when necessary to keep bus free 1/18/2019 ENEL AMD on K5 Copyright M. Smith

Innovative x86 instruction predecoding -- mean what
4th generation CPU -- 1 X86 instruction per cycle 5th generation CPU - 2 X86 instruction per cycle K X86 instructions per cycle 1/18/2019 ENEL AMD on K5 Copyright M. Smith

Innovative x86 instruction predecoding -- how done
Each byte is tagged with predecode information x86 instruction boundaries identified multiple x86 instructions aligned (length 8 to 120 bits) Aligned instructions are assigned issue positions for most efficient processing Predecode information also indicates number of ROPs needed After decoding stored in instruction cache Speculative instructions (from a predicted branch stream) are pushed to a byte queue for further decoding 1/18/2019 ENEL AMD on K5 Copyright M. Smith

Turnings into ROPS 1/18/2019 ENEL AMD on K5 Copyright M. Smith

Unique x86 instruction conversion and decoding
32 bytes of precoded X86 instructions forwarded to decoder Decoder converts complex x86 to ROPS -- fixed length easy to process Simultaneous operands for ROP fetched from register files or re-order buffer X86 instructions are scanned and allocated to a decode position. Number of ROPS to X86 is known during predecoding -- saves time -- why? 1/18/2019 ENEL AMD on K5 Copyright M. Smith

Stage II If X86 instruction requires less than 4 ROPS after conversion then goes “fast path” to any of 4 decode positions Very complex X86 instructions are transferred to microcode ROM for conversion After decoding, ROPS send to reservation stations at the 6 execution units. At execution units, ROPS may be executed out of order -- faster than compiler optimizations ROPs wait in reservation stations for operands from register file, data cache or result of other ops 1/18/2019 ENEL AMD on K5 Copyright M. Smith

K5 Superscalar RISC core
Six execution units two ALU (integer) two load/store units branch unit floating point FPU Conversions of variable length X86 to simple fixed length RISC operations (ROPs) Dispatch four ROPs at a time to superscalar core Execution rate -- peak at 6 ROPS per cycle Register forwarding and data bypassing allows results to be used immediately in next ROP -- no delay of results to destination register and then out again 1/18/2019 ENEL AMD on K5 Copyright M. Smith

6 Parallel Execution Units

Out of order execution Eliminated delays due to pipeline dependencies Each execution unit has 2 reservations stations for ROP Instructions can be issued in any order from reservation stations Execution unit act independently -- some work if others stall 16-entry reorder buffer keeps track of original instruction sequence 1/18/2019 ENEL AMD on K5 Copyright M. Smith

Write-read dependency problem for out-of order execution

Write-write dependency problem for out of order execution

Register renaming Original X86 architecture has only 8 general purpose registers Increases register reuse (load and stores to memory) and register dependencies Register re-use overcome with multiple load/store execution units and dual-ported data cache Register renaming overcomes register dependencies -- multiple logical registers for each physical register allow execution units to use the same physical register names simultaneously 1/18/2019 ENEL AMD on K5 Copyright M. Smith

Register renaming -- code

Register renaming -- diagram

Branch Prediction Branches in X86 programs every 7 instructions on average Processor predicts which branch to take Prediction done dynamically 75% accuracy, 1024 branch targets are cached If invalid prediction then minimal 3 cycle mispredict penalty Dynamic branch prediction enables instructions to be fed to execution core and eliminated pipeline bubbles (stalls?) 1/18/2019 ENEL AMD on K5 Copyright M. Smith

Without prediction --time to find instruction in cache

With prediction faster throughput

Re-order buffer and register file

Reorder buffer Reorder buffer -- key to speculative out of order execution (issue and completion) Reorder buffer used to rename registers, provide forwarding of requested intermediate results, recover from mispredictions Reorder buffer keeps track of original instruction sequence and ensures that results are retired in correct order with results going to register file If branch is mis-predicted then results of instructions are invalidated in re-order buffer before having affect on x86 registers or memory 1/18/2019 ENEL AMD on K5 Copyright M. Smith

Register file X86 architecture has limited number of general purpose registers Fewer registers means frequeent reuse of registers and reduction in performance. -- uses register renaming to avoid this problems Movement between registers and memory is unavoidable with x86 instruction set. K5-CPU has a single cycle load from data cache. Also multi-ported register file, renaming in the reorder buffer -- near optimal speculative performance 1/18/2019 ENEL AMD on K5 Copyright M. Smith

Quote from AMD The right combination Compatibility with performance 1/18/2019 ENEL AMD on K5 Copyright M. Smith

* From AMD 1996 Publication #18522 Revision E

Similar presentations

Presentation on theme: "* From AMD 1996 Publication #18522 Revision E"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

* From AMD 1996 Publication #18522 Revision E

Similar presentations

Presentation on theme: "* From AMD 1996 Publication #18522 Revision E"— Presentation transcript:

Similar presentations

About project

Feedback