Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alpha Microarchitecture

Similar presentations


Presentation on theme: "Alpha Microarchitecture"— Presentation transcript:

1 Alpha 21264 Microarchitecture
Kenneth Conley 6.893 9/14/00 Presentation slides 823 lecture note sides

2 21264 Overview 64-bit RISC Processor 500-1000 Mhz 7-stage pipeline
15 million transistors 2.2V, 60W 310 mm2 (.35 micron) Target apps: Internet servers, data warehousing, digital video, speech recognition G4 – 4-stage pipeline, 1 FPU, 2 Vector units, 2 integer units, 1 AGU, 8-entry LSU, 5W, 10 million transistors, 32/32 I/D Cache Mhz, 12.6 Fp Alpha – SpecInt95: 40 at 667Mhz, 83.6 FP95…, 512MP UltraSparcII – 5.4M transistors, 126mm2, 4-way superscalar, 4 IEU, 3 FPU, 2 GEU, 16/16 I/Dcache, 3.3 and 2.6V, 1.3GB/s memory bandwidth, 600 MBs sustained, in-cache 2-bit branch prediction Fp95 K7 – 22 million transistors, 10-stage integer pipe, 14-stage FP, 3 pipe stages due to variable length CISC, ’44-entry LSU’. SpecInt95 – 32.9 at 750Mhz, 42.9 at 1Ghz, SpecFP95 – 29.4 at 1Ghz Pentium II – 14 stage pipeline

3 21264 Fetch Unit 4 instructions/cycle, speculative Prediction:
Line/way predictor for each icache line (2-way, 64K) 3 branch prediction mechanisms Local: 2 level, 10-bit history pattern predictor (e.g ) Global: History of last 12 branches, 4096 entry, 2-bit saturation Chooser: Chooses between local/global Prediction tables: 3.6KB Targets: 6KB 90-100% accurate on most benchmarks Line and way predictor: % accurate Global predictor: prediction is MSB of indexed prediction counter (2-bit saturating counters) Local predictor: 2-level, 1024 entries first level, 10-bit history pattern indexes 1024 second-level entries Chooser: 4096-entry, 2-bit Cache line = 64b = 2^6 64-bit targets=8 bytes, 2^16/2^6 = 1024 lines = 8KB Global: 4096*2-bit = 8192-bits = 1KB Local: 1024*10-bits ~ 1280 bytes = 1.3K first level 1024*2-bit? = 2048-bits = 256b = .3K Chooser: 4096*2-bit = 1KB Target hints for jumps, prefetching hint instrs, granularity hints for virtual address mapping that allow more effective use of TLB for large continuous structures G4 – 2 ops/cycle issue

4 21264 Dispatch and Execution
4 integer execution units (2 clusters) Each maintains copy of 80-entry register file Single cycle latency for basic integer ops Integer population count/leading zero count Fully-pipelined multiplier Motion Video Instructions (MVI) 2 FP execution units (1 cluster): Upper: Multiply Lower: Add, IEEE Divide, SQRT 72-entry RF Duplicate 80-entry register files avoids single huge bus as well as necessary 14 ports – 12 int pipe, 2 outstanding loads Single cycle latency for copying between RFs Small (few percent) peformance loss from clustering Problems if loads sent to same cluster (same for shifts) 72-entry RF for FP unit 20-entry queue for integer, 4-issue 15-entry queue for FP unit, 2-issue No integer divide operation 128-bit multiply, scaled add/subtract IEEE FP Support: Precise exceptions, NaN, infinity processing, flushing denormal results to zero FP types: IEEE SP, IEEE DP, IEEE Extended (128-bit), VAX F_floating 32, VAX G_floating (64)

5 21264 Memory System 2, 64-bit data buses for icache/dcache
32 in-flight loads, 32 in-flight stores Dcache increased to 64K (2-way), double-pumped L2 Cache: Moved off-chip (increased latency by 6) 4 GB/s sustained bandwidth Speculative issue consumers of loads for 3 cycle integer load hit latency 1.3 GB/s sustained bandwidth on McCalpin Stream 6.4GB/s with a 400Mhz transfer rate Support for cache prefetch Store-conditional for MP 2-cycle cost for load misses – mini-restart instead of full speculative quash Load hit/miss predictor: 4-bit global counter of recent loads (-2 on miss, inc 1 on hit) 5 cycle (instead of 3) latency for load that hits but is predicted to miss FP latency is longer, so no need for speculative issue L2 – 12 cycle load-to-use L2 directed mapped, 1-16MB, 16 bytes every 1.5 CPU cycles Most of the McCalpin perf comes from mobo/x-bar setup. DS10 – 1.3GB/s peak, DS20 – 5.3GB/s (DDR) Asus_Athlon_750_K7V Compaq_AlphaServer_DS10L_Linux Compaq_AlphaServer_DS

6 Out-of-order execution
User visible registers: 32 int/32 float Renaming registers: 41 int/41 float Renaming map data saved for precise exception handling 80 instruction in-flight window, in-order retirement Loads can speculatively bypass stores Store wait bits for mis-speculation Store wait bits periodically cleared RAW hazards detected after instructions issue

7

8 21264 Prediction Mechanisms

9 21264 Execution Units


Download ppt "Alpha Microarchitecture"

Similar presentations


Ads by Google