Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer Corporation Shrewsbury, Massachusetts Slides: 1998 Microprocessor Forum (Peter Bannon) and 1999 Microprocessor Forum (Joel Emer)
Better answers Higher Performance Lower Cost EV EV m EV m 0.18 m 21364EV EV m Alpha Microprocessor Roadmap 21364EV m First System Ship
Better answers Alpha Microprocessor Architectural Features First “Out-of-Order” Alpha First “Out-of-Order” Alpha Four-wide superscalar Four-wide superscalar … Performance World’s Fastest Microprocessor ( 11/17/99) World’s Fastest Microprocessor ( 11/17/99) 39 SPECINT95, Mhz 39 SPECINT95, Mhz – Intel Pentium 733 Mhz delivers 36 SPECINT95, 30 SPECFP95
Better answers Higher Performance Lower Cost EV EV m EV m 0.18 m 21364EV EV m Alpha Microprocessor Roadmap 21364EV m First System Ship
Better answers Alpha Goals Leadership single stream performance Higher operating frequency Higher operating frequency Integrated memory interface Integrated memory interface Leadership multiprocessor performance Integrated system / multiprocessor interface Integrated system / multiprocessor interface
Better answers Alpha Features System-on-a-Chip Alpha core with enhancements Alpha core with enhancements Integrated L2 Cache Integrated L2 Cache Integrated memory controller Integrated memory controller Integrated network interface Integrated network interface Fault-Tolerance Support for lock-step operation to enable high- availability systems. Support for lock-step operation to enable high- availability systems.
Better answers Memory Controller RAMBUSRAMBUS Chip Block Diagram Core 16 L1 Miss Buffers L2 Cache Address Out Address In Network Interface N S E W I/O 16 L1 Victim Buf 16 L2 Victim Buf 64K Icache 64K Dcache
Better answers Int Reg Map Branch Predictors Core FETCH MAP QUEUE REG EXEC DCACHE Stage: L2 cache1.5MB 6-Set Int Issue Queue (20) Exec 4 Instructions / cycle Reg File (80 ) Victim Buffer L1 Data Cache 64KB 2-Set FP Reg Map FP ADD Div/Sqrt FP MUL Addr 80 in-flight instructions plus 32 loads and 32 stores Addr Miss Address Next-Line Address L1 Ins. Cache 64KB 2-Set Exec Reg File (80 ) FP Issue Queue (15) Reg File (72 )
Better answers Integrated L2 Cache 1.5 MB 6-way set associative 16 GB/s total read/write bandwidth 16 Victim buffers for L1 -> L2 16 Victim buffers for L2 -> Memory ECC SECDED code 12ns load to use latency
Better answers Integrated Memory Controller Direct RAMbus High data capacity per pin High data capacity per pin 800 MHz operation 800 MHz operation 30ns CAS latency pin to pin 30ns CAS latency pin to pin 6 GB/sec read or write bandwidth 100s of open pages Directory based cache coherence ECC SECDED
Better answers Integrated Network Interface Direct processor-to-processor interconnect 10 GB/second per processor 15ns processor-to-processor latency Out-of-order network with adaptive routing Asynchronous clocking between processors 3 GB/second I/O interface per processor
Better answers System Block Diagram 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO 364 M IO
Better answers Alpha Technology 0.18 m CMOS MHz volts 3.5 cm 2 6 Layer Metal 100 million transistors 8 million logic 8 million logic 92 million RAM 92 million RAM
Better answers Alpha Status 70 SPECint95 (estimated) 120 SPECfp95 (estimated) RTL model running Tapeout: Summer 2000
Better answers Summary: System on a Chip Integrated L2 cache and memory controller outstanding single processor performance outstanding single processor performance Integrated network interface high performance multi-processor systems high performance multi-processor systems scales to large number of processors scales to large number of processors
Better answers Higher Performance Lower Cost EV EV m EV m 0.18 m 21364EV EV m Alpha Microprocessor Overview 21364EV m First System Ship
Better answers Alpha Goals Leadership single stream performance Higher operating frequency / better technology Higher operating frequency / better technology New microarchitecture New microarchitecture Integrated memory interface (like 21364) Integrated memory interface (like 21364) Leadership multiprocessor performance Simultaneous Multithreading (with minimal change/cost) Simultaneous Multithreading (with minimal change/cost) Integrated system / multiprocessor interface (like 21364) Integrated system / multiprocessor interface (like 21364)
Better answers Alpha Technology Overview Leading edge process technology – GHz 0.125µm CMOS 0.125µm CMOS SOI-compatible SOI-compatible Cu interconnect Cu interconnect low-k dielectrics low-k dielectrics Chip characteristics ~1.2V Vdd ~1.2V Vdd ~250 Million transistors ~250 Million transistors
Better answers Alpha Architecture Overview Enhanced out-of-order execution 8-wide superscalar Large on-chip L2 cache Direct RAMBUS interface On-chip router for system interconnect Glueless, directory-based, ccNUMA for up to 512-way multiprocessing for up to 512-way multiprocessing 4-way simultaneous multithreading (SMT)
Better answers Instruction Issue Reduced function unit utilization due to dependencies Time
Better answers Superscalar Issue Superscalar leads to more performance, but lower utilization Time
Better answers Predicated Issue Adds to function unit utilization, but results are thrown away Time
Better answers Chip Multiprocessor Limited utilization when only running one thread Time
Better answers Fine Grained Multithreading Intra-thread dependencies still limit performance Time
Better answers Simultaneous Multithreading Maximum utilization of function units by independent operations Time
Better answers Basic Out-of-order Pipeline Fetch Decode/ Map Queue Reg Read ExecuteDcache/ Store Buffer Reg Write Retire PC Icache Register Map Dcache Regs Thread-blind
Better answers SMT Pipeline Fetch Decode/ Map Queue Reg Read ExecuteDcache/ Store Buffer Reg Write Retire Icache Dcache PC Register Map Regs
Better answers Changes for SMT Basic pipeline – unchanged Replicated resources Program counters Program counters Register maps Register maps Shared resources Register file (size increased) Register file (size increased) Instruction queue Instruction queue First and second level caches First and second level caches Translation buffers Translation buffers Branch predictor Branch predictor
Better answers Multiprogrammed workload
Better answers Decomposed SPEC95 Applications
Better answers Multithreaded Applications
Better answers Architectural Abstraction 1 Processor with 4 Thread Processing Units (TPUs) Shared hardware resources TPU 0TPU1TPU2TPU3 IcacheTLBDcache Scache
Better answers System Block Diagram EV8 MIO EV8 MIO EV8 MIO EV8 MIO EV8 MIO EV8 MIO EV8 MIO EV8 MIO EV8 MIO 0123
Better answers Alpha Summary Leadership single stream performance Higher operating frequency / better technology Higher operating frequency / better technology New microarchitecture New microarchitecture Integrated memory interface (like 21364) Integrated memory interface (like 21364) Leadership multiprocessor performance Simultaneous Multithreading (with minimal changes/cost) Simultaneous Multithreading (with minimal changes/cost) Integrated system / multiprocessor interface (like 21364) Integrated system / multiprocessor interface (like 21364)
Better answers Alpha Reuses microprocessor core Reuses microprocessor core System on a chip System on a chip Alpha New microarchitecture New microarchitecture System on a chip System on a chip Simultaneous Multithreading Simultaneous Multithreading Maintain Performance Lead Beyond Y2K
Better answers My Current Research: Beyond 21464? The Truth Project (w/ Joel Emer) Examines different microarchitectural issues Examines different microarchitectural issues The Multinet Project (w/ Rick Kessler) Tightly-coupled multiprocessor networks Tightly-coupled multiprocessor networks The Reliant Project (w/ Steve Reinhardt) Self-Checking Microprocessors using SMT, ISCA submission Self-Checking Microprocessors using SMT, ISCA submission Asim (w/ VSSAD Labs) Performance Model for Alphas beyond Performance Model for Alphas beyond 21464