Download presentation
Presentation is loading. Please wait.
1
CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin ISCA ’00
2
Previous Papers Limits of ILP – it is probably worth doing o-o-o superscalar Complexity-Effective – wire delays make the implementations harder and increase latencies Today’s paper – these latencies severely impact IPCs and slow the growth in processor performance
3
1995-2000
4
Clock speed has improved by 50% every year Reduction in logic delays Deeper pipelines This will soon end IPC has gone up dramatically (the increased complexity was worth it) Will this end too?
5
Wire Scaling Multiple wire layers – the SIA roadmap predicts dimensions (somewhat aggressive) As transistor widths shrink, wires become thinner, and their resistivity goes up (quadratically – Table 1) Parallel-plate capacitance reduces, but coupling capacitance increases (slight overall increase) The equations are different, but the end result is similar to Palacharla’s (without repeaters)
6
Wire Scaling
7
With repeaters, delay of a fixed-length wire does not go up quadratically as we shrink gate-width In going from 250nm 35nm, 5mm wire delay 170ps 390ps delay to cross X gates 170ps 55ps SIA clock speed 0.75GHz 13.5GHz delay to cross X gates 0.13 cyc 0.75 cycles We could increase wire width, but that compromises bandwidth
8
Clock Scaling Logic delay (the FO4 delay) scales linearly with gate length Likewise, work per pipeline stage has also been shrinking The SIA predicts that today’s 16 FO4/stage delay will shrink to 5.6 FO4/stage A 64-bit add takes 5.5 FO4 – hence, they examine SIA (super-aggressive), 8-FO4 (aggressive), and 16-FO4 (conservative) scaling strategies
9
Clock Scaling
10
While the 15-20% improvement in technology scaling will continue, the 15-20% improvement in pipeline depth will cease
11
On-Chip Wire Delays The number of bits reachable in a cycle are shrinking (by more than a factor of two across three generations) Structures that fit in a cycle today, will have to be shrunk (smaller regfiles, issue queues) Chip area is steadily increasing Less than 1% of the chip reachable in a cycle, 30 cycles to go across the chip! Processors are becoming communication-bound
12
Processor Structure Delays To model the microarchitecture, they estimate the delays of all wire-limited structures Structuref SIA f8f8 f 16 64K-2-port L1753 64-entry 10-port regfile321 20-entry 8-port issueq321 64-entry 8-port ROB321 Weakness: bypass delays are not considered
13
Microarchitecture Scaling Capacity Scaling: constant access latencies in cycles (simpler designs), scale capacities down to make it fit Pipeline Scaling: constant capacities, latencies go up, hence, deeper pipelines Any other approaches?
14
Microarchitecture Scaling Capacity Scaling: constant access latencies in cycles (simpler designs), scale capacities down to make it fit Pipeline Scaling: constant capacities, latencies go up, hence, deeper pipelines Replicated Capacity Scaling: fast core with few resources, but lots of them – high IPC if you can localize communication
15
IPC Comparisons 20-IQ 40 Regs F F F F 20-IQ 40 Regs F F F F 2-cycle wakeup 2-cycle regread 2-cycle bypass 15-IQ 30 Regs F F F 15-IQ 30 Regs F F F 15-IQ 30 Regs F F F Pipeline Scaling Capacity Scaling Replicated Capacity Scaling
16
Methodology
17
Results
18
Every instruction experiences longer latencies IPCs are much lower for aggressive clocks Overall performance is still comparable for all approaches
19
Results In 17 years, we are seeing only a 7-fold speedup (historically, it should have been 1720) – annual increase of 12.5% Slow growth because pipeline depth and IPC increase will stagnate
20
Questionable Assumptions Additional transistors are not being used to improve IPC All instructions pay wire-delay penalties
21
Conclusions Large monolithic cores will perform poorly – microarchitectures will have to be partitioned On-chip caches will be the biggest bottlenecks – 3-cycle 0.5KB L1s, 30-50-cycle 2MB L2s Future proposals should be wire-delay-sensitive
22
Next Class’ Paper “Dynamic Code Partitioning for Clustered Architectures”, UPC-Barcelona, 2001 Instruction steering heuristics to balance load and minimize communication
23
Title Bullet
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.