Download presentation
Presentation is loading. Please wait.
1
CS 7960-4 Lecture 25 Wire Delay is not a Problem for SMT Z. Chishti, T.N. Vijaykumar Proceedings of ISCA-31 June, 2004
2
Prior Results Hrishikesh et al. [ISCA’02]: Optimal pipeline depth is 6-8 FO4 at 100nm technology Agarwal et al. [ISCA’00]: IPCs will decrease dramatically due to wire delays Goals: How does pipeline depth vary with technology? How does SMT influence thruput and pipeline depth? Identify and alleviate bottlenecks (bandwidth)
3
Critical Loops
4
Back-to-Back Instructions The loop lengths determine the delay between back-to-back dependent instructions Some loops can be optimized with aggressive designs (rename, ALUs) Difficult loops: cache access, branch prediction
5
Superscalar vs. SMT For superscalars, deep pipelining more overheads in each loop more delay between b2b instrs performance loss For SMT, slowing a dependence chain is not a problem – can find other useful work Deep pipelines can benefit SMT since it affords more parallelism – how do you build deep pipelines?
6
Wire Delays and Bandwidth Wire delays can limit bandwidth in RAM/CAMs – they control the delay between successive accesses Bitline signals are weak – a latch can be introduced only after the sense-amp
7
Bitline-Scaling Decode Mux+output driver Latency-optimized Low bandwidth Low latency Bitline-scaled High bandwidth High latency
8
Delay Results
9
Examining Deep Pipelines Bitline-scaling allows high bandwidth enables deep pipelining (high parallelism, longer chains) Range of implementations: b2b: aggressive design that allows instrs to issue back-2-back in spite of long loops nb2b: low-complexity design that can severely limit single-thread ILP
10
Effect of Wire Delays on IPC Assumes that all structures are perfectly pipelined
11
Effect of Technology on Pipeline Depth For a single thread, as we move from 100nm 50nm, optimal depth goes from 8 10 and 6 8 FO4, for nb2b and b2b Multiprogrammed workload remains at 8 (nb2b) and 6FO4 (b2b) Multiprogramming lets you keep up with Moore’s Law
12
Effect of Bandwidth Constraints Perfect has the latency of latency-optimized and the bandwidth of bitline-scaled l-o does well for single-thread, but very poorly for five threads
13
Conclusions For superscalars, the optimal logic depth shall grow because of longer wire delays and lack of parallelism SMT is unaffected – has parallelism to offset back-2-back inefficiencies SMT meets Moore’s Law expectations by increasing the number of threads SMT has high bandwidth needs – soln: bitline-scaling
14
Title Bullet
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.