Download presentation
Presentation is loading. Please wait.
Published byMelvin Lewis Modified over 8 years ago
1
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison http://www.ece.wisc.edu/~pharm
2
2 Cache Power Consumption Increasing on-chip cache size Increasing cache power consumption Increasing clock frequency Increasing dynamic power Lots of prior work to reduce cache power consumption
3
3 Prior Work Cache subbanking, bitline segmentation [Su et al. 1995, Ghose et al. 2001] Cache decomposition [Huang et al. 2001] Block buffering [Su et al. 1995] Reducing Leakage power Drowsy caches [Flautner et al. 2002, Kim et al. 2002] Cache decay [Kaxiras et al. 2001] Gated Vdd [Powell et al. 2000]
4
4 Cache Subbanking Proposed by Su et al. 1995 Fetching only requested subline Partitioned data array vertically into several subbanks Further study by Ghose et al. 2001 Partitioned data array vertically and horizontally Only activate the requested subbanks
5
5 Bit-sliced ALU Originally proposed by Hsu et al. 1985 Slices the addition operations i.e. 32-bit addition -> four 8-bit addition Avoids waiting for full-width addition Bypasses partial operand result Has been successfully implemented in Pentium 4 staggered adder
6
6 Outline Motivation Prior Work Bit-sliced Cache Experiment Results Conclusion
7
7 Power Consumption in Cache Row decoding consumes up to 40% of active power
8
8 Bit-sliced Cache Extends cache subbanking technique Saves decoding power Enables only row decoders that are accessed Serializes subarray decoding with row decoding Uses low order index bits to select row decoder Minimal changes to subbanking technique
9
9 Pipelining the Cache Access Cache access time increases due to: Serializing subarray decoder with row decoder Pipeline the access to hide the delay Need to balance the latency of each stage Choose operations for each stage carefully Provide more throughput Same throughput as a conventional cache with n ports
10
10 Pipelined-Cache’s Access Steps Cycle 1 Start subarray decoding for data and tag Cycle 2 Activate necessary row decoders Read tag array while waiting Cycle 3 Read data array Concurrently, do partial tag comparison Cycle 4 Compare the rest of the tag bits Use tag comparison result to select data
11
11 Bit-sliced Cache
12
12 Bit-sliced Cache + Bit-sliced ALU Optimal performance benefit Cache access starts sooner As soon as the first slice is available Limited number of subarrays According to the number of bits per slice When the bitslice is too small Unable to achieve optimal power saving
13
13 Pipelining with Bit-sliced Cache addi R3, R3, 4add R3, R2, R1lw R4, 4(R3) lw R1, 0(R3) Pipelined Execution Stage with Pipelined Cache add R3, R2, R1addi R3, R3, 4lw R1, 0(R3)lw R4, 4(R3)add R3, R2, R1addi R3, R3, 4lw R4, 4(R3)lw R1, 0(R3) Bit-sliced Execution Stage with Bit-sliced Cache Bit-sliced Execution Stage with Pipelined Cache
14
14 Cache Model Simulation Estimates energy consumption and cache latency Uses a modified version of CACTI 3.0 Parameters: Ntbl, Ndbl, Ntwl, Ndwl. Enumerates all possible configurations Chooses the one with the best weighted value (cycle time and energy consumption) Simulates: Various cache sizes (8K-512K), 64 B blocks DM, 2-way, 4-way, and 8-way Uses 0.18 um technology
15
15 Processor Simulation Estimates performance benefit Uses a heavily modified SimpleScalar 3.0 Supports bit-sliced execution stage Supports speculative slice execution Benchmarks Eight Spec2000 Integer benchmarks Full reference input set Fast forward 500M, simulate 100M
16
16 Machine Configuration 4-wide fetch, issue, commit 128 entry ROB 32 entry scheduler 20 stage pipeline 64K-entry gshare L1 I-Cache: 32KB, 2-way, 64B block L1 D-Cache: 8KB, 4-way, 64B block L2 Cache: 512KB, 8-way, 128B block
17
17 Energy Consumption / Access
18
18 Cycle Time Comparison
19
19 Speed Up Comparison
20
20 Speed Up Comparison
21
21 Conclusion Bit-sliced cache Achieves significant power reduction Without adds much complexity Adds some delay to access latency Pipelined bit-sliced cache Reduces cycle time Provides more bandwidth Measurable speed up (w/ bit-sliced ALU)
22
22 Question? Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.