Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Slides:

Advertisements

Similar presentations

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Advertisements

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Very low power pipelines using significance compression Canal, R. Gonzalez, A. Smith, J.E. Dept. d'Arquitectura de Computadors, Univ. Politecnica de Catalunya,

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

Architectural Power Management for High Leakage Technologies Department of Electrical and Computer Engineering Auburn University, Auburn, AL /15/2011.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Mrinmoy Ghosh Weidong Shi Hsien-Hsin (Sean) Lee

Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.

Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.

Cache Physical Implementation Panayiotis Charalambous Xi Research Group Panayiotis Charalambous Xi Research Group.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.

UNIT 2. ADDITION & SUBTRACTION OF SIGNED NUMBERS.

1 Improved Policies for Drowsy Caches in Embedded Processors Junpei Zushi Gang Zeng Hiroyuki Tomiyama Hiroaki Takada (Nagoya University) Koji Inoue (Kyushu.

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

PipeliningPipelining Computer Architecture (Fall 2006)

Array Multiplier Haibin Wang Qiong Wu. Outlines Background & Motivation Principles Implementation & Simulation Advantages & Disadvantages Conclusions.

University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.

Dynamic Associative Caches:

SECTIONS 1-7 By Astha Chawla

Physical Register Inlining (PRI)

5.2 Eleven Advanced Optimizations of Cache Performance

Half-Price Architecture

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

Lecture 11: Memory Data Flow Techniques

Ka-Ming Keung Swamy D Ponpandi

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

* From AMD 1996 Publication #18522 Revision E

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison

2 Cache Power Consumption Increasing on-chip cache size  Increasing cache power consumption Increasing clock frequency  Increasing dynamic power Lots of prior work to reduce cache power consumption

3 Prior Work Cache subbanking, bitline segmentation [Su et al. 1995, Ghose et al. 2001] Cache decomposition [Huang et al. 2001] Block buffering [Su et al. 1995] Reducing Leakage power Drowsy caches [Flautner et al. 2002, Kim et al. 2002] Cache decay [Kaxiras et al. 2001] Gated Vdd [Powell et al. 2000]

4 Cache Subbanking Proposed by Su et al Fetching only requested subline Partitioned data array vertically into several subbanks Further study by Ghose et al Partitioned data array vertically and horizontally Only activate the requested subbanks

5 Bit-sliced ALU Originally proposed by Hsu et al Slices the addition operations i.e. 32-bit addition -> four 8-bit addition Avoids waiting for full-width addition Bypasses partial operand result Has been successfully implemented in Pentium 4 staggered adder

6 Outline Motivation Prior Work Bit-sliced Cache Experiment Results Conclusion

7 Power Consumption in Cache  Row decoding consumes up to 40% of active power

8 Bit-sliced Cache Extends cache subbanking technique Saves decoding power Enables only row decoders that are accessed Serializes subarray decoding with row decoding Uses low order index bits to select row decoder Minimal changes to subbanking technique

9 Pipelining the Cache Access Cache access time increases due to: Serializing subarray decoder with row decoder Pipeline the access to hide the delay Need to balance the latency of each stage Choose operations for each stage carefully Provide more throughput Same throughput as a conventional cache with n ports

10 Pipelined-Cache’s Access Steps Cycle 1 Start subarray decoding for data and tag Cycle 2 Activate necessary row decoders Read tag array while waiting Cycle 3 Read data array Concurrently, do partial tag comparison Cycle 4 Compare the rest of the tag bits Use tag comparison result to select data

11 Bit-sliced Cache

12 Bit-sliced Cache + Bit-sliced ALU Optimal performance benefit Cache access starts sooner As soon as the first slice is available Limited number of subarrays According to the number of bits per slice When the bitslice is too small Unable to achieve optimal power saving

13 Pipelining with Bit-sliced Cache addi R3, R3, 4add R3, R2, R1lw R4, 4(R3) lw R1, 0(R3) Pipelined Execution Stage with Pipelined Cache add R3, R2, R1addi R3, R3, 4lw R1, 0(R3)lw R4, 4(R3)add R3, R2, R1addi R3, R3, 4lw R4, 4(R3)lw R1, 0(R3) Bit-sliced Execution Stage with Bit-sliced Cache Bit-sliced Execution Stage with Pipelined Cache

14 Cache Model Simulation Estimates energy consumption and cache latency Uses a modified version of CACTI 3.0 Parameters: Ntbl, Ndbl, Ntwl, Ndwl. Enumerates all possible configurations Chooses the one with the best weighted value (cycle time and energy consumption) Simulates: Various cache sizes (8K-512K), 64 B blocks DM, 2-way, 4-way, and 8-way Uses 0.18 um technology

15 Processor Simulation Estimates performance benefit Uses a heavily modified SimpleScalar 3.0 Supports bit-sliced execution stage Supports speculative slice execution Benchmarks Eight Spec2000 Integer benchmarks Full reference input set Fast forward 500M, simulate 100M

16 Machine Configuration 4-wide fetch, issue, commit 128 entry ROB 32 entry scheduler 20 stage pipeline 64K-entry gshare L1 I-Cache: 32KB, 2-way, 64B block L1 D-Cache: 8KB, 4-way, 64B block L2 Cache: 512KB, 8-way, 128B block

17 Energy Consumption / Access

18 Cycle Time Comparison

19 Speed Up Comparison

20 Speed Up Comparison

21 Conclusion Bit-sliced cache Achieves significant power reduction Without adds much complexity Adds some delay to access latency Pipelined bit-sliced cache Reduces cycle time Provides more bandwidth Measurable speed up (w/ bit-sliced ALU)

22 Question? Thank you