ROBTIC : On chip I-cache design for low power embedded systems

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
How caches take advantage of Temporal locality
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Lecture 19: Virtual Memory
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 2 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.
Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock.
Chapter 19 Translation Lookaside Buffer
CMSC 611: Advanced Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Cache Memory.
The Goal: illusion of large, fast, cheap memory
Lecture 12 Virtual Memory.
Improving Memory Access 1/3 The Cache and Virtual Memory
Outline Paging Swapping and demand paging Virtual memory.
Multilevel Memories (Improving performance using alittle “cash”)
Appendix B. Review of Memory Hierarchy
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Consider a Direct Mapped Cache with 4 word blocks
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
Bojian Zheng CSCD70 Spring 2018
Cache Memories September 30, 2008
Evolution in Memory Management Techniques
FIGURE 12-1 Memory Hierarchy
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Lecture 22: Cache Hierarchies, Memory
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Chapter 6 Memory System Design
Cache Memory.
ECE232: Hardware Organization and Design
Lecture 10: Branch Prediction and Instruction Delivery
Morgan Kaufmann Publishers
Lecture 20: OOO, Memory Hierarchy
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 20: OOO, Memory Hierarchy
Translation Buffers (TLB’s)
CS-447– Computer Architecture Lecture 20 Cache Memories
Translation Buffers (TLB’s)
Chapter 1 Computer System Overview
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Some of the slides are adopted from David Patterson (UCB)
Chapter Five Large and Fast: Exploiting Memory Hierarchy
COMP3221: Microprocessors and Embedded Systems
Cache - Optimization.
Cache Memory Rabi Mahapatra
Translation Buffers (TLBs)
Review What are the advantages/disadvantages of pages versus segments?
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

ROBTIC : On chip I-cache design for low power embedded systems Varun Mathur Mingwei Liu

I-cache and address tag Instruction cache has Large chip area High access frequency=>switching power Example: Direct mapped I-cache 1024 entries (=>1024 one way sets) 32 bit data per entry (=>four 8 bit instructions per entry) 32 bit PC address So, 10 bit index 2 bit block offset Therefore, 32-(10+2) = 20 bit tag Cache area taken by tag = (20/(20+1+32)) = 37.7% Larger the cache tag => more area and power consumption.

Previous work Reduced One-Bit Tag Instruction Cache (ROBTIC) a filter cache or L0-cache tag comparison before data fetch buffer the address of the previously hit cache line (non-branch or un-taken) use a couple of the least significant bits of the tag for comparison Reduced One-Bit Tag Instruction Cache (ROBTIC)

Their work is directed at: Application programs have a lot of loops. Loops have high temporal and spatial locality. Exploit this locality to reduce tag size.

Standard cache size for ROBTIC: Wide range of benchmarks surveyed. Apps most time within certain loop blocks. 90% of loops have < 30 instructions. ROBTIC’s standard capacity = 32 inst Enough to capture most of temporal and spatial locality.

ROBTIC Archetecture

Cache tag size v/s cache memory coverage Tag : identifies which memory location stored at particular cache location Cache memory coverage : cache mappable address space of main memory.

Features of ROBTIC: Direct mapped Each cache entry contains : Cache line (data) 1 bit tag Valid bit Cache size extensively reduced Due to 1 bit tag, cache operational control unit reqd. Performance retained. Reduced size + maintained performance=>reduced power consumption

Cache tag size v/s cache memory coverage FULL TAG = FULL MEMORY COVERAGE

Cache tag size v/s cache memory coverage Reduced TAG = PARTIAL MEMORY COVERAGE Only segments of memory mapped at a time. Mappable regions distinguished by CC-index.

Ideal scenario

Reality : Ping-Pong effect Consider the case of block B2.

Overlapped Dynamic Cache Coverage Control Dynamically shift cache coverage in overlapping manner. Some of the previously mapped addresses survive. Again consider the case of Block B2.

Cache operation control flow: Three states: Cache coverage shift Shift when PC moves out of current coverage. Invalidate Always shift by cache size. Cache flush If Jump or Branch offset > cache size Normal cache access

Cache coverage shift and shift detection Method 1 : Range gauge Disadvantages 2 registers (full tag size) 2 comparators (full tag size) Neutralizes savings from 1 bit tag.

Method 2 : with gray encoding Only need to differentiate between 3 regions. Least two significant bits of tag sufficient. Called local cache coverage position (CC-lpos). Range checking will not work. 2 bit CC-lpos (p1p0) -> gray code value (g1g0) Adjacent coverage regions differentiated by a bit in (g1g0) called common value bit (CVB). Its value is called common value(CV). CVB and its CV uniquely identify 4 consecutive regions.

For current cache coverage : CVP = ~(g1 xor g0) = ~p0 CV = g0 = p1 xor p0 Hardware cost : does not increase with tag size, one 2 bit reg, one 2 bit B2G conv, one 2x1 mux and 1 bit comp.

Experimental Results: Results of Cache Coverage Shift implementation

Experimental Results: Comparison of ROBTIC with PTC-5

Experimental Results: Comparison of standard & custom ROBTIC designs

Limitations and future work: On a jump instruction, the cache is flushed irrespective of the offset. For future work, they want to investigate multi-way associative structure in ROBTIC.