Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL.

Slides:



Advertisements
Similar presentations
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Advertisements

Hardware Managed Scratchpad for Embedded Systems Ben Rudzyn.
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Lecture 12 Reduce Miss Penalty and Hit Time
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 21, 2007 Karem A. Sakallah Lecture 23 Virtual Memory (2) CS : Computer Architecture.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Multiprocessing Memory Management
Memory Management 2010.
1 Virtual Memory Chapter 8. 2 Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.
Pipeline Exceptions & ControlCSCE430/830 Pipeline: Exceptions & Control CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Computer Organization and Architecture
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Microprocessor-based systems Curse 7 Memory hierarchies.
1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)
A Time Predictable Instruction Cache for a Java Processor Martin Schoeberl.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Virtual Memory 1 1.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CDA 5155 Virtual Memory Lecture 27. Memory Hierarchy Cache (SRAM) Main Memory (DRAM) Disk Storage (Magnetic media) CostLatencyAccess.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
CE 454 Computer Architecture
Memory COMPUTER ARCHITECTURE
Multilevel Memories (Improving performance using alittle “cash”)
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Ke Bai and Aviral Shrivastava Presented by Bryce Holton
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Address-Value Delta (AVD) Prediction
Adapted from slides by Sally McKee Cornell University
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Overview Problem Solution CPU vs Memory performance imbalance
Virtual Memory 1 1.
Presentation transcript:

Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL

Outline Introduction Baseline Implementation Optimizations Energy Conclusions

Hardware Instruction Caches Used in virtually all high- performance general-purpose processors DRAM Good performance –Decreases average memory access time Easy to use –Transparent operation Processor I-Cache Chip

ICache-less Processors Embedded procs and DSPs –TMS470, ADSP-21xx, etc. Embedded multicore processors –IBM Cell SPE Processor DRAM SRAM No special-purpose hardware –Less design/verification time –Less area –Shorter cycle time –Less energy per access –Predictable behavior Much harder to program! –Manually partition code and transfer pieces from DRAM

Software-based I-Caching Use a software system to virtualize instruction memory by recreating hardware cache functionality Automatic management of simple SRAM memory –Good performance with no extra programming effort Integrated into each individual application –Customized to program’s needs –Optimize for different goals –Real-time predictability Maintain low-cost, high-speed hardware

Outline Introduction Baseline Implementation Optimizations Energy Conclusions

Flexicache System Overview ProcessorDRAM I-mem Programmer Original Binary Rewriter Rewritten Binary Linker Runtime library Flexicache Binary

Binary Rewriter Break up user program into cache blocks Modify control-flow that leaves the blocks Flexicache runtime Binary Rewriter

Rewriter: Details One basic block in each cache block, but… –Fixed-size of 16 instructions Simplifies bookkeeping Requires padding of small blocks and splitting of large ones Control-flow instructions that leave a block are modified to jump to the runtime system –E.g. BEQ $2,$3,foo  JEQL $2,$3,runtime –Original destination addresses stored in table –Fall-through jumps at end of blocks

Runtime: Overview Stays resident in I-mem Receive requests from cache blocks See if requested block is resident Load new block from DRAM if necessary –Evict blocks to make room Transfer control to the new block

Runtime Operation Loaded Cache Blocks Miss Handler DRAM Block 0 Block 1 Block 2 Block 3 … request reply Block 2 branch fall-thru JR Runtime System Entry Point 1 Entry Point 2 Indirect EP

System Policies and Mechanisms Fully-associative cache block placement Replacement Policy: FIFO –Evict oldest block in cache –Matches sequential execution Pinned functions –Key feature for timing predictability –No cache overhead within function

Experimental Setup Implemented for a tile in the Raw multicore processor –Similar to many embedded processors –32-bit single-issue in-order MIPS pipeline –32 kB SRAM I-mem Raw simulator –Cycle-accurate –Idealized I/O model –SRAM I-mem or traditional hardware I-cache models –Uses Wattch to estimate energy consumption Mediabench benchmark suite –Multimedia applications for embedded processors

Baseline Performance Flexicache Overhead Overhead: Number of additional cycles relative to 32 kB, 2-way HW cache

Outline Introduction Baseline Implementation Optimizations Energy Conclusions

Basic Chaining Problem: Hit case in runtime system takes about 40 cycles Without Chaining Block A Block B Block C Block D Runtime System With Chaining Block A Block B Block C Block D Runtime System Solution: Modify jump to runtime system so that it jumps directly to loaded code the next time

Basic Chaining Performance Flexicache Overhead

Basic Chaining Performance Flexicache Overhead

Function Call Chaining Problem: Function calls were not being chained Compound instructions (like jump-and-link) handle two virtual addresses –Load return address into link register –Jump to destination address Solution: –Decompose them in the rewriter –Jump can be chained normally at runtime

Function Call Chaining Performance Flexicache Overhead

Replacement Policy Problem: Too much bookkeeping –Chains must be backed out if destination block is evicted –Idea 1: With FIFO replacement policy, no need to record chains from old to young –Idea 2: Limit # of chains to each block Solution: Flush replacement policy –Evict everything and start fresh –No need to undo or track chains –Increased miss rate vs FIFO Block A Block B Block C Block D Runtime System older newer  A  D Unchaining table A: B: C: D:  C

Flush Policy Performance Flexicache Overhead

Indirect Jump Chaining Problem: Different destination on each execution Solution: Pre-screen addresses and chain each individually But… –Screening takes time –Which addresses should we chain? A B C JR $31 A B C if $31==A: JMP A if $31==B: JMP B if $31==C: JMP C

Indirect Jump Chaining Performance Flexicache Overhead

Fixed-size Block Padding Padding for small blocks wastes more space than expected –Average basic block contains 5.5 instructions –Most common size is 3 –60-65% of storage space is wasted on NOPs : 8400: mfsr $r9, : rlm $r9,$r9,0x4,0x0 8408: jnel+ $r9,$0, _dispatch.entry1 840c: jal _dispatch.entry2 8410: nop 8414: nop 8418: nop 841c: nop …

8-word Cache Blocks Reduce cache block size to better fit basic blocks –Less padding  less wasted space  lower miss rate –Bookkeeping structures get bigger  higher miss rate –More block splits  higher miss rate, overhead Allow up to 4 consecutive blocks to be loaded together –Effectively creates 8, 16, 24 and 32 word blocks –Avoid splitting up large basic blocks Performance Benefits –Amortize cost of a call into the runtime –Overlap DRAM fetches –Eliminate jumps used to split large blocks –Also used to add extra space for runtime JR chaining

8-word Blocks Performance Flexicache Overhead

Performance Summary Good performance on 6 of 9 benchmarks: 5-11% G721 ( 24.2% overhead) –Indirect jumps Mesa ( 24.4% overhead) –Indirect jumps, High miss rate Rasta ( 93.6% overhead) –High miss rate, indirect jumps Majority of remaining overhead is due to modifications to user code, not runtime calls –Fall-through jumps added by rewriter –Indirect jump chain comparisons

Outline Introduction Baseline Implementation Optimizations Energy Conclusions

Energy Analysis SRAM uses less energy than cache for each access –No tags and unused cache ways –Saves about 9% of total processor power Additional instructions for software management use extra energy –Total energy roughly proportional to number of cycles Software I-cache will use less total energy if instruction overhead is below 9%

Energy Results Wattch used with CACTI models for SRAM and I-cache –32 kB, 2-way set associative HW cache, 25% of total power Total energy to complete each benchmark calculated

Conclusions Software-based instruction caching can be a practical solution for embedded processors Provides programming convenience of a HW cache Performance and energy similar to a HW cache –Overhead < 10% on several benchmarks –Energy savings of up to 3.8% Maintain advantages of Icache-less architecture –Low-cost hardware –Real-time guarantees

Questions?