Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL
Outline Introduction Baseline Implementation Optimizations Energy Conclusions
Hardware Instruction Caches Used in virtually all high- performance general-purpose processors DRAM Good performance –Decreases average memory access time Easy to use –Transparent operation Processor I-Cache Chip
ICache-less Processors Embedded procs and DSPs –TMS470, ADSP-21xx, etc. Embedded multicore processors –IBM Cell SPE Processor DRAM SRAM No special-purpose hardware –Less design/verification time –Less area –Shorter cycle time –Less energy per access –Predictable behavior Much harder to program! –Manually partition code and transfer pieces from DRAM
Software-based I-Caching Use a software system to virtualize instruction memory by recreating hardware cache functionality Automatic management of simple SRAM memory –Good performance with no extra programming effort Integrated into each individual application –Customized to program’s needs –Optimize for different goals –Real-time predictability Maintain low-cost, high-speed hardware
Outline Introduction Baseline Implementation Optimizations Energy Conclusions
Flexicache System Overview ProcessorDRAM I-mem Programmer Original Binary Rewriter Rewritten Binary Linker Runtime library Flexicache Binary
Binary Rewriter Break up user program into cache blocks Modify control-flow that leaves the blocks Flexicache runtime Binary Rewriter
Rewriter: Details One basic block in each cache block, but… –Fixed-size of 16 instructions Simplifies bookkeeping Requires padding of small blocks and splitting of large ones Control-flow instructions that leave a block are modified to jump to the runtime system –E.g. BEQ $2,$3,foo JEQL $2,$3,runtime –Original destination addresses stored in table –Fall-through jumps at end of blocks
Runtime: Overview Stays resident in I-mem Receive requests from cache blocks See if requested block is resident Load new block from DRAM if necessary –Evict blocks to make room Transfer control to the new block
Runtime Operation Loaded Cache Blocks Miss Handler DRAM Block 0 Block 1 Block 2 Block 3 … request reply Block 2 branch fall-thru JR Runtime System Entry Point 1 Entry Point 2 Indirect EP
System Policies and Mechanisms Fully-associative cache block placement Replacement Policy: FIFO –Evict oldest block in cache –Matches sequential execution Pinned functions –Key feature for timing predictability –No cache overhead within function
Experimental Setup Implemented for a tile in the Raw multicore processor –Similar to many embedded processors –32-bit single-issue in-order MIPS pipeline –32 kB SRAM I-mem Raw simulator –Cycle-accurate –Idealized I/O model –SRAM I-mem or traditional hardware I-cache models –Uses Wattch to estimate energy consumption Mediabench benchmark suite –Multimedia applications for embedded processors
Baseline Performance Flexicache Overhead Overhead: Number of additional cycles relative to 32 kB, 2-way HW cache
Outline Introduction Baseline Implementation Optimizations Energy Conclusions
Basic Chaining Problem: Hit case in runtime system takes about 40 cycles Without Chaining Block A Block B Block C Block D Runtime System With Chaining Block A Block B Block C Block D Runtime System Solution: Modify jump to runtime system so that it jumps directly to loaded code the next time
Basic Chaining Performance Flexicache Overhead
Basic Chaining Performance Flexicache Overhead
Function Call Chaining Problem: Function calls were not being chained Compound instructions (like jump-and-link) handle two virtual addresses –Load return address into link register –Jump to destination address Solution: –Decompose them in the rewriter –Jump can be chained normally at runtime
Function Call Chaining Performance Flexicache Overhead
Replacement Policy Problem: Too much bookkeeping –Chains must be backed out if destination block is evicted –Idea 1: With FIFO replacement policy, no need to record chains from old to young –Idea 2: Limit # of chains to each block Solution: Flush replacement policy –Evict everything and start fresh –No need to undo or track chains –Increased miss rate vs FIFO Block A Block B Block C Block D Runtime System older newer A D Unchaining table A: B: C: D: C
Flush Policy Performance Flexicache Overhead
Indirect Jump Chaining Problem: Different destination on each execution Solution: Pre-screen addresses and chain each individually But… –Screening takes time –Which addresses should we chain? A B C JR $31 A B C if $31==A: JMP A if $31==B: JMP B if $31==C: JMP C
Indirect Jump Chaining Performance Flexicache Overhead
Fixed-size Block Padding Padding for small blocks wastes more space than expected –Average basic block contains 5.5 instructions –Most common size is 3 –60-65% of storage space is wasted on NOPs : 8400: mfsr $r9, : rlm $r9,$r9,0x4,0x0 8408: jnel+ $r9,$0, _dispatch.entry1 840c: jal _dispatch.entry2 8410: nop 8414: nop 8418: nop 841c: nop …
8-word Cache Blocks Reduce cache block size to better fit basic blocks –Less padding less wasted space lower miss rate –Bookkeeping structures get bigger higher miss rate –More block splits higher miss rate, overhead Allow up to 4 consecutive blocks to be loaded together –Effectively creates 8, 16, 24 and 32 word blocks –Avoid splitting up large basic blocks Performance Benefits –Amortize cost of a call into the runtime –Overlap DRAM fetches –Eliminate jumps used to split large blocks –Also used to add extra space for runtime JR chaining
8-word Blocks Performance Flexicache Overhead
Performance Summary Good performance on 6 of 9 benchmarks: 5-11% G721 ( 24.2% overhead) –Indirect jumps Mesa ( 24.4% overhead) –Indirect jumps, High miss rate Rasta ( 93.6% overhead) –High miss rate, indirect jumps Majority of remaining overhead is due to modifications to user code, not runtime calls –Fall-through jumps added by rewriter –Indirect jump chain comparisons
Outline Introduction Baseline Implementation Optimizations Energy Conclusions
Energy Analysis SRAM uses less energy than cache for each access –No tags and unused cache ways –Saves about 9% of total processor power Additional instructions for software management use extra energy –Total energy roughly proportional to number of cycles Software I-cache will use less total energy if instruction overhead is below 9%
Energy Results Wattch used with CACTI models for SRAM and I-cache –32 kB, 2-way set associative HW cache, 25% of total power Total energy to complete each benchmark calculated
Conclusions Software-based instruction caching can be a practical solution for embedded processors Provides programming convenience of a HW cache Performance and energy similar to a HW cache –Overhead < 10% on several benchmarks –Energy savings of up to 3.8% Maintain advantages of Icache-less architecture –Low-cost hardware –Real-time guarantees
Questions?