Download presentation
Presentation is loading. Please wait.
Published byVivian Alaina Jennings Modified over 9 years ago
1
Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL
2
Outline Introduction Baseline Implementation Optimizations Energy Conclusions
3
Hardware Instruction Caches Used in virtually all high- performance general-purpose processors DRAM Good performance –Decreases average memory access time Easy to use –Transparent operation Processor I-Cache Chip
4
ICache-less Processors Embedded procs and DSPs –TMS470, ADSP-21xx, etc. Embedded multicore processors –IBM Cell SPE Processor DRAM SRAM No special-purpose hardware –Less design/verification time –Less area –Shorter cycle time –Less energy per access –Predictable behavior Much harder to program! –Manually partition code and transfer pieces from DRAM
5
Software-based I-Caching Use a software system to virtualize instruction memory by recreating hardware cache functionality Automatic management of simple SRAM memory –Good performance with no extra programming effort Integrated into each individual application –Customized to program’s needs –Optimize for different goals –Real-time predictability Maintain low-cost, high-speed hardware
6
Outline Introduction Baseline Implementation Optimizations Energy Conclusions
7
Flexicache System Overview ProcessorDRAM I-mem Programmer Original Binary Rewriter Rewritten Binary Linker Runtime library Flexicache Binary
8
Binary Rewriter Break up user program into cache blocks Modify control-flow that leaves the blocks Flexicache runtime Binary Rewriter
9
Rewriter: Details One basic block in each cache block, but… –Fixed-size of 16 instructions Simplifies bookkeeping Requires padding of small blocks and splitting of large ones Control-flow instructions that leave a block are modified to jump to the runtime system –E.g. BEQ $2,$3,foo JEQL $2,$3,runtime –Original destination addresses stored in table –Fall-through jumps at end of blocks
10
Runtime: Overview Stays resident in I-mem Receive requests from cache blocks See if requested block is resident Load new block from DRAM if necessary –Evict blocks to make room Transfer control to the new block
11
Runtime Operation Loaded Cache Blocks Miss Handler DRAM Block 0 Block 1 Block 2 Block 3 … request reply Block 2 branch fall-thru JR Runtime System Entry Point 1 Entry Point 2 Indirect EP
12
System Policies and Mechanisms Fully-associative cache block placement Replacement Policy: FIFO –Evict oldest block in cache –Matches sequential execution Pinned functions –Key feature for timing predictability –No cache overhead within function
13
Experimental Setup Implemented for a tile in the Raw multicore processor –Similar to many embedded processors –32-bit single-issue in-order MIPS pipeline –32 kB SRAM I-mem Raw simulator –Cycle-accurate –Idealized I/O model –SRAM I-mem or traditional hardware I-cache models –Uses Wattch to estimate energy consumption Mediabench benchmark suite –Multimedia applications for embedded processors
14
Baseline Performance Flexicache Overhead Overhead: Number of additional cycles relative to 32 kB, 2-way HW cache
15
Outline Introduction Baseline Implementation Optimizations Energy Conclusions
16
Basic Chaining Problem: Hit case in runtime system takes about 40 cycles Without Chaining Block A Block B Block C Block D Runtime System With Chaining Block A Block B Block C Block D Runtime System Solution: Modify jump to runtime system so that it jumps directly to loaded code the next time
17
Basic Chaining Performance Flexicache Overhead
18
Basic Chaining Performance Flexicache Overhead
19
Function Call Chaining Problem: Function calls were not being chained Compound instructions (like jump-and-link) handle two virtual addresses –Load return address into link register –Jump to destination address Solution: –Decompose them in the rewriter –Jump can be chained normally at runtime
20
Function Call Chaining Performance Flexicache Overhead
21
Replacement Policy Problem: Too much bookkeeping –Chains must be backed out if destination block is evicted –Idea 1: With FIFO replacement policy, no need to record chains from old to young –Idea 2: Limit # of chains to each block Solution: Flush replacement policy –Evict everything and start fresh –No need to undo or track chains –Increased miss rate vs FIFO Block A Block B Block C Block D Runtime System older newer A D Unchaining table A: B: C: D: C
22
Flush Policy Performance Flexicache Overhead
23
Indirect Jump Chaining Problem: Different destination on each execution Solution: Pre-screen addresses and chain each individually But… –Screening takes time –Which addresses should we chain? A B C JR $31 A B C if $31==A: JMP A if $31==B: JMP B if $31==C: JMP C
24
Indirect Jump Chaining Performance Flexicache Overhead
25
Fixed-size Block Padding Padding for small blocks wastes more space than expected –Average basic block contains 5.5 instructions –Most common size is 3 –60-65% of storage space is wasted on NOPs 00008400 : 8400: mfsr $r9,28 8404: rlm $r9,$r9,0x4,0x0 8408: jnel+ $r9,$0, _dispatch.entry1 840c: jal _dispatch.entry2 8410: nop 8414: nop 8418: nop 841c: nop …
26
8-word Cache Blocks Reduce cache block size to better fit basic blocks –Less padding less wasted space lower miss rate –Bookkeeping structures get bigger higher miss rate –More block splits higher miss rate, overhead Allow up to 4 consecutive blocks to be loaded together –Effectively creates 8, 16, 24 and 32 word blocks –Avoid splitting up large basic blocks Performance Benefits –Amortize cost of a call into the runtime –Overlap DRAM fetches –Eliminate jumps used to split large blocks –Also used to add extra space for runtime JR chaining
27
8-word Blocks Performance Flexicache Overhead
28
Performance Summary Good performance on 6 of 9 benchmarks: 5-11% G721 ( 24.2% overhead) –Indirect jumps Mesa ( 24.4% overhead) –Indirect jumps, High miss rate Rasta ( 93.6% overhead) –High miss rate, indirect jumps Majority of remaining overhead is due to modifications to user code, not runtime calls –Fall-through jumps added by rewriter –Indirect jump chain comparisons
29
Outline Introduction Baseline Implementation Optimizations Energy Conclusions
30
Energy Analysis SRAM uses less energy than cache for each access –No tags and unused cache ways –Saves about 9% of total processor power Additional instructions for software management use extra energy –Total energy roughly proportional to number of cycles Software I-cache will use less total energy if instruction overhead is below 9%
31
Energy Results Wattch used with CACTI models for SRAM and I-cache –32 kB, 2-way set associative HW cache, 25% of total power Total energy to complete each benchmark calculated
32
Conclusions Software-based instruction caching can be a practical solution for embedded processors Provides programming convenience of a HW cache Performance and energy similar to a HW cache –Overhead < 10% on several benchmarks –Energy savings of up to 3.8% Maintain advantages of Icache-less architecture –Low-cost hardware –Real-time guarantees http://cag.csail.mit.edu/raw
33
Questions? http://cag.csail.mit.edu/raw
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.