Presentation is loading. Please wait.

Presentation is loading. Please wait.

Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL.

Similar presentations


Presentation on theme: "Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL."— Presentation transcript:

1 Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL

2 Outline Introduction Baseline Implementation Optimizations Energy Conclusions

3 Hardware Instruction Caches Used in virtually all high- performance general-purpose processors DRAM Good performance –Decreases average memory access time Easy to use –Transparent operation Processor I-Cache Chip

4 ICache-less Processors Embedded procs and DSPs –TMS470, ADSP-21xx, etc. Embedded multicore processors –IBM Cell SPE Processor DRAM SRAM No special-purpose hardware –Less design/verification time –Less area –Shorter cycle time –Less energy per access –Predictable behavior Much harder to program! –Manually partition code and transfer pieces from DRAM

5 Software-based I-Caching Use a software system to virtualize instruction memory by recreating hardware cache functionality Automatic management of simple SRAM memory –Good performance with no extra programming effort Integrated into each individual application –Customized to program’s needs –Optimize for different goals –Real-time predictability Maintain low-cost, high-speed hardware

6 Outline Introduction Baseline Implementation Optimizations Energy Conclusions

7 Flexicache System Overview ProcessorDRAM I-mem Programmer Original Binary Rewriter Rewritten Binary Linker Runtime library Flexicache Binary

8 Binary Rewriter Break up user program into cache blocks Modify control-flow that leaves the blocks Flexicache runtime Binary Rewriter

9 Rewriter: Details One basic block in each cache block, but… –Fixed-size of 16 instructions Simplifies bookkeeping Requires padding of small blocks and splitting of large ones Control-flow instructions that leave a block are modified to jump to the runtime system –E.g. BEQ $2,$3,foo  JEQL $2,$3,runtime –Original destination addresses stored in table –Fall-through jumps at end of blocks

10 Runtime: Overview Stays resident in I-mem Receive requests from cache blocks See if requested block is resident Load new block from DRAM if necessary –Evict blocks to make room Transfer control to the new block

11 Runtime Operation Loaded Cache Blocks Miss Handler DRAM Block 0 Block 1 Block 2 Block 3 … request reply Block 2 branch fall-thru JR Runtime System Entry Point 1 Entry Point 2 Indirect EP

12 System Policies and Mechanisms Fully-associative cache block placement Replacement Policy: FIFO –Evict oldest block in cache –Matches sequential execution Pinned functions –Key feature for timing predictability –No cache overhead within function

13 Experimental Setup Implemented for a tile in the Raw multicore processor –Similar to many embedded processors –32-bit single-issue in-order MIPS pipeline –32 kB SRAM I-mem Raw simulator –Cycle-accurate –Idealized I/O model –SRAM I-mem or traditional hardware I-cache models –Uses Wattch to estimate energy consumption Mediabench benchmark suite –Multimedia applications for embedded processors

14 Baseline Performance Flexicache Overhead Overhead: Number of additional cycles relative to 32 kB, 2-way HW cache

15 Outline Introduction Baseline Implementation Optimizations Energy Conclusions

16 Basic Chaining Problem: Hit case in runtime system takes about 40 cycles Without Chaining Block A Block B Block C Block D Runtime System With Chaining Block A Block B Block C Block D Runtime System Solution: Modify jump to runtime system so that it jumps directly to loaded code the next time

17 Basic Chaining Performance Flexicache Overhead

18 Basic Chaining Performance Flexicache Overhead

19 Function Call Chaining Problem: Function calls were not being chained Compound instructions (like jump-and-link) handle two virtual addresses –Load return address into link register –Jump to destination address Solution: –Decompose them in the rewriter –Jump can be chained normally at runtime

20 Function Call Chaining Performance Flexicache Overhead

21 Replacement Policy Problem: Too much bookkeeping –Chains must be backed out if destination block is evicted –Idea 1: With FIFO replacement policy, no need to record chains from old to young –Idea 2: Limit # of chains to each block Solution: Flush replacement policy –Evict everything and start fresh –No need to undo or track chains –Increased miss rate vs FIFO Block A Block B Block C Block D Runtime System older newer  A  D Unchaining table A: B: C: D:  C

22 Flush Policy Performance Flexicache Overhead

23 Indirect Jump Chaining Problem: Different destination on each execution Solution: Pre-screen addresses and chain each individually But… –Screening takes time –Which addresses should we chain? A B C JR $31 A B C if $31==A: JMP A if $31==B: JMP B if $31==C: JMP C

24 Indirect Jump Chaining Performance Flexicache Overhead

25 Fixed-size Block Padding Padding for small blocks wastes more space than expected –Average basic block contains 5.5 instructions –Most common size is 3 –60-65% of storage space is wasted on NOPs 00008400 : 8400: mfsr $r9,28 8404: rlm $r9,$r9,0x4,0x0 8408: jnel+ $r9,$0, _dispatch.entry1 840c: jal _dispatch.entry2 8410: nop 8414: nop 8418: nop 841c: nop …

26 8-word Cache Blocks Reduce cache block size to better fit basic blocks –Less padding  less wasted space  lower miss rate –Bookkeeping structures get bigger  higher miss rate –More block splits  higher miss rate, overhead Allow up to 4 consecutive blocks to be loaded together –Effectively creates 8, 16, 24 and 32 word blocks –Avoid splitting up large basic blocks Performance Benefits –Amortize cost of a call into the runtime –Overlap DRAM fetches –Eliminate jumps used to split large blocks –Also used to add extra space for runtime JR chaining

27 8-word Blocks Performance Flexicache Overhead

28 Performance Summary Good performance on 6 of 9 benchmarks: 5-11% G721 ( 24.2% overhead) –Indirect jumps Mesa ( 24.4% overhead) –Indirect jumps, High miss rate Rasta ( 93.6% overhead) –High miss rate, indirect jumps Majority of remaining overhead is due to modifications to user code, not runtime calls –Fall-through jumps added by rewriter –Indirect jump chain comparisons

29 Outline Introduction Baseline Implementation Optimizations Energy Conclusions

30 Energy Analysis SRAM uses less energy than cache for each access –No tags and unused cache ways –Saves about 9% of total processor power Additional instructions for software management use extra energy –Total energy roughly proportional to number of cycles Software I-cache will use less total energy if instruction overhead is below 9%

31 Energy Results Wattch used with CACTI models for SRAM and I-cache –32 kB, 2-way set associative HW cache, 25% of total power Total energy to complete each benchmark calculated

32 Conclusions Software-based instruction caching can be a practical solution for embedded processors Provides programming convenience of a HW cache Performance and energy similar to a HW cache –Overhead < 10% on several benchmarks –Energy savings of up to 3.8% Maintain advantages of Icache-less architecture –Low-cost hardware –Real-time guarantees http://cag.csail.mit.edu/raw

33 Questions? http://cag.csail.mit.edu/raw


Download ppt "Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL."

Similar presentations


Ads by Google