Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany
(2) Outline Motivation Motivation Related Work Related Work State of the art: “Instruction Cache” State of the art: “Instruction Cache” Our approach: ”Block cache” Our approach: ”Block cache” Workflow (Instruction Selection / Simulation) Workflow (Instruction Selection / Simulation) Assumptions & Constrains Assumptions & Constrains Algorithm Algorithm Results Results Summary Summary
(3) Motivation Off-chip memory CPU I-Cache Area is expected to increase enormously(!) Area is expected to increase enormously(!) CPU I-Cache CPU I-Cache On-ChipOff-Chip David A. Patterson „Latency lags bandwidth” Commun. ACM 2004” David A. Patterson „Latency lags bandwidth” Commun. ACM 2004” Efficiency Power consumption Area Efficiency Power consumption Area Block Cache 1.. N Memory blocks of instructions (SRAM cells) 1.. N Memory blocks of instructions (SRAM cells) Generally caches consume more power than on-chip memory [1,2,3] Generally caches consume more power than on-chip memory [1,2,3]
(4) Related Work S. Steinke, L. Wehmeyer B, B. Lee, P. Marwedel „Assigning Program and Data Objects to Scratchpad for Energy Reduction” – DATE ’02 S. Steinke, L. Wehmeyer B, B. Lee, P. Marwedel „Assigning Program and Data Objects to Scratchpad for Energy Reduction” – DATE ’02 Statically partition on- and off-chip memory Statically partition on- and off-chip memory S. Steinke, N. Grunwald, L. Wehmeyer, R. Banakar, M. Balakrishnan, P. Marwedel, “Reducing energy consumption by dynamic copying of instructions to on-chip memory” – ISSS ‘02 S. Steinke, N. Grunwald, L. Wehmeyer, R. Banakar, M. Balakrishnan, P. Marwedel, “Reducing energy consumption by dynamic copying of instructions to on-chip memory” – ISSS ‘02 Statically determine code copying points Statically determine code copying points P. Francesco, P.Marchal, D.Atienza, L. Benini, F. Catthoor, J.Mendias “An integrated hw/sw-approach for run-time scratchpad management” – DAC ’04 P. Francesco, P.Marchal, D.Atienza, L. Benini, F. Catthoor, J.Mendias “An integrated hw/sw-approach for run-time scratchpad management” – DAC ’04 DMA for acceleration in on-chip memory for data DMA for acceleration in on-chip memory for data B. Egger, J. Kee, H. Shin “Scratchpad memory management for portable systems with a memory management unit”, EMSOFT ’06 B. Egger, J. Kee, H. Shin “Scratchpad memory management for portable systems with a memory management unit”, EMSOFT ’06 MMU to map between on- and off-chip memory (we share the µTLB) MMU to map between on- and off-chip memory (we share the µTLB)
(5) “State of the Art”: Instruction Cache Off-chip memory CPU On-ChipOff-Chip Block Cache I-Cache
(6) Architecture: Instruction Cache Tag Offset... Tag MUX Set MUX Data = = = = = = = =... T T T TTTT O O O O O
(7) “State of the Art”: Instruction Cache Off-chip memory CPU On-ChipOff-Chip Block Cache I-Cache Our approach: Block Cache
(8) Our approach: Block Cache B1 B2.. BNBN Memory Memory Blocks Blocks (SRAM (SRAM cells) cells) Logic +
(9) BNBN.. B2 Architectural Overview: Block Cache Off-chip memory Instruction B1 CPU µTLB = address Memory blocks Memory blocks Control Unit DMA Block load On-chip Instructions Exploit burst transfers (DRAM Memory) Exploit burst transfers (DRAM Memory) -Area efficient (SRAM cells) -Scalable (up to application size) -Area efficient (SRAM cells) -Scalable (up to application size)
(10) BNBN.. B2 Architectural Overview: Block Cache Off-chip memory Instruction B1 CPU µTLB = address Memory blocks Memory blocks Control Unit DMA Block load On-chip Instructions Exploit burst transfers (DRAM Memory) Exploit burst transfers (DRAM Memory) -Area efficient (SRAM cells) -Scalable (up to application size) -Area efficient (SRAM cells) -Scalable (up to application size)
(11) Architectural Overview: Block Cache BNBN.. B2 B1 Memory blocks Memory blocks On-chip …. F1 F2 FNFN (Binary) PUSH R1 PUSH R2 …. POP R2 POP R1 RET (Assembler) = 1..N Function(s)
(12) Function to Block Mapping B2 B1 F20 F1 F2 FNFN F3 F4 F5 F6 F7 F8 F16 F10 F12 F9 F15 F14 F17 F19 F18 F19 F6c F6a F6b Eviction: LRU, Round Robin, ARC, Belady Eviction: LRU, Round Robin, ARC, Belady B3
(13) Design Flow : Analysis Instrumented Execution / Simulation Instrumented Execution / Simulation Dynamic Call Graph Dynamic Call Graph Disassemble Static Call Graph Executed Instruction Trace Executed Instruction Trace Software Component Software Component Input Data / Parameters Input Data / Parameters Trace: function enter/exit function address Trace: function enter/exit function address + Functions not called during profiling (need to be included) + Functions not called during profiling (need to be included)
(14) Besides: Assumptions & Constrains Software Behavior Analysis Software Behavior Analysis Component level Component level Trace composition reflects deployment usage ( parameters / input set ) Trace composition reflects deployment usage ( parameters / input set ) Hardware Hardware External memory: High bandwidth / high latency External memory: High bandwidth / high latency Block size (fixed) / Number of code blocks (fixed) Block size (fixed) / Number of code blocks (fixed) Compiler / Linker Compiler / Linker Function splitting (function size < block size) Function splitting (function size < block size)
(15) Design Flow : Analysis Instrumented Execution / Simulation Instrumented Execution / Simulation Dynamic Call Graph Dynamic Call Graph Disassemble Static Call Graph Executed Instruction Trace Executed Instruction Trace Application (component) Application (component) Input Data / Parameters Input Data / Parameters Trace: function enter/exit function address Trace: function enter/exit function address
(16) Design Flow : Block composition Dynamic Call Graph Dynamic Call Graph Static Call Graph Block composition algorithm Block composition algorithm Linker File
(17) Design Flow : Re-linking Function 1 Function 2 Function 3 Function 4 Function 5 Function 6 …. Code block 2 Code block 1 Code block 3 Code block 4 Original Binary Re-linked Binary X Linker File done
(18) Design Flow : Re-linking Code block 1 Original code section size Code section size after re-linking Data section size Function Reference Function Pointer Data Reference Compiler supplies: Relocation table Symbol table ELF headers Compiler supplies: Relocation table Symbol table ELF headers
(19) Overview: Algorithm Input: Dynamic function call graph Input: Dynamic function call graph (Node = function) (Node = function) Output: Block graph Output: Block graph (Node = 1..n functions) (Node = 1..n functions) 3 steps (differ in merging distance): 3 steps (differ in merging distance): (1) combine_neighbor (2) merge_direct_children (3) bubble_merge Challenge: “Merge appropriate functions into a block” Challenge: “Merge appropriate functions into a block”
(20) Algorithm Step 1/3 F2 F1 F5 F3 F6 F7 F8 F9 F4 Dynamic Call Graph Function size (architecture) Function size (architecture) e4 1e8 4 1e e6 1e6 combine_neighbor
(21) Algorithm Step 1/3 F2 F1 F5 F6 F7 F8 F9 F e4 1e8 4 1e e6 1e F4,7 F3 Centrality Measure: Centrality Measure: 0.00 combine_neighbor Dynamic Call Graph
(22) Algorithm Step 2/3 F5F6F7 Dynamic Call Graph 30 1e6 1e6 F3 merge_direct_children F8 1e4 F5F6F7F8
(23) Algorithm Step 2/3 F5F6F7 Dynamic Call Graph 30 1e6 1e6 F3 merge_direct_children F8 1e4 F5F6F7F8 F6F7F8F5 F6,7 F6,7,8
(24) Algorithm Step 2/3 F5F6F7 Dynamic Call Graph 30 1e6 1e6 F3 merge_direct_children F8 1e4 F5 F6,7,8 1e6+1e6+1e4
(25) Algorithm Step 3/3 F2 F1 F5 F6 F7 F8 F9 F4 Dynamic Call Graph e 4 1e8 4 1e e6 1e6 bubble_merge F5 F6 F7 F1 F3
(26) Algorithm Step 3/3 F2 F1 F5 F6 F7 F8 F9 F4 Dynamic Call Graph e4 1e8 4 1e e6 1e6 bubble_merge F5 F6 F7 F1 F3 F4 F8 F2 F9
(27) Algorithm Step 3/3 F2 F5 F6 F7 F8 F9 F4 Dynamic Call Graph e4 1e8 4 1e e6 1e6 bubble_merge F3 F5 F6 F7 F4 F8 F2 F9 F1 F3,F8
(28) Results What is interesting ? What is interesting ? Memory efficiency: Block Fragmentation Memory efficiency: Block Fragmentation Technology scaling: Misses Technology scaling: Misses Energy: Amount of transferred data Energy: Amount of transferred data Performance: Number of cycles Performance: Number of cycles Benchmark: MediaBench (CJPEG) Benchmark: MediaBench (CJPEG)
(29) Results: Block Fragmentation CJPEG – JPEG encoding (MediaBench) Results: Function size distribution Block size [Byte] x-axis: Binary size [Byte]
(30) Results: Misses : LRU: [6-12 blocks] CJPEG – JPEG encoding (MediaBench) X-axis: total cache size [Byte]
(31) Results: Transferred Code : LRU [6-12 blocks] CJPEG – JPEG encoding (MediaBench) X-axis: total cache size [Byte]
(32) Results: LRU/ARC/RR Transferred Code [8 blocks] CJPEG – JPEG encoding (MediaBench) X-axis: total cache size [Byte]
(33) Results: Copy cycles : LRU : [6-12 blocks] CJPEG – JPEG encoding (MediaBench) X-axis: total cache size [Byte]
(34) Summary Introduced: Block Cache for Embedded Systems Introduced: Block Cache for Embedded Systems Area increase / External memory latency Area increase / External memory latency Utilization / Suitability of traditional designs Utilization / Suitability of traditional designs Scalability: on-chip memories (Megabytes) Scalability: on-chip memories (Megabytes) Block Cache: Block Cache: Hardware Hardware Simple hardware structure: Simple hardware structure: Logic + Memory (SRAM not cache memory) Logic + Memory (SRAM not cache memory) Design Flow Design Flow Execute software component, block composition (algorithm, 3 steps), re-link the binary Execute software component, block composition (algorithm, 3 steps), re-link the binary Results Results Exploits high-bandwidth memory Exploits high-bandwidth memory Good performance Good performance
(35) References [1] David A. Patterson „Latency lags bandwidth”, Commun. ACM – 2004 [1] David A. Patterson „Latency lags bandwidth”, Commun. ACM – 2004 [2] R.Banakar, S.Steineke, B.Lee, M. Balakrishnan, P.Marwedel, “Scratchpad memory: Design alternative for cache on-chip memory in embedded systems” - CODES, 2002 [2] R.Banakar, S.Steineke, B.Lee, M. Balakrishnan, P.Marwedel, “Scratchpad memory: Design alternative for cache on-chip memory in embedded systems” - CODES, 2002 [3] F.Angiolini, F.Menichelli, A.Ferrero, L.Benini, M.Oliveri, “A post compiler approach to scratchpad mapping of code” – CASES, 2004 [3] F.Angiolini, F.Menichelli, A.Ferrero, L.Benini, M.Oliveri, “A post compiler approach to scratchpad mapping of code” – CASES, 2004 [4] S.Steineke, L.Wehmeyer, B. Lee, P.Marwedel, “Assigning program and data objects to scratchpad for energy reduction” - DATE, 2002 [4] S.Steineke, L.Wehmeyer, B. Lee, P.Marwedel, “Assigning program and data objects to scratchpad for energy reduction” - DATE, 2002
(36)
(37) Motivation Off-chip memory CPU I-Cache CPU I-Cache CPU I-Cache Bandwidth improves but latency not [1] Bandwidth improves but latency not [1] Generally caches consume more power than on-chip memory [2,3,4] Generally caches consume more power than on-chip memory [2,3,4] A significant amount of power will be spent in the memory hierarchy A significant amount of power will be spent in the memory hierarchy On-chip area will increase enormously On-chip area will increase enormously
(38) Motivation Off-chip memory CPU I-Cache CPU I-Cache CPU I-Cache
(39) Motivation Off-chip memory CPU I-Cache CPU I-Cache CPU I-Cache B-Cache
(40) …… Addr. B1 … B3 B2 Architectural Overview: Block Cache Off-chip memory Instruction B1 CPU Addr. B1 = address Code blocks Code blocks Block status Control Unit DMA Block load On-chip Instructions µTLB Exploit burst transfers (DRAM Memory) Exploit burst transfers (DRAM Memory) -Area efficient (SRAM cells) -Scalable (up to application size) -Area efficient (SRAM cells) -Scalable (up to application size)
(41) Function to Block Mapping B2 B1 Exploit burst transfers (DRAM Memory) Exploit burst transfers (DRAM Memory) -Area efficient (SRAM cells) -Scalable (up to application size) -Area efficient (SRAM cells) -Scalable (up to application size) F20 F1 F2 FNFN F3 F4 F5 F6 F7 F8 F16 F10 F12 F9 F15 F14 F17 F19 F18 F19