Download presentation
Presentation is loading. Please wait.
Published byCharlene Riley Modified over 8 years ago
1
Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany
2
(2) Outline Motivation Motivation Related Work Related Work State of the art: “Instruction Cache” State of the art: “Instruction Cache” Our approach: ”Block cache” Our approach: ”Block cache” Workflow (Instruction Selection / Simulation) Workflow (Instruction Selection / Simulation) Assumptions & Constrains Assumptions & Constrains Algorithm Algorithm Results Results Summary Summary
3
(3) Motivation Off-chip memory CPU I-Cache Area is expected to increase enormously(!) Area is expected to increase enormously(!) CPU I-Cache CPU I-Cache On-ChipOff-Chip David A. Patterson „Latency lags bandwidth” Commun. ACM 2004” David A. Patterson „Latency lags bandwidth” Commun. ACM 2004” Efficiency Power consumption Area Efficiency Power consumption Area Block Cache 1.. N Memory blocks of instructions (SRAM cells) 1.. N Memory blocks of instructions (SRAM cells) Generally caches consume more power than on-chip memory [1,2,3] Generally caches consume more power than on-chip memory [1,2,3]
4
(4) Related Work S. Steinke, L. Wehmeyer B, B. Lee, P. Marwedel „Assigning Program and Data Objects to Scratchpad for Energy Reduction” – DATE ’02 S. Steinke, L. Wehmeyer B, B. Lee, P. Marwedel „Assigning Program and Data Objects to Scratchpad for Energy Reduction” – DATE ’02 Statically partition on- and off-chip memory Statically partition on- and off-chip memory S. Steinke, N. Grunwald, L. Wehmeyer, R. Banakar, M. Balakrishnan, P. Marwedel, “Reducing energy consumption by dynamic copying of instructions to on-chip memory” – ISSS ‘02 S. Steinke, N. Grunwald, L. Wehmeyer, R. Banakar, M. Balakrishnan, P. Marwedel, “Reducing energy consumption by dynamic copying of instructions to on-chip memory” – ISSS ‘02 Statically determine code copying points Statically determine code copying points P. Francesco, P.Marchal, D.Atienza, L. Benini, F. Catthoor, J.Mendias “An integrated hw/sw-approach for run-time scratchpad management” – DAC ’04 P. Francesco, P.Marchal, D.Atienza, L. Benini, F. Catthoor, J.Mendias “An integrated hw/sw-approach for run-time scratchpad management” – DAC ’04 DMA for acceleration in on-chip memory for data DMA for acceleration in on-chip memory for data B. Egger, J. Kee, H. Shin “Scratchpad memory management for portable systems with a memory management unit”, EMSOFT ’06 B. Egger, J. Kee, H. Shin “Scratchpad memory management for portable systems with a memory management unit”, EMSOFT ’06 MMU to map between on- and off-chip memory (we share the µTLB) MMU to map between on- and off-chip memory (we share the µTLB)
5
(5) “State of the Art”: Instruction Cache Off-chip memory CPU On-ChipOff-Chip Block Cache I-Cache
6
(6) Architecture: Instruction Cache Tag Offset... Tag MUX Set MUX Data = = = = = = = =... T T T TTTT O O O O O
7
(7) “State of the Art”: Instruction Cache Off-chip memory CPU On-ChipOff-Chip Block Cache I-Cache Our approach: Block Cache
8
(8) Our approach: Block Cache B1 B2.. BNBN Memory Memory Blocks Blocks (SRAM (SRAM cells) cells) Logic +
9
(9) BNBN.. B2 Architectural Overview: Block Cache Off-chip memory Instruction B1 CPU µTLB = address Memory blocks Memory blocks Control Unit DMA Block load On-chip Instructions Exploit burst transfers (DRAM Memory) Exploit burst transfers (DRAM Memory) -Area efficient (SRAM cells) -Scalable (up to application size) -Area efficient (SRAM cells) -Scalable (up to application size)
10
(10) BNBN.. B2 Architectural Overview: Block Cache Off-chip memory Instruction B1 CPU µTLB = address Memory blocks Memory blocks Control Unit DMA Block load On-chip Instructions Exploit burst transfers (DRAM Memory) Exploit burst transfers (DRAM Memory) -Area efficient (SRAM cells) -Scalable (up to application size) -Area efficient (SRAM cells) -Scalable (up to application size)
11
(11) Architectural Overview: Block Cache BNBN.. B2 B1 Memory blocks Memory blocks On-chip …. F1 F2 FNFN 010101 010010 110111 100010 (Binary) PUSH R1 PUSH R2 …. POP R2 POP R1 RET (Assembler) = 1..N Function(s)
12
(12) Function to Block Mapping B2 B1 F20 F1 F2 FNFN F3 F4 F5 F6 F7 F8 F16 F10 F12 F9 F15 F14 F17 F19 F18 F19 F6c F6a F6b Eviction: LRU, Round Robin, ARC, Belady Eviction: LRU, Round Robin, ARC, Belady B3
13
(13) Design Flow : Analysis Instrumented Execution / Simulation Instrumented Execution / Simulation Dynamic Call Graph Dynamic Call Graph Disassemble Static Call Graph Executed Instruction Trace Executed Instruction Trace Software Component Software Component Input Data / Parameters Input Data / Parameters Trace: function enter/exit function address Trace: function enter/exit function address + Functions not called during profiling (need to be included) + Functions not called during profiling (need to be included)
14
(14) Besides: Assumptions & Constrains Software Behavior Analysis Software Behavior Analysis Component level Component level Trace composition reflects deployment usage ( parameters / input set ) Trace composition reflects deployment usage ( parameters / input set ) Hardware Hardware External memory: High bandwidth / high latency External memory: High bandwidth / high latency Block size (fixed) / Number of code blocks (fixed) Block size (fixed) / Number of code blocks (fixed) Compiler / Linker Compiler / Linker Function splitting (function size < block size) Function splitting (function size < block size)
15
(15) Design Flow : Analysis Instrumented Execution / Simulation Instrumented Execution / Simulation Dynamic Call Graph Dynamic Call Graph Disassemble Static Call Graph Executed Instruction Trace Executed Instruction Trace Application (component) Application (component) Input Data / Parameters Input Data / Parameters Trace: function enter/exit function address Trace: function enter/exit function address
16
(16) Design Flow : Block composition Dynamic Call Graph Dynamic Call Graph Static Call Graph Block composition algorithm Block composition algorithm Linker File
17
(17) Design Flow : Re-linking Function 1 Function 2 Function 3 Function 4 Function 5 Function 6 …. Code block 2 Code block 1 Code block 3 Code block 4 Original Binary Re-linked Binary X Linker File done
18
(18) Design Flow : Re-linking Code block 1 Original code section size Code section size after re-linking Data section size Function Reference Function Pointer Data Reference Compiler supplies: Relocation table Symbol table ELF headers Compiler supplies: Relocation table Symbol table ELF headers
19
(19) Overview: Algorithm Input: Dynamic function call graph Input: Dynamic function call graph (Node = function) (Node = function) Output: Block graph Output: Block graph (Node = 1..n functions) (Node = 1..n functions) 3 steps (differ in merging distance): 3 steps (differ in merging distance): (1) combine_neighbor (2) merge_direct_children (3) bubble_merge Challenge: “Merge appropriate functions into a block” Challenge: “Merge appropriate functions into a block”
20
(20) Algorithm Step 1/3 F2 F1 F5 F3 F6 F7 F8 F9 F4 Dynamic Call Graph Function size (architecture) Function size (architecture) 100 1 1e4 1e8 4 1e10 30 1e6 1e6 combine_neighbor
21
(21) Algorithm Step 1/3 F2 F1 F5 F6 F7 F8 F9 F4 1.00 1 1e4 1e8 4 1e10 30 1e6 1e6 100 0.99 0.01 0.00 0.00 0.00 0.00 0.00 F4,7 F3 Centrality Measure: Centrality Measure: 0.00 combine_neighbor Dynamic Call Graph
22
(22) Algorithm Step 2/3 F5F6F7 Dynamic Call Graph 30 1e6 1e6 F3 merge_direct_children F8 1e4 F5F6F7F8
23
(23) Algorithm Step 2/3 F5F6F7 Dynamic Call Graph 30 1e6 1e6 F3 merge_direct_children F8 1e4 F5F6F7F8 F6F7F8F5 F6,7 F6,7,8
24
(24) Algorithm Step 2/3 F5F6F7 Dynamic Call Graph 30 1e6 1e6 F3 merge_direct_children F8 1e4 F5 F6,7,8 1e6+1e6+1e4
25
(25) Algorithm Step 3/3 F2 F1 F5 F6 F7 F8 F9 F4 Dynamic Call Graph 100 1 1e 4 1e8 4 1e10 30 1e6 1e6 bubble_merge F5 F6 F7 F1 F3
26
(26) Algorithm Step 3/3 F2 F1 F5 F6 F7 F8 F9 F4 Dynamic Call Graph 100 1 1e4 1e8 4 1e10 30 1e6 1e6 bubble_merge F5 F6 F7 F1 F3 F4 F8 F2 F9
27
(27) Algorithm Step 3/3 F2 F5 F6 F7 F8 F9 F4 Dynamic Call Graph 100 1 1e4 1e8 4 1e10 30 1e6 1e6 bubble_merge F3 F5 F6 F7 F4 F8 F2 F9 F1 F3,F8
28
(28) Results What is interesting ? What is interesting ? Memory efficiency: Block Fragmentation Memory efficiency: Block Fragmentation Technology scaling: Misses Technology scaling: Misses Energy: Amount of transferred data Energy: Amount of transferred data Performance: Number of cycles Performance: Number of cycles Benchmark: MediaBench (CJPEG) Benchmark: MediaBench (CJPEG)
29
(29) Results: Block Fragmentation CJPEG – JPEG encoding (MediaBench) Results: Function size distribution Block size [Byte] x-axis: Binary size [Byte]
30
(30) Results: Misses : LRU: [6-12 blocks] CJPEG – JPEG encoding (MediaBench) X-axis: total cache size [Byte]
31
(31) Results: Transferred Code : LRU [6-12 blocks] CJPEG – JPEG encoding (MediaBench) X-axis: total cache size [Byte]
32
(32) Results: LRU/ARC/RR Transferred Code [8 blocks] CJPEG – JPEG encoding (MediaBench) X-axis: total cache size [Byte]
33
(33) Results: Copy cycles : LRU : [6-12 blocks] CJPEG – JPEG encoding (MediaBench) X-axis: total cache size [Byte]
34
(34) Summary Introduced: Block Cache for Embedded Systems Introduced: Block Cache for Embedded Systems Area increase / External memory latency Area increase / External memory latency Utilization / Suitability of traditional designs Utilization / Suitability of traditional designs Scalability: on-chip memories (Megabytes) Scalability: on-chip memories (Megabytes) Block Cache: Block Cache: Hardware Hardware Simple hardware structure: Simple hardware structure: Logic + Memory (SRAM not cache memory) Logic + Memory (SRAM not cache memory) Design Flow Design Flow Execute software component, block composition (algorithm, 3 steps), re-link the binary Execute software component, block composition (algorithm, 3 steps), re-link the binary Results Results Exploits high-bandwidth memory Exploits high-bandwidth memory Good performance Good performance
35
(35) References [1] David A. Patterson „Latency lags bandwidth”, Commun. ACM – 2004 [1] David A. Patterson „Latency lags bandwidth”, Commun. ACM – 2004 [2] R.Banakar, S.Steineke, B.Lee, M. Balakrishnan, P.Marwedel, “Scratchpad memory: Design alternative for cache on-chip memory in embedded systems” - CODES, 2002 [2] R.Banakar, S.Steineke, B.Lee, M. Balakrishnan, P.Marwedel, “Scratchpad memory: Design alternative for cache on-chip memory in embedded systems” - CODES, 2002 [3] F.Angiolini, F.Menichelli, A.Ferrero, L.Benini, M.Oliveri, “A post compiler approach to scratchpad mapping of code” – CASES, 2004 [3] F.Angiolini, F.Menichelli, A.Ferrero, L.Benini, M.Oliveri, “A post compiler approach to scratchpad mapping of code” – CASES, 2004 [4] S.Steineke, L.Wehmeyer, B. Lee, P.Marwedel, “Assigning program and data objects to scratchpad for energy reduction” - DATE, 2002 [4] S.Steineke, L.Wehmeyer, B. Lee, P.Marwedel, “Assigning program and data objects to scratchpad for energy reduction” - DATE, 2002
36
(36)
37
(37) Motivation Off-chip memory CPU I-Cache CPU I-Cache CPU I-Cache Bandwidth improves but latency not [1] Bandwidth improves but latency not [1] Generally caches consume more power than on-chip memory [2,3,4] Generally caches consume more power than on-chip memory [2,3,4] A significant amount of power will be spent in the memory hierarchy A significant amount of power will be spent in the memory hierarchy On-chip area will increase enormously On-chip area will increase enormously
38
(38) Motivation Off-chip memory CPU I-Cache CPU I-Cache CPU I-Cache
39
(39) Motivation Off-chip memory CPU I-Cache CPU I-Cache CPU I-Cache B-Cache
40
(40) …… Addr. B1 … B3 B2 Architectural Overview: Block Cache Off-chip memory Instruction B1 CPU Addr. B1 = address Code blocks Code blocks Block status Control Unit DMA Block load On-chip Instructions µTLB Exploit burst transfers (DRAM Memory) Exploit burst transfers (DRAM Memory) -Area efficient (SRAM cells) -Scalable (up to application size) -Area efficient (SRAM cells) -Scalable (up to application size)
41
(41) Function to Block Mapping B2 B1 Exploit burst transfers (DRAM Memory) Exploit burst transfers (DRAM Memory) -Area efficient (SRAM cells) -Scalable (up to application size) -Area efficient (SRAM cells) -Scalable (up to application size) F20 F1 F2 FNFN F3 F4 F5 F6 F7 F8 F16 F10 F12 F9 F15 F14 F17 F19 F18 F19
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.