Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.

Slides:



Advertisements
Similar presentations
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Advertisements

A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
CS 31003: Compilers ANIRUDDHA GUPTA 11CS10004 G2 CLASS DATE : 24/07/2013.
Allocating Memory.
S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.
A Study of Energy Efficiency Methods for Memory Mao-Yin Wang & Cheng-Wen Wu.
CS 153 Design of Operating Systems Spring 2015
OS Spring ‘04 Paging and Virtual Memory Operating Systems Spring 2004.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.
CS 104 Introduction to Computer Science and Graphics Problems
Overview C programming Environment C Global Variables C Local Variables Memory Map for a C Function C Activation Records Example Compilation.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Virtual Memory BY JEMINI ISLAM. What is Virtual Memory Virtual memory is a memory management system that gives a computer the appearance of having more.
Chapter 6 Memory and Programmable Logic Devices
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Technische Universität Dortmund Automatic mapping to tightly coupled memories and cache locking Peter Marwedel 1,2, Heiko Falk 1, Robert Pyka 1, Lars Wehmeyer.
Memory Allocation via Graph Coloring using Scratchpad Memory
Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL.
Outline Introduction Different Scratch Pad Memories Cache and Scratch Pad for embedded applications.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
CIS250 OPERATING SYSTEMS Memory Management Since we share memory, we need to manage it Memory manager only sees the address A program counter value indicates.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
A Dynamic Code Mapping Technique for Scratchpad Memories in Embedded Systems Amit Pabalkar Compiler and Micro-architecture Lab School of Computing and.
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.
Basic Memory Management 1. Readings r Silbershatz et al: chapters
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Virtual Memory.  Next in memory hierarchy  Motivations:  to remove programming burdens of a small, limited amount of main memory  to allow efficient.
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
Sunpyo Hong, Hyesoon Kim
Jeffrey Ellak CS 147. Topics What is memory hierarchy? What are the different types of memory? What is in charge of accessing memory?
WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.
CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
Hello world !!! ASCII representation of hello.c.
A Framework For Trusted Instruction Execution Via Basic Block Signature Verification Milena Milenković, Aleksandar Milenković, and Emil Jovanov Electrical.
1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
Cache and Scratch Pad Memory (SPM)
Memory COMPUTER ARCHITECTURE
High Performance Computing (HIPC)
Evaluating Register File Size
Section 9: Virtual Memory (VM)
Selective Code Compression Scheme for Embedded System
ENERGY 211 / CME 211 Lecture 25 November 17, 2008.
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Improving Program Efficiency by Packing Instructions Into Registers
Ke Bai and Aviral Shrivastava Presented by Bryce Holton
ECE 445 – Computer Organization
Ann Gordon-Ross and Frank Vahid*
Dynamic Code Mapping Techniques for Limited Local Memory Systems
Adapted from slides by Sally McKee Cornell University
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Main Memory Background
Virtual Memory Lecture notes from MKP and S. Yalamanchili.
Page Main Memory.
Presentation transcript:

Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany

(2) Outline Motivation Motivation Related Work Related Work State of the art: “Instruction Cache” State of the art: “Instruction Cache” Our approach: ”Block cache” Our approach: ”Block cache” Workflow (Instruction Selection / Simulation) Workflow (Instruction Selection / Simulation) Assumptions & Constrains Assumptions & Constrains Algorithm Algorithm Results Results Summary Summary

(3) Motivation Off-chip memory CPU I-Cache Area is expected to increase enormously(!) Area is expected to increase enormously(!) CPU I-Cache CPU I-Cache On-ChipOff-Chip David A. Patterson „Latency lags bandwidth” Commun. ACM 2004” David A. Patterson „Latency lags bandwidth” Commun. ACM 2004” Efficiency Power consumption Area Efficiency Power consumption Area Block Cache 1.. N Memory blocks of instructions (SRAM cells) 1.. N Memory blocks of instructions (SRAM cells) Generally caches consume more power than on-chip memory [1,2,3] Generally caches consume more power than on-chip memory [1,2,3]

(4) Related Work S. Steinke, L. Wehmeyer B, B. Lee, P. Marwedel „Assigning Program and Data Objects to Scratchpad for Energy Reduction” – DATE ’02 S. Steinke, L. Wehmeyer B, B. Lee, P. Marwedel „Assigning Program and Data Objects to Scratchpad for Energy Reduction” – DATE ’02 Statically partition on- and off-chip memory Statically partition on- and off-chip memory S. Steinke, N. Grunwald, L. Wehmeyer, R. Banakar, M. Balakrishnan, P. Marwedel, “Reducing energy consumption by dynamic copying of instructions to on-chip memory” – ISSS ‘02 S. Steinke, N. Grunwald, L. Wehmeyer, R. Banakar, M. Balakrishnan, P. Marwedel, “Reducing energy consumption by dynamic copying of instructions to on-chip memory” – ISSS ‘02 Statically determine code copying points Statically determine code copying points P. Francesco, P.Marchal, D.Atienza, L. Benini, F. Catthoor, J.Mendias “An integrated hw/sw-approach for run-time scratchpad management” – DAC ’04 P. Francesco, P.Marchal, D.Atienza, L. Benini, F. Catthoor, J.Mendias “An integrated hw/sw-approach for run-time scratchpad management” – DAC ’04 DMA for acceleration in on-chip memory for data DMA for acceleration in on-chip memory for data B. Egger, J. Kee, H. Shin “Scratchpad memory management for portable systems with a memory management unit”, EMSOFT ’06 B. Egger, J. Kee, H. Shin “Scratchpad memory management for portable systems with a memory management unit”, EMSOFT ’06 MMU to map between on- and off-chip memory (we share the µTLB) MMU to map between on- and off-chip memory (we share the µTLB)

(5) “State of the Art”: Instruction Cache Off-chip memory CPU On-ChipOff-Chip Block Cache I-Cache

(6) Architecture: Instruction Cache Tag Offset... Tag MUX Set MUX Data = = = = = = = =... T T T TTTT O O O O O

(7) “State of the Art”: Instruction Cache Off-chip memory CPU On-ChipOff-Chip Block Cache I-Cache Our approach: Block Cache

(8) Our approach: Block Cache B1 B2.. BNBN Memory Memory Blocks Blocks (SRAM (SRAM cells) cells) Logic +

(9) BNBN.. B2 Architectural Overview: Block Cache Off-chip memory Instruction B1 CPU µTLB = address Memory blocks Memory blocks Control Unit DMA Block load On-chip Instructions Exploit burst transfers (DRAM Memory) Exploit burst transfers (DRAM Memory) -Area efficient (SRAM cells) -Scalable (up to application size) -Area efficient (SRAM cells) -Scalable (up to application size)

(10) BNBN.. B2 Architectural Overview: Block Cache Off-chip memory Instruction B1 CPU µTLB = address Memory blocks Memory blocks Control Unit DMA Block load On-chip Instructions Exploit burst transfers (DRAM Memory) Exploit burst transfers (DRAM Memory) -Area efficient (SRAM cells) -Scalable (up to application size) -Area efficient (SRAM cells) -Scalable (up to application size)

(11) Architectural Overview: Block Cache BNBN.. B2 B1 Memory blocks Memory blocks On-chip …. F1 F2 FNFN (Binary) PUSH R1 PUSH R2 …. POP R2 POP R1 RET (Assembler) = 1..N Function(s)

(12) Function to Block Mapping B2 B1 F20 F1 F2 FNFN F3 F4 F5 F6 F7 F8 F16 F10 F12 F9 F15 F14 F17 F19 F18 F19 F6c F6a F6b Eviction: LRU, Round Robin, ARC, Belady Eviction: LRU, Round Robin, ARC, Belady B3

(13) Design Flow : Analysis Instrumented Execution / Simulation Instrumented Execution / Simulation Dynamic Call Graph Dynamic Call Graph Disassemble Static Call Graph Executed Instruction Trace Executed Instruction Trace Software Component Software Component Input Data / Parameters Input Data / Parameters Trace: function enter/exit function address Trace: function enter/exit function address + Functions not called during profiling (need to be included) + Functions not called during profiling (need to be included)

(14) Besides:  Assumptions & Constrains Software Behavior Analysis Software Behavior Analysis Component level Component level Trace composition reflects deployment usage ( parameters / input set ) Trace composition reflects deployment usage ( parameters / input set ) Hardware Hardware External memory: High bandwidth / high latency External memory: High bandwidth / high latency Block size (fixed) / Number of code blocks (fixed) Block size (fixed) / Number of code blocks (fixed) Compiler / Linker Compiler / Linker Function splitting (function size < block size) Function splitting (function size < block size)

(15) Design Flow : Analysis Instrumented Execution / Simulation Instrumented Execution / Simulation Dynamic Call Graph Dynamic Call Graph Disassemble Static Call Graph Executed Instruction Trace Executed Instruction Trace Application (component) Application (component) Input Data / Parameters Input Data / Parameters Trace: function enter/exit function address Trace: function enter/exit function address

(16) Design Flow : Block composition Dynamic Call Graph Dynamic Call Graph Static Call Graph Block composition algorithm Block composition algorithm Linker File

(17) Design Flow : Re-linking Function 1 Function 2 Function 3 Function 4 Function 5 Function 6 …. Code block 2 Code block 1 Code block 3 Code block 4 Original Binary Re-linked Binary X Linker File done

(18) Design Flow : Re-linking Code block 1 Original code section size Code section size after re-linking Data section size Function Reference Function Pointer Data Reference Compiler supplies: Relocation table Symbol table ELF headers Compiler supplies: Relocation table Symbol table ELF headers

(19) Overview: Algorithm Input: Dynamic function call graph Input: Dynamic function call graph (Node = function) (Node = function) Output: Block graph Output: Block graph (Node = 1..n functions) (Node = 1..n functions) 3 steps (differ in merging distance): 3 steps (differ in merging distance): (1) combine_neighbor (2) merge_direct_children (3) bubble_merge Challenge: “Merge appropriate functions into a block” Challenge: “Merge appropriate functions into a block”

(20) Algorithm Step 1/3 F2 F1 F5 F3 F6 F7 F8 F9 F4 Dynamic Call Graph Function size (architecture) Function size (architecture) e4 1e8 4 1e e6 1e6 combine_neighbor

(21) Algorithm Step 1/3 F2 F1 F5 F6 F7 F8 F9 F e4 1e8 4 1e e6 1e F4,7 F3 Centrality Measure: Centrality Measure: 0.00 combine_neighbor Dynamic Call Graph

(22) Algorithm Step 2/3 F5F6F7 Dynamic Call Graph 30 1e6 1e6 F3 merge_direct_children F8 1e4 F5F6F7F8

(23) Algorithm Step 2/3 F5F6F7 Dynamic Call Graph 30 1e6 1e6 F3 merge_direct_children F8 1e4 F5F6F7F8 F6F7F8F5 F6,7 F6,7,8

(24) Algorithm Step 2/3 F5F6F7 Dynamic Call Graph 30 1e6 1e6 F3 merge_direct_children F8 1e4 F5 F6,7,8 1e6+1e6+1e4

(25) Algorithm Step 3/3 F2 F1 F5 F6 F7 F8 F9 F4 Dynamic Call Graph e 4 1e8 4 1e e6 1e6 bubble_merge F5 F6 F7 F1 F3

(26) Algorithm Step 3/3 F2 F1 F5 F6 F7 F8 F9 F4 Dynamic Call Graph e4 1e8 4 1e e6 1e6 bubble_merge F5 F6 F7 F1 F3 F4 F8 F2 F9

(27) Algorithm Step 3/3 F2 F5 F6 F7 F8 F9 F4 Dynamic Call Graph e4 1e8 4 1e e6 1e6 bubble_merge F3 F5 F6 F7 F4 F8 F2 F9 F1 F3,F8

(28) Results What is interesting ? What is interesting ? Memory efficiency: Block Fragmentation Memory efficiency: Block Fragmentation Technology scaling: Misses Technology scaling: Misses Energy: Amount of transferred data Energy: Amount of transferred data Performance: Number of cycles Performance: Number of cycles Benchmark: MediaBench (CJPEG) Benchmark: MediaBench (CJPEG)

(29) Results: Block Fragmentation CJPEG – JPEG encoding (MediaBench) Results: Function size distribution Block size [Byte] x-axis: Binary size [Byte]

(30) Results: Misses : LRU: [6-12 blocks] CJPEG – JPEG encoding (MediaBench) X-axis: total cache size [Byte]

(31) Results: Transferred Code : LRU [6-12 blocks] CJPEG – JPEG encoding (MediaBench) X-axis: total cache size [Byte]

(32) Results: LRU/ARC/RR Transferred Code [8 blocks] CJPEG – JPEG encoding (MediaBench) X-axis: total cache size [Byte]

(33) Results: Copy cycles : LRU : [6-12 blocks] CJPEG – JPEG encoding (MediaBench) X-axis: total cache size [Byte]

(34) Summary Introduced: Block Cache for Embedded Systems Introduced: Block Cache for Embedded Systems Area increase / External memory latency Area increase / External memory latency Utilization / Suitability of traditional designs Utilization / Suitability of traditional designs Scalability: on-chip memories (Megabytes) Scalability: on-chip memories (Megabytes) Block Cache: Block Cache: Hardware Hardware Simple hardware structure: Simple hardware structure: Logic + Memory (SRAM not cache memory) Logic + Memory (SRAM not cache memory) Design Flow Design Flow Execute software component, block composition (algorithm, 3 steps), re-link the binary Execute software component, block composition (algorithm, 3 steps), re-link the binary Results Results Exploits high-bandwidth memory Exploits high-bandwidth memory Good performance Good performance

(35) References [1] David A. Patterson „Latency lags bandwidth”, Commun. ACM – 2004 [1] David A. Patterson „Latency lags bandwidth”, Commun. ACM – 2004 [2] R.Banakar, S.Steineke, B.Lee, M. Balakrishnan, P.Marwedel, “Scratchpad memory: Design alternative for cache on-chip memory in embedded systems” - CODES, 2002 [2] R.Banakar, S.Steineke, B.Lee, M. Balakrishnan, P.Marwedel, “Scratchpad memory: Design alternative for cache on-chip memory in embedded systems” - CODES, 2002 [3] F.Angiolini, F.Menichelli, A.Ferrero, L.Benini, M.Oliveri, “A post compiler approach to scratchpad mapping of code” – CASES, 2004 [3] F.Angiolini, F.Menichelli, A.Ferrero, L.Benini, M.Oliveri, “A post compiler approach to scratchpad mapping of code” – CASES, 2004 [4] S.Steineke, L.Wehmeyer, B. Lee, P.Marwedel, “Assigning program and data objects to scratchpad for energy reduction” - DATE, 2002 [4] S.Steineke, L.Wehmeyer, B. Lee, P.Marwedel, “Assigning program and data objects to scratchpad for energy reduction” - DATE, 2002

(36)

(37) Motivation Off-chip memory CPU I-Cache CPU I-Cache CPU I-Cache Bandwidth improves but latency not [1] Bandwidth improves but latency not [1] Generally caches consume more power than on-chip memory [2,3,4] Generally caches consume more power than on-chip memory [2,3,4] A significant amount of power will be spent in the memory hierarchy A significant amount of power will be spent in the memory hierarchy On-chip area will increase enormously On-chip area will increase enormously

(38) Motivation Off-chip memory CPU I-Cache CPU I-Cache CPU I-Cache

(39) Motivation Off-chip memory CPU I-Cache CPU I-Cache CPU I-Cache B-Cache

(40) …… Addr. B1 … B3 B2 Architectural Overview: Block Cache Off-chip memory Instruction B1 CPU Addr. B1 = address Code blocks Code blocks Block status Control Unit DMA Block load On-chip Instructions µTLB Exploit burst transfers (DRAM Memory) Exploit burst transfers (DRAM Memory) -Area efficient (SRAM cells) -Scalable (up to application size) -Area efficient (SRAM cells) -Scalable (up to application size)

(41) Function to Block Mapping B2 B1 Exploit burst transfers (DRAM Memory) Exploit burst transfers (DRAM Memory) -Area efficient (SRAM cells) -Scalable (up to application size) -Area efficient (SRAM cells) -Scalable (up to application size) F20 F1 F2 FNFN F3 F4 F5 F6 F7 F8 F16 F10 F12 F9 F15 F14 F17 F19 F18 F19