Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University
Lecture 5: Scratch Pad Memories Motivation
Processor-Memory Performance Gap “Moore’s Law” µProc 55%/year (2X/1.5yr) DRAM 7%/year (2X/10yrs) □ Huge Processor-Memory Performance Gap □Cold start can take billions of cycles
More serious dimensions of the memory problem Energy Access times □Applications are getting larger and larger … Sub-banking
Memory Performance Impact on Performance □Suppose a processor executes at □ideal CPI = 1.1 □50% arith/logic, 30% ld/st, 20% control and that 10% of data memory operations miss with a 50 cycle miss penalty □CPI = ideal CPI + average stalls per instruction = 1.1(cycle) + ( 0.30 (datamemops/instr) x 0.10 (miss/datamemop) x 50 (cycle/miss) ) = 1.1 cycle cycle = 2.6 so 58% of the time the processor is stalled waiting for memory! □A 1% instruction miss rate would add an additional 0.5 to the CPI!
The Memory Hierarchy Goal: Create an illusion □Fact: Large memories are slow and fast memories are small □How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)? □With hierarchy □With parallelism
Second Level Cache (SRAM) A Typical Memory Hierarchy Control Datapath Secondary Memory (Disk) On-Chip Components RegFile Main Memory (DRAM) Data Cache Instr Cache ITLB DTLB eDRAM Speed (%cycles): ½’s 1’s 10’s 100’s 1,000’s Size (bytes): 100’s K’s 10K’s M’s G’s to T’s Cost: highest lowest By taking advantage of the principle of locality l Can present the user with as much memory as is available in the cheapest technology l at the speed offered by the fastest technology
Memory system frequently consumes >50 % of the energy used for processing Multi-processor with cacheUni-processor without caches [M. Verma, P. Marwedel: Advanced Memory Optimization Techniques for Low-Power Embedded Processors, Springer, May 2007] [Segars 01 according to Osman S. Unsal, Israel Koren, C. Mani Krishna, Csaba Andras Moritz, U. of Massachusetts, Amherst, 2001
Cache □Decoder logic
Energy Efficiency Technology [H. de Man, Keynote, DATE‘02; T. Claasen, ISSCC99] Operations/Watt [GOPS/W] Processors Reconfigurable Computing ASIC µ Necessary to optimize; otherwise the price for flexibility cannot be paid! Ambient Intelligence 0.07µ DSP-ASIPs µPs µ0.5µ1.0µ poor design techniques
Timing Predictability G.721: using unified Worst case execution time (WCET) larger than without cache
Objectives for Memory System Design □(Average) Performance □Throughput □Latency □Energy consumption □Predictability, good worst case execution time bound (WCET) □Size □Cost □….
Scratch pad memories (SPM): Fast, energy-efficient, timing-predictable □Address space ARM7TDMI cores, well-known for low power consumption scratch pad memory 0 FFF.. Example Small; no tag memory SPMs are small, physically separate memories mapped into the address space; Selection is by an appropriate address decoder (simple!) CPU CPU Regi sters SPM L1 Ca che L2 Cac he RAM
Comparison of currents E.g.: ATMEL board with ARM7TDMI and ext. SRAM
Scratchpad vs. main memory Example: Atmel ARM-Evaluation board > 86% savings energy reduction: 1/ % predictable energy reduction: 1/ % predictable
Why not just use a cache ? Energy consumption in tags, comparators and muxes is significant. [R. Banakar, S. Steinke, B.-S. Lee, 2001]
Influence of the associativity
Systems with SPM □Most of the ARM architectures have an on-chip SPM termed as Tightly-coupled memory (TCM) □GPUs such as Nvidia’s 8800 have a 16KB SPM □Its typical for a DSP to have scratch pad RAM □Embedded processors like Motorola Mcore, TI TMS370C □Commercial network processors – Intel IXP □And many more …
□Same motivation □Large memory latency □Huge overhead for automatically managed caches □Local SPE processors fetch instructions and data from local storage LS (256 kB). □LS not designed as a cache. Separate DMA transfers required to fill and spill. Main Memory And for the Cell processor
Advantages of Scratch Pads □Area advantage - For the same area, we can fit more memory of SPM than in cache (around 34%) □SPM consists of just a memory array & address decoding circuitry □Less energy consumption per access □Absence of tag memory and comparators □Performance comparable with cache □Predictable WCET – required for RTES
Challenges in using SPMs □In SPMs, application developer, or compiler has explicitly move data between memories □Data mapping is transparent in cache based architectures □Binary compatible? □Do advantages translate to a different machine?
Data Allocation on SPM □Techniques focus on mapping □Global data □Stack data □Heap data □Broadly, we can classify as □Static – Mapping of data decided at compile time and remains constant throughout the execution □Compile-time Dynamic – Mapping of data decided at compile time and data in SPM changes throughout execution □Goals are □To minimize off-chip memory access □To reduce energy consumption □To achieve better performance
Global Data □Panda et al., “Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications” □Map all scalars to SPM □Very small in size □Estimate conflicts in array □IAC(u): Interference Access Count: No. of accesses to other arrays during lifetime of u □VAC(u): Variable Access Count: Number of accesses to elements of u □IF(u) = ILT(u)*VAC(u) □Loop Conflict Graph □Nodes are arrays □Edge weight of (u -> v) is the number of accesses to u and v in the loop □More conflict SPM □Either whole array goes to SPM or not
ILP Formulation □For Functions □For Basic Blocks □For global variables □ILP Variables
ILP Formulation □Energy Savings □Size Constraint □Need not jump to and back from memory for consecutive BBs