S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Slides:



Advertisements
Similar presentations
MEMORY popo.
Advertisements

Dr. Rabie A. Ramadan Al-Azhar University Lecture 3
Computer Organization and Architecture
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
Semiconductor Memory Design. Organization of Memory Systems Driven only from outside Data flow in and out A cell is accessed for reading by selecting.
+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
M. Mateen Yaqoob The University of Lahore Spring 2014.
Chapter 12 Memory Organization
System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.
Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.
ENGIN112 L30: Random Access Memory November 14, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 30 Random Access Memory (RAM)
Overview Memory definitions Random Access Memory (RAM)
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
CS 151 Digital Systems Design Lecture 30 Random Access Memory (RAM)
Memory Devices Wen-Hung Liao, Ph.D..
Chapter 6 Memory and Programmable Logic Devices
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Memory and Programmable Logic
Review: Basic Building Blocks  Datapath l Execution units -Adder, multiplier, divider, shifter, etc. l Register file and pipeline registers l Multiplexers,
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Logic and Computer Design Dr. Sanjay P. Ahuja, Ph.D. FIS Distinguished Professor of CIS ( ) School of Computing, UNF.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.
Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use ECE/CS 352: Digital Systems.
Memory System Unit-IV 4/24/2017 Unit-4 : Memory System.
1 Memory Hierarchy The main memory occupies a central position by being able to communicate directly with the CPU and with auxiliary memory devices through.
CPEN Digital System Design
Digital Logic Design Instructor: Kasım Sinan YILDIRIM
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
CHAPTER 4 The Central Processing Unit. Chapter Overview Microprocessors Replacing and Upgrading a CPU.
Input-Output Organization
Computer Memory Storage Decoding Addressing 1. Memories We've Seen SIMM = Single Inline Memory Module DIMM = Dual IMM SODIMM = Small Outline DIMM RAM.
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 22 Memory Definitions Memory ─ A collection of storage cells together with the necessary.
Sunpyo Hong, Hyesoon Kim
07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Computer Architecture Chapter (5): Internal Memory
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
RAM RAM - random access memory RAM (pronounced ramm) random access memory, a type of computer memory that can be accessed randomly;
Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.
Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.
Chapter 5 Internal Memory
William Stallings Computer Organization and Architecture 7th Edition
Cache and Scratch Pad Memory (SPM)
Evaluating Register File Size
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
William Stallings Computer Organization and Architecture 7th Edition
William Stallings Computer Organization and Architecture 8th Edition
EE345: Introduction to Microcontrollers Memory
William Stallings Computer Organization and Architecture 7th Edition
William Stallings Computer Organization and Architecture 8th Edition
BIC 10503: COMPUTER ARCHITECTURE
Digital Logic & Design Dr. Waseem Ikram Lecture 40.
Memory Organization.
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
AKT211 – CAO 07 – Computer Memory
William Stallings Computer Organization and Architecture 8th Edition
Presentation transcript:

S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat

OUTLINE Introduction Scratch pad memory Cache memory Proposed methodology Results Conclusions 04/09/ Spring 2010, EEL 6935, Embedded Systems

INTRODUCTION Scratch pad memory Cache memory Proposed methodology Results Conclusions 04/09/ Spring 2010, EEL 6935, Embedded Systems

INTRODUCTION Scratch pad memory: A high speed internal memory used for temporary storage of calculations, data and other work in progress. It is next closest memory to the ALU after the internal registers. Scratch pad based systems have NUMA(Non-Uniform Memory Access) latencies, and use explicit instructions to move data. DMA based data transfer is often used. On chip caches using SRAM consume power in the range of 25% to 45% of the total chip power Current embedded processors for multimedia applications have on-chip scratch pad memories 04/09/ Spring 2010, EEL 6935, Embedded Systems

INTRODUCTION Scratchpad vs. Cache: A scratchpad doesn’t contain a copy of data that is stored in the main memory. Scratchpad memory is directly manipulated by applications. In cache memory systems mapping of program elements is done during runtime, in scratch pad memory systems it is done either by the user or by the compiler using a suitable algorithm Prior studies on scratch pad memories do not address the impact on area 04/09/ Spring 2010, EEL 6935, Embedded Systems

CONTRIBUTIONS The paper proposes scratchpad memory as an alternative to cache memory as on-chip memory for computationally intensive applications. CACTI tool is used for computing area and energy for AT91M40400 target architecture. The results establish scratchpad memory as a low power alternative in most situations with an average energy reduction of 40% 04/09/ Spring 2010, EEL 6935, Embedded Systems

Introduction SCRATCH PAD MEMORY Cache memory Proposed methodology Results Conclusions 04/09/ Spring 2010, EEL 6935, Embedded Systems

SCRATCH PAD MEMORY 04/09/2010 Spring 2010, EEL 6935, Embedded Systems 8 Memory array with the decoding and the column circuitry logic Memory objects are mapped to the scratch pad in the last stage of the compiler It occupies one distant part of the memory address space. No need to check for data/instr. availability in the scratch pad Reduces the comparator and the signal miss/hit acknowledging circuitry Figure: Scratch Memory Array 6 Transistor Static RAM Memory Array Memory Cell

SCRATCH PAD MEMORY Area of scratchpad, A s A s = A sde + A sda + A sco + A spr + A sse + A sou Energy Consumption is estimated from the energy consumption of the components E scratchpad = E decoder + E memcol Components: Data decoder, data array area, column multiplexers, pre charge circuit, data sense amplifiers, output driver circuitry Memory array is the major consumer of energy CACTI tool first computes the capacitances for each unit then estimates the energy 04/09/ Spring 2010, EEL 6935, Embedded Systems

ESTIMATING THE ENERGY CONSUMPTION For the memory array: E memcol = C memcol * V dd 2 * P 0->1 C memcol is the capacitance of the memory array unit and is calculated as C memcol = ncols * (C pre + C readwrite ) P 0->1 is the probability of bit toggle, 0.5 Only two word lines are switched regardless of the change in the address bits Total energy spent in the scratch pad memory is E sptotal = SP access * E scratchpad The only case that holds good is read or write access 04/09/ Spring 2010, EEL 6935, Embedded Systems

Introduction Scratch pad memory CACHE MEMORY Proposed methodology Results Conclusions 04/09/ Spring 2010, EEL 6935, Embedded Systems

CACHE MEMORY 04/09/2010 Spring 2010, EEL 6935, Embedded Systems 12 Area model is based on the transistor count in the circuitry Area of the cache, A c = A tag + A data where A tag = A dt + A ta + A co + A pr + A se + A com + A mu and A data = A de + A da + A col + A pre + A sen + A out Figure: Cache Memory Organization Tag Array Data Array

Introduction Scratch pad memory Cache memory PROPOSED METHODOLOGY Results Conclusions 04/09/ Spring 2010, EEL 6935, Embedded Systems

EXPERIMENTAL SETUP Compare same size cache with scratchpad memory (the delay of cache is higher than scratchpad for the same technology) Identification and Assignment of critical data structures to scratch pad in based on a packing algorithm Total number of clock cycles determines the performance Larger the number of clock cycles, lower the performance because on-chip configuration doesn’t change the clock period 04/09/ Spring 2010, EEL 6935, Embedded Systems

SCRATCH PAD MEMORY ACCESS Performance estimation from the trace file. An appropriate latency is added to the overall program delay on scratchpad access: one for scratch pad read/write access, one cycle and one wait cycle for 16 bit main memory access, one cycle plus three wait states for main memory 32 bit access 04/09/ Spring 2010, EEL 6935, Embedded Systems AccessNumber of Cycles CacheUsing Cache calculations Scratch Pad1 cycle Main memory 16 bit1 cycle + 1 wait cycle Main memory 32 bit1 cycle + 1 wait cycle

CACHE MEMORY ACCESS Authors assume a write through cache Read Hit: Tag array is accessed. No write to cache and no access to main memory Read Miss: One cache read operation, L (line size) words written to cache. One main memory read event of size L and no main memory write Write Hit: Cache write followed by memory write Write Miss: One cache tag read and main memory write. No cache update. 04/09/ Spring 2010, EEL 6935, Embedded Systems Access type Ca read Ca write Mm read Mm write Read hit1000 Read miss1LL0 Write hit0101 Write miss1001

C Benchmark Mapping Algorithm CACTI Cache/Scratch Pad Size Cache Number of Cycles Cache Number of Cycles Scratchpad Number of cycles Scratchpad Number of cycles Trace Analysis Energy Aware Compiler ARMulator trace analysis FLOW DIAGRAM 04/09/ Spring 2010, EEL 6935, Embedded Systems Analytical model Energy Estimates Area Estimates Compiler Support

EXPERIMENTAL SETUP Target architecture: AT91M40400, based on embedded ARM 7TDMI embedded processor High performance RSIC processor with a very low power consumption On-chip scratch memory of 4KB. 32 bit data path and two instruction sets. encc – energy aware complier, uses a special packing algorithm- knapsack algorithm for assigning code and data blocks to the scratch pad memory The binary output of the compiler is simulated on the ARMulator to produce a trace file. ARMulator accepts the cache size as a parameter for on-chip cache configuration and generates the performance as number of cycles. The area and performance estimates are made for the 0.5um technology 04/09/ Spring 2010, EEL 6935, Embedded Systems

Introduction Scratch pad memory Cache memory Proposed methodologyRESULTS Conclusions 04/09/ Spring 2010, EEL 6935, Embedded Systems

RESULTS 04/09/ Spring 2010, EEL 6935, Embedded Systems Cache per access(2kB)4.57 nJ Scratch pad per access(2kB)1.53 nJ Main memory read access, 2 bytes24.00 nJ Main memory read access, 4 bytes49.30 nJ Main memory write access, 4 bytes41.10 nJ Size BytesArea Cache Area Scratchpad CPU cycles Cache CPU cycles, Scratchpad Area reduction Time reduction Area-time product Average Table: Energy per access of various devices Table: Area/Performance ratios for bubble-sort The average area, time and AT product reductions are 34% 18% and 46%

RESULTS 04/09/2010 Spring 2010, EEL 6935, Embedded Systems 21 Figure: Energy consumed by the memory system Figure: Comparison of cache and scratch pad memory area

Introduction Scratch pad memory Cache memory Proposed methodology ResultsCONCLUSION 04/09/ Spring 2010, EEL 6935, Embedded Systems

CONCLUSION Presents an approach for selection of on-chip memory configurations Results show that scratch pad based compile time memory outperforms cache-based run-time memory on almost counts. 40% average reduction for the application considered Authors propose study of DRAM based memory comparisons since memory bandwidth and on-chip memory capacity are limiting factors for many applications. Also, the energy models for both cache and scratchpad need to be validated by real measurements 04/09/ Spring 2010, EEL 6935, Embedded Systems

QUESTIONS 04/09/ Spring 2010, EEL 6935, Embedded Systems