1 Memory Design for Multi-Core System on Chip. 2 Introduction The DSP processor is optimized for extremely high performance for a specific kind of arithmetic-intensive.

1 Memory Design for Multi-Core System on Chip

2 Introduction The DSP processor is optimized for extremely high performance for a specific kind of arithmetic-intensive algorithms. Data Path Optimization Operations like Multiply- Accumulate should only take one clock cycle Memory Architecture Optimization Large Amounts of Data must be moved from and to Memory

3 A FIR-Filter is a typical DSP application

4 Example FIR-Filter If Multiply-Accumulate can be done in a single clock cycle, a new sample in a k-tap FIR filter could be calculated in k cycle, if there would be no delay due to memory access. However, several memory acceses are necessary: 1. Fetch the multiply-accumulate instruction 2. Read the delayed data value (xi) 3. Read the coefficient value (ci) 4. Write the data value into the next delay location in memory (xi -> xi-1)

5 Memory Structure The common memory structure used by general purpose processors is the Von Neumann architecture The processor can make one memory access per instruction cycle

6 Original Harvard Architecture Processor is connected with two memories (one for instructions, one for data) via independent buses.

7 Modified Harvard Architecture Processor is connected with two memories (both for instruction and data) via independent buses.

8 Comparison An implementation of the FIR filter needs per sample o 4 instruction cycles (Von Neumann) o 3 instruction cycles (Original Harvard) o 2 instruction cycles (Modified Harvard) Thus there is of course the possibility to have more than two independent memory banks, which is also used in some DSPs

9 Multiple Memory Buses Multiple Memory buses outside the chip are costly. DSP processors generally provide only two off-chip buses (address and data bus) Processors with multiple memory banks provide usually a small amount of memory on-chip

10 Multiple Memory Access Fast Memories Multiple Memory Access can be achieved by using faster memories that support multiple memory accesses per instruction cycle Fast memories can be combined with a Harvard architecture to achieve higher memory bandwidth

11 Multiple Memory Access Multi-Port Memories Multi-port memories have multiple independent set of address and data connections.

12 DSP processors often include a program cache to eliminate to access the main memory for certain instructions In general DSP processor caches are much smaller and simpler than caches in general purpose processors DSP 프로세서 캐쉬

13 DSP Processor Caches Single-Instruction Repeat Buffer An instruction is loaded into the repeat buffer (initiated by the programmer). If the repeat instruction is used, the processor can make an extra memory access within a single cycle. Extended Repeat-Buffer A whole block of instructions is loaded into the repeat buffer.

14 DSP Processor Caches Single Sector Instruction Cache o The cache stores a number of the most recent instructions, which are in a single contiguous region of program memory. o The cache is loaded automatically with these instructions during program execution Multiple Sector Instruction Cache o Two or more independent sectors in memory can be stored in a cache o If an instruction is of another sector than stored in the caches (sector miss), one of the sectors in the caches is replaced by the sector of the instruction

15 DSP Processor Caches Some DSP processors provide special instructions that allow to lock the contents of the caches or to disable the cache o This may lead to better performance, if the programmer knows the behavior of the program

16 DSP Processor Caches DSP processors caches are in general only used for program instructions and not for data o Caches that accomodate data and instructions must include a mechanism to write back data to the external memory o If this is not the case, the deletion of the cache contents means that updates of data in the cache are lost In order to use caches in an efficient way, algorithms should exploit data locality

17 External Memory Interfaces Most DSPs provide a single external memory port consisting of an address bus, a data bus and a set of control signals

18 External Memory Interfaces The lack of external memory ports means that multiple external memory accesses cannot be done within a single clock cycle There are DSP Processors that provide multiple off-chip memory ports

19 Multiprocessor Support in DSP External Interfaces DSPs intended for multiprocessor systems often provide special features to simplify the design of multi-processor DSP systems Examples: o Two external memory ports o Sophisticated shared bus interface that allows to connect several DSPs together with no special hardware or software P. Lapsley, Jeff Bier, Amit Shoham, E. A. Lee, DSP Processor Fundamentals, IEEE Press, 1997

20 FAST COMPUTATIONS ON A LOW-COST DSP- BASED SHAREDMEMORY MULTIPROCESSOR SYSTEM ICECS’ 2002 Charalambos S. Christou

21 Introduction  Processor performance has increased dramatically over the past few years while memory latency and bandwidth have progressed at a much slower pace  Large latencies have considerably reduced the number of processors, which can be effectively supported in shared-memory parallel computers => New cost-effective parallel system 1. Reduces memory latency 2. Effectively supports a greater number of processing elements for faster DSP computations

22 TWIN-PREFETCHING DSP-BASED SHARED-MEMORY SYSTEM(1)  The proposed multiprocessor is a high-speed lowcost DSP-based twin- prefetching shared-memory MIMD parallel system Figure1. Twin-prefetching multiprocessor system diagram

23 Advantages of Shared-memory Ease of programming when communication patterns are complex or vary dynamically during execution Lower communication overhead, good utilization of communication bandwidth for small data items, no expensive I/O operations Hardware-controlled caching to reduce remote commutations when remote data is cached Another favor: Message passing (e.g. MPI)

24 SMP Interconnect Processors to Memory AND to I/O Bus based: all memory locations equal access time so SMP = “Symmetric MP” Sharing limited BW w/ processors, I/O

25 TWIN-PREFETCHING DSP-BASED SHARED-MEMORY SYSTEM(2)  Data Memory - Comprised of two controllers (twin TTCs) and two fast memories (twin- prefetching caches) - The two TTC/cache pairs are Twin1 and Twin2. => one Twin is accessible to the processor providing data operands => the other Twin is transferring data from/to the shared memory i.e., as soon as a block of data is moved into the cache - Loading (input image segments) and unloading (results) from/to the Twins occur simultaneously with data processing. - The back and forth switching of Twin1 and Twin2 allows maximum utilization of resources; thus optimum system performance

26 TWIN-PREFETCHING DSP-BASED SHARED-MEMORY SYSTEM(3)  Host Processor - The host can directly read or write asynchronously the internal memory of any ADSP-21060 via the Host bus - Host processor is responsible for booting all nodes and downloading all necessary code and some data to the internal memory of every processor. - The data downloaded to the internal memories include the addresses of image segments in the global memory which every node is assigned to process.

27 RESULT(1)

28 RESULT(2)

29 An efficient dynamic Memory Manager for Embedded systems Most embedded systems rely on statically allocated memory to avoid problems with garbage collection. Solution of the garbage collection –Manual Memory Management System : The systems that would benefit form using dynamic memory management are applications where all clients involved do not need instant access to the maximum memory that they could claim.

30 An efficient dynamic Memory Manager for Embedded systems The DMMS works as an address translator for the clients. The DMMS contains an Arbiter for granting access to the different clients. –Three types of requests : Allocation, Deallocation, R/W.

31 An efficient dynamic Memory Manager for Embedded systems The interface to the clients consists of four parts : Allocation, Deallocation, R/W, maintenance. In order to achieve an as high grade as possible of memory utilisation it is important to perform thorough analysis of the optimal block size.

32 An efficient dynamic Memory Manager for Embedded systems The key issue in the DMMS is to make the clients to see a contiguous memory. A naive implementation of the DMM would be to have a separate Address Translation Table(ATT) for each client.

33 An efficient dynamic Memory Manager for Embedded systems The table below shows the worst and average cases for a different number of clients. The main advantage with DMMS is that it has predictable, and nice, worst, and average case behaviors. The system is intended to ease the job of the hardware designer/programmer as well as create better and smaller hardware.

34 Hardware Support for Real-Time Embedded Multiprocessor SoC Memory Management The aggressive evolution of the semiconductor industry has provided design engineers the means to create complex, high performance SoC designs. A typical SoC consists of multiple processing elements, configurable logic, large memory, analog components and digital interfaces.

35 Hardware Support for Real-Time Embedded Multiprocessor SoC Memory Management The SoC Dynamic Memory Management Unit is a hardware unit which deals with the global on-chip memory allocation/de-allocation between the PEs. There are three types of commands that the SoCDMMU can execute : G_Allocate commmands, G_Deallocate command and Move command.

36 Hardware Support for Real-Time Embedded Multiprocessor SoC Memory Management An RTOS usually divides the memory into fixed-sized allocation units and any task can allocate only one unit at a time. As an RTOS, Atalanta manage memory in a deterministic way ; tasks can dynamically allocate fixed-size blocks by using memory partitions.

37 Hardware Support for Real-Time Embedded Multiprocessor SoC Memory Management Atalanta RTOS memory management –to support the SoCDMMU and to allow the Atalanta RTOS to work in a multiprocessor SoC environment.

38 Hardware Support for Real-Time Embedded Multiprocessor SoC Memory Management Four-PE SoC with SoCDMMU hardware SoCDMMU that provides a dynamic, fast way to allocate/deallocate the global on- chip memory.

39 Unifying Memory and Processor Wrapper Architecture in Multiprocessor SoC Design ISSS’ 2002 Férid Gharsalli, Damien Lyonnard, Samy Meftali, Frédéric Rousseau, Ahmed A. Jerraya TIMA Laboratory, Grenoble, France

40 Introduction Multiprocessor SoCs (MP SoC) Increasing performance requirements of application domains Complex communication protocols IP or application-specific memory components Require heterogeneous processors => This architecture generation demands significant design efforts

41 Introduction  To reduce the productivity gap, designers reuse components (IP cores)  An IP core need to adapt specific physical accesses and protocols of those components to the communication network that may have other physical connections and other protocols.  To facilitate the design space exploration and to allow the designer to try different components or communication protocols, => Need that generate automatically these wrappers, based on parameters given by the architecture (processor types,protocols, etc.)

42 MP SoC Architecture Figure 1: A typical Multiprocessor SoC

43 MP SoC Architecture Figure 2: Architectural models

44 Unified Wrapper Model(1) – generic wrapper architecture Figure 3: Wrapper Architectural Key Idea : To allow the automatic wrapper generation based on a common library Module Adapter (MA) Implements services requested by the module. Channel Adapter (CA) - Implements the communication protocol (FIFO, DMA controller, etc) - Controls the communication between the module and the network.

45 Unified Wrapper Model(2)- processor wrapper architecture Figure 3: Wrapper Architectural Processor Adapter (PA) - Performs channel access selection by address decoding and interrupt management - The PA is a master, whereas the CAs are slaves - Enable signals are set/reset by the PA, select one CA and enable it to read/write data to/from data signal.

46 Unified Wrapper Model(3)- memory wrapper architecture Figure 3: Wrapper Architectural Memory Port Adapter (MPA) -Includes a memory controller and several memory specific functions -Performs data type conversion and data transfer between internal communication bus and memory bus.

47 Wrapper generation Figure 4: Wrapper generation flow  In order to facilitate the wrapper generation a library of basic components should be built.  This library includes several macro-models of channel adapters and module adapters  The wrapper generation flow are composed of processor and memory library, and MA and CA library.

48 Memory Wrapper Generation in Image Processing Application  Validation. - The correctness of the memory wrapper, we performed a low level image processing for a digital camera application - This algorithm using two processors (ARM7) and a global shared memory.  Experiment - two CAs, each one is composed of two FIFOs (32 words x 32 bits) with one controller and one buffer (1 word of 32bits), - two specific SRAM port adapters. Each one is composed of one address decoder and one SRAM controller that provides the following services: SRAM control, burst access and test operation used during co-simulation, - two parallel internal buses of 32 bits.

50 Results  The automatic generation of these wrappers allows a fast design space exploration of various types of memories.  The generated wrappers have been validated with a cycle accurate co- simulation approach based on SystemC. Two ISSs of ARM7 core (40 MHz) are used.  We note that there is a small difference in the code size of the memory wrapper in the two RTL architecture models. CAs are not changed. Only the MPA is changed (10% of the wrapper code).  Write latency : 3 CPU (without memory latency) cycles Read latency : 7 CPU cycles (send/receive).  Simulation cycle which corresponds to the processing of an image of 387x322 pixels => 2.05×106 CPU cycles

1 Memory Design for Multi-Core System on Chip. 2 Introduction The DSP processor is optimized for extremely high performance for a specific kind of arithmetic-intensive.

Similar presentations

Presentation on theme: "1 Memory Design for Multi-Core System on Chip. 2 Introduction The DSP processor is optimized for extremely high performance for a specific kind of arithmetic-intensive."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Memory Design for Multi-Core System on Chip. 2 Introduction The DSP processor is optimized for extremely high performance for a specific kind of arithmetic-intensive.

Similar presentations

Presentation on theme: "1 Memory Design for Multi-Core System on Chip. 2 Introduction The DSP processor is optimized for extremely high performance for a specific kind of arithmetic-intensive."— Presentation transcript:

Similar presentations

About project

Feedback