Spring 2008 CSE 591 Compilers for Embedded Systems

Slides:

Advertisements

Similar presentations

Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures.

Advertisements

Memory Management: Overlays and Virtual Memory

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Memory Optimizations Research at UNT Krishna Kavi Professor Director of NSF Industry/University Cooperative Center for Net-Centric Software and Systems.

CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Technische Universität Dortmund Automatic mapping to tightly coupled memories and cache locking Peter Marwedel 1,2, Heiko Falk 1, Robert Pyka 1, Lars Wehmeyer.

Memory Allocation via Graph Coloring using Scratchpad Memory

Outline Introduction Different Scratch Pad Memories Cache and Scratch Pad for embedded applications.

A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.

Storage Allocation for Embedded Processors By Jan Sjodin & Carl von Platen Present by Xie Lei ( PLS Lab)

A Dynamic Code Mapping Technique for Scratchpad Memories in Embedded Systems Amit Pabalkar Compiler and Micro-architecture Lab School of Computing and.

High Performance Embedded Computing © 2007 Elsevier Chapter 3, part 1: Programs High Performance Embedded Computing Wayne Wolf.

Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

© 2004 Wayne Wolf Memory system optimizations Strictly software:  Effectively using the cache and partitioned memory. Hardware + software:  Scratch-pad.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Memory Management: Overlays and Virtual Memory. Agenda Overview of Virtual Memory –Review material based on Computer Architecture and OS concepts Credits.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.

CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.

Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.

High Performance Embedded Computing © 2007 Elsevier Chapter 3, part 1: Programs High Performance Embedded Computing Wayne Wolf.

CS 704 Advanced Computer Architecture

Cache and Scratch Pad Memory (SPM)

Memory Hierarchy Ideal memory is fast, large, and inexpensive

The Memory System (Chapter 5)

Reducing Code Management Overhead in Software-Managed Multicores

Software Coherence Management on Non-Coherent-Cache Multicores

Memory and cache CPU Memory I/O.

High Performance Computing (HIPC)

143A: Principles of Operating Systems Lecture 6: Address translation (Paging) Anton Burtsev October, 2017.

ENERGY 211 / CME 211 Lecture 25 November 17, 2008.

Modeling of Digital Systems

5.2 Eleven Advanced Optimizations of Cache Performance

CSE 153 Design of Operating Systems Winter 2018

Ke Bai and Aviral Shrivastava Presented by Bryce Holton

CMSC 611: Advanced Computer Architecture

Operating System Concepts

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Memory and cache CPU Memory I/O.

Lecture 23: Cache, Memory, Virtual Memory

ECE Dept., University of Toronto

CS399 New Beginnings Jonathan Walpole.

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

Dynamic Code Mapping Techniques for Limited Local Memory Systems

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Virtual Memory Overcoming main memory size limitation

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Lecture 7: Flexible Address Translation

Code Transformation for TLB Power Reduction

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

CS703 - Advanced Operating Systems

Main Memory Background

CSE 153 Design of Operating Systems Winter 2019

6- General Purpose GPU Programming

Research: Past, Present and Future

ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.

Presentation transcript:

Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University

Lecture 6: Scratch Pad Memories Management and Data Mapping Techniques

Energy Efficiency Operations/Watt [GOPS/W] Ambient Intelligence 10 DSP-ASIPs 1 ASIC Reconfigurable Computing Processors µPs poor design techniques 0.1 0.01 Technology 1.0µ 0.5µ 0.25µ 0.13µ 0.07µ Necessary to optimize; otherwise the price for flexibility cannot be paid! [H. de Man, Keynote, DATE‘02; T. Claasen, ISSCC99]

SPMs vs. Cache Energy consumption in tags, comparators and muxes is significant! [R. Banakar, S. Steinke, B.-S. Lee, 2001]

Advantages of Scratch Pads Area advantage - For the same area, we can fit more memory of SPM than in cache (around 34%) SPM consists of just a memory array & address decoding circuitry Less energy consumption per access Absence of tag memory and comparators Performance comparable with cache Predictable WCET – required for RTES

Systems with SPM Most of the ARM architectures have an on-chip SPM termed as Tightly-coupled memory (TCM) GPUs such as Nvidia’s 8800 have a 16KB SPM Its typical for a DSP to have scratch pad RAM Embedded processors like Motorola Mcore, TI TMS370C Commercial network processors – Intel IXP Cell Broadband Engine

Challenges in using SPMs In SPMs, application developer, or compiler has explicitly move data between memories Data mapping is transparent in cache based architectures Binary compatible? Do advantages translate to a different machine?

Data Allocation on SPM Data Classification Mapping classification Global data Stack data Heap data Application Code Mapping classification Static – Mapping of data decided at compile time and the mapping persists throughout the execution Dynamic – Mapping of data decided at compile time but the mapping changes throughout execution Analysis Classification Profile based analysis Compile-time analysis Goal Classification To minimize off-chip memory access To reduce energy consumption To achieve better performance

Global Data Panda et al., “Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications” Map all scalars to SPM Very small in size Estimate conflicts in array IAC(u): Interference Access Count: No. of accesses to other arrays during lifetime of u VAC(u): Variable Access Count: Number of accesses to elements of u IF(u) = ILT(u)*VAC(u) Loop Conflict Graph Nodes are arrays Edge weight of (u -> v) is the number of accesses to u and v in the loop More conflict  SPM Either whole array goes to SPM or not

ILP for Global Variables Memory units: m_i Power per access of each memory: p_m_i Number of times each variable is accessed: n_j Compute where each variable should be placed to minimize the power consumption Compiler decides where to place the global variables

ILP for Code+GVs Number of times each function is executed Size of the function In terms of dynamic instruction count Energy Savings if the function is mapped to SPM Find out where to map the functions to minimize energy, given SPM size constraint Can be done at BB level also Is completely static Dynamic copying of code and data to scratch pad? Profile based analysis only Does it scale? How to do static analysis?

Stack Data Management Need to manage 2 stack pointers Unlike global variables, stack variables are dynamic Stack variable addresses defined dynamically Amount of stack not known statically Stack frame granularity Keep only the most frequently used stack frames in the scratch pad Find out by profiling Keep only most frequently accessed variables into the SPM Need to manage 2 stack pointers Bar() DRAM CPU Foo() SPM

MMU based Stack Management Deny read permissions to all stack space When processor accesses a function, a page fault is generated It resets the permission and brings the function to the SPM Address mapping is also modified Granularity Page-based Binary compatible!!

Heap Data Management Static techniques? Like stack variables, heaps are also dynamically allocated Partition program into regions E.g. functions, loops Find time order between regions Add instructions to transfer portions of heap data to the scratch pad Static techniques?

Leakage Energy Minimization in Multi-bank scratch pad Power of a bank is proportional to size of Scratchpad Partition data according to access frequency A bank can be put in low-power mode Partition data to increase the sleep time Reduces leakage of the bank Explore all possible page mappings > 60% energy reduction

SPM Management for MMU based systems SPM split into pages MMU page fault abort exceptions used to intercept access to data to be copied to SPM on demand Physical address generated by MMU compared with SPM Base and accessed if within range System has a direct mapped cache SPM Manager calls are inserted for frequently accessed data pages How to map data so as to minimize the page faults?

Static Analysis Static analysis typically require well structured code But that is not always the case While-loop Pointers Variable modifications Limits the ability of static analysis Simulate the program to find out the address functions Match the address functions with several affine patterns a, ai, ai+b, ai+bj+c

Data Reuse Concept 5 15 5 5 Main mem 15 5 array index time for i=0 to 9 for j=0 to 1 for k=0 to 2 for l=0 to 2 for m=0 to 4 …=A[i*15+k*5+m] Main mem 15 5 Proc array index Reuse tree 15 5 5 5 time

Basic Idea for Data Reuse Analysis 0 10 20 30 40 50 These 25 elements are ‘alive’ only during 10 iterations The buffer size needed to keep all reused data is 25x10 = 250 elements X 10 20 30 40 50 60 70 Iteration 10 Iteration 15 Iteration 20 for y=0 to 4 for x=0 to 4 for dy = 0 to 4 for dx = 0 to 9 ...=A[10y+dy, 10x+dx] for dy = 0 to 14 for dx = 0 to 4 …=A[10y+dy+10, 10x+dx] Y

Basic Idea for Data Reuse Analysis 0 10 20 30 40 50 How to find the buffer size in a systematic way? Find the elements accessed during the first iteration Assign distance numbers X 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 for y=0 to 4 for x=0 to 4 for dy = 0 to 4 for dx = 0 to 9 ...=A[10y+dy, 10x+dx] for dy = 0 to 14 for dx = 0 to 4 …=A[10y+dy+10, 10x+dx] Distance=10 Distance=0 Distance=5 Overlapped area = # of reused elements (5x5=25 in our case) Buffer size = max difference in distance numbers * overlap area: 10 * 25 = 250; partial reuse is possible with smaller size Y

Array A[] in main memory Reuse Graph Result of the data reuse analysis: buffer hierarchy Each buffer can be mapped to physical memory Many different possibilities for mapping Example: for y=0 to 4 for x=0 to 4 for dy = 0 to 4 for dx = 0 to 9 ...=A[10y+dy, 10x+dx] for dy = 0 to 14 for dx = 0 to 4 …=A[10y+dy+10, 10x+dx] Off-chip memory Array A[] in main memory On-chip RAM, Mb Level 2 Buf 1 Buf 2 On-chip RAM, Kb Lev 1 Buf 1 Buf 2 Buf 3 A[ ] A[ ]

Multiprocessor data reuse analysis Multiprocessor program: A[100] for i1 = 0 to 9 for j1 = 0 to 9 for k1 = 0 to 9 n1+=f(A[10*i1+j1+k1]) for i2 = 0 to 9 for j2 = 0 to 9 for k2 = 0 to 9 n2+=g(A[10*i2+j2+k2]) Proc 1 Proc 2 for j1 = 0 to 9 for k1 = 0 to 9 n1+=f(A[10*i1+j1+k1]) for j2 = 0 to 9 for k2 = 0 to 9 n2+=g(A[10*i2+j2+k2]) Architecture with shared memory: Main memory Array A[] in main memory (100) SPM Shared buffer (19) Proc 1 Proc 2 Additional synchronization is required

Synchronization Models Buffer Update with Barrier Synchronization: for i1 = 0 to 9 for j1 = 0 to 9 for k1 = 0 to 9 n1+=f(A[10*i1+j1+k1]) for i2 = 0 to 9 for j2 = 0 to 9 for k2 = 0 to 9 n2+=g(A[10*i2+j2+k2]) barrier synchronization Buffer update (DMA) Proc 1 Proc 2 barrier synchronization Buffer Update using Larger Buffer: for i1 = 0 to 9 for j1 = 0 to 9 for k1 = 0 to 9 n1+=f(A[10*i1+j1+k1]) for i2 = 0 to 9 for j2 = 0 to 9 for k2 = 0 to 9 n2+=g(A[10*i2+j2+k2]) Proc 1 Proc 2 barrier synchronization Wait for DMA Start DMA

Multiprocessor Reuse Analysis: Example Original multiprocessor program for i1 = 0 to 9 for j1 = 0 to 4 for k1 = 0 to 8 n1+=f(A[10*i1+2*j1+k1]) for i2 = 0 to 9 for j2 = 0 to 4 for k2 = 1 to 9 n2+=g(A[10*i2+2*j2+k2]) Proc 1 Proc 2 sync 1 sync 3 sync 2 i1: 17 i2: 17 Proc 1 k: 2 Proc 2 Main memory i: 18 j: 10 j1: 9 j2: 9 MP data reuse graph Proc 1 k: 2 Proc 2 i: 18 Main memory j: 10 Sync. 3 Reuse trees (buffer hierarchies) Proc 1 j1: 9 Proc 2 j2: 9 i1: 17 i2: 17 Main memory No sync. Proc 1 j1: 9 Proc 2 j2: 9 i: 18 Main memory Sync. 1 Proc 1 j: 10 Proc 2 i: 18 Main memory Sync. 2