- 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.

Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter.

Part IV: Memory Management

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Memory Optimizations Research at UNT Krishna Kavi Professor Director of NSF Industry/University Cooperative Center for Net-Centric Software and Systems.

Fakultät für informatik informatik 12 technische universität dortmund Lab 4: Exploiting the memory hierarchy - Session 14 - Peter Marwedel Heiko Falk TU.

Memory Management Chapter 7.

S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Eliminating Stack Overflow by Abstract Interpretation John Regehr Alastair Reid Kirk Webb University of Utah.

Predictable Implementation of Real-Time Applications on Multiprocessor Systems-on-Chip Alexandru Andrei Embedded Systems Laboratory Linköping University,

Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.

Chapter 12 CPU Structure and Function. Example Register Organizations.

1 Chapter 13 Embedded Systems Embedded Systems Characteristics of Embedded Operating Systems.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Technische Universität Dortmund Automatic mapping to tightly coupled memories and cache locking Peter Marwedel 1,2, Heiko Falk 1, Robert Pyka 1, Lars Wehmeyer.

TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

Paper Review Building a Robust Software-based Router Using Network Processors.

- 1 - Embedded systems: processing Embedded System Hardware Embedded system hardware is frequently used in a loop („hardware in a loop“): actuators.

Computer Science 12 Embedded Systems Group © H. Falk | Dortmund, 08-Jul-08 Overview about Computer Science 12 at Dortmund University of Technology Heiko.

Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.

Multi-core architectures. Single-core computer Single-core CPU chip.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.

Chapter 4 Storage Management (Memory Management).

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Evaluation and Validation Peter Marwedel TU Dortmund, Informatik 12 Germany 2013 年 12 月 02 日 These slides use Microsoft clip arts. Microsoft copyright.

Fakultät für informatik informatik 12 technische universität dortmund Worst-Case Execution Time Analysis - Session 19 - Heiko Falk TU Dortmund Informatik.

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

Lu Hao Profiling-Based Hardware/Software Co- Exploration for the Design of Video Coding Architectures Heiko Hübert and Benno Stabernack.

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.

Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

A Unified WCET Analysis Framework for Multi-core Platforms Sudipta Chattopadhyay, Chong Lee Kee, Abhik Roychoudhury National University of Singapore Timon.

Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.

Static WCET Analysis vs. Measurement: What is the Right Way to Assess Real-Time Task Timing? Worst Case Execution Time Prediction by Static Program Analysis.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

Background Computer System Architectures Computer System Software.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.

Cache and Scratch Pad Memory (SPM)

Processes and threads.

Evaluating Register File Size

Cache Memory Presentation I

CSCI1600: Embedded and Real Time Software

Evaluation and Validation

Lecture: Cache Innovations, Virtual Memory

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Virtual Memory Overcoming main memory size limitation

Spring 2008 CSE 591 Compilers for Embedded Systems

Lecture: Cache Hierarchies

Main Memory Background

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

CSCI1600: Embedded and Real Time Software

Presentation transcript:

- 1 -  P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Memory-aware compilation enables fast, energy-efficient, timing predictable memory accesses Peter Marwedel 12, Heiko Falk 1, Christian Ferdinand 3 Paul Lokuciejewski 1, Manish Verma 1, Lars Wehmeyer 12 1 Universität Dortmund, Informatik 12 2 Informatik Centrum Dortmund (ICD) 3 AbsInt GmbH, Saarbrücken Peter Marwedel 12, Heiko Falk 1, Christian Ferdinand 3 Paul Lokuciejewski 1, Manish Verma 1, Lars Wehmeyer 12 1 Universität Dortmund, Informatik 12 2 Informatik Centrum Dortmund (ICD) 3 AbsInt GmbH, Saarbrücken

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Key properties of embedded systems embedded real-time embedded real-time Strong correlation between embedded and real-time systems „A reactive system is one which is in continual interaction with is environment and executes at a pace determined by that environment“ [Bergé, 1995] Strong correlation between embedded and reactive systems

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Serious mismatch Despite considerable progress in software and hardware techniques, when embedded computing systems absolutely must meet tight timing constraints, many of the advances in computing become part of the problem rather than part of the solution. What would it take to achieve concurrent and networked embedded software that was absolutely positively on time … ?..What is needed is nearly a reinvention of computer science. Edward A. Lee: Absolutely Positively On Time: What Would It Take?, Editorial, Draft version: May 18, 2005, Published in: Embedded Systems Column, IEEE Computer, July, 2005

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Technology "advances" will make the situation worse Speed years CPU (1.5-2 p.a.) DRAM (1.07 p.a.) 31  2x every 2 years 1 0 Increasing gap between processor and memory speeds Future semiconductor technology will be inherently unreliable, e.g. due to quantum effects and will require fault tolerance mechanisms to be used. Timing "redundancy" used?

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Scratchpad seen to help with timing problems Fortunately, there is quite a bit to draw on. To name a few examples, architecture techniques such as software-managed caches (scratchpad memories) promise to deliver much of the benefit of memory hierarchy without the timing unpredictability… [E.Lee, 2005] Fortunately, there is quite a bit to draw on. To name a few examples, architecture techniques such as software-managed caches (scratchpad memories) promise to deliver much of the benefit of memory hierarchy without the timing unpredictability… [E.Lee, 2005]

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Scratch pad memories (SPM): Fast, energy-efficient, timing-predictable Address space scratch pad memory 0 FFF.. ARM7TDMI cores, well- known for low power consumption Example main memory Called “tightly coupled memory” by ARM Small; no tag memory

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Worst case timing analysis using aiT SP size C program encc executable ARMulator aiT Actual performance WCET

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Results for G.721  L. Wehmeyer, P. Marwedel: Influence of Onchip Scratchpad Memories on WCET: 4th Intl Workshop on worst-case execution time analysis, (WCET), 2004  L. Wehmeyer, P. Marwedel: Influence of Memory Hierarchies on Predictability for Time Constrained Embedded Software, Design Automation and Test in Europe (DATE), 2005 Using Scratchpad:Using Unified Cache:

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Impact on access time and energy consumption Energy Access times Small memories also provide faster access time and reduced energy consumption CACTI model for SRAM

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Energy savings for memory system energy

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Static allocation of memory objects Which object (array, function, etc.) to be stored in SPM? Gain g k and size s k for each object k. Maximise gain G =  g k, respecting size of SPM  s k ≤ SSP. Static memory allocation: Solution: knapsack algorithm. Processor Scratch pad memory, capacity SSP board Main memory ? For i.{ } for j..{ } while... Repeat call... Array... Int... Array Example:

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Dynamic replacement within scratch pad  Effectively results in a kind of compiler- controlled swapping for SPM  Address assignment within SPM required (paging or segmentation-like)  Effectively results in a kind of compiler- controlled swapping for SPM  Address assignment within SPM required (paging or segmentation-like) M.Verma, P.Marwedel (U. Dortmund): Dynamic Overlay of Scratchpad Memory for Energy Minimization, ISSS, 2004 CPU Memory SPM

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Dynamic replacement of data within scratch pad: based on liveness analysis SP Size = |A| = |T3| Solution: A  SP & T3  SP Solution: A  SP & T3  SP SPILL_STORE(A); SPILL_LOAD(T3); SPILL_STORE(A); SPILL_LOAD(T3); SPILL_LOAD(A); T3 DEF A USE A MOD A USE T3 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Dynamic replacement within scratch pad - Results for edge detection relative to static allocation -

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Impact of partitioning scratch pads "main" memory Scratch pad 2, 16 k entries Scratch pad 1, 2 k entries Scratch pad 0, 256 entries 0 addresses

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Results for parts of GSM coder/decoder A key advantage of partitioned scratchpads for multiple applications is their ability to adapt to the size of the current working set. „Working set“

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Multiple Processes: Non-Saving Context Switch Process P1 Process P3 Process P2 Scratchpad Process P1 Non-Saving Context Switch (Non-Saving)  Partitions SPM into disjoint regions  Each process is assigned a SPM region  Copies contents during initialization  Good for large scratchpads Non-Saving Context Switch (Non-Saving)  Partitions SPM into disjoint regions  Each process is assigned a SPM region  Copies contents during initialization  Good for large scratchpads Process P2 Process P3 P1 P2 P3

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Saving/Restoring Context Switch Saving Context Switch (Saving)  Utilizes SPM as a common region shared all processes  Contents of processes are copied on/off the SPM at context switch  Good for small scratchpads Saving Context Switch (Saving)  Utilizes SPM as a common region shared all processes  Contents of processes are copied on/off the SPM at context switch  Good for small scratchpads P1 P2 P3 Scratchpad Process P3 Process P1 Process P2 Saving/Restoring at context switch

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Hybrid Context Switch Hybrid Context Switch (Hybrid)  Disjoint + Shared SPM regions  Good for all scratchpads  Analysis is similar to Non-Saving Approach  Runtime: O(nM 3 ) Hybrid Context Switch (Hybrid)  Disjoint + Shared SPM regions  Good for all scratchpads  Analysis is similar to Non-Saving Approach  Runtime: O(nM 3 ) P1 P2 P3 Scratchpad Process P1 Process P3 Process P2 Process P1,P2, P3 Process P1 Process P2 Process P3 Process P1 Process P2 Process P3

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Multi-process Scratchpad Allocation: Results Hybrid is the best for all SPM sizes. Energy 4kB SPM is 27% for Hybrid approach. Avoids poor timing predictability of cache-based system after context switch. Hybrid is the best for all SPM sizes. Energy 4kB SPM is 27% for Hybrid approach. Avoids poor timing predictability of cache-based system after context switch. edge detection, adpcm, g721, mpeg 27% SPA: Single Process Approach

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Multi-processor ARM (MPARM) Framework –Homogenous SMP ~ CELL processor –Processing Unit : ARM7T processor –Shared Coherent Main Memory –Private Memory: Scratchpad Memory SPM Interrupt Device Semaphore Device ARM Interconnect (AMBA or STBus) Shared Main Memory

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Using optimization in an gcc-based tool flow Source is split into 2 different file by specially developed memory optimizer tool *. Memory Optimizer ICD-C Compiler.c.txt.c ARM-GCC Compiler.ld.exe application source profile Info. main mem. src spm src. linker script executable *Built with new tool design suite ICD-C available from ICD (see

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Results (MOMPARM) DES-Encryption: 4 processors: 2 Controllers+2 Compute Engines Energy values from ST Microelectronics Result of ongoing cooperation between U. Bologna and U. Dortmund supported by ARTIST2 network of excellence.

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund State of the art of SPM algorithms FeatureStatic allocationDynamic allocation Partitioned SPMsWehmeyer et al. [WMPI 2004] - WCET analysisWehmeyer et al. [WS WCET 04, DATE 05] Wehmeyer et al. [Thesis] Multiple processesVerma et al. [ISSS 2004] Future work Multiprocessor Systems Verma et al. [Estimedia 2005] Verma et al. [ongoing work] Sections from arraysNot always applicableIMEC (MHLA), Kandemir

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Extension: WCET-aware compiler Loop bounds analysis Standard input to aiT ANSI-C Programm ANSI-C Frontend Parse Tree IR-Code Generator Medium Level IR LLIR-Code Generator Low Level IR Code Generator WCET optimized assembly code Optimization Techniques Analyses LLIR2crl crl2llir Pipeline Analysis Cache Analysis Value Analysis CRL2 CRL2 with WCET Info Path Analysis ARTIST2

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Opportunities  Precise WCET information for run-time optimizations -Single implementation of hardware timing models -Accurate information on pipeline influence -Accurate information on timing of memory -Trade-off Cache vs. Scratchpad Optimization  Pass additional information (flow facts) to aiT Potential for tighter bounds? (e.g. due to pointer disambiguation)  Aggressive optimizations for code on WCET path  Respecting WCET constraints during compilation  Reduction of jitter in multimedia applications  Alternative input to aiT (compare compiler output)  Precise WCET information for run-time optimizations -Single implementation of hardware timing models -Accurate information on pipeline influence -Accurate information on timing of memory -Trade-off Cache vs. Scratchpad Optimization  Pass additional information (flow facts) to aiT Potential for tighter bounds? (e.g. due to pointer disambiguation)  Aggressive optimizations for code on WCET path  Respecting WCET constraints during compilation  Reduction of jitter in multimedia applications  Alternative input to aiT (compare compiler output)

 P. Marwedel, Univ. Dortmund/Informatik 12 + ICD/ES, 2005 Universität Dortmund Conclusion Timeliness and timing predictability seriously missing in key concepts of current information technology  Scratchpads are seen as a potential contribution towards new architectural concepts -Comprehensive set of allocation methods has been developed Static allocation Dynamic allocation  Full integration of WCET tools into compiler tool chain enables further explicit considerations of time. Timeliness and timing predictability seriously missing in key concepts of current information technology  Scratchpads are seen as a potential contribution towards new architectural concepts -Comprehensive set of allocation methods has been developed Static allocation Dynamic allocation  Full integration of WCET tools into compiler tool chain enables further explicit considerations of time.