Precision Timed Embedded Systems Using TickPAD Memory Matthew M Y Kuo* Partha S Roop* Sidharta Andalam † Nitish Patel* *University of Auckland, New Zealand.

Slides:

Advertisements

Similar presentations

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Computer Organization and Architecture

Instruction-Level Parallelism (ILP)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Computer Organization and Architecture

Computer Organization and Architecture

Chapter 12 Pipelining Strategies Performance Hazards.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Chapter 12 CPU Structure and Function. Example Register Organizations.

CprE 458/558: Real-Time Systems

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

CH12 CPU Structure and Function

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Programming Safety-Critical Embedded Systems Work mainly by Sidharta Andalam and Eugene Yip Main supervisor:Advisor: Dr. Partha RoopDr. Alain Girault (UoA)(INRIA)

5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Please see “portrait orientation” PowerPoint file for Chapter 8 Figure 8.1. Basic idea of instruction pipelining.

Computer Systems Research at UNT 1 A Multithreaded Architecture Krishna Kavi (with minor modifcations)

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

CS25212 Coarse Grain Multithreading Learning Objectives: – To be able to describe a coarse grain multithreading implementation – To be able to estimate.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.

1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Computer Architecture Souad MEDDEB

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.

Basic Memory Management 1. Readings r Silbershatz et al: chapters

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

ILPc: A novel approach for scalable timing analysis of synchronous programs Hugh Wang Partha S. Roop Sidharta Andalam.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Introduction to Operating Systems Concepts

Computer Architecture Chapter (14): Processor Structure and Function

Memory COMPUTER ARCHITECTURE

Computer Organization CS224

William Stallings Computer Organization and Architecture 8th Edition

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

How will execution time grow with SIZE?

ESE532: System-on-a-Chip Architecture

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Real-time Software Design

The processor: Pipelining and Branching

Systems Architecture II

Hardware Multithreading

Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.

Processor: Multi-Cycle Datapath & Control

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Chapter 12 Pipelining and RISC

Hardware Multithreading

CS510 Operating System Foundations

Chapter 11 Processor Structure and function

OPERATING SYSTEMS MEMORY MANAGEMENT BY DR.V.R.ELANGOVAN.

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Precision Timed Embedded Systems Using TickPAD Memory Matthew M Y Kuo* Partha S Roop* Sidharta Andalam † Nitish Patel* *University of Auckland, New Zealand † TUM CREATE, Singapore

Introduction Hard real time systems ◦ Need to meet real time deadlines ◦ Catastrophic events may occur when missed Synchronous execution approach ◦ Good for hard real time systems  Deterministic  Reactive ◦ Aids static timing analysis  Well bounded programs  No unbounded loops or recursions

Synchronous Languages Executes in logical time ◦ Ticks  Sample input → computation → emit output Synchronous hypothesis ◦ Tick are instantaneous  Assumes system is executes infinitely fast  System is faster than environment response ◦ Worst case reaction time  Time between two logical ticks Languages ◦ Esterel ◦ Scade ◦ PRET-C  Extension to C

Synchronous Languages Executes in logical time ◦ Ticks  Sample input → computation → emit output Synchronous hypothesis ◦ Tick are instantaneous  Assumes system is executes infinitely fast  System is faster than environment response ◦ Worst case reaction time  Time between two logical ticks Languages ◦ Esterel ◦ Scade ◦ PRET-C  Extension to C

PRET-C Light-weight multithreading in C Provides thread safe memory access C extension implemented as C macros StatementMeaning ReactiveInput IDeclares I as a reactive environment input ReactiveOutput ODeclares O as a reactive environment output PAR(T1, …. Tn)Synchronously executes n threads in parallel, where thread t i has a higher priority than t i+1 EOTMarks the end of tick [weak] abort P when C Preempt p when c is true

Introduction Practical System require larger memory ◦ Not all applications fit on on-chip memory Require memory hierarchy ◦ Processor memory gap [1] Hennessy, John L., and David A. Patterson. Computer Architecture: A Quantitative Approach. San Francisco, CA: Morgan Kaufmann, 2011.

Introduction Traditional approaches ◦ Caches ◦ Scratchpads However, ◦ Scant research for memory architectures tailored for synchronous execution and concurrency.

Caches CPU Main Memory

Caches Traditionally Caches ◦ Small fast piece of memory  Temporal locality  Spatial locality ◦ Hardware Controlled  Replacement policy CPU Main Memory Cache

Caches Hard real time systems ◦ Needs to model the architecture  Compute the WCRT ◦ Caches models  Trade off between length of computation time and tightness  Very tight worse case estimate is not scalable CPU Main Memory Cache

Scratchpad Scratchpad Memory (SPM) ◦ Software controlled ◦ Statically allocated  Statically or dynamically loaded ◦ Requires an allocation algorithm  e.g. ILP, Greedy CPU Main Memory SPM

Scratchpad Hard real time systems ◦ Easy to compute tight the WCRT ◦ Reduces the worst case performance ◦ Balance between amount of reload points and overheads  May perform worst than cache in the worst case performance CPU Main Memory SPM

TickPAD CPU Main Memory SPMCache  Good at overall performance  Hardware controlled  Good at worst case performance  Easy for fast and tight static analysis

TickPAD CPU Main Memory SPMCache  Good at overall performance  Hardware controlled  Good at worst case performance  Easy for fast and tight static analysis TPM

TickPAD CPU Main Memory TPM TickPAD Memory ◦ TickPAD - Tick Precise Allocation Device ◦ Memory controller  Hybrid between caches and scratchpads  Hardware controlled features  Static software allocation ◦ Tailored for synchronous languages ◦ Instruction memory

TickPAD Design flow

PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread t1() { compute; EOT; compute; EOT; } main t1 t3 t2

PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread t1() { compute; EOT; compute; EOT; } Computation main t1 t3 t2

PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread t1() { compute; EOT; compute; EOT; } Spawn children threads main t1 t3 t2

PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread t1() { compute; EOT; compute; EOT; } End of tick – Synchronization boundaries main t1 t3 t2

PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread t1() { compute; EOT; compute; EOT; } Child thread terminate main t1 t3 t2

PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread t1() { compute; EOT; compute; EOT; } Main thread resume main t1 t3 t2

PRET-C Execution Time main t1 t3 t2 Sample inputs

PRET-C Execution main t1 t3 t2 main Time

PRET-C Execution main t1 t3 t2 main Time t1

PRET-C Execution main t1 t3 t2 main Time t1t2

PRET-C Execution main t1 t3 t2 main Time t1t2

PRET-C Execution main t1 t3 t2 main Time t1t2 Emit Outputs

PRET-C Execution main t1 t3 t2 main Time t1t2 1 tick (reaction time)

PRET-C Execution main t1 t3 t2 main Time t1t2 local tick

Assumptions 0x000x040x080x0C 4 Instructions 1 Cache Line Takes 1 burst transfer from main memory Cache miss, takes 38 clock cycles [2] 0x00Each instructions takes 2 cycles to execute buffer Buffers are 1 cache line in size 2. J. Whitham and N. Audsley. The Scratchpad Memory Management Unit for Microblaze: Implémentation, Testing, and Case Study. Technical Report YCS , University of York, 2009.

TickPAD - Overview TickPAD - Overview

Spatial memory pipeline  To accelerate linear code TickPAD - Overview TickPAD - Overview

Associative loop memory ◦ For predictable temporal locality  Statically allocated and Dynamically loaded TickPAD - Overview TickPAD - Overview

Tick address queue ◦ Stores the resumptions address of active threads TickPAD - Overview TickPAD - Overview

Tick instruction buffer ◦ Stores the instructions at the resumption of the next active thread ◦ To reduce context switching overhead at state/tick boundaries TickPAD - Overview TickPAD - Overview

Command table  Stores a set of commands to be executed by the TickPAD controller. TickPAD - Overview TickPAD - Overview

Command buffer  A buffer to store operands fetched from main memory  Command requiring 2+ operands TickPAD - Overview TickPAD - Overview

Spatial Memory Pipeline Cache – on miss ◦ Fetches from main memory on to cache  First instruction miss, subsequence instructions on that line hits ◦ Requires history of cache needed for timing analysis Scratchpad – unallocated ◦ Executes from main memory  Miss cost for all instructions ◦ Simple timing analysis

Spatial Memory Pipeline Memory controller ◦ Single line buffer Simple analysis ◦ Analyse previous instruction  First instruction miss, subsequence instructions on that line hits CPU Main Memory

Spatial Memory Pipeline Computation required many lines of instructions Exploit spatial locality ◦ Predictability prefetch the next line of instructions ◦ Add another buffer

Spatial Memory Pipeline To preserve determinism ◦ Prefetch only active if no branch

Spatial Memory Pipeline

Timing analysis ◦ Simple to analyse ◦ Analysis next instruction line  If has a branch next target line will miss  e.g. 38 clock cycles  Else – will be prefetched  e.g. 38 – 8 = 30 clock cycles

Spatial Memory Pipeline Timing analysis ◦ Simple to analyse ◦ Analysis next instruction line  If has a branch next target line will miss  e.g. 38 clock cycles  Else – will be prefetched  e.g. 38 – 8 = 30 clock cycles

Spatial Memory Pipeline Timing analysis ◦ Simple to analyse ◦ Analysis next instruction line  If has a branch next target line will miss  e.g. 38 clock cycles  Else – will be prefetched  e.g. 38 – 8 = 30 clock cycles

Tick Address Queue Tick Instruction Buffer Reduce cost of context switching Maintains a priority queue ◦ Thread execution order Prefetches instructions from next thread Make context switching points appear as linear code ◦ Paired using Spatial Memory Pipeline

Tick Address Queue Tick Instruction Buffer

Context switching – memory cost same as linear code

Tick Address Queue Tick Instruction Buffer

Timing analysis ◦ Same prefetch lines for allocated context switching points

Associative Loop Memory Statically Allocated ◦ Greedy  Allocates inner most look first Fetches Loop Before Executing ◦ Predictable – easy and tight to model ◦ Exploits temporal locality

Command Table Statically Allocated A Look Up table to dynamically load ◦ Tick Instruction Buffer ◦ Tick Queue ◦ Associative Loop Memory Command are executed when the PC matches the address stored on the command ◦ Allows the TickPAD to function without modification to source code  Libraries  Propriety programs

Command Table Three fields ◦ Address  The PC address to execute the command ◦ Command  Discard Loop Associative Memory  Store Loop Associative Memory  Fill Tick Instruction Buffer  Load Tick Address Queue ◦ Operand  Data used by the command

Command Table Allocation NodeCommandAddress FORKLoad Tick Address Queue x N Fill Tick Instruction Buffer Address of FORK EOTLoad Tick Address Queue Fill Tick Instruction Buffer Address of EOT KILLFill Tick Instruction BufferAddress of Kill LoopsDiscard Loop Associative Memory Store Loop Associative Memory Address at start of Loop

Command Table Allocation NodeCommandAddress FORKLoad Tick Address Queue x N Fill Tick Instruction Buffer Address of FORK EOTLoad Tick Address Queue Fill Tick Instruction Buffer Address of EOT KILLFill Tick Instruction BufferAddress of Kill LoopsDiscard Loop Associative Memory Store Loop Associative Memory Address at start of Loop

Command Table Allocation NodeCommandAddress FORKLoad Tick Address Queue x N Fill Tick Instruction Buffer Address of FORK EOTLoad Tick Address Queue Fill Tick Instruction Buffer Address of EOT KILLFill Tick Instruction BufferAddress of Kill LoopsDiscard Loop Associative Memory Store Loop Associative Memory Address at start of Loop

Command Table Allocation NodeCommandAddress FORKLoad Tick Address Queue x N Fill Tick Instruction Buffer Address of FORK EOTLoad Tick Address Queue Fill Tick Instruction Buffer Address of EOT KILLFill Tick Instruction BufferAddress of Kill LoopsDiscard Loop Associative Memory Store Loop Associative Memory Address at start of Loop

Command Table Allocation NodeCommandAddress FORKLoad Tick Address Queue x N Fill Tick Instruction Buffer Address of FORK EOTLoad Tick Address Queue Fill Tick Instruction Buffer Address of EOT KILLFill Tick Instruction BufferAddress of Kill LoopsDiscard Loop Associative Memory Store Loop Associative Memory Address at start of Loop

Results

Results WCRT reduction 8.5% Locked SPMs 12.3% Thread multiplexed SPM 13.4% Direct Mapped Caches

Results

Results - Synthesis

Conclusion Presented a new memory architecture ◦ Tailored for synchronous programs Has better worst case performance Analysis time is scalable ◦ Between scratchpad and abstract cache analysis The presented architecture is also suitable for other synchronous languages Future work ◦ Data TickPAD ◦ TickPAD on multicores

Thank You