Predictable Programming on a Precision Timed Architecture Hiren D. Patel UC Berkeley Joint work with: Ben Lickly, Isaac Liu, Edward.

Slides:



Advertisements
Similar presentations
I/O Management and Disk Scheduling
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Operating Systems Part III: Process Management (Process Synchronization)
Sungjun Kim Columbia University Edward A. Lee UC Berkeley
CSCI 4717/5717 Computer Architecture
© 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006 Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Instruction-Level Parallelism (ILP)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Overview of PTIDES Project
Timing Analysis of Embedded Software for Families of Microarchitectures Jan Reineke, UC Berkeley Edward A. Lee, UC Berkeley Representing Distributed Sense.
IEEE International Symposium on Distributed Simulation and Real-Time Applications October 27, 2008 Vancouver, British Columbia, Canada Presented by An.
8th Biennial Ptolemy Miniconference Berkeley, CA April 16, 2009 Precision Timed (PRET) Architecture Hiren D. Patel, Ben Lickly, Isaac Liu and Edward A.
The Case for Precision Timed (PRET) Machines Edward A. Lee Professor, Chair of EECS UC Berkeley With thanks to Stephen Edwards, Columbia University. National.
Chapter 13 Embedded Systems
7th Biennial Ptolemy Miniconference Berkeley, CA February 13, 2007 Cyber-Physical Systems: A Vision of the Future Edward A. Lee Robert S. Pepper Distinguished.
Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.
February 21, 2008 Center for Hybrid and Embedded Software Systems Mapping A Timed Functional Specification to a Precision.
Device Management.
Pipelining By Toan Nguyen.
Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Real-Time Software Design Yonsei University 2 nd Semester, 2014 Sanghyun Park.
1 I/O Management and Disk Scheduling Chapter Categories of I/O Devices Human readable Used to communicate with the user Printers Video display terminals.
1 소프트웨어공학 강좌 Chap 11. Real-time software Design - Designing embedded software systems whose behaviour is subject to time constraints -
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Real-Time Systems Mark Stanovich. Introduction System with timing constraints (e.g., deadlines) What makes a real-time system different? – Meeting timing.
Slide 1 Chapter 11 Real –time Software Designs. Slide 2 Real-time systems l Systems which monitor and control their environment l Inevitably associated.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
ECE 456 Computer Architecture Lecture #14 – CPU (III) Instruction Cycle & Pipelining Instructor: Dr. Honggang Wang Fall 2013.
Real Time Operating Systems Introduction to Real-Time Operating Systems (Part I) Course originally developed by Maj Ron Smith.
SoC CAD 2015/11/22 1 Instruction Set Extensions for Multi-Threading in LEON3 林孟諭 電機系, Department of Electrical Engineering 國立成功大學, National Cheng Kung.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Computing For Embedded System IEEE Instrumentation and Measurement Technology Conference Budapest, Hungary, May 21-23, Author : Edward A. Lee UC.
Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
ECE 720T5 Fall 2011 Cyber-Physical Systems Rodolfo Pellizzoni.
Real-time aspects Bernhard Weirich Real-time Systems Real-time systems need to accomplish their task s before the deadline. – Hard real-time:
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
February 12, 2009 Center for Hybrid and Embedded Software Systems Timing-aware Exceptions for a Precision Timed (PRET)
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
ECE 720T5 Winter 2014 Cyber-Physical Systems Rodolfo Pellizzoni.
Multiscalar Processors
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
ESE532: System-on-a-Chip Architecture
Real-time Software Design
On-Time Network On-chip
A Precision Timed Architecture for Predictable and Repeatable Timing
Figure 13.1 MIPS Single Clock Cycle Implementation.
Precision Timed Machine (PRET)
Hiren D. Patel Isaac Liu Ben Lickly Edward A. Lee
Shanna-Shaye Forbes Ben Lickly Man-Kit Leung
Timing-aware Exceptions for a Precision Timed (PRET) Target
15-740/ Computer Architecture Lecture 5: Precise Exceptions
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Computer Architecture
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Predictable Programming on a Precision Timed Architecture Hiren D. Patel UC Berkeley Joint work with: Ben Lickly, Isaac Liu, Edward A. Lee - UC Berkeley Sungjun Kim, Stephen A. Edwards - Columbia University

Patel, UC Berkeley, PRET2 Edwards and Lee - Case for PRET 2007 – Edwards and Lee made a case for precision timed computers (PRET machines) –Predictability –Repeatability S. A. Edwards and E. A. Lee, The case for the precision timed (PRET) machine. In Proceedings of the 44th Annual Conference on Design Automation (San Diego, California, June , 2007). DAC '07. ACM, New York, NY,

Patel, UC Berkeley, PRET3 Edwards and Lee - Case for PRET Unpredictability –Difficulty in determining timing behavior through analysis Non-repeatability –Lack of guarantee that every execution yields the same timing behavior Brittleness –Small changes have big effects on timing behavior 3

Patel, UC Berkeley, PRET4 Brittleness Expensive affair Tight coupling of software and hardware Reliance on testing for validation Upgrading difficult Solution: stockpile 4 Source:

Patel, UC Berkeley, PRET5 But wait … Real-time scheduling –Worst-case execution time Detailed model of hardware Large engineering effort Valid for particular hardware models –Interrupts, inter- process communication, locks … Bench testing –Brittle 5 Sebastian Altmeyer, Christian Hümbert, Björn Lisper, and Reinhard Wilhelm. Parametric Timing Analysis for Complex Architectures. In Proceedings of the 14th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA'08), pages , Kaohsiung, Taiwan, August IEEE Computer Society.

Patel, UC Berkeley, PRET6 Precise Timing and High Performance 6 TraditionalAlternative CachesScratchpads Deep out-of-order pipelinesThread-interleaved pipelines Function-only ISAsISAs with timing instructions Function-only languagesLanguages and programming models with timing Best-effort communicationFixed-latency communication Time-sharingMultiple independent processors

Patel, UC Berkeley, PRET7 Outline Introduction Related Work PRET Machine Programming Example Future Work Conclusion 7

Patel, UC Berkeley, PRET8 Related Work Java Optimized Processor –Schoeberl et al. [2003] Timing instructions –Ip and Edwards [2006] Reactive processors –Von Hanxleden et al. [2005] –Salcic et al. [2005] Virtual Simple Architecture –Mueller et al. [2003] 8

Patel, UC Berkeley, PRET9 9 Semantics of Timing Instructions Deadline instructions –Denote the required execution time of a block When decoded –Stall instruction if timer value is not 0 –Otherwise set timer value to new value deadi $t0, 10 … deadi $t0, 8 … deadi $t0, 0 … L0: … deadi $t0, 10 b L0 … Straight Line Block 0 Straight Line Block 1 Loop Block

Patel, UC Berkeley, PRET10 Tracing A Program Fragment A: deadi $t0, 6 B: sethi %hi(0x3f800000), %g1 C: or %g1, 0x200, %g1 D: st %g1, [ %fp ] E: deadi $t0, 8 F: … cycle $t0

Patel, UC Berkeley, PRET11 Precision Timed Architecture Thread-interleaved pipeline Scratchpad memories Time-triggered main memory access Round-robin thread scheduling

Patel, UC Berkeley, PRET12 Memory Hierarchy Clocks –Main clock –Derived clocks Instruction and data scratchpad memories –1 cycle access latency Main memory –16MB size –Latency of 50ns –Frequency:250Mhz ~13 cycles latency 12 Core Main Mem. Main Mem. SPM DMA

Patel, UC Berkeley, PRET13 Thread-interleaved Pipeline Thread stalls –Main memory access –Multi-cycle operations –Deadline instructions Replay mechanism –Execute same PC next iteration –Multi-cycle ALU ops replay instructions 13 Fetch Decode Reg. Access Execute Memory WriteBack F/D D/R R/E E/M M/W Decrement Deadline Timers Stall if Deadline Instruction Increment PC Check main memory access

Patel, UC Berkeley, PRET14 Time-Triggered Access through Memory Wheel Decouple thread’s access pattern Time-triggered access Best-case access time –If accessed 1st cycle Worst-case access time –If accessed 2nd cycle of window 14

Patel, UC Berkeley, PRET15 Tool Flow GCC 3.4.4, SystemC 2.2, Python 2.4 Boot codeMotorola SREC files C programs timing instructions GCC to compile boot code and program code

Patel, UC Berkeley, PRET16 Simple Mutual Exclusion Example Producer followed by Consumer and Observer –Consumer and Observer execute together Loop rate of two rotations of memory wheel –1 st for Producer to write –2 nd Consumer and Observer to read 16 Write to shared data Read from shared data Write to output

Patel, UC Berkeley, PRET17 Video Game Example Graphi c Thread VGA- Driver Thread Even Buffer Odd Buffer Main- Control Thread Odd Queue Even Queue Command Pixel Data Swap (When Sync Requested and When Odd Queue Empty) Sync (After queue swapped) Update Screen (Sync request) Sync (After buffer swapped) Refresh (Sync request) Swap (When sync requested and when Vertical blank)

Patel, UC Berkeley, PRET18 Timing Requirements 18 SignalTiming Requirement Pixel Cycles V. Sync64µs1611 V. Back-porch1.02ms25679 Draw 480 lines15.25ms V. Front-porch350µs8811 H. Sync3.77µs96 H. Back-porch1.89µs48 Draw 640 pixels25.42µs H. Front-porch0.64µs16

Patel, UC Berkeley, PRET19 Timing Implementation Pixel-clock using derived clock –25.175Mhz –~ 39.72ns cycle period Drawing 16 pixels 19

Patel, UC Berkeley, PRET20 Future Work Architecture –DMA –DDR2 main memory model –Thread synchronization primitives –Shared data between threads Real-time Benchmarks –With timing requirements Programming models –Memory allocation schemes –Synchronizations

Patel, UC Berkeley, PRET21 Conclusion What we want … –Time as a first class citizen of embedded computing –Predictability –Repeatability Where we are at … –PRET cycle-accurate simulator –Release …

Patel, UC Berkeley, PRET22

Patel, UC Berkeley, PRET23 Extras

Patel, UC Berkeley, PRET24 More on Brittleness Small changes may have big effects on timing behavior Theorem (Richard’s anomalies): If a task set with fixed priorities, execution times, and precedence constraints is optimally scheduled on a fixed number of processors, then increasing the number of processors, reducing execution times, or weakening precedence constraints can increase the schedule length. Richard L. Graham, “Bounds on the performance of scheduling algorithms”, in E. G. Coffman, Jr.(ed.), Computer and Job-Shop Scheduling Theory, John Wiley, New York, 1975.

Patel, UC Berkeley, PRET25 Richard’s Anomalies T1/3T2/2T3/2T4/2 T9/9T5/4T6/4T7/4 8 T8/ tasks, 3 processors, priority list, precedence order, execution times.

Patel, UC Berkeley, PRET26 eTime’ = eTime - 1 Richard’s Anomalies: Reducing Execution Times T1/2T2/1T3/1T4/1 T9/8T5/3T6/3T7/3 8 T8/3 0312

Patel, UC Berkeley, PRET27 Richard’s Anomalies: More Processors T1/3T2/2T3/2T4/2 T9/9T5/4T6/4T7/4 8 T8/ processors 15

Patel, UC Berkeley, PRET28 Richard’s Anomalies: Changing Priority List T1/3T2/2T3/2T4/2 T9/9T5/4T6/4T7/4 9 T8/ L = (T1,T2,T4,T5,T6,T3,T9,T7,T8)

Patel, UC Berkeley, PRET29 Brittleness Again… In general, all task scheduling strategies are brittle