Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.

Slides:



Advertisements
Similar presentations
Static Bus Schedule aware Scratchpad Allocation in Multiprocessors Sudipta Chattopadhyay Abhik Roychoudhury National University of Singapore.
Advertisements

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra.
Timing Analysis of Concurrent Programs Running on Shared Cache Multi-Cores Presented By: Rahil Shah Candidate for Master of Engineering in ECE Electrical.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Performance of Cache Memory
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 Recap: Memory Hierarchy. 2 Unified vs.Separate Level 1 Cache Unified Level 1 Cache (Princeton Memory Architecture). A single level 1 cache is used for.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Chapter 1 and 2 Computer System and Operating System Overview
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL.
WCET Analysis for a Java Processor Martin Schoeberl TU Vienna, Austria Rasmus Pedersen CBS, Denmark.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
A Time Predictable Instruction Cache for a Java Processor Martin Schoeberl.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Zheng Wu. Background Motivation Analysis Framework Intra-Core Cache Analysis Cache Conflict Analysis Optimization Techniques WCRT Analysis Experiment.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
1 CENG 450 Computer Systems and Architecture Cache Review Amirali Baniasadi
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.
Lecture 15 Calculating and Improving Cache Perfomance
Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
Sunpyo Hong, Hyesoon Kim
WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
Timing Anomalies in Dynamically Scheduled Microprocessors Thomas Lundqvist, Per Stenstrom (RTSS ‘99) Presented by: Kaustubh S. Patil.
CSC 4250 Computer Architectures
Cache Memory Presentation I
CSCI206 - Computer Organization & Programming
Some challenges in heterogeneous multi-core systems
CMPT 886: Computer Architecture Primer
Ann Gordon-Ross and Frank Vahid*
Control unit extension for data hazards
Control unit extension for data hazards
Hardik Shah, Kai Huang and Alois Knoll
Control unit extension for data hazards
Cache - Optimization.
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University HPEC’12, Waltham, MA

Overview A time-predictable two-level SPM based architecture is proposed for single-core VLIW (Very Long Instruction Word) microprocessors. An ILP based static memory objects assignment algorithm is extended to support multi-level SPMs without harming the characteristic of time predictability of SPMs. We developed a SPM-aware scheduling to improve performance of the proposed VLIW architecture

2-Level Cache-based Architecture Two separated L1 caches to store instructions/ data to isolate the interference between them. One unified L2 cache slower than L1 caches but with larger size to trade off between speed and size. L1 I-CacheL1 D-Cache L2 Unified Cache Main Memory Microprocessor

2-Level SPM-based Architecture Two separated L1 SPMs to store instructions/ data, and one unified L2 SPM with larger size but slower speed. No replacement in any higher level memory of this architecture. L1 I-SPML1 D-SPM L2 Unified SPM Main Memory Microprocessor

ILP-based Static Allocation The ILP-based static allocation method is utilized to allocate memory objects to multi-level SPMs, since it can completely guarantee the characteristic of time predictability. The object function is to maximally save executing time, while the constraint is the sizes of SPMs. The ILP-based method is utilized three times for the three SPMs, and all instruction and data objects not selected to be allocated in the L1 SPMs need to be considered as candidates for the L2 SPM.

Background on Load-Sensitive Scheduling In the cache-based architecture, generally it is hard to statically know the latency of each load operation An optimistic scheduler assumes a load always hits in the cache Too aggressive Processor needs to be stalled when a miss occurs A pessimistic scheduler assumes a load always misses in the cache Leads to bad performance

Use-Stall Cycles

Scratchpad-Aware Scheduling Whenever possible, schedule a load op with large memory latency earlier Schedule its use op later Shorten use-stall cycles while preserving time predictability

Memory Objects The instruction objects consist of basic blocks, functions, and combinations of consecutive basic blocks The data objects consist of global scalars and non- scalar variables.

ILP for L1 Instruction SPM

Scratchpad-Aware Scheduling The load/store latencies are known in the SPM- based architecture Instruction scheduling can be enhanced by exploiting the predictable load/store latencies This is known as Load Sensitive Scheduling for VLIW architectures [Hardnett et al. GIT, 2001]

Evaluation Methodology We evaluate the performance and energy consumption of our SPM based architecture compared to the cache based architecture. Eight real-time benchmarks are selected for this evaluation. We simulate the proposed two-level SPM based architecture on a VLIW processor based on the HPL-PD architecture.

Cache and SPM Configurations

Evaluation Framework The framework of our two-level SPM based architecture for the single-core CPUs evaluation.

Results (SPMs vs. Caches) The WCET comparison (L1 Size: 128 Bytes, L2 Size: 256 Bytes), normalized to SPM The energy consumption comparison (L1 Size: 128 Bytes, L2 Size: 256 Bytes), normalized to SPM

Sensitivity Study LevelSetting 1 (S1)Setting 2 (S2)Setting 3 (S3) L1 Instruction L1 Data L2 Shared

Sensitivity WCET Results The WCET comparison among the SPMs with different size settings. The WCET comparison among the caches with different size settings.

Sensitivity Energy Results The energy consumption comparison among the SPMs with different size settings. The energy consumption comparison among the caches with different size settings.

Why Two Levels? L1 I-CacheL1 D-Cache Main Memory Microprocessor Why do we need two levels SPMs instead of one level? The level 2 SPM is important to mitigate the access latency, which otherwise has to fetch from the memory. One level SPM architecture.

Results (One-Level vs. Two-Level) The timing performance comparison, normalized to two-level SPM based architecture. The energy consumption comparison, normalized to two-level SPM based architecture.

Scratchpad-Aware Scheduling The maximum improvement of computation cycles is about 3.9%, and the maximum improvement of use stall cycles is about 10%.

Thank You and Questions!

Backup Slides – SPM Access Latencies

Backup Slide – Priority Function in SSS In our Scratchpad Sensitive Scheduling, we consider two factors related to the Load-To-Use Distance, including the memory latency for a Load Op (curLat) and the related Load Op memory latency for a Use Op (preLat). Priority function of the default Critical Path Scheduling: