Liquid computing – the rVEX approach

Slides:



Advertisements
Similar presentations
4. Workload directed adaptive SMP multicores
Advertisements

Computer Architecture Instruction-Level Parallel Processors
Superscalar and VLIW Architectures Miodrag Bolic CEG3151.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 6: Multicore Systems
System Level Benchmarking Analysis of the Cortex™-A9 MPCore™ John Goodacre Director, Program Management ARM Processor Division October 2009 Anirban Lahiri.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Performance Analysis of Multiprocessor Architectures
Project Overview 2014/05/05 1. Current Project “Research on Embedded Hypervisor Scheduler Techniques” ◦ Design an energy-efficient scheduling mechanism.
Design Automation of Co-Processors for Application Specific Instruction Set Processors Seng Lin Shee.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Introduction CS 524 – High-Performance Computing.
Instruction Level Parallelism (ILP) Colin Stevens.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
Super computers Parallel Processing By Lecturer: Aisha Dawood.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Pipelining and Parallelism Mark Staveley
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
EKT303/4 Superscalar vs Super-pipelined.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation.
CS203 – Advanced Computer Architecture
Adaptive Cache Partitioning on a Composite Core
Design-Space Exploration
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Simultaneous Multithreading
ECE 4100/ Advanced Computer Architecture Sudhakar Yalamanchili
Performance Tuning Team Chia-heng Tu June 30, 2009
/ Computer Architecture and Design
Henk Corporaal TUEindhoven 2009
Vector Processing => Multimedia
Instructional Parallelism
Many-core Software Development Platforms
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
Drinking from the Firehose Decode in the Mill™ CPU Architecture
Pipelining: Advanced ILP
Improved schedulability on the ρVEX polymorphic VLIW processor
Multi-core CPU Computing Straightforward with OpenMP
Hardware Multithreading
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation
Henk Corporaal TUEindhoven 2011
Liquid Architectures: Linux on rVEX
Chapter 1 Introduction.
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction to Heterogeneous Parallel Computing
/ Computer Architecture and Design
Superscalar and VLIW Architectures
CSC3050 – Computer Architecture
Presentation transcript:

Liquid computing – the rVEX approach Liquid computing – the rVEX approach ILP-driven dynamic core adaptations Joost J. Hoozemans – Computer Engineering, TU Delft Monday, 19 November 2018

Observation Past Current Future Embedded workloads becoming increasingly Dynamic Intensity (nr of tasks) Characteristics (amount, type of parallelism) Requirements (criticality)

Dynamic workloads call for dynamic computing platforms Realization Dynamic workloads call for dynamic computing platforms

Vision – Liquid Architectures Implementing a system that constantly optimizes its hardware for all its running tasks

Current state of the art: Heterogeneous Multicore processors Core A Core B

Heterogeneous Multicore (big.LITTLE) - Problem Core A Core B Core A

Heterogeneous Multicore (big.LITTLE) - Problem Source: ARM – Programmers guide for ARMv8

Heterogeneous Multicore (big.LITTLE) - Problem Core A Core B Source: Anandtech

Instruction-Level Parallelism (ILP) Heterogeneous Multicore (big.LITTLE) - Problem Some programs cannot make use of additional processor resources (parallel datapaths) Should use a better metric for choosing between big or little Instruction-Level Parallelism (ILP)

Heterogeneous Multicore (big.LITTLE) - Problem Superscalar processors: ILP is implicit Measure ILP: run on largest core/configuration ILP-extraction = power hungry Source: Nvidia Tegra 4 Family CPU architecture whitepaper

Heterogeneous Multicore (big.LITTLE) - Problem Superscalar processors: ILP is implicit Measure ILP: run on largest core/configuration Solution: VLIW-based dynamic processor

Super-scalar VLIW Program Program Compiler Compiler Sequential binary Explicitly Parallel binary Datapath Scheduler Super-scalar Datapath Datapath Datapath Datapath VLIW

VLIW: explicit parallelism (ILP) VLIW processors: ILP is explicit Encoded in binary by compiler Bundle boundaries (stopbits)

VLIW: explicit parallelism (ILP) VLIW processors: ILP is explicit Encoded in binary by compiler VLIW and add nop

VLIW: explicit parallelism (ILP) VLIW processors: ILP is explicit Encoded in binary by compiler VLIW shl add nop

VLIW: explicit parallelism (ILP) VLIW processors: ILP is explicit Encoded in binary by compiler VLIW sub add nop

VLIW: explicit parallelism (ILP) VLIW processors: ILP is explicit Encoded in binary by compiler VLIW stw add nop goto

Heterogeneous Multicore (big.LITTLE) – Problem 2: Migration penalty Task 2 Underutilization Core A Task 1 Save Task 1 Restore Task 2 Unused ILP Task 2 Save Task 2 Restore Task 1 Task 1 Core B t Migration penalty!

Solution: Liquid Computing Dynamic processor Assigning datapaths to threads Datapath 1 & 2 Datapath 3 & 4 Datapath 5 & 6 Datapath 7 & 8 t

Solution: Liquid Computing Dynamic processor Assigning datapaths to threads Task 2 Task 2 Task 2 Task 2 Task 4 Task 3 Task 1 Task 3 t 5 clock cycles

Heterogeneous Multicore (big.LITTLE) – Problem 3: reactive Response time + migration penalty Source: ARM – Programmers guide for ARMv8

Phases ILP changes too rapidly for heterogeneous core migrations. But not for our dynamic processor!

Phases - Solution The compiler analyses loops…

Phases - Solution … and writes ILP info into a control register The compiler analyses loops…

Coverage Up to 72% avg.

Overhead Up to 2.35% avg.

Dynamic 20% faster than heterogeneous Throughput Dynamic 20% faster than heterogeneous

Demo Liefst wil ik een plaatje met 4 contexts die ILP info in hun control registers schrijven en de runtime die adh daarvan de beste configuratie gaat berekenen

Liquid Computing – Advantages High single-thread performance High multi-thread throughput Low configuration overhead (no migration penalties) Low interrupt latency

Applications - Image processing pipeline (Rolf, Joost), Doom (Koray, Jeroen), Demos (Muneeb, Joost, Jeroen), Benchmarks SPEC, MiBench, Malardalen, Powerstone (Anthony, Joost) Operating System support - Linux (Mainly Joost, some low-level code written/fixed/updated by Anthony & Jeroen), FreeRTOS (Jeroen, Muneeb) Runtime libraries - Newlib (Joost, Anthony), uCLibc (Tom, Joost), Floating Point & Division, math (Joost) Compilers - HP VEX, GCC (IBM, Anthony, Joost), Cosy (Hugo), LLVM (Maurice, Hugo), Open64 (Joost) Binutils - Assembler, linker, etc. (Anthony), VEXparse (Anthony, Jeroen) Architectural Simulator (Joost) Debug hardware, tools and interface (Jeroen) Hardware design - VHDL (Jeroen) ASIC manufacturing effort - core (Lennart), interface (Shizao) supported by Jeroen

http://rvex.ewi.tudelft.nl

Liquid Computing – Fault-tolerance Protected Task 2 Task 2 Task 2 Task 2 Task 2 t

Image processing FPGA overlay fabric Streaming architecture 16x4 cores 194 MHz