Liquid computing – the rVEX approach Liquid computing – the rVEX approach ILP-driven dynamic core adaptations Joost J. Hoozemans – Computer Engineering, TU Delft Monday, 19 November 2018
Observation Past Current Future Embedded workloads becoming increasingly Dynamic Intensity (nr of tasks) Characteristics (amount, type of parallelism) Requirements (criticality)
Dynamic workloads call for dynamic computing platforms Realization Dynamic workloads call for dynamic computing platforms
Vision – Liquid Architectures Implementing a system that constantly optimizes its hardware for all its running tasks
Current state of the art: Heterogeneous Multicore processors Core A Core B
Heterogeneous Multicore (big.LITTLE) - Problem Core A Core B Core A
Heterogeneous Multicore (big.LITTLE) - Problem Source: ARM – Programmers guide for ARMv8
Heterogeneous Multicore (big.LITTLE) - Problem Core A Core B Source: Anandtech
Instruction-Level Parallelism (ILP) Heterogeneous Multicore (big.LITTLE) - Problem Some programs cannot make use of additional processor resources (parallel datapaths) Should use a better metric for choosing between big or little Instruction-Level Parallelism (ILP)
Heterogeneous Multicore (big.LITTLE) - Problem Superscalar processors: ILP is implicit Measure ILP: run on largest core/configuration ILP-extraction = power hungry Source: Nvidia Tegra 4 Family CPU architecture whitepaper
Heterogeneous Multicore (big.LITTLE) - Problem Superscalar processors: ILP is implicit Measure ILP: run on largest core/configuration Solution: VLIW-based dynamic processor
Super-scalar VLIW Program Program Compiler Compiler Sequential binary Explicitly Parallel binary Datapath Scheduler Super-scalar Datapath Datapath Datapath Datapath VLIW
VLIW: explicit parallelism (ILP) VLIW processors: ILP is explicit Encoded in binary by compiler Bundle boundaries (stopbits)
VLIW: explicit parallelism (ILP) VLIW processors: ILP is explicit Encoded in binary by compiler VLIW and add nop
VLIW: explicit parallelism (ILP) VLIW processors: ILP is explicit Encoded in binary by compiler VLIW shl add nop
VLIW: explicit parallelism (ILP) VLIW processors: ILP is explicit Encoded in binary by compiler VLIW sub add nop
VLIW: explicit parallelism (ILP) VLIW processors: ILP is explicit Encoded in binary by compiler VLIW stw add nop goto
Heterogeneous Multicore (big.LITTLE) – Problem 2: Migration penalty Task 2 Underutilization Core A Task 1 Save Task 1 Restore Task 2 Unused ILP Task 2 Save Task 2 Restore Task 1 Task 1 Core B t Migration penalty!
Solution: Liquid Computing Dynamic processor Assigning datapaths to threads Datapath 1 & 2 Datapath 3 & 4 Datapath 5 & 6 Datapath 7 & 8 t
Solution: Liquid Computing Dynamic processor Assigning datapaths to threads Task 2 Task 2 Task 2 Task 2 Task 4 Task 3 Task 1 Task 3 t 5 clock cycles
Heterogeneous Multicore (big.LITTLE) – Problem 3: reactive Response time + migration penalty Source: ARM – Programmers guide for ARMv8
Phases ILP changes too rapidly for heterogeneous core migrations. But not for our dynamic processor!
Phases - Solution The compiler analyses loops…
Phases - Solution … and writes ILP info into a control register The compiler analyses loops…
Coverage Up to 72% avg.
Overhead Up to 2.35% avg.
Dynamic 20% faster than heterogeneous Throughput Dynamic 20% faster than heterogeneous
Demo Liefst wil ik een plaatje met 4 contexts die ILP info in hun control registers schrijven en de runtime die adh daarvan de beste configuratie gaat berekenen
Liquid Computing – Advantages High single-thread performance High multi-thread throughput Low configuration overhead (no migration penalties) Low interrupt latency
Applications - Image processing pipeline (Rolf, Joost), Doom (Koray, Jeroen), Demos (Muneeb, Joost, Jeroen), Benchmarks SPEC, MiBench, Malardalen, Powerstone (Anthony, Joost) Operating System support - Linux (Mainly Joost, some low-level code written/fixed/updated by Anthony & Jeroen), FreeRTOS (Jeroen, Muneeb) Runtime libraries - Newlib (Joost, Anthony), uCLibc (Tom, Joost), Floating Point & Division, math (Joost) Compilers - HP VEX, GCC (IBM, Anthony, Joost), Cosy (Hugo), LLVM (Maurice, Hugo), Open64 (Joost) Binutils - Assembler, linker, etc. (Anthony), VEXparse (Anthony, Jeroen) Architectural Simulator (Joost) Debug hardware, tools and interface (Jeroen) Hardware design - VHDL (Jeroen) ASIC manufacturing effort - core (Lennart), interface (Shizao) supported by Jeroen
http://rvex.ewi.tudelft.nl
Liquid Computing – Fault-tolerance Protected Task 2 Task 2 Task 2 Task 2 Task 2 t
Image processing FPGA overlay fabric Streaming architecture 16x4 cores 194 MHz