Designing Customized ISA Processors using High Level Synthesis

Slides:



Advertisements
Similar presentations
Machine cycle.
Advertisements

CMPE 421 Advanced Parallel Computer Architecture Pipeline datapath and Control.
Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho Fraunhofer IIS June 3 rd, 2014.
August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
Data Dependence Types and Associated Pipeline Hazards Chapter 4 — The Processor — 1 Sections 4.7.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.
Lab Assignment 2: MIPS single-cycle implementation
Computer Architecture Lab at Building a Synthesizable x86 Eriko Nurvitadhi, James C. Hoe, Babak Falsafi S IMFLEX /P ROTOFLEX.
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
The Processor 2 Andreas Klappenecker CPSC321 Computer Architecture.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Educational Computer Architecture Experimentation Tool Dr. Abdelhafid Bouhraoua.
Ross Brennan On the Introduction of Reconfigurable Hardware into Computer Architecture Education Ross Brennan
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Computer Organization CS224 Chapter 4 Part b The Processor Spring 2010 With thanks to M.J. Irwin, T. Fountain, D. Patterson, and J. Hennessy for some lecture.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.
Sample Code (Simple) Run the following code on a pipelined datapath: add1 2 3 ; reg 3 = reg 1 + reg 2 nand ; reg 6 = reg 4 & reg 5 lw ; reg.
EECE 476: Computer Architecture Slide Set #5: Implementing Pipelining Tor Aamodt Slide background: Die photo of the MIPS R2000 (first commercial MIPS microprocessor)
Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
2/15/02CSE Data Hazzards Data Hazards in the Pipelined Implementation.
CSE431 L06 Basic MIPS Pipelining.1Irwin, PSU, 2005 MIPS Pipeline Datapath Modifications  What do we need to add/modify in our MIPS datapath? l State registers.
Introduction to Computer Organization Pipelining.
MIPS Processor.
CS 61C: Great Ideas in Computer Architecture MIPS Datapath 1 Instructors: Nicholas Weaver & Vladimir Stojanovic
IMPLEMENTING RISC MULTI CORE PROCESSOR USING HLS LANGUAGE - BLUESPEC LIAM WIGDOR INSTRUCTOR MONY ORBACH SHIREL JOSEF Winter 2013 One Semester Mid-term.
Maj Jeffrey Falkinburg Room 2E46E
Programmable Hardware: Hardware or Software?
Variable Word Width Computation for Low Power
CDA 3101 Spring 2016 Introduction to Computer Organization
Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.
Single Clock Datapath With Control
FPGAs in AWS and First Use Cases, Kees Vissers
ECS 154B Computer Architecture II Spring 2009
CDA 3101 Spring 2016 Introduction to Computer Organization
Chapter 4 The Processor Part 2
Computer Architecture
Lecture 18: Pipelining Today’s topics:
Comparison of Two Processors
Lecture 18: Pipelining Today’s topics:
CSC 4250 Computer Architectures
Topic 5: Processor Architecture Implementation Methodology
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
The Processor Lecture 3.6: Control Hazards
Guest Lecturer TA: Shreyas Chand
Topic 5: Processor Architecture
Designing a Pipelined CPU
Basic MIPS Implementation
Application-Specific Customization of Soft Processor Microarchitecture
Introduction to Computer Systems Engineering
Processor: Datapath and Control
Presentation transcript:

Designing Customized ISA Processors using High Level Synthesis Sam Skalicky, Tejaswini Ananthanarayana, Sonia Lopez, Marcin Lukowiak

Outline Motivation Background Our Approach Implementation Flow Experiments & Results Conclusion

Motivation Wide availability of soft processors MB, NIOS, MIPS, custom, etc… Reconfigurable logic allows for extreme configurability Processor configurability, not enough Pipeline stages, cache, mult/div, float, peripherals Typical app utilizes subset of all ISA instructions MB ISA (144), MIPS (153) Kernels (Linear algebra, encryption) use less than 20

Background Classic processor design Low level: HDL High level: LISA, Lava, Bluespec, Chisel C/C++ processor simulators available MIPS => SPIM, MB => ISS, … High level synthesis tools much more capable VivadoHLS, LegUp, ImpulseC, Synphony, …

Our Approach Take C/C++ processor simulator Implements ISA Only necessary instructions Produce HDL implementation using HLS Customize the implementation using directives Pipelining, register partitioning, etc…

Implementation Flow

Sample Architecture C/C++ void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; unsigned PC = 0; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; }

Sample Architecture HDL

Experiments Create customized processors from C MIPS simulator code (based on SPIM) Kernels: Dot product & AES Apply HLS directives to improve design Compare to common soft processors Using Xilinx VivadoHLS & Vivado tools Goal: evaluate this approach in terms of ease of use, resource utilization, performance

Experiments

Experiments Analyze kernel code to determine which instructions to implement Customize architecture code Base HLS design Apply HLS directives Improved HLS design Compare results

Experiments - directives Partition void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Pipeline

Results – General Observations Base processors were multi-cycle Pipelined processors were not fully pipelined Initiation interval > 1 due to hazards VivadoHLS only stalls on top level interfaces (FIFO) Registers implemented as BRAM Separate function units (no ALU)

Results – Dot product _________ void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Base design – no directives 3-9 cycles per instruction All functional units pipelined

Results – Dot product _________ void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Pipeline Improved design – pipelining 8 cycles per instruction 4 cycle initiation interval 2 simultaneous instructions

Results – AES ___ Base design – no directives void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Base design – no directives 3-4 cycles per instruction All functional units combinational Bit manipulations

Results – AES ___ Improved design – pipelining void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Pipeline Improved design – pipelining 4 cycles per instruction 3 cycle initiation interval 2 simultaneous instructions

Results – AES ___ Improved design – pipelining & register partitioning void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Partition Pipeline Improved design – pipelining & register partitioning 3 cycles per instruction 2 cycle initiation interval 2 simultaneous instructions

Results Custom ISA HLS designs Existing soft processors: MB, MIPSfpga Separate processor for each kernel & combined Existing soft processors: MB, MIPSfpga MB: minimal & standard Implemented using Vivado 2015.2 Digilent Nexys4 board, Artix-7 100T

Results

Results v1 – base v2 – pipelining v3 – pipelining & ____register partitioning std – standard min – minimal

Results - Summary Execution time was never more than 2.2x MB Base designs used 3x less resources than minimal MB, 6x less than standard MB Dot product used more FFs, similar LUTs, similar slices Pipelining improve performance to 1.7x MB for DP & 1.5x for AES Combined design was limited by DP instructions (MULT)

Conclusion Presented an approach for designing custom ISA processors, HLS for ease of use HLS produced HDL used 1/6th resources of standard MicroBlaze processor Reasonable trade-off in terms of performance Very minimal user effort required In resource limited designs, customized soft processors can be produced quickly and easily