Designing Customized ISA Processors using High Level Synthesis Sam Skalicky, Tejaswini Ananthanarayana, Sonia Lopez, Marcin Lukowiak
Outline Motivation Background Our Approach Implementation Flow Experiments & Results Conclusion
Motivation Wide availability of soft processors MB, NIOS, MIPS, custom, etc… Reconfigurable logic allows for extreme configurability Processor configurability, not enough Pipeline stages, cache, mult/div, float, peripherals Typical app utilizes subset of all ISA instructions MB ISA (144), MIPS (153) Kernels (Linear algebra, encryption) use less than 20
Background Classic processor design Low level: HDL High level: LISA, Lava, Bluespec, Chisel C/C++ processor simulators available MIPS => SPIM, MB => ISS, … High level synthesis tools much more capable VivadoHLS, LegUp, ImpulseC, Synphony, …
Our Approach Take C/C++ processor simulator Implements ISA Only necessary instructions Produce HDL implementation using HLS Customize the implementation using directives Pipelining, register partitioning, etc…
Implementation Flow
Sample Architecture C/C++ void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; unsigned PC = 0; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; }
Sample Architecture HDL
Experiments Create customized processors from C MIPS simulator code (based on SPIM) Kernels: Dot product & AES Apply HLS directives to improve design Compare to common soft processors Using Xilinx VivadoHLS & Vivado tools Goal: evaluate this approach in terms of ease of use, resource utilization, performance
Experiments
Experiments Analyze kernel code to determine which instructions to implement Customize architecture code Base HLS design Apply HLS directives Improved HLS design Compare results
Experiments - directives Partition void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Pipeline
Results – General Observations Base processors were multi-cycle Pipelined processors were not fully pipelined Initiation interval > 1 due to hazards VivadoHLS only stalls on top level interfaces (FIFO) Registers implemented as BRAM Separate function units (no ALU)
Results – Dot product _________ void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Base design – no directives 3-9 cycles per instruction All functional units pipelined
Results – Dot product _________ void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Pipeline Improved design – pipelining 8 cycles per instruction 4 cycle initiation interval 2 simultaneous instructions
Results – AES ___ Base design – no directives void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Base design – no directives 3-4 cycles per instruction All functional units combinational Bit manipulations
Results – AES ___ Improved design – pipelining void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Pipeline Improved design – pipelining 4 cycles per instruction 3 cycle initiation interval 2 simultaneous instructions
Results – AES ___ Improved design – pipelining & register partitioning void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Partition Pipeline Improved design – pipelining & register partitioning 3 cycles per instruction 2 cycle initiation interval 2 simultaneous instructions
Results Custom ISA HLS designs Existing soft processors: MB, MIPSfpga Separate processor for each kernel & combined Existing soft processors: MB, MIPSfpga MB: minimal & standard Implemented using Vivado 2015.2 Digilent Nexys4 board, Artix-7 100T
Results
Results v1 – base v2 – pipelining v3 – pipelining & ____register partitioning std – standard min – minimal
Results - Summary Execution time was never more than 2.2x MB Base designs used 3x less resources than minimal MB, 6x less than standard MB Dot product used more FFs, similar LUTs, similar slices Pipelining improve performance to 1.7x MB for DP & 1.5x for AES Combined design was limited by DP instructions (MULT)
Conclusion Presented an approach for designing custom ISA processors, HLS for ease of use HLS produced HDL used 1/6th resources of standard MicroBlaze processor Reasonable trade-off in terms of performance Very minimal user effort required In resource limited designs, customized soft processors can be produced quickly and easily