Designing Customized ISA Processors using High Level Synthesis

Designing Customized ISA Processors using High Level Synthesis
Sam Skalicky, Tejaswini Ananthanarayana, Sonia Lopez, Marcin Lukowiak

Outline Motivation Background Our Approach Implementation Flow
Experiments & Results Conclusion

Motivation Wide availability of soft processors
MB, NIOS, MIPS, custom, etc… Reconfigurable logic allows for extreme configurability Processor configurability, not enough Pipeline stages, cache, mult/div, float, peripherals Typical app utilizes subset of all ISA instructions MB ISA (144), MIPS (153) Kernels (Linear algebra, encryption) use less than 20

Background Classic processor design
Low level: HDL High level: LISA, Lava, Bluespec, Chisel C/C++ processor simulators available MIPS => SPIM, MB => ISS, … High level synthesis tools much more capable VivadoHLS, LegUp, ImpulseC, Synphony, …

Our Approach Take C/C++ processor simulator
Implements ISA Only necessary instructions Produce HDL implementation using HLS Customize the implementation using directives Pipelining, register partitioning, etc…

Implementation Flow

Sample Architecture C/C++
void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; unsigned PC = 0; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; }

Sample Architecture HDL

Experiments Create customized processors from C MIPS simulator code (based on SPIM) Kernels: Dot product & AES Apply HLS directives to improve design Compare to common soft processors Using Xilinx VivadoHLS & Vivado tools Goal: evaluate this approach in terms of ease of use, resource utilization, performance

Experiments

Experiments Analyze kernel code to determine which instructions to implement Customize architecture code Base HLS design Apply HLS directives Improved HLS design Compare results

Experiments - directives
Partition void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Pipeline

Results – General Observations
Base processors were multi-cycle Pipelined processors were not fully pipelined Initiation interval > 1 due to hazards VivadoHLS only stalls on top level interfaces (FIFO) Registers implemented as BRAM Separate function units (no ALU)

Results – Dot product _________
void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Base design – no directives 3-9 cycles per instruction All functional units pipelined

Results – Dot product _________
void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Pipeline Improved design – pipelining 8 cycles per instruction 4 cycle initiation interval 2 simultaneous instructions

Results – AES ___ Base design – no directives
void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Base design – no directives 3-4 cycles per instruction All functional units combinational Bit manipulations

Results – AES ___ Improved design – pipelining
void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Pipeline Improved design – pipelining 4 cycles per instruction 3 cycle initiation interval 2 simultaneous instructions

Results – AES ___ Improved design – pipelining & register partitioning
void datapath(unsigned IM[], unsigned DM[]) { int reg[32]; main loop:while(true) { unsigned instr = IM[PC]; switch(OPCODE(instr)) { case 0x00: //Begin 0x00 R−Type switch(FUNCT(instr)) { #ifdef ADD INST case 0x20: reg[RD(instr)] = reg[RS(instr)] + reg[RT(instr)]; break; #endif } break; //End 0x00 R−Type #ifdef LW INST case 0x23: reg[RT(instr)] = DM[reg[RS(instr)] + IMM(instr)]; #endif #ifdef SW INST case 0x2b: DM[reg[RS(instr)] + IMM(instr)] = reg[RT(instr)]; #endif } //End Instruction decoding PC += 1; } Partition Pipeline Improved design – pipelining & register partitioning 3 cycles per instruction 2 cycle initiation interval 2 simultaneous instructions

Results Custom ISA HLS designs Existing soft processors: MB, MIPSfpga
Separate processor for each kernel & combined Existing soft processors: MB, MIPSfpga MB: minimal & standard Implemented using Vivado Digilent Nexys4 board, Artix-7 100T

Results

Results v1 – base v2 – pipelining
v3 – pipelining & ____register partitioning std – standard min – minimal

Results - Summary Execution time was never more than 2.2x MB
Base designs used 3x less resources than minimal MB, 6x less than standard MB Dot product used more FFs, similar LUTs, similar slices Pipelining improve performance to 1.7x MB for DP & 1.5x for AES Combined design was limited by DP instructions (MULT)

Conclusion Presented an approach for designing custom ISA processors, HLS for ease of use HLS produced HDL used 1/6th resources of standard MicroBlaze processor Reasonable trade-off in terms of performance Very minimal user effort required In resource limited designs, customized soft processors can be produced quickly and easily

Designing Customized ISA Processors using High Level Synthesis

Similar presentations

Presentation on theme: "Designing Customized ISA Processors using High Level Synthesis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Designing Customized ISA Processors using High Level Synthesis

Similar presentations

Presentation on theme: "Designing Customized ISA Processors using High Level Synthesis"— Presentation transcript:

Similar presentations

About project

Feedback