6/25/2015Platform Design H.Corporaal and B. Mesman1 Platform Design TU/e 5kk70 Henk Corporaal Bart Mesman ASIP Application Specific Instruction-set Processor
6/25/2015Platform Design H.Corporaal and B. Mesman2 flexibility efficiency DS P Programmable CPU Programmable DSP Application domain specific Application specific processor Application domain specific processors (ADSP or ASIP)
6/25/2015Platform Design H.Corporaal and B. Mesman3 Application domain specific processors (ADSP or ASIP) takes a well defined application domain as a starting point exploits characteristics of the domain (computation kernels) still programmable within the domain e.g. MPEG2 coding uses 8*8 DCT transform, DECT, GSM etc... performance: clock speed + ILP ILP + tuning to domain flexible dev. (new apps.) cost effective (high volume) Appl. domain implementation ADSP implementation Appl. domain GP problems - specification manual design, - design time and effort large effort => synthesized cores
6/25/2015Platform Design H.Corporaal and B. Mesman4
6/25/2015Platform Design H.Corporaal and B. Mesman5 design process retargetable code generation (problem statement) ADSP/VLIW architectures (Mistral 2 /A|RT designer) low power aspects (Mistral 2 /A|RT designer) discussion conclusion Outline
6/25/2015Platform Design H.Corporaal and B. Mesman6 application(s) processor - model OK? more appl.? yes no yes Estimations cycles/alg occupation HW design SW (code generation) Estimations nsec/cycle, area, power/instr go to phase 2 3 phases 1. exploration 2. hw design (layout) + processing 3. design appl. sw Fast, accurate and early feedback Design process parameters instance e.g. VLIW with shared RFs
6/25/2015Platform Design H.Corporaal and B. Mesman7 A compiler is retargetable if it can generate code for a ‘new’ processor architecture specified in a machine description file. A guarded register transfer pattern (GRTP) is a register transfer pattern (RTP) together with the control bits of the instruction word that control the RTP. a: = b + c | instr = xxxx0101 GRTPs contain all inter-RT-conflict information. Instruction set extraction (ISE) is the process of generating all possible GRTPs for a specific processor. Problem statement
6/25/2015Platform Design H.Corporaal and B. Mesman8 Algorithm spec FE CDFG Code Generation Machinecode Processor spec (instance) ISE GRTP Problem statement in ch 4 this is part of the code generator
6/25/2015Platform Design H.Corporaal and B. Mesman9 PC IM +1 I.(20:0) RAM I.(12:5) I.(4) Inp I.(20:13) I.(3:2) I.(1:0) REG outp Example: Simple processor [Leupers]
6/25/2015Platform Design H.Corporaal and B. Mesman10 Example: Simple processor [Leupers]
6/25/2015Platform Design H.Corporaal and B. Mesman11 ASIP/VLIW architectures A|RT designer template as an example (= set of rules, a model) Differences with VLIW processors of ch // FUs ASUs = complex appl. Spec. FUs (beyond subword //) e.g. biquad, median, DCT etc … larger grainsize, more heterogeneous, more pipelines 2. Rfiles many Rfiles (>5 vs 1 or 2) limited # ports (3 vs 15) limited size (<16 vs. 128) 3. Issue slots all in parallel vs. 5
6/25/2015Platform Design H.Corporaal and B. Mesman12 RF1 FU1 RF2 RF3 FU2 RF4 RF5 FU3 RF6 RF7 FU4 RF8 IR1IR2 IR3IR4 Instruction memory Con- trol flags
6/25/2015Platform Design H.Corporaal and B. Mesman13 read address RF 1 write address RF 1 read address RF 2 write address RF 2 mux 1 mux 2 control FU output drivers Additional characteristics of the A|RT designer template interconnect network: busses + input multiplexers mux control is part of the instruction control can change every clock cycle network can be incomplete busses can be merged memories are modeled as FUs separate data in and data out 2 inputs (data in and address) and 1 output Each FU can generate one or more flags instruction format (per issue slot) ASIP/VLIW architectures
6/25/2015Platform Design H.Corporaal and B. Mesman14 ALUMAC bus1bus2 RF1RF2RF3RF4 mux 2 read RF1 write RF1 read RF2 write RF2 ALU instr. mux 3 read RF4 write RF4 read RF3 write RF3 MAC instr ASIP/VLIW architectures: example
6/25/2015Platform Design H.Corporaal and B. Mesman15 ASIP/VLIW architectures : example
6/25/2015Platform Design H.Corporaal and B. Mesman16 Datapath synthesis Controller synthesis OK? Change pragmas Algorithm spec no yes RTs Estimations area, power, timing RF1 : x = RF2 : y, RF3 : z | ALU = ADD Inmux = bus2 assign ( a+b, ALU, fu_alu1) assign ( a+_, ALU, fu_alu2) assign ( _+_, ALU, fu_alu3) VLIW makes relatively simple code selection possible ASIP/VLIW architectures: design flow
6/25/2015Platform Design H.Corporaal and B. Mesman17 * * 3 * 4 * * 8 * IPB OPB ALU MULT IPB OPB + 2 * 3 * 1 * 1 * * 1 * 3 * 4 * 3 * 4 * 4 * * * 8 * 5 * 5 * 8 * * 5 * 9 * 5 * 9 * 5 * * 9 + Candidate LIST Conflict & Priority Comp. Scheduled Operation ASIP/VLIW architectures: list scheduling
6/25/2015Platform Design H.Corporaal and B. Mesman18 ASIP/VLIW architectures: feedback
6/25/2015Platform Design H.Corporaal and B. Mesman19 design process retargetable code generation (problem statement) ADSP/VLIW architectures (Mistral 2 /A|RT designer) low power aspects (Mistral 2 /A|RT designer) discussion conclusion Outline
6/25/2015Platform Design H.Corporaal and B. Mesman20 Implementation Independent Design Database Implementation Independent Design Database Low power aspects Estimation area speed power Estimation Database + Architecture Mistral2
6/25/2015Platform Design H.Corporaal and B. Mesman21 GSM viterbi decoder : default solution EXUACTIVAREAPOWER alu_196% romctrl_148%39259 acu_126% ipb_15% opb_123% ctrl total EXUACTIVAREAPOWER alu_196% romctrl_148%39259 acu_126% ipb_15% opb_123% ctrl total controller responsible for 70% of power consumption –maximum resource-sharing –heavy decision-making : “main” loop with 16 metrics-computations per iteration EXU-numbers include Registers for local storage
6/25/2015Platform Design H.Corporaal and B. Mesman22 GSM viterbi decoder : no loop-folding area down by 33% power down by 35% next step: reduce # of program-steps with second ALU EXUACTIVAREAPOWER alu_192% romctrl_145%39255 acu_125% ipb_15%10786 opb_122% ctrl total EXUACTIVAREAPOWER alu_192% romctrl_145%39255 acu_125% ipb_15%10786 opb_122% ctrl total
6/25/2015Platform Design H.Corporaal and B. Mesman23 GSM viterbi decoder : 2 ALU’s 9739 EXUACTIVAREAPOWER alu_169% alu_265% romctrl_167%39255 acu_137% ipb_18% opb_133% ctrl total EXUACTIVAREAPOWER alu_169% alu_265% romctrl_167%39255 acu_137% ipb_18% opb_133% ctrl total © cycle count down 30% © area up 42% © power down by 5% © next step: introduce ASU to reduce ALU-load
6/25/2015Platform Design H.Corporaal and B. Mesman24 GSM viterbi decoder : 1 x ACS-ASU EXUACTIVAREAPOWER alu_120% acs_asu_183% or_asu_110% romctrl_116%6521 acu_136% ipb_120%10743 opb_111%16335 ctrl total EXUACTIVAREAPOWER alu_120% acs_asu_183% or_asu_110% romctrl_116%6521 acu_136% ipb_120%10743 opb_111%16335 ctrl total func ACS ( M1, M2, d ) MS, MS8 = begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi; end; func ACS ( M1, M2, d ) MS, MS8 = begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi; end; = 1930 © cycle count down 5X © power down 20X !
6/25/2015Platform Design H.Corporaal and B. Mesman25 GSM viterbi decoder : 4 x ACS-ASU EXUACTIVAREAPOWER alu_194%24397 acs_asu_195% acs_asu_295% acs_asu_395% acs_asu_495% split_asu_147%9018 or_asu_147% romctrl_128%486 acu_198%21285 ipb_123%606 opb_150%36980 ctrl total EXUACTIVAREAPOWER alu_194%24397 acs_asu_195% acs_asu_295% acs_asu_395% acs_asu_495% split_asu_147%9018 or_asu_147% romctrl_128%486 acu_198%21285 ipb_123%606 opb_150%36980 ctrl total © cycle count down another 5X © area up 23% © power down another 3X ! 425
6/25/2015Platform Design H.Corporaal and B. Mesman26 GSM viterbi example : summary Implementation Independent Design Database Implementation Independent Design Database 72x ! Mistral2
6/25/2015Platform Design H.Corporaal and B. Mesman27 Exploration phase Application software development: constraint driven compilation application(s) processor - model OK? more appl.? yes no yes HW design SW (code generation) application(s) OK? no yes SW (code generation) Freeze processor model no Discussion: phase 3
6/25/2015Platform Design H.Corporaal and B. Mesman28 Discussion: problems with VLIWs code compaction = reduce code size after scheduling possible compaction ratio ? e.g. p0 = 0.9 and p1 = 0.1 information content (entropy) = - p i log 2 p i = 0.47 maximum compression factor 2 control parallelism during scheduling = switch between different processor models (10% of code = 90% runtime) architecture reduce number of control bits for operand addresses e.g. 128 reg (TM) -> 28 bits/issue slot for addresses only => use stacks and fifos code size and instruction bandwidth
6/25/2015Platform Design H.Corporaal and B. Mesman29 RF1 FU1FU2 FU3FU4 IR1IR2 IR3IR4 Instruction memory Con- trol flags RF2 RF3RF4
6/25/2015Platform Design H.Corporaal and B. Mesman30 Conclusions ASIPs provide efficient solutions for well-defined application domains (2 orders of magnitude higher efficiency). The methodology is interesting for IP creation. The key problem is retargetable compilation. A (distributed) VLIW model is a good compromise between HW and SW. Although an automatic process can generate a default solution, the process usually is interactive and iterative for efficiency reasons. The key is fast and accurate feedback.