Processor Architectures and Program Mapping Application domain specific processors (ADSP or ASIP) 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman.

Processor Architectures and Program Mapping Application domain specific processors (ADSP or ASIP) 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman2 flexibility efficiency DS P Programmable CPU Programmable DSP Application domain specific Application specific processor Application domain specific processors (ADSP or ASIP)

takes a well defined application domain as a starting point exploits characteristics of the domain (computation kernels) still programmable within the domain e.g. MPEG2 coding uses 8*8 DCT transform, DECT, GSM etc... performance: clock speed + ILP ILP + tuning to domain flexible dev. (new apps.) cost effective (high volume) Appl. domain implementation ADSP implementation Appl. domain GP problems - specification manual design, - design time and effort large effort => synthesized cores

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman4 www.adelantetech.com

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman5 design process retargetable code generation (problem statement) ADSP/VLIW architectures (Mistral 2 /A|RT designer) instructive demo (Adelante) application examples low power aspects (Mistral 2 /A|RT designer) discussion conclusion Outline

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman6 application(s) processor - model OK? more appl.? yes no yes Estimations cycles/alg occupation HW design SW (code generation) Estimations nsec/cycle, area, power/instr go to phase 2 3 phases 1. exploration 2. hw design (layout) + processing 3. design appl. sw Fast, accurate and early feedback Design process parameters instance e.g. VLIW with shared RFs

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman7 A compiler is retargetable if it can generate code for a ‘new’ processor architecture specified in a machine description file. A guarded register transfer pattern (GRTP) is a register transfer pattern (RTP) together with the control bits of the instruction word that control the RTP. a: = b + c | instr = xxxx0101 GRTPs contain all inter-RT-conflict information. Instruction set extraction (ISE) is the process of generating all possible GRTPs for a specific processor. Problem statement

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman8 Algorithm spec FE CDFG Code Generation Machinecode Processor spec (instance) ISE GRTP Problem statement in ch 4 this is part of the code generator

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman9 PC IM +1 I.(20:0) RAM I.(12:5) I.(4) Inp I.(20:13) I.(3:2) I.(1:0) REG outp Example: Simple processor [Leupers]

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman10 Example: Simple processor [Leupers]

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman11 ASIP/VLIW architectures A|RT designer template as an example (= set of rules, a model) Differences with VLIW processors of ch. 4 1. // FUs ASUs = complex appl. Spec. FUs (beyond subword //) e.g. biquad, median, DCT etc … larger grainsize, more heterogeneous, more pipelines 2. Rfiles many Rfiles (>5 vs 1 or 2) limited # ports (3 vs 15) limited size (<16 vs. 128) 3. Issue slots all in parallel vs. 5

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman12 RF1 FU1 RF2 RF3 FU2 RF4 RF5 FU3 RF6 RF7 FU4 RF8 IR1IR2 IR3IR4 Instruction memory Con- trol flags

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman13 read address RF 1 write address RF 1 read address RF 2 write address RF 2 mux 1 mux 2 control FU output drivers Additional characteristics of the A|RT designer template interconnect network: busses + input multiplexers mux control is part of the instruction control can change every clock cycle network can be incomplete busses can be merged memories are modeled as FUs separate data in and data out 2 inputs (data in and address) and 1 output Each FU can generate one or more flags instruction format (per issue slot) ASIP/VLIW architectures

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman14 ALUMAC bus1bus2 RF1RF2RF3RF4 mux 2 read RF1 write RF1 read RF2 write RF2 ALU instr. mux 3 read RF4 write RF4 read RF3 write RF3 MAC instr. 0 9 10 19 ASIP/VLIW architectures: example

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman15 ASIP/VLIW architectures : example

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman16 Datapath synthesis Controller synthesis OK? Change pragmas Algorithm spec no yes RTs Estimations area, power, timing RF1 : x = RF2 : y, RF3 : z | ALU = ADD Inmux = bus2 assign ( a+b, ALU, fu_alu1) assign ( a+_, ALU, fu_alu2) assign ( _+_, ALU, fu_alu3) VLIW makes relatively simple code selection possible ASIP/VLIW architectures: design flow

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman17 * 1 + 2 * 3 * 4 * 5 + 6 + 7 * 8 * 9 + 10 IPB OPB ALU MULT IPB OPB + 2 * 3 * 1 * 1 * 3 + 2 * 1 * 3 * 4 * 3 * 4 * 4 * 3 + 6 * 3 + 6 + 7 * 8 * 5 * 5 * 8 * 8 + 7 * 5 * 9 * 5 * 9 * 5 * 9 + 10 * 9 + Candidate LIST Conflict & Priority Comp. Scheduled Operation 00 11 2233 44 5 ASIP/VLIW architectures: list scheduling

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman18 ASIP/VLIW architectures: feedback

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman19 design process retargetable code generation (problem statement) ASIP/VLIW architectures (Mistral 2 /A|RT designer) instructive demo (Adelante) application examples low power aspects (Mistral 2 /A|RT designer) discussion conclusion Outline

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman20 filter Control unit- c0c0 c1c1 c 63 xy er Application examples: adaptive filter Minimizes the difference between x and e (reference signal) Many applications are possible echo cancelling for TV e = flyback signal (known without echoes) automatic equalization of cables in data transmission acoustic echo cancelling

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman21 filter Control unit- c0c0 c1c1 c 63 x y e r speaker microphone speech Speech + noise noise Application examples: adaptive filter

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman22 filter Control unit- c0c0 c1c1 c 63 x y e r noise (e.g. radio) Speech + noise speech Hearing aid Application examples: adaptive filter

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman23 A1A1 * Z -1 AiAi * AnAn * A0A0 * * + - S 0 [n]S 1 [n]S i [n] S 63 [n] c0c0 c1c1 cici c 63 x[n]x[n-1]x[n-i]x[n-63] r[n] e[n] ê [n] mu t[n] Application examples: adaptive filter

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman24 * + Z -1 C i [n] C i [n-1] x[n-i] t[n] AiAi Application examples: adaptive filter

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman25 * r + w r * sum[i+1] sum[i] x@i t c[i]@1 + Application examples: adaptive filter

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman26 RAM bus1 2 1 ALU 1 2 ROM MULT 1 2 ACU 2 3 bus2 266 clock cycles 1.1 mm 2 Application examples: adaptive filter implementation 1

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman27 RAM bus1 4 1 ALU 5 5 ROM ACU 2 5 bus2 2250 clock cycles 0.7 mm 2 Application examples: adaptive filter implementation 2

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman28 RAM1 1 1 ACU1 2 2 ALU 1 2 MULT 1 2 RAM2 1 1 ROMACU2 1 1 202 clock cycles 1.4 mm 2 Application examples: adaptive filter implementation 3

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman29 clock cycles area (mm 2 ) 12 1000 2000

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman30 design process retargetable code generation (problem statement) ADSP/VLIW architectures (Mistral 2 /A|RT designer) instructive demo (Adelante) application examples low power aspects (Mistral 2 /A|RT designer) discussion conclusion Outline

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman31 Implementation Independent Design Database Implementation Independent Design Database Low power aspects Estimation area speed power Estimation Database + Architecture Mistral2

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman32 GSM viterbi decoder : default solution 13750 EXUACTIVAREAPOWER alu_196%346946196 romctrl_148%39259 acu_126%3271209 ipb_15%131105 opb_123%18045801 ctrl9821135035 total15591188605 EXUACTIVAREAPOWER alu_196%346946196 romctrl_148%39259 acu_126%3271209 ipb_15%131105 opb_123%18045801 ctrl9821135035 total15591188605 controller responsible for 70% of power consumption –maximum resource-sharing –heavy decision-making : “main” loop with 16 metrics-computations per iteration EXU-numbers include Registers for local storage

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman33 GSM viterbi decoder : no loop-folding area down by 33% power down by 35% next step: reduce # of program-steps with second ALU 14247 EXUACTIVAREAPOWER alu_192%341145073 romctrl_145%39255 acu_125%2941087 ipb_15%10786 opb_122%16615340 ctrl491970087 total10431121928 EXUACTIVAREAPOWER alu_192%341145073 romctrl_145%39255 acu_125%2941087 ipb_15%10786 opb_122%16615340 ctrl491970087 total10431121928

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman34 GSM viterbi decoder : 2 ALU’s 9739 EXUACTIVAREAPOWER alu_169%179712248 alu_265%13938916 romctrl_167%39255 acu_137%2941087 ipb_18%149119 opb_133%21366871 ctrl895787235 total14766116731 EXUACTIVAREAPOWER alu_169%179712248 alu_265%13938916 romctrl_167%39255 acu_137%2941087 ipb_18%149119 opb_133%21366871 ctrl895787235 total14766116731 © cycle count down 30% © area up 42% © power down by 5% © next step: introduce ASU to reduce ALU-load

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman35 GSM viterbi decoder : 1 x ACS-ASU EXUACTIVAREAPOWER alu_120%261105 acs_asu_183%23823816 or_asu_110%611122 romctrl_116%6521 acu_136%294205 ipb_120%10743 opb_111%16335 ctrl18643597 total57477944 EXUACTIVAREAPOWER alu_120%261105 acs_asu_183%23823816 or_asu_110%611122 romctrl_116%6521 acu_136%294205 ipb_120%10743 opb_111%16335 ctrl18643597 total57477944 func ACS ( M1, M2, d ) MS, MS8 = begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi; end; func ACS ( M1, M2, d ) MS, MS8 = begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi; end; = 1930 © cycle count down 5X © power down 20X !

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman36 GSM viterbi decoder : 4 x ACS-ASU EXUACTIVAREAPOWER alu_194%24397 acs_asu_195%1041420 acs_asu_295%1041420 acs_asu_395%1041420 acs_asu_495%1041420 split_asu_147%9018 or_asu_147%592118 romctrl_128%486 acu_198%21285 ipb_123%606 opb_150%36980 ctrl1306555 total70842645 EXUACTIVAREAPOWER alu_194%24397 acs_asu_195%1041420 acs_asu_295%1041420 acs_asu_395%1041420 acs_asu_495%1041420 split_asu_147%9018 or_asu_147%592118 romctrl_128%486 acu_198%21285 ipb_123%606 opb_150%36980 ctrl1306555 total70842645 © cycle count down another 5X © area up 23% © power down another 3X ! 425

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman37 GSM viterbi example : summary Implementation Independent Design Database Implementation Independent Design Database 72x ! Mistral2

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman38 Exploration phase Application software development: constraint driven compilation application(s) processor - model OK? more appl.? yes no yes HW design SW (code generation) application(s) OK? no yes SW (code generation) Freeze processor model no Discussion: phase 3

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman39 Discussion: problems with VLIWs code compaction = reduce code size after scheduling possible compaction ratio ? e.g. p0 = 0.9 and p1 = 0.1 information content (entropy) = -  p i log 2 p i = 0.47 maximum compression factor  2 control parallelism during scheduling = switch between different processor models (10% of code = 90% runtime) architecture reduce number of control bits for operand addresses e.g. 128 reg (TM) -> 28 bits/issue slot for addresses only => use stacks and fifos code size and instruction bandwidth

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman40 RF1 FU1FU2 FU3FU4 IR1IR2 IR3IR4 Instruction memory Con- trol flags RF2 RF3RF4

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman41 RF1 FU1FU2FU3FU4 RF2RF3RF4 Discussion: clustered VLIW architectures

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman42 Conclusions ASIPs provide efficient solutions for well-defined application domains (2 orders of magnitude higher efficiency). The methodology is interesting for IP creation. The key problem is retargetable compilation. A (distributed) VLIW model is a good compromise between HW and SW. Although an automatic process can generate a default solution, the process usually is interactive and iterative for efficiency reasons. The key is fast and accurate feedback.

7/12/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman43 Imagine assignment For the coming 3 weeks: –Install the tools (VisualC package will be sent by mail) –Read the beginners’ guide –Experiment with the compiler on a few examples http://www.ics.ele.tue.nl/~hfatemi/5kk10/ Further information on Imagine: –www.cva.stanford.edu/projects/imagine/

Processor Architectures and Program Mapping Application domain specific processors (ADSP or ASIP) 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman.

Similar presentations

Presentation on theme: "Processor Architectures and Program Mapping Application domain specific processors (ADSP or ASIP) 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Processor Architectures and Program Mapping Application domain specific processors (ADSP or ASIP) 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman.

Similar presentations

Presentation on theme: "Processor Architectures and Program Mapping Application domain specific processors (ADSP or ASIP) 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman."— Presentation transcript:

Similar presentations

About project

Feedback