Presentation is loading. Please wait.

Presentation is loading. Please wait.

Embedded Computer Architecture

Similar presentations


Presentation on theme: "Embedded Computer Architecture"— Presentation transcript:

1 Embedded Computer Architecture
ASIP Application Specific Instruction-set Processor 5KK73 Bart Mesman and Henk Corporaal

2 Embedded Computer Archtiecture H.Corporaal and B. Mesman
Application domain specific processors (ADSP or ASIP) DSP Programmable CPU Programmable DSP Application domain specific Application specific processor flexibility efficiency 4/27/2017 Embedded Computer Archtiecture H.Corporaal and B. Mesman

3 Embedded Computer Architecture H.Corporaal and B. Mesman
Application domain specific processors (ADSP or ASIP) takes a well defined application domain as a starting point exploits characteristics of the domain (computation kernels) still programmable within the domain e.g. MPEG2 coding uses 8*8 DCT transform, DECT, GSM etc ... implementation Appl. domain GP Appl. domain implementation ADSP performance: clock speed + ILP ILP,DLP, tuning to domain flexible dev. (new apps.) cost effective (high volume) problems specification manual design, design time and effort large effort => synthesized cores 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

4 Embedded Computer Architecture H.Corporaal and B. Mesman
4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

5 Embedded Computer Architecture H.Corporaal and B. Mesman
Design process processor- model application(s) e.g. VLIW with shared RFs instance parameters SW (code generation) HW design 3 phases 1. exploration 2. hw design (layout) + processing 3. design appl. sw Estimations nsec/cycle, area, power/instr Estimations cycles/alg occupation Fast, accurate and early feedback OK? no yes yes more appl.? no go to phase 2 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

6 ASIP/VLIW architectures: list scheduling
IPB Candidate Conflict & Scheduled LIST Priority Comp. Operation * 1 + 2 * 3 * 1 + 2 * 3 * 1 * 3 * 1 + 2 * * 5 1 * 3 * 4 * 3 * 4 * 4 4 OPB + 6 2 * 3 + 6 * 3 + 6 MULT + 7 * 8 * 5 + 7 * 8 * 5 * 8 * 8 + 7 3 ALU * 9 + 10 * 5 * 9 * 5 * 9 * 5 4 IPB * 9 + 10 * 9 + 10 OPB 5 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

7 Application examples (1)
* Z-1 + c3 c4 c2 c1 x4 x3 x2 x1 y c0 x0

8 19 instructions per tap!! Application examples (1)
Embedded Computer Architecture H. Corporaal, and B. Mesman

9 Very simple in hardware
Application examples (2) Bit level operations: finite field arithmetic 10 instructions!! Very simple in hardware

10 Application examples (2)
Bit level operations : DES example srl $13, $2, 20 andi $25, $13, 1 srl $14, $2, 21 andi $24, $14, 6 or $15, $25, $24 srl $13, $2, 22 andi $14, $13, 56 or $25, $15, $14 sll $24, $25, 2 20 22 23 25 26 27 source register ($2) destination register ($24) 2 3 4 5 6 7 Embedded Computer Architecture H. Corporaal and B. Mesman

11 Application examples (2)
Bit level operations : A5 example (GSM encryption) srl $24, $5, 18 $25, $5, 17 xor $8, $24, $25 $9, $5, 16 $10, $8, $9 $11, $5, 13 $12, $10, $11 andi $13, $12, 1 18 17 16 13 $5 1 $13 … 0 ...

12 ASIP/VLIW architectures: feedback
4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

13 Embedded Computer Architecture H.Corporaal and B. Mesman
Low power aspects Implementation Independent Design Database Estimation area + speed power Mistral2 Architecture Estimation Database 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

14 GSM viterbi decoder : default solution
EXU ACTIV AREA POWER alu_1 96% romctrl_1 48% acu_1 26% ipb_1 5% opb_1 23% ctrl total 13750 controller responsible for 70% of power consumption maximum resource-sharing heavy decision-making : “main” loop with 16 metrics-computations per iteration EXU-numbers include Registers for local storage 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

15 GSM viterbi decoder : no loop-folding
EXU ACTIV AREA POWER alu_1 92% romctrl_1 45% acu_1 25% ipb_1 5% opb_1 22% ctrl total 14247 area down by 33% power down by 35% next step: reduce # of program-steps with second ALU 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

16 GSM viterbi decoder : 2 ALU’s
EXU ACTIV AREA POWER alu_1 69% alu_2 65% romctrl_1 67% acu_1 37% ipb_1 8% opb_1 33% ctrl total 9739 cycle count down 30% area up 42% power down by 5% next step: introduce ASU to reduce ALU-load 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

17 GSM viterbi decoder : 1 x ACS-ASU
func ACS ( M1, M2, d ) MS, MS8 = begin MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi; MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi; end; = EXU ACTIV AREA POWER alu_1 20% acs_asu_1 83% or_asu_1 10% romctrl_1 16% 65 21 acu_1 36% ipb_1 20% opb_1 11% ctrl total 1930 cycle count down 5X power down 20X ! 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

18 GSM viterbi decoder : 4 x ACS-ASU
EXU ACTIV AREA POWER alu_1 94% acs_asu_1 95% acs_asu_2 95% acs_asu_3 95% acs_asu_4 95% split_asu_1 47% 90 18 or_asu_1 47% romctrl_1 28% 48 6 acu_1 98% ipb_1 23% 60 6 opb_1 50% ctrl total 425 cycle count down another 5X area up 23% power down another 3X ! 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

19 GSM viterbi example : summary
Implementation Independent Design Database Mistral2 72x ! 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

20 Discussion: phase 3 Application software development:
processor- model application(s) application(s) SW (code generation) HW design SW (code generation) Freeze processor model no OK? no no OK? yes yes yes no more appl.? Application software development: constraint driven compilation Exploration phase 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

21 Embedded Computer Architecture H.Corporaal and B. Mesman
RF1 RF2 RF3 RF4 FU1 FU2 FU3 FU4 flags IR1 IR2 IR3 IR4 Instruction memory Con- trol 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

22 Discussion: problems with VLIWs
code size and instruction bandwidth code compaction = reduce code size after scheduling possible compaction ratio ? e.g. p0 = 0.9 and p1 = 0.1 information content (entropy) = - pi log2 pi = 0.47 maximum compression factor  2 control parallelism during scheduling = switch between different processor models (10% of code = 90% runtime) architecture reduce number of control bits for operand addresses e.g. 128 reg (TM) -> 28 bits/issue slot for addresses only => use stacks and fifos 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman

23 23 GPU basics Synthetic objects are represented with a bunch of triangles (3d) in a language/library like OpenGL or DirectX plus texture Triangles are represented with 3 vertices A vertex is represented with 4 coordinates with floating-point precision Objects are transformed between coordinate representations Transformations are matrix-vector multiplications 23

24 24 GPU DirectX 10 pipeline 24

25 NVIDIA GeForce 6800 3D Pipeline
25 NVIDIA GeForce D Pipeline 25

26 GeForce 8800 GPU 26 330 Gflops, 128 processors with 4-way SIMD 26

27 GPU: Why more general-purpose programmable?
27 GPU: Why more general-purpose programmable? All transformations are shading Shading is all matrix-vector multiplications Computational load varies heavily between different sorts of shading Programmable shaders allow dynamic resource allocation between shaders Result: Modern GPUs are serious competitor for general-purpose processors! 27

28 Mixed serial/parallel n n n n n D n n n n n E n n n n n B A n n C n n
Fully serial Classical encoding: fetching many nops n n A n n n n n n B n n n n n n n n n n n C n n Mixed serial/parallel n n n n n D n n n n n E n n n n n B A n n C n n F n n n n n n n n n n E n D n n Fully parallel n n n n n n G n F n n n n n n n n n n n n n n H n n n n n n G H A B C D E F G H A B C D E F G H A B C D E F G H A B C D E F G H 1 1 1 1 1 1 1 1 1 1 1 Velocity encoding 4/27/2017 Embedded Computer Architecture

29 Embedded Computer Architecture H.Corporaal and B. Mesman
Conclusions ASIPs provide efficient solutions for well-defined application domains (2 orders of magnitude higher efficiency). The methodology is interesting for IP creation. The key problem is retargetable compilation. A (distributed) VLIW model is a good compromise between HW and SW. Although an automatic process can generate a default solution, the process usually is interactive and iterative for efficiency reasons. The key is fast and accurate feedback. GPUs are ASIPs 4/27/2017 Embedded Computer Architecture H.Corporaal and B. Mesman


Download ppt "Embedded Computer Architecture"

Similar presentations


Ads by Google