Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transport Triggered Architectures used for Embedded Systems Henk Corporaal EE department Delft Univ. of Technology

Similar presentations


Presentation on theme: "Transport Triggered Architectures used for Embedded Systems Henk Corporaal EE department Delft Univ. of Technology"— Presentation transcript:

1 Transport Triggered Architectures used for Embedded Systems Henk Corporaal EE department Delft Univ. of Technology h.corporaal@et.tudelft.nl http://cs.et.tudelft.nl International Symposium on NEW TRENDS IN COMPUTER ARCHITECTURE Gent, Belgium December 16, 1999

2 Gent, December 19992 Topics n MOVE project goals n Architecture spectrum of solutions n From VLIW to TTA n Code generation for TTAs n Mapping applications to processors n Achievements n TTA related research

3 Gent, December 19993 MOVE project goals n Remove bottlenecks of current ILP processors n Tools for quick processor and system design; offer expertise in a package n Application driven design process n Exploit ILP to its limits (but not further !!) n Replace hardware complexity with software complexity as far as possible n Extreme functional flexibility n Scalable solutions n Orthogonal concept (combine with SIMD, MIMD, FPGA function units,... )

4 Gent, December 19994 Architecture design spectrum Four dimensional architecture design space: I,O,D,S S =  freq (op) lt(op) Four dimensional architecture design space: I,O,D,S S =  freq (op) lt(op) Operations/instruction ‘O’ Instructions/cycle ‘I’ Data/operation ‘D’ Superpipelining degree ‘S’ (1,1,1,1) VLIW Superpipelined RISC SIMD SuperscalarDataflow CISC (MOVE design space)

5 Gent, December 19995 Architecture design spectrum Mpar is the amount of parallelism to be exploited by the compiler / application !

6 Gent, December 19996 Architecture design spectrum Which choice: I,O,D,or S ? A few remarks: n I: instructions / cycle u Superscalar / dataflow: limited scaling due to complexity u MIMD: do it yourself n O: operations / instruction u VLIW: good choice if binary compatibility not an issue u Speedup for all types of applications

7 Gent, December 19997 Architecture design spectrum n D: data/operation u SIMD / Vector: application has to offer this type of parallelism u may be good choice for multimedia n S: pipelining degree u Superpipelined: cheap solution u however, operation latencies may become dominant u unused delay slots increase n MOVE project initially concentrates on O and S

8 Gent, December 19998 From VLIW to TTA n VLIW n Scaling problems u number of ports on register file u bypass complexity n Flexibility problems u can we plug in arbitrary functionality ? n TTA: reverse the programming paradigm u template u characteristics

9 Gent, December 19999 From VLIW to TTA General organization of a VLIW Instruction memory Instruction fetch unit Instruction decode unit FU-1 FU-2 FU-3 FU-4 FU-5 Register file Data memory CPU Bypassing network

10 Gent, December 199910 From VLIW to TTA Strong points of VLIW: u Scalable (add more FUs) u Flexible (an FU can be almost anything) Weak points: n With N FUs: u Bypassing complexity: O(N 2 ) u Register file complexity: O(N) u Register file size: O(N 2 ) n Register file design restricts FU flexibility Solution: mirror programming paradigm

11 Gent, December 199911 Transport Triggered Architecture General organization of a TTA Instruction memory Instruction fetch unit Instruction decode unit FU-1 FU-2 FU-3 FU-4 FU-5 Register file Data memory CPU Bypassing network

12 Gent, December 199912 TTA structure; datapath details Socket

13 Gent, December 199913 TTA characteristics Hardware n Modular: Lego play tool generator n Very flexible and scalable u easy inclusion of Special Function Units (SFUs) n Low complexity u 50% reduction on # register ports u reduced bypass complexity (no associative matching) u up to 80 % reduction in bypass connectivity u trivial decoding u reduced register pressure

14 Gent, December 199914 Register pressure

15 Gent, December 199915 TTA characteristics Software A traditional Operation-triggered instruction: mul r1, r2, r3 A Transport-triggered instruction: r3  mul.o, r2  mul.t, mul.r  r1 n Extra scheduling optimizations n However: More difficult to schedule !

16 Gent, December 199916 Code generation trajectory Application (C) Compiler frontend Sequential code Compiler backend Parallel code Sequential simulation Parallel simulation Architecture description Profiling data Input/Output Frontend: GCC or SUIF (adapted) Frontend: GCC or SUIF (adapted)

17 Gent, December 199917 TTA compiler characteristics n Handles all ANSI C programs n Region scheduling scope with speculative execution n Using profiling n Software pipelining n Predicated execution (e.g. for stores) n Multiple register files n Integrated register allocation and scheduling n Fully parametric

18 Gent, December 199918 Code generation for TTAs n TTA specific optimizations u common operand elimination u software bypassing u dead result move elimination u scheduling freedom of T, O and R n Our scheduler (compiler backend) exploits these advantages

19 Gent, December 199919 TTA specific optimizations Bypassing can eliminate the need of RF accesses  Example: r1 -> add.o, r2 -> add.t; add.r -> r3; r3 -> sub.o, r4 -> sub.t sub.r -> r5; Translates into: r1 -> add.o, r2 -> add.t; add.r -> sub.o, r4 -> sub.t; sub.r -> r5;

20 Gent, December 199920 Mapping applications to processors We have described a n Templated architecture n Parametric compiler exploiting specifics of the template Problem: How to tune a processor architecture for a certain application domain?

21 Gent, December 199921 Mapping applications to processors Architecture parameters Optimizer Parametric compiler Hardware generator feedback User intercation Parallel object code chip Pareto curve (solution space) cost exec. time x x x x x x x x x x x x x x x xx x x x Move framework

22 Gent, December 199922 Achievements within the MOVE project n Transport Triggered Architecture (TTA) template u lego playbox toolkit n Design framework almost operational u you may add your own ‘strange’ function units (no restrictions) n Several chips have been designed by TUD and Industry; their applications include u Intelligent datalogger u Video image enhancement (video stretcher) u MPEG2 decoder u Wireless communication

23 Gent, December 199923 Video stretcher board containing TTA

24 Gent, December 199924 Intelligent datalogger mixed signal special FUs on-chip RAM and ROM operates stand alone core generated automatically C compiler

25 Gent, December 199925 TTA related research n RoD: registers on demand scheduling n SFUs: pattern detection n CTT: code transformation tool n Multiprocessor single chip embedded systems n Global program optimizations n Automatic fixed point code generation n ReMove

26 Gent, December 199926 RoD: Register on Demand scheduling

27 Gent, December 199927 Phase ordering problem: scheduling  allocation n Early register assignment u Introduces false dependencies u Bypassing information not available n Late register assignment u Span of live ranges likely to increase which leads to more spill code u Spill/reload code inserted after scheduling which requires an extra scheduling step n Integrated with the instruction scheduler: RoD u More complex

28 Gent, December 199928 RoD 4 -> add.o, x -> add.t, add.r-> y; r0 -> sub.o, y -> sub.t, sub.r -> z; 4 -> add.o r1-> add.t add.r -> r1 4-> add.o r1 -> add.t add.r -> sub.t 4-> add.o r1 -> add.t add.r -> sub.t r0 -> sub.o sub.r -> r7 RRTsScheduler0 r0 r0 r0 r0 r0, r1 r0 r7 step 1. step 2. step 3. step 4. step 5.

29 Gent, December 199929 Spilling n Occurs when the number of simultaneously live variables exceeds the number of registers n Contents of variables are stored in memory n The impact on the performance due to the insertion of extra code must be as small as possible

30 Gent, December 199930 Spilling def r1 store r1 use r1 load r1 use r1 def y use x use y def x

31 Gent, December 199931 Spilling Operation to schedule: x -> sub.o, r1 -> sub.t; sub.r -> r3; Code after spill code insertion: Bypassed code: 4 -> add.o, fp -> add.t; 4 -> add.o, fp -> add.o; add.r -> z; add.r -> ld.t; z -> ld.t; ld.r -> sub.o, r1 -> sub.t; ld.r -> x; sub.r -> r3; x -> sub.o, r1 -> sub.t; sub.r -> r3;

32 Gent, December 199932 RoD compared with early assignment Number of registers Speedup of RoD[%]

33 Gent, December 199933 RoD compared with early assignment 0 4 8 12 16 20 24 121620242832 RoD early assignment Number of registers cycle count increase[%] Impact of decreasing number of registers

34 Gent, December 199934 Special Functionality: SFUs

35 Gent, December 199935 Mapping applications to processors SFUs may help ! n Which one do I need ? n Tradeoff between costs and performance SFU granularity ? n Coarse grain: do it yourself (profiling helps) Move framework supports this n Fine grain: tooling needed

36 Gent, December 199936 SFUs: fine grain patterns n Why using fine grain SFUs: u code size reduction u register file #ports reduction u could be cheaper and/or faster u transport reduction u power reduction (avoid charging non-local wires) Which patterns do need support? n Detection of recurring operation patterns needed

37 Gent, December 199937 SFUs: Pattern identification Method: n Trace analysis n Built DDG n Create pattern library on demand n Fusing partial matches into complete matches

38 Gent, December 199938 SFUs: fine grain patterns General pattern & subject graph u multi-output u non-tree u operand and operation nodes

39 Gent, December 199939 SFUs: covering results

40 Gent, December 199940 SFUs: top-10 patterns (2 ops)

41 Gent, December 199941 SFUs: conclusions n Most patterns are: multi-output and not tree like n Patterns 1, 4, 6 and 8 have implementation advantages n 20 additional 2-node patterns give 40% reduction (in operation count) n Group operations into classes for even better results Now: scheduling for these patterns? How?

42 Gent, December 199942 Source-to-Source transformations

43 Gent, December 199943 Design transformations Source-to-source transformations n CTT: code transformation tool

44 Gent, December 199944 Transformation example: loop embedding.... for (i=0;i<100;i++){ do_something(); }.... void do_something() { procedure body }.... for (i=0;i<100;i++){ do_something(); }.... void do_something() { procedure body }.... do_something2();.... void do_something2() { int i; for (i=0;i<100;i++){ procedure body }}.... do_something2();.... void do_something2() { int i; for (i=0;i<100;i++){ procedure body }}

45 Gent, December 199945 Structure of transformation PATTERN { description of the code selection stage } CONDITIONS { additional constraints } RESULT { description of the new code } PATTERN { description of the code selection stage } CONDITIONS { additional constraints } RESULT { description of the new code }

46 Gent, December 199946 Implementation

47 Gent, December 199947 Experimental results n Could transform 39 out of 45 SIMD loops (in a set of 9 DSP benchmarks and MPEG) n Can handle transformations like:

48 Gent, December 199948 Partitioning your program for Multiprocessor single chip solutions

49 Gent, December 199949 RAMI/O TPU core sfu1sfu2sfu1 sfu2 sfu3 Asip1Asip2 Asip3 RAM Multiprocessor embedded system An ASIP based heterogeneous multiprocessor u How to partition and map your application? u Splitting threads

50 Gent, December 199950 Design transformations Why splitting threads? n Combine fine (ILP) and coarse grain parallelism n Avoid ILP bottleneck n Multiprocessor solution may be cheaper u More efficient resource use n Wire delay problem  clustering needed !

51 Gent, December 199951 Experimental results of partitioner

52 Gent, December 199952 Instant frequency tracking example

53 Gent, December 199953 Global program optimizations

54 Gent, December 199954 Traditional compilation path n Compiler output is textual, i.e. assembly u loss of source-level information. n The object code defines the program’s memory layout. u efficient binary representation, but u not suitable for code transformations.

55 Gent, December 199955 New Compilation Path n Structured machine-level representation of the program: u the representation is accessible to “binary tools”, u high-level information is maintained and passed to the linker, u code transformations on whole-programs are easier. n The link function and the section offsets information must be rethought.

56 Gent, December 199956 Inter-module Register Allocation n After linkage global exported variables can be allocated to registers u Performing re-allocation of exported variables before scheduling is expensive Solution: re-allocation after linking all modules n Analyses on variable aliasing (is address taken?) is computed and maintained n A larger pool of live ranges candidates available for actual register allocation

57 Gent, December 199957 Fixed-point conversion: motivation n Cost of floating-point hardware. n Most “embedded” programs written in ANSI C. n C does not support fixed-point arithmetic. n Manual writing of fixed-point programs is tedious and error-prone (insertion of scaling operations). n Fixed-point extensions to C are only a partial solution.

58 Gent, December 199958 Fixed-point conversion Example: acc += (*coef_ptr) * (*data_ptr)

59 Gent, December 199959 Methodology n The user starts with a floating-point version of the application. n The user annotates a selected set of FP variables. n The converter automatically converts the remaining variables/temporaries and delivers feedback. n Result: source file where floating- point variables are replaced by integer variables with appropriate scaling operations.

60 Gent, December 199960 Link-time code conversion n Problem: linking fixed-point code with library code u transformations on binary code impractical u source-level linkage is awkward n Solution: Floating- to fixed-point conversion of library code “on the fly” during linkage. n Advantages: u No need to compile in advance a specific version of the library for a particular fixed-point format. u Information about the fixed-point format can flow between user and library code in both directions.

61 Gent, December 199961 Experimental Results n S = floating-point signal n S’ = fixed-point signal Accuracy Metric: signal-to-noise ratio (dB) Test programs: 35th-order FIR, 6th-order IIR filters

62 Gent, December 199962 Experimental Results Performance and code size

63 Gent, December 199963 What next? How to map your application A(L,A,D) to hardware (L,N,C) L: design level (e.g. architecture, implementation or realization level) A: application compononents D: dependences between application components N: hardware component C: connections between hardware components

64 Gent, December 199964 Integrated design environment In the MOVE project we mostly ‘closed’ the right part of the design cycle !!

65 Gent, December 199965 Conclusions / Discussion Billions of embedded systems with embedded processors sold annually; how to design these systems quickly, cheap, correct, low power,.... ? n We have experience with tuning architectures for applications u extremely flexible templated TTA; used by several companies u parametric code generation u automatic TTA design space exploration n The challenge: automated tuning of applications for architectures : closing the Y-chart u design transformation framework needed


Download ppt "Transport Triggered Architectures used for Embedded Systems Henk Corporaal EE department Delft Univ. of Technology"

Similar presentations


Ads by Google