Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fakultät für informatik informatik 12 technische universität dortmund HIR Optimizations and Transformations - Session 12 - Heiko Falk TU Dortmund Informatik.

Similar presentations


Presentation on theme: "Fakultät für informatik informatik 12 technische universität dortmund HIR Optimizations and Transformations - Session 12 - Heiko Falk TU Dortmund Informatik."— Presentation transcript:

1 fakultät für informatik informatik 12 technische universität dortmund HIR Optimizations and Transformations - Session 12 - Heiko Falk TU Dortmund Informatik 12 Germany Slides use Microsoft cliparts. All Microsoft restrictions apply.

2 - 2 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Schedule of the course TimeMondayTuesdayWednesdayThursdayFriday 09:30- 11:00 1: Orientation, introduction 2: Models of computation + specs 5: Models of computation + specs 9: Mapping of applications to platforms 13: Memory aware compilation 17: Memory aware compilation 11:00 Brief break 11:15- 12:30 6: Lab*: Ptolemy 10: Lab*: Scheduling 14: Lab*: Mem. opt. 18: Lab*: Mem. opt. 12:30Lunch 14:00- 15:20 3: Models of computation + specs 7: Mapping of applications to platforms 11: High-level optimizations* 15: Memory aware compilation 19: WCET & compilers* 15:20Break 15:40- 17:00 4: Lab*: Kahn process networks 8: Mapping of applications to platforms 12: High-level optimizations* 16: Memory aware compilation 20: Wrap-up * Dr. Heiko Falk

3 - 3 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Outline Intermediate Representations Motivation of High-Level Optimizations Parallelization for Multi-DSPs  Introduction and Target Architecture  Workflow Program Recovery Data Partitioning and Mapping Locality Improvement and DMA  Results References & Summary

4 - 4 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Structure of an Optimizing Compiler Lexical Analysis Source Code Tokens Syntactical Analysis Syntax Tree Semantical Analysis High- Level IR Code Selection Register Allocation Instruction Scheduling ASM Code Optimization High- Level IR Low- Level IR Code Optimization Low- Level IR Low- Level IR

5 - 5 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Requirements for Code Optimizations Required infrastructure for optimizations:  Effective & efficient internal data structures that model the code currently under optimization enable and facilitate code manipulation provide necessary analyses for optimizations  Intermediate Representations (IRs)

6 - 6 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Abstraction Levels of IRs [S. S. Muchnick, Advanced Compiler Design & Implementation, Morgan Kaufmann, 1997] High-Level IRs (HIR):  Very close to source code.  Often: Abstract Syntax Trees  Variables & Types used to represent values and their storage.  Complex control and data flow operations are conserved (e.g. loops, if-then / if-else statements, array accesses [])  Back-transformation of HIR into source code easy.

7 - 7 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Abstraction Levels of IRs Medium-Level IRs (MIR):  Three-address code: a 1  a 2 op a 3 ;  IR independent of both source language and target processor.  Temporary variables used to store values.  Complex control and data flow operations simplified (e.g. labels & jumps, pointer arithmetic)  Control flow in form of basic blocks (sequences of straight- line code).

8 - 8 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Abstraction Levels of IRs Low-Level IRs (LIR):  Represent assembly code.  Operations correspond to machine instructions.  Registers used to store values.  Transformation of an LIR into assembly code easy.

9 - 9 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund HIR-Example: ICD-C Compilation Unit IR Function Statement Expression 1 single C file during simultaneous compilation of several source files Loop Statements (for, do-while, while-do) Selection Statements (if, if-else, switch) Jump Statements (return, break, continue, …) … Binary & Unary expressions (+, -, *, /, …) Assignment operators (=, +=, -=, …) Index & component access (a[x], a.x, …) …

10 - 10 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund HIR-Example: ICD-C Compilation Unit IR Function Statement Expression Global ST File ST Funct. ST Local ST Basic Block [Informatik Centrum Dortmund e.V., http://www.icd.de/es, Dortmund, 2008]

11 - 11 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund ICD-C: Features ANSI-C Compiler Frontend:  C89 + C99 Standards  GNU Inline-Assembly Included Analyses:  Data flow analyses  Control flow analyses  Loop analyses Interfaces:  ANSI-C dump of the IR as interface to external tools  Built-in interface to code selectors in compiler backends

12 - 12 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund LIR-Example: LLIR LLIR Function Instruction Operation Basic Block Machine Instruction Consists of 1-N machine operations Operations are executed in parallel (  VLIW) Machine Operation Includes assembly opcode (e.g. ADD, MUL, …) Includes 0-M parameters

13 - 13 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund LIR-Example: LLIR LLIR Function Instruction Operation Basic Block Parameter Registers Integer constants & Labels Addressing modes … LLIR structure fully processor independent:  An LLIR consists of some generic functions  An LLIR function consists of …  An LLIR operation consists of some generic parameters

14 - 14 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund LIR-Example: LLIR [Informatik Centrum Dortmund e.V., http://www.icd.de/es, Dortmund, 2008] TriCore 1.3: + Registers = { D0, …, D15, A0, …, A15 } + Mnemonics = { ABS, ABS.B, …, XOR.T } + Status Flags = { C, V, …, SAV } +... LLIR Function Instruction Operation Basic Block Parameter LLIR becomes processor-specific by adding a processor description:

15 - 15 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund LLIR: Features Retargetability:  Integrated mechanisms to adapt LLIR to various processors (e.g. DSPs, VLIWs, network processors, …)  Modeling of different kinds of instruction sets  Modeling of different kinds of register sets Included Analyses:  Data flow analyses  Control flow analyses Interfaces:  Parsing and output of assembly files  Built-in interface to code selectors

16 - 16 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Motivation of High-Level Optimizations High-Level IRs:  Structure very close to source language.  High-level constructs (esp. loops, function calls + parameter passing, array accesses) are conserved. High-Level Optimizations:  Exploit features of HIRs intensively.  Concentrate on huge restructuring of loops and function calls.  Are difficult to realize on lower abstraction levels since high-level information is lost and would have to be recreated.

17 - 17 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Parallelization for Multi-DSPs Material by courtesy of: Björn Franke and Michael O’Boyle University of Edinburgh, UK School of Informatics

18 - 18 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Parallelization for Multi-DSPs Motivation:  Performance requirements of an entire system often surpass abilities of a single processor (e.g. radar, sonar, medical image processing, HDTV, …).  Cluster of parallel DSPs provides enough performance, but… Few / little hardware support for parallel execution. Even less support of parallel programming by development tool chains (e.g. specification languages, compilers, …). Existing source codes often coded in low-level fashion, complicating an effective parallelization.

19 - 19 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Auto-Parallelizing Compilers Discipline “High Performance Computing”:  Research on vectorizing compilers for more than 25 years.  Traditionally: Fortran compilers.  Such vectorizing compilers usually inappropriate for Multi- DSPs, since assumptions on memory model unrealistic: Communication between processors via shared memory Memory has only one single common address space Distributed caches may be used, but cache coherence ensured by hardware protocols  De Facto no auto-parallelizing compiler for Multi-DSPs!

20 - 20 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Multi-DSPs DSP Core 0 Mem Bank 1 Mem Bank 2 DSP Core X Mem Bank 1 Mem Bank 2 External Memory Bus Multiple address spaces: Intern 1 & 2, Extern, Remote DSP Core Using internal memories: Higher bandwidth, reduced latencies Using remote memories: ID of remote DSP must be known

21 - 21 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Workflow of Auto-Parallelization for Multi-DSPs Program Recovery  Removal of undesired low-level constructs in IR  Replacement by equivalent high-level constructs Parallelism Detection  Identification of parallelizable loops Partitioning and Mapping of Data  Minimization of communication overhead between DSPs Memory Access Localization  Minimization of accesses to remote memories Data Transfer Optimization  Exploitation of DMA for burst transfers

22 - 22 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund (Running) Code Example for 2 parallel DSPs /* Array Declarations */ int A[16], B[16], C[16], D[16]; /* Declaration & Initialization of Pointers */ int *p_a = A, *p_b = &B[15], *p_c = C, *p_d = D; /* Loop over all Array Elements */ for (i = 0; i < 16; i++) *p_d++ = *p_c++ + *p_a++ * *p_b--; Low-level array accesses via pointers; explicit pointer arithmetic (auto-increment addressing). Disadvantageous for parallelization: ad hoc, no structure in array accesses visible and analyzable.

23 - 23 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Program Recovery /* Array Declarations */ int A[16], B[16], C[16], D[16]; /* Loop over all Array Elements */ for (i = 0; i < 16; i++) D[i] = C[i] + A[i] * B[15-i]; Replacement of pointer accesses by explicit array operations []. Structure of array accesses now better visible and accessible for following analyses.

24 - 24 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Program Recovery /* Array Declarations */ int A[16], B[16], C[16], D[16]; /* Loop over all Array Elements */ for (i = 0; i < 16; i++) D[i] = C[i] + A[i] * B[15-i]; One-dimensional “flat” arrays too unstructured for Multi-DSP parallelization. Partitioning of the arrays onto available parallel DSPs unclear.

25 - 25 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Data Partitioning /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Loop over all Array Elements */ for (i = 0; i < 16; i++) D[i/8][i%8] = C[i/8][i%8] + A[i/8][i%8] * B[(15-i)/8][(15-i)%8]; Novel two-dimensional array declarations. First dimension corresponds to number of parallel DSPs. Originally flat arrays now partitioned into disjoint areas that can be processed independent of each other.

26 - 26 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Data Partitioning /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Loop over all Array Elements */ for (i = 0; i < 16; i++) D[i/8][i%8] = C[i/8][i%8] + A[i/8][i%8] * B[(15-i)/8][(15-i)%8]; Very costly and complex addressing involved now. Reason: Arrays are multi-dimensional now; but loop counter i which is used to index arrays is incremented sequentially. So-called circular buffer addressing involved.

27 - 27 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Strip Mining of i -Loop /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Nested Loop over all Array Elements */ for (j = 0; j < 2; j++) for (i = 0; i < 8; i++) D[j][i] = C[j][i] + A[j][i] * B[1-j][7-i]; Splitting of sequential iteration space of i into two independent iteration spaces. Iteration spaces of new loop nest now reflect data layout. Just affine expressions for array indexing.

28 - 28 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Strip Mining of i -Loop /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Nested Loop over all Array Elements */ for (j = 0; j < 2; j++) for (i = 0; i < 8; i++) D[j][i] = C[j][i] + A[j][i] * B[1-j][7-i]; How can this code be parallelized for two DSPs?

29 - 29 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Parallelization (for Processor 0) /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D[MYID][i] = C[MYID][i] + A[MYID][i] * B[1-MYID][7-i]; Insertion of an explicit processor ID. Array addressing by using the processor ID. For N parallel DSPs, N different HIR codes with individual processor IDs are generated.

30 - 30 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Parallelization (for Processor 0) /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D[MYID][i] = C[MYID][i] + A[MYID][i] * B[1-MYID][7-i]; This structure makes explicit which code is executed on which DSP. Still unclear: How are arrays mapped to local memory banks or remote memories? How are remote memory banks accessed?

31 - 31 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Array Descriptors Sub-Array A0[0...7] A0 | A1 DSP 0 Sub-Array A1[0...7] A0 | A1 DSP 1 Array Descriptors A[0][5] 2-dimensional array A[2][8] is partitioned into two Sub-Arrays A0 and A1 along A ’s first dimension. Each Sub-Array A n is stored in local memory of DSP n. Original 2-dimensional accesses of A have to be re-routed to A0 and A1 using array descriptors.

32 - 32 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Memory Layout (for Processor 0) /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations & Array Descriptors */ int A0[8]; extern int A1[8]; int *A[2] = { A0, A1 }; int B0[8]; extern int B1[8]; int *B[2] = { B0, B1 };... /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D[MYID][i] = C[MYID][i] + A[MYID][i] * B[1-MYID][7-i]; Arrays stored in local and remote memories. Array accesses via descriptors in unchanged syntax.

33 - 33 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Memory Layout (for Processor 0) /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations & Array Descriptors */ int A0[8]; extern int A1[8]; int *A[2] = { A0, A1 }; int B0[8]; extern int B1[8]; int *B[2] = { B0, B1 };... /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D[MYID][i] = C[MYID][i] + A[MYID][i] * B[1-MYID][7-i]; Descriptor accesses to local arrays inefficient due to additional indirection. Scheduling issues: latency of A[i][j] can vary significantly, depending on whether i references local or remote memory.

34 - 34 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Increasing Locality of Array Accesses /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations & Array Descriptors */ int A0[8]; extern int A1[8]; int *A[2] = { A0, A1 }; int B0[8]; extern int B1[8]; int *B[2] = { B0, B1 };... /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D0[i] = C0[i] + A0[i] * B[1-MYID][7-i]; Direct accesses to local arrays; avoid array accesses via descriptors whenever possible.  Maximal usage of high bandwidth of local memories.

35 - 35 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Increasing Locality of Array Accesses /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations & Array Descriptors */ int A0[8]; extern int A1[8]; int *A[2] = { A0, A1 }; int B0[8]; extern int B1[8]; int *B[2] = { B0, B1 };... /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D0[i] = C0[i] + A0[i] * B[1-MYID][7-i]; 8 sequential accesses to consecutive array elements in remote memory.  Inefficient since 8 full bus cycles are required.

36 - 36 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Insertion of DMA Block Transfers /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations & Array Descriptors */ int A0[8]; extern int A1[8]; int *A[2] = { A0, A1 }; int B0[8]; extern int B1[8]; int *B[2] = { B0, B1 };... /* Temporary DMA Buffer */ int temp[8]; DMA_get( temp, &(B[1-MYID]), 8 * sizeof( int ) ); /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D0[i] = C0[i] + A0[i] * temp[7-i]; Burst-Load of local buffer from remote memory via DMA. Array accesses in loop only use local memory.

37 - 37 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Benchmarking of Multi-DSP Parallelization Multi-DSP Hardware  4 parallel AD TigerSHARC TS-101 DSPs @250 MHz  768 kB local SRAM per DSP, 128 MB external DRAM Benchmarks for Auto-Parallelization  DSPstone: small DSP kernel codes, low code complexity  UTDSP: entire complex applications, compute-intensive code Results: Execution Times  of fully sequential code running on 1 DSP  of code after Program Recovery  of code after Data Partitioning and Mapping  of code after Locality Improvement and DMA Transfers

38 - 38 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Results – DSPstone

39 - 39 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Results – UTDSP

40 - 40 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Discussion of Results Average Total Speedups:  DSPstone: Factor 4.28  UTDSP: Factor 3.65  All benchmarks: Factor 3.78 Very astonishing: How can a speedup of more than a factor of 4 be achieved for DSPstone if parallelization is performed for 4 DSPs?

41 - 41 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Reasons for Super-Linear Speedups > 4 Parallelized code possibly offers more optimization potential for successive compiler optimizations than original sequential code. Example: Sequential i -loop (slide 25): 16 iterations. i -loop parallelized for 2 DSPs (slide 26): 8 iterations.  Parallelized loops possibly candidates for Loop Unrolling:  Fully unrolled loop without any branches!  No delay slots, branch prediction can not predict incorrectly. for (i = 0; i < 8; i++) ;... ; 8 times

42 - 42 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund References Compiler structure and Intermediate Representations:  Steven S. Muchnick, Advanced Compiler Design & Implementation, Morgan Kaufmann, 1997. ISBN 1-55860-320-4 Parallelization for homogeneous Multi-DSPs:  B. Franke, M. O’Boyle, A Complete Compiler Approach to Auto-Parallelizing C Programs for Multi-DSP Systems, IEEE Transactions on Parallel and Distributed Systems, Vol. 16, No. 3, March 2005.

43 - 43 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Summary Intermediate Representations  Representation of code at different abstraction levels.  HIR- and LIR-Examples. HIR Optimizations  Restructuring of loops.  Restructuring of functions and their calling relations. Parallelization for homogeneous Multi-DSPs  Focuses on exploitation of local memories & address spaces.  Speedups basically linear to number of available parallel DSPs.

44 - 44 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Questions (if on schedule) ? Q&A?


Download ppt "Fakultät für informatik informatik 12 technische universität dortmund HIR Optimizations and Transformations - Session 12 - Heiko Falk TU Dortmund Informatik."

Similar presentations


Ads by Google