Download presentation
Presentation is loading. Please wait.
Published byMarjorie Webb Modified over 8 years ago
1
fakultät für informatik informatik 12 technische universität dortmund HIR Optimizations and Transformations - Session 12 - Heiko Falk TU Dortmund Informatik 12 Germany Slides use Microsoft cliparts. All Microsoft restrictions apply.
2
- 2 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Schedule of the course TimeMondayTuesdayWednesdayThursdayFriday 09:30- 11:00 1: Orientation, introduction 2: Models of computation + specs 5: Models of computation + specs 9: Mapping of applications to platforms 13: Memory aware compilation 17: Memory aware compilation 11:00 Brief break 11:15- 12:30 6: Lab*: Ptolemy 10: Lab*: Scheduling 14: Lab*: Mem. opt. 18: Lab*: Mem. opt. 12:30Lunch 14:00- 15:20 3: Models of computation + specs 7: Mapping of applications to platforms 11: High-level optimizations* 15: Memory aware compilation 19: WCET & compilers* 15:20Break 15:40- 17:00 4: Lab*: Kahn process networks 8: Mapping of applications to platforms 12: High-level optimizations* 16: Memory aware compilation 20: Wrap-up * Dr. Heiko Falk
3
- 3 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Outline Intermediate Representations Motivation of High-Level Optimizations Parallelization for Multi-DSPs Introduction and Target Architecture Workflow Program Recovery Data Partitioning and Mapping Locality Improvement and DMA Results References & Summary
4
- 4 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Structure of an Optimizing Compiler Lexical Analysis Source Code Tokens Syntactical Analysis Syntax Tree Semantical Analysis High- Level IR Code Selection Register Allocation Instruction Scheduling ASM Code Optimization High- Level IR Low- Level IR Code Optimization Low- Level IR Low- Level IR
5
- 5 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Requirements for Code Optimizations Required infrastructure for optimizations: Effective & efficient internal data structures that model the code currently under optimization enable and facilitate code manipulation provide necessary analyses for optimizations Intermediate Representations (IRs)
6
- 6 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Abstraction Levels of IRs [S. S. Muchnick, Advanced Compiler Design & Implementation, Morgan Kaufmann, 1997] High-Level IRs (HIR): Very close to source code. Often: Abstract Syntax Trees Variables & Types used to represent values and their storage. Complex control and data flow operations are conserved (e.g. loops, if-then / if-else statements, array accesses []) Back-transformation of HIR into source code easy.
7
- 7 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Abstraction Levels of IRs Medium-Level IRs (MIR): Three-address code: a 1 a 2 op a 3 ; IR independent of both source language and target processor. Temporary variables used to store values. Complex control and data flow operations simplified (e.g. labels & jumps, pointer arithmetic) Control flow in form of basic blocks (sequences of straight- line code).
8
- 8 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Abstraction Levels of IRs Low-Level IRs (LIR): Represent assembly code. Operations correspond to machine instructions. Registers used to store values. Transformation of an LIR into assembly code easy.
9
- 9 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund HIR-Example: ICD-C Compilation Unit IR Function Statement Expression 1 single C file during simultaneous compilation of several source files Loop Statements (for, do-while, while-do) Selection Statements (if, if-else, switch) Jump Statements (return, break, continue, …) … Binary & Unary expressions (+, -, *, /, …) Assignment operators (=, +=, -=, …) Index & component access (a[x], a.x, …) …
10
- 10 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund HIR-Example: ICD-C Compilation Unit IR Function Statement Expression Global ST File ST Funct. ST Local ST Basic Block [Informatik Centrum Dortmund e.V., http://www.icd.de/es, Dortmund, 2008]
11
- 11 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund ICD-C: Features ANSI-C Compiler Frontend: C89 + C99 Standards GNU Inline-Assembly Included Analyses: Data flow analyses Control flow analyses Loop analyses Interfaces: ANSI-C dump of the IR as interface to external tools Built-in interface to code selectors in compiler backends
12
- 12 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund LIR-Example: LLIR LLIR Function Instruction Operation Basic Block Machine Instruction Consists of 1-N machine operations Operations are executed in parallel ( VLIW) Machine Operation Includes assembly opcode (e.g. ADD, MUL, …) Includes 0-M parameters
13
- 13 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund LIR-Example: LLIR LLIR Function Instruction Operation Basic Block Parameter Registers Integer constants & Labels Addressing modes … LLIR structure fully processor independent: An LLIR consists of some generic functions An LLIR function consists of … An LLIR operation consists of some generic parameters
14
- 14 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund LIR-Example: LLIR [Informatik Centrum Dortmund e.V., http://www.icd.de/es, Dortmund, 2008] TriCore 1.3: + Registers = { D0, …, D15, A0, …, A15 } + Mnemonics = { ABS, ABS.B, …, XOR.T } + Status Flags = { C, V, …, SAV } +... LLIR Function Instruction Operation Basic Block Parameter LLIR becomes processor-specific by adding a processor description:
15
- 15 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund LLIR: Features Retargetability: Integrated mechanisms to adapt LLIR to various processors (e.g. DSPs, VLIWs, network processors, …) Modeling of different kinds of instruction sets Modeling of different kinds of register sets Included Analyses: Data flow analyses Control flow analyses Interfaces: Parsing and output of assembly files Built-in interface to code selectors
16
- 16 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Motivation of High-Level Optimizations High-Level IRs: Structure very close to source language. High-level constructs (esp. loops, function calls + parameter passing, array accesses) are conserved. High-Level Optimizations: Exploit features of HIRs intensively. Concentrate on huge restructuring of loops and function calls. Are difficult to realize on lower abstraction levels since high-level information is lost and would have to be recreated.
17
- 17 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Parallelization for Multi-DSPs Material by courtesy of: Björn Franke and Michael O’Boyle University of Edinburgh, UK School of Informatics
18
- 18 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Parallelization for Multi-DSPs Motivation: Performance requirements of an entire system often surpass abilities of a single processor (e.g. radar, sonar, medical image processing, HDTV, …). Cluster of parallel DSPs provides enough performance, but… Few / little hardware support for parallel execution. Even less support of parallel programming by development tool chains (e.g. specification languages, compilers, …). Existing source codes often coded in low-level fashion, complicating an effective parallelization.
19
- 19 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Auto-Parallelizing Compilers Discipline “High Performance Computing”: Research on vectorizing compilers for more than 25 years. Traditionally: Fortran compilers. Such vectorizing compilers usually inappropriate for Multi- DSPs, since assumptions on memory model unrealistic: Communication between processors via shared memory Memory has only one single common address space Distributed caches may be used, but cache coherence ensured by hardware protocols De Facto no auto-parallelizing compiler for Multi-DSPs!
20
- 20 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Multi-DSPs DSP Core 0 Mem Bank 1 Mem Bank 2 DSP Core X Mem Bank 1 Mem Bank 2 External Memory Bus Multiple address spaces: Intern 1 & 2, Extern, Remote DSP Core Using internal memories: Higher bandwidth, reduced latencies Using remote memories: ID of remote DSP must be known
21
- 21 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Workflow of Auto-Parallelization for Multi-DSPs Program Recovery Removal of undesired low-level constructs in IR Replacement by equivalent high-level constructs Parallelism Detection Identification of parallelizable loops Partitioning and Mapping of Data Minimization of communication overhead between DSPs Memory Access Localization Minimization of accesses to remote memories Data Transfer Optimization Exploitation of DMA for burst transfers
22
- 22 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund (Running) Code Example for 2 parallel DSPs /* Array Declarations */ int A[16], B[16], C[16], D[16]; /* Declaration & Initialization of Pointers */ int *p_a = A, *p_b = &B[15], *p_c = C, *p_d = D; /* Loop over all Array Elements */ for (i = 0; i < 16; i++) *p_d++ = *p_c++ + *p_a++ * *p_b--; Low-level array accesses via pointers; explicit pointer arithmetic (auto-increment addressing). Disadvantageous for parallelization: ad hoc, no structure in array accesses visible and analyzable.
23
- 23 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Program Recovery /* Array Declarations */ int A[16], B[16], C[16], D[16]; /* Loop over all Array Elements */ for (i = 0; i < 16; i++) D[i] = C[i] + A[i] * B[15-i]; Replacement of pointer accesses by explicit array operations []. Structure of array accesses now better visible and accessible for following analyses.
24
- 24 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Program Recovery /* Array Declarations */ int A[16], B[16], C[16], D[16]; /* Loop over all Array Elements */ for (i = 0; i < 16; i++) D[i] = C[i] + A[i] * B[15-i]; One-dimensional “flat” arrays too unstructured for Multi-DSP parallelization. Partitioning of the arrays onto available parallel DSPs unclear.
25
- 25 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Data Partitioning /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Loop over all Array Elements */ for (i = 0; i < 16; i++) D[i/8][i%8] = C[i/8][i%8] + A[i/8][i%8] * B[(15-i)/8][(15-i)%8]; Novel two-dimensional array declarations. First dimension corresponds to number of parallel DSPs. Originally flat arrays now partitioned into disjoint areas that can be processed independent of each other.
26
- 26 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Data Partitioning /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Loop over all Array Elements */ for (i = 0; i < 16; i++) D[i/8][i%8] = C[i/8][i%8] + A[i/8][i%8] * B[(15-i)/8][(15-i)%8]; Very costly and complex addressing involved now. Reason: Arrays are multi-dimensional now; but loop counter i which is used to index arrays is incremented sequentially. So-called circular buffer addressing involved.
27
- 27 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Strip Mining of i -Loop /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Nested Loop over all Array Elements */ for (j = 0; j < 2; j++) for (i = 0; i < 8; i++) D[j][i] = C[j][i] + A[j][i] * B[1-j][7-i]; Splitting of sequential iteration space of i into two independent iteration spaces. Iteration spaces of new loop nest now reflect data layout. Just affine expressions for array indexing.
28
- 28 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Strip Mining of i -Loop /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Nested Loop over all Array Elements */ for (j = 0; j < 2; j++) for (i = 0; i < 8; i++) D[j][i] = C[j][i] + A[j][i] * B[1-j][7-i]; How can this code be parallelized for two DSPs?
29
- 29 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Parallelization (for Processor 0) /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D[MYID][i] = C[MYID][i] + A[MYID][i] * B[1-MYID][7-i]; Insertion of an explicit processor ID. Array addressing by using the processor ID. For N parallel DSPs, N different HIR codes with individual processor IDs are generated.
30
- 30 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Parallelization (for Processor 0) /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D[MYID][i] = C[MYID][i] + A[MYID][i] * B[1-MYID][7-i]; This structure makes explicit which code is executed on which DSP. Still unclear: How are arrays mapped to local memory banks or remote memories? How are remote memory banks accessed?
31
- 31 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Array Descriptors Sub-Array A0[0...7] A0 | A1 DSP 0 Sub-Array A1[0...7] A0 | A1 DSP 1 Array Descriptors A[0][5] 2-dimensional array A[2][8] is partitioned into two Sub-Arrays A0 and A1 along A ’s first dimension. Each Sub-Array A n is stored in local memory of DSP n. Original 2-dimensional accesses of A have to be re-routed to A0 and A1 using array descriptors.
32
- 32 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Memory Layout (for Processor 0) /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations & Array Descriptors */ int A0[8]; extern int A1[8]; int *A[2] = { A0, A1 }; int B0[8]; extern int B1[8]; int *B[2] = { B0, B1 };... /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D[MYID][i] = C[MYID][i] + A[MYID][i] * B[1-MYID][7-i]; Arrays stored in local and remote memories. Array accesses via descriptors in unchanged syntax.
33
- 33 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Memory Layout (for Processor 0) /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations & Array Descriptors */ int A0[8]; extern int A1[8]; int *A[2] = { A0, A1 }; int B0[8]; extern int B1[8]; int *B[2] = { B0, B1 };... /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D[MYID][i] = C[MYID][i] + A[MYID][i] * B[1-MYID][7-i]; Descriptor accesses to local arrays inefficient due to additional indirection. Scheduling issues: latency of A[i][j] can vary significantly, depending on whether i references local or remote memory.
34
- 34 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Increasing Locality of Array Accesses /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations & Array Descriptors */ int A0[8]; extern int A1[8]; int *A[2] = { A0, A1 }; int B0[8]; extern int B1[8]; int *B[2] = { B0, B1 };... /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D0[i] = C0[i] + A0[i] * B[1-MYID][7-i]; Direct accesses to local arrays; avoid array accesses via descriptors whenever possible. Maximal usage of high bandwidth of local memories.
35
- 35 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Increasing Locality of Array Accesses /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations & Array Descriptors */ int A0[8]; extern int A1[8]; int *A[2] = { A0, A1 }; int B0[8]; extern int B1[8]; int *B[2] = { B0, B1 };... /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D0[i] = C0[i] + A0[i] * B[1-MYID][7-i]; 8 sequential accesses to consecutive array elements in remote memory. Inefficient since 8 full bus cycles are required.
36
- 36 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Insertion of DMA Block Transfers /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations & Array Descriptors */ int A0[8]; extern int A1[8]; int *A[2] = { A0, A1 }; int B0[8]; extern int B1[8]; int *B[2] = { B0, B1 };... /* Temporary DMA Buffer */ int temp[8]; DMA_get( temp, &(B[1-MYID]), 8 * sizeof( int ) ); /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D0[i] = C0[i] + A0[i] * temp[7-i]; Burst-Load of local buffer from remote memory via DMA. Array accesses in loop only use local memory.
37
- 37 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Benchmarking of Multi-DSP Parallelization Multi-DSP Hardware 4 parallel AD TigerSHARC TS-101 DSPs @250 MHz 768 kB local SRAM per DSP, 128 MB external DRAM Benchmarks for Auto-Parallelization DSPstone: small DSP kernel codes, low code complexity UTDSP: entire complex applications, compute-intensive code Results: Execution Times of fully sequential code running on 1 DSP of code after Program Recovery of code after Data Partitioning and Mapping of code after Locality Improvement and DMA Transfers
38
- 38 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Results – DSPstone
39
- 39 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Results – UTDSP
40
- 40 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Discussion of Results Average Total Speedups: DSPstone: Factor 4.28 UTDSP: Factor 3.65 All benchmarks: Factor 3.78 Very astonishing: How can a speedup of more than a factor of 4 be achieved for DSPstone if parallelization is performed for 4 DSPs?
41
- 41 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Reasons for Super-Linear Speedups > 4 Parallelized code possibly offers more optimization potential for successive compiler optimizations than original sequential code. Example: Sequential i -loop (slide 25): 16 iterations. i -loop parallelized for 2 DSPs (slide 26): 8 iterations. Parallelized loops possibly candidates for Loop Unrolling: Fully unrolled loop without any branches! No delay slots, branch prediction can not predict incorrectly. for (i = 0; i < 8; i++) ;... ; 8 times
42
- 42 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund References Compiler structure and Intermediate Representations: Steven S. Muchnick, Advanced Compiler Design & Implementation, Morgan Kaufmann, 1997. ISBN 1-55860-320-4 Parallelization for homogeneous Multi-DSPs: B. Franke, M. O’Boyle, A Complete Compiler Approach to Auto-Parallelizing C Programs for Multi-DSP Systems, IEEE Transactions on Parallel and Distributed Systems, Vol. 16, No. 3, March 2005.
43
- 43 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Summary Intermediate Representations Representation of code at different abstraction levels. HIR- and LIR-Examples. HIR Optimizations Restructuring of loops. Restructuring of functions and their calling relations. Parallelization for homogeneous Multi-DSPs Focuses on exploitation of local memories & address spaces. Speedups basically linear to number of available parallel DSPs.
44
- 44 - technische universität dortmund fakultät für informatik h. falk, informatik 12, 2008 TU Dortmund Questions (if on schedule) ? Q&A?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.