Fakultät für informatik informatik 12 technische universität dortmund HIR Optimizations and Transformations - Session 12 - Heiko Falk TU Dortmund Informatik.

Slides:

Advertisements

Similar presentations

fakultät für informatik informatik 12 technische universität dortmund Additional compiler optimizations Peter Marwedel TU Dortmund Informatik 12 Germany.

Advertisements

fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.

1 Lecture 10 Intermediate Representations. 2 front end »produces an intermediate representation (IR) for the program. optimizer »transforms the code in.

Intermediate Code Generation

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

Intermediate Representations Saumya Debray Dept. of Computer Science The University of Arizona Tucson, AZ

1 Compiler Construction Intermediate Code Generation.

Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.

Program Representations. Representing programs Goals.

Compiler Challenges for High Performance Architectures

Intermediate Representation I High-Level to Low-Level IR Translation EECS 483 – Lecture 17 University of Michigan Monday, November 6, 2006.

Chapter 14: Building a Runnable Program Chapter 14: Building a runnable program 14.1 Back-End Compiler Structure 14.2 Intermediate Forms 14.3 Code.

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Cpeg421-08S/final-review1 Course Review Tom St. John.

CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.

1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.

1 ICS 51 Introductory Computer Organization Fall 2006 updated: Oct. 2, 2006.

Intermediate Code CS 471 October 29, CS 471 – Fall Intermediate Code Generation Source code Lexical Analysis Syntactic Analysis Semantic.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Precision Going back to constant prop, in what cases would we lose precision?

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Computer Organization and Architecture Reduced Instruction Set Computers (RISC) Chapter 13.

Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.

COP4020 Programming Languages

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

ICD-C Compiler Framework Dr. Heiko Falk  H. Falk, ICD/ES, 2008 ICD-C Compiler Framework 1.Highlights and Features 2.Basic Concepts 3.Extensions.

Compiler course 1. Introduction. Outline Scope of the course Disciplines involved in it Abstract view for a compiler Front-end and back-end tasks Modules.

1 COMP 3438 – Part II-Lecture 1: Overview of Compiler Design Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

Fakultät für informatik informatik 12 technische universität dortmund Worst-Case Execution Time Analysis - Session 19 - Heiko Falk TU Dortmund Informatik.

Unit-1 Introduction Prepared by: Prof. Harish I Rathod

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Execution of an instruction

Introduction to Compilers. Related Area Programming languages Machine architecture Language theory Algorithms Data structures Operating systems Software.

Introduction to Code Generation and Intermediate Representations

Overview of Previous Lesson(s) Over View  A program must be translated into a form in which it can be executed by a computer.  The software systems.

1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.

Chapter 1 Introduction Study Goals: Master: the phases of a compiler Understand: what is a compiler Know: interpreter,compiler structure.

Addressing Modes Chapter 6 S. Dandamudi To be used with S. Dandamudi, “Introduction to Assembly Language Programming,” Second Edition, Springer,

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 10 Ahmed Ezzat.

1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.

LLVM IR, File - Praakrit Pradhan. Overview The LLVM bitcode has essentially two things A bitstream container format Encoding of LLVM IR.

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ read/write and clock inputs Sequence of control signal combinations.

Fakultät für informatik informatik 12 technische universität dortmund Prepass Optimizations - Session 11 - Heiko Falk TU Dortmund Informatik 12 Germany.

CS 404 Introduction to Compiler Design

Advanced Computer Systems

Compiler Design (40-414) Main Text Book:

Assembly language.

SOFTWARE DESIGN AND ARCHITECTURE

课程名编译原理 Compiling Techniques

Compiler Construction

Chapter 12 Pipelining and RISC

Introduction to Optimization

Presentation transcript:

fakultät für informatik informatik 12 technische universität dortmund HIR Optimizations and Transformations - Session 12 - Heiko Falk TU Dortmund Informatik 12 Germany Slides use Microsoft cliparts. All Microsoft restrictions apply.

- 2 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Schedule of the course TimeMondayTuesdayWednesdayThursdayFriday 09:30- 11:00 1: Orientation, introduction 2: Models of computation + specs 5: Models of computation + specs 9: Mapping of applications to platforms 13: Memory aware compilation 17: Memory aware compilation 11:00 Brief break 11:15- 12:30 6: Lab*: Ptolemy 10: Lab*: Scheduling 14: Lab*: Mem. opt. 18: Lab*: Mem. opt. 12:30Lunch 14:00- 15:20 3: Models of computation + specs 7: Mapping of applications to platforms 11: High-level optimizations* 15: Memory aware compilation 19: WCET & compilers* 15:20Break 15:40- 17:00 4: Lab*: Kahn process networks 8: Mapping of applications to platforms 12: High-level optimizations* 16: Memory aware compilation 20: Wrap-up * Dr. Heiko Falk

- 3 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Outline Intermediate Representations Motivation of High-Level Optimizations Parallelization for Multi-DSPs  Introduction and Target Architecture  Workflow Program Recovery Data Partitioning and Mapping Locality Improvement and DMA  Results References & Summary

- 4 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Structure of an Optimizing Compiler Lexical Analysis Source Code Tokens Syntactical Analysis Syntax Tree Semantical Analysis High- Level IR Code Selection Register Allocation Instruction Scheduling ASM Code Optimization High- Level IR Low- Level IR Code Optimization Low- Level IR Low- Level IR

- 5 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Requirements for Code Optimizations Required infrastructure for optimizations:  Effective & efficient internal data structures that model the code currently under optimization enable and facilitate code manipulation provide necessary analyses for optimizations  Intermediate Representations (IRs)

- 6 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Abstraction Levels of IRs [S. S. Muchnick, Advanced Compiler Design & Implementation, Morgan Kaufmann, 1997] High-Level IRs (HIR):  Very close to source code.  Often: Abstract Syntax Trees  Variables & Types used to represent values and their storage.  Complex control and data flow operations are conserved (e.g. loops, if-then / if-else statements, array accesses [])  Back-transformation of HIR into source code easy.

- 7 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Abstraction Levels of IRs Medium-Level IRs (MIR):  Three-address code: a 1  a 2 op a 3 ;  IR independent of both source language and target processor.  Temporary variables used to store values.  Complex control and data flow operations simplified (e.g. labels & jumps, pointer arithmetic)  Control flow in form of basic blocks (sequences of straight- line code).

- 8 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Abstraction Levels of IRs Low-Level IRs (LIR):  Represent assembly code.  Operations correspond to machine instructions.  Registers used to store values.  Transformation of an LIR into assembly code easy.

- 9 - technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund HIR-Example: ICD-C Compilation Unit IR Function Statement Expression 1 single C file during simultaneous compilation of several source files Loop Statements (for, do-while, while-do) Selection Statements (if, if-else, switch) Jump Statements (return, break, continue, …) … Binary & Unary expressions (+, -, *, /, …) Assignment operators (=, +=, -=, …) Index & component access (a[x], a.x, …) …

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund HIR-Example: ICD-C Compilation Unit IR Function Statement Expression Global ST File ST Funct. ST Local ST Basic Block [Informatik Centrum Dortmund e.V., Dortmund, 2008]

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund ICD-C: Features ANSI-C Compiler Frontend:  C89 + C99 Standards  GNU Inline-Assembly Included Analyses:  Data flow analyses  Control flow analyses  Loop analyses Interfaces:  ANSI-C dump of the IR as interface to external tools  Built-in interface to code selectors in compiler backends

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund LIR-Example: LLIR LLIR Function Instruction Operation Basic Block Machine Instruction Consists of 1-N machine operations Operations are executed in parallel (  VLIW) Machine Operation Includes assembly opcode (e.g. ADD, MUL, …) Includes 0-M parameters

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund LIR-Example: LLIR LLIR Function Instruction Operation Basic Block Parameter Registers Integer constants & Labels Addressing modes … LLIR structure fully processor independent:  An LLIR consists of some generic functions  An LLIR function consists of …  An LLIR operation consists of some generic parameters

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund LIR-Example: LLIR [Informatik Centrum Dortmund e.V., Dortmund, 2008] TriCore 1.3: + Registers = { D0, …, D15, A0, …, A15 } + Mnemonics = { ABS, ABS.B, …, XOR.T } + Status Flags = { C, V, …, SAV } +... LLIR Function Instruction Operation Basic Block Parameter LLIR becomes processor-specific by adding a processor description:

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund LLIR: Features Retargetability:  Integrated mechanisms to adapt LLIR to various processors (e.g. DSPs, VLIWs, network processors, …)  Modeling of different kinds of instruction sets  Modeling of different kinds of register sets Included Analyses:  Data flow analyses  Control flow analyses Interfaces:  Parsing and output of assembly files  Built-in interface to code selectors

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Motivation of High-Level Optimizations High-Level IRs:  Structure very close to source language.  High-level constructs (esp. loops, function calls + parameter passing, array accesses) are conserved. High-Level Optimizations:  Exploit features of HIRs intensively.  Concentrate on huge restructuring of loops and function calls.  Are difficult to realize on lower abstraction levels since high-level information is lost and would have to be recreated.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Parallelization for Multi-DSPs Material by courtesy of: Björn Franke and Michael O’Boyle University of Edinburgh, UK School of Informatics

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Parallelization for Multi-DSPs Motivation:  Performance requirements of an entire system often surpass abilities of a single processor (e.g. radar, sonar, medical image processing, HDTV, …).  Cluster of parallel DSPs provides enough performance, but… Few / little hardware support for parallel execution. Even less support of parallel programming by development tool chains (e.g. specification languages, compilers, …). Existing source codes often coded in low-level fashion, complicating an effective parallelization.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Auto-Parallelizing Compilers Discipline “High Performance Computing”:  Research on vectorizing compilers for more than 25 years.  Traditionally: Fortran compilers.  Such vectorizing compilers usually inappropriate for Multi- DSPs, since assumptions on memory model unrealistic: Communication between processors via shared memory Memory has only one single common address space Distributed caches may be used, but cache coherence ensured by hardware protocols  De Facto no auto-parallelizing compiler for Multi-DSPs!

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Multi-DSPs DSP Core 0 Mem Bank 1 Mem Bank 2 DSP Core X Mem Bank 1 Mem Bank 2 External Memory Bus Multiple address spaces: Intern 1 & 2, Extern, Remote DSP Core Using internal memories: Higher bandwidth, reduced latencies Using remote memories: ID of remote DSP must be known

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Workflow of Auto-Parallelization for Multi-DSPs Program Recovery  Removal of undesired low-level constructs in IR  Replacement by equivalent high-level constructs Parallelism Detection  Identification of parallelizable loops Partitioning and Mapping of Data  Minimization of communication overhead between DSPs Memory Access Localization  Minimization of accesses to remote memories Data Transfer Optimization  Exploitation of DMA for burst transfers

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund (Running) Code Example for 2 parallel DSPs /* Array Declarations */ int A[16], B[16], C[16], D[16]; /* Declaration & Initialization of Pointers */ int *p_a = A, *p_b = &B[15], *p_c = C, *p_d = D; /* Loop over all Array Elements */ for (i = 0; i < 16; i++) *p_d++ = *p_c++ + *p_a++ * *p_b--; Low-level array accesses via pointers; explicit pointer arithmetic (auto-increment addressing). Disadvantageous for parallelization: ad hoc, no structure in array accesses visible and analyzable.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Program Recovery /* Array Declarations */ int A[16], B[16], C[16], D[16]; /* Loop over all Array Elements */ for (i = 0; i < 16; i++) D[i] = C[i] + A[i] * B[15-i]; Replacement of pointer accesses by explicit array operations []. Structure of array accesses now better visible and accessible for following analyses.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Program Recovery /* Array Declarations */ int A[16], B[16], C[16], D[16]; /* Loop over all Array Elements */ for (i = 0; i < 16; i++) D[i] = C[i] + A[i] * B[15-i]; One-dimensional “flat” arrays too unstructured for Multi-DSP parallelization. Partitioning of the arrays onto available parallel DSPs unclear.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Data Partitioning /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Loop over all Array Elements */ for (i = 0; i < 16; i++) D[i/8][i%8] = C[i/8][i%8] + A[i/8][i%8] * B[(15-i)/8][(15-i)%8]; Novel two-dimensional array declarations. First dimension corresponds to number of parallel DSPs. Originally flat arrays now partitioned into disjoint areas that can be processed independent of each other.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Data Partitioning /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Loop over all Array Elements */ for (i = 0; i < 16; i++) D[i/8][i%8] = C[i/8][i%8] + A[i/8][i%8] * B[(15-i)/8][(15-i)%8]; Very costly and complex addressing involved now. Reason: Arrays are multi-dimensional now; but loop counter i which is used to index arrays is incremented sequentially. So-called circular buffer addressing involved.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Strip Mining of i -Loop /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Nested Loop over all Array Elements */ for (j = 0; j < 2; j++) for (i = 0; i < 8; i++) D[j][i] = C[j][i] + A[j][i] * B[1-j][7-i]; Splitting of sequential iteration space of i into two independent iteration spaces. Iteration spaces of new loop nest now reflect data layout. Just affine expressions for array indexing.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Strip Mining of i -Loop /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Nested Loop over all Array Elements */ for (j = 0; j < 2; j++) for (i = 0; i < 8; i++) D[j][i] = C[j][i] + A[j][i] * B[1-j][7-i]; How can this code be parallelized for two DSPs?

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Parallelization (for Processor 0) /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D[MYID][i] = C[MYID][i] + A[MYID][i] * B[1-MYID][7-i]; Insertion of an explicit processor ID. Array addressing by using the processor ID. For N parallel DSPs, N different HIR codes with individual processor IDs are generated.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Parallelization (for Processor 0) /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations */ int A[2][8], B[2][8], C[2][8], D[2][8]; /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D[MYID][i] = C[MYID][i] + A[MYID][i] * B[1-MYID][7-i]; This structure makes explicit which code is executed on which DSP. Still unclear: How are arrays mapped to local memory banks or remote memories? How are remote memory banks accessed?

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Array Descriptors Sub-Array A0[0...7] A0 | A1 DSP 0 Sub-Array A1[0...7] A0 | A1 DSP 1 Array Descriptors A[0][5] 2-dimensional array A[2][8] is partitioned into two Sub-Arrays A0 and A1 along A ’s first dimension. Each Sub-Array A n is stored in local memory of DSP n. Original 2-dimensional accesses of A have to be re-routed to A0 and A1 using array descriptors.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Memory Layout (for Processor 0) /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations & Array Descriptors */ int A0[8]; extern int A1[8]; int *A[2] = { A0, A1 }; int B0[8]; extern int B1[8]; int *B[2] = { B0, B1 };... /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D[MYID][i] = C[MYID][i] + A[MYID][i] * B[1-MYID][7-i]; Arrays stored in local and remote memories. Array accesses via descriptors in unchanged syntax.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Memory Layout (for Processor 0) /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations & Array Descriptors */ int A0[8]; extern int A1[8]; int *A[2] = { A0, A1 }; int B0[8]; extern int B1[8]; int *B[2] = { B0, B1 };... /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D[MYID][i] = C[MYID][i] + A[MYID][i] * B[1-MYID][7-i]; Descriptor accesses to local arrays inefficient due to additional indirection. Scheduling issues: latency of A[i][j] can vary significantly, depending on whether i references local or remote memory.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Increasing Locality of Array Accesses /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations & Array Descriptors */ int A0[8]; extern int A1[8]; int *A[2] = { A0, A1 }; int B0[8]; extern int B1[8]; int *B[2] = { B0, B1 };... /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D0[i] = C0[i] + A0[i] * B[1-MYID][7-i]; Direct accesses to local arrays; avoid array accesses via descriptors whenever possible.  Maximal usage of high bandwidth of local memories.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Increasing Locality of Array Accesses /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations & Array Descriptors */ int A0[8]; extern int A1[8]; int *A[2] = { A0, A1 }; int B0[8]; extern int B1[8]; int *B[2] = { B0, B1 };... /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D0[i] = C0[i] + A0[i] * B[1-MYID][7-i]; 8 sequential accesses to consecutive array elements in remote memory.  Inefficient since 8 full bus cycles are required.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Insertion of DMA Block Transfers /* Definition of Processor ID */ #define MYID 0 /* Partitioned Array Declarations & Array Descriptors */ int A0[8]; extern int A1[8]; int *A[2] = { A0, A1 }; int B0[8]; extern int B1[8]; int *B[2] = { B0, B1 };... /* Temporary DMA Buffer */ int temp[8]; DMA_get( temp, &(B[1-MYID]), 8 * sizeof( int ) ); /* Simple Loop over all Array Elements for DSP No. MYID */ for (i = 0; i < 8; i++) D0[i] = C0[i] + A0[i] * temp[7-i]; Burst-Load of local buffer from remote memory via DMA. Array accesses in loop only use local memory.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Benchmarking of Multi-DSP Parallelization Multi-DSP Hardware  4 parallel AD TigerSHARC TS-101 MHz  768 kB local SRAM per DSP, 128 MB external DRAM Benchmarks for Auto-Parallelization  DSPstone: small DSP kernel codes, low code complexity  UTDSP: entire complex applications, compute-intensive code Results: Execution Times  of fully sequential code running on 1 DSP  of code after Program Recovery  of code after Data Partitioning and Mapping  of code after Locality Improvement and DMA Transfers

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Results – DSPstone

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Results – UTDSP

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Discussion of Results Average Total Speedups:  DSPstone: Factor 4.28  UTDSP: Factor 3.65  All benchmarks: Factor 3.78 Very astonishing: How can a speedup of more than a factor of 4 be achieved for DSPstone if parallelization is performed for 4 DSPs?

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Reasons for Super-Linear Speedups > 4 Parallelized code possibly offers more optimization potential for successive compiler optimizations than original sequential code. Example: Sequential i -loop (slide 25): 16 iterations. i -loop parallelized for 2 DSPs (slide 26): 8 iterations.  Parallelized loops possibly candidates for Loop Unrolling:  Fully unrolled loop without any branches!  No delay slots, branch prediction can not predict incorrectly. for (i = 0; i < 8; i++) ;... ; 8 times

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund References Compiler structure and Intermediate Representations:  Steven S. Muchnick, Advanced Compiler Design & Implementation, Morgan Kaufmann, ISBN Parallelization for homogeneous Multi-DSPs:  B. Franke, M. O’Boyle, A Complete Compiler Approach to Auto-Parallelizing C Programs for Multi-DSP Systems, IEEE Transactions on Parallel and Distributed Systems, Vol. 16, No. 3, March 2005.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Summary Intermediate Representations  Representation of code at different abstraction levels.  HIR- and LIR-Examples. HIR Optimizations  Restructuring of loops.  Restructuring of functions and their calling relations. Parallelization for homogeneous Multi-DSPs  Focuses on exploitation of local memories & address spaces.  Speedups basically linear to number of available parallel DSPs.

technische universität dortmund fakultät für informatik  h. falk, informatik 12, 2008 TU Dortmund Questions (if on schedule) ? Q&A?