Optimizations for the Multi-Level Computing Architecture Presented by: Utku Aydonat Kirk Stewart Ahmed Abdelkhalek Ivan Matosevic Supervisor: Prof. Tarek.

Optimizations for the Multi-Level Computing Architecture Presented by: Utku Aydonat Kirk Stewart Ahmed Abdelkhalek Ivan Matosevic Supervisor: Prof. Tarek S. Abdelrahman Connections 2005 24 June 2005

1 The MLCA Architecture Processing Units (PUs) Processing Units (PUs) Coarse-grain computation units: CPU, DSP, ASIC, etc Coarse-grain computation units: CPU, DSP, ASIC, etc Read operands, process, write result Read operands, process, write result Universal Register File (URF) Universal Register File (URF) Communication Communication Synchronization Synchronization Control Processor (CP) Control Processor (CP) Fetch and decode “task instructions” Fetch and decode “task instructions” Perform data-dependence analysis and register renaming Perform data-dependence analysis and register renaming Task Scheduler (TS) Task Scheduler (TS) Issue tasks out-of-order Issue tasks out-of-order Control Processor Task Scheduler URF PU Shared Memory

2 The MLCA Programming Model Control Program Applicatio n TaskFunctionsTaskFunctions

3 The MLCA Programming Model Task functions: C functions (but can be hardware blocks, FPGAs, etc); execute on the PUs. Task functions: C functions (but can be hardware blocks, FPGAs, etc); execute on the PUs. A control program: a sequential program for the CP specifying (sequential) task execution and URF register usage in Sarek. A control program: a sequential program for the CP specifying (sequential) task execution and URF register usage in Sarek. while(…) { TaskA (in x, out y); TaskB (in y, out z); TaskC (in m, out y); TaskD (in y, out m); TaskE (in z, out x); } A C D E A Time C D PU 1 PU 2 PU 3 B E  Tasks issue, execute, and complete out-of-order, speculatively.

4 Presentation Overview Compile-time optimizations to extract parallelism Compile-time optimizations to extract parallelism by Utku Aydonat Automatic task formation Automatic task formation by Kirk Stewart Memory management Memory management by Ahmed Abdelkhalek Power optimization using dynamic voltage scaling Power optimization using dynamic voltage scaling by Ivan Matosevic

5 Scalar Data in URF TaskA(out x); The CP synchronizes dependent tasks and eliminates false dependences with renaming. A B C D TaskB(in x); TaskC(out x); TaskD(in x); x is scalar data in URF. ff ff oo aa

6 Renaming Problem The control processor can only rename the pointers stored in the URF, but not the data in memory. TaskA(in buf); TaskB(in buf); TaskC(in buf); TaskD(in buf); ff ff Alloc(out buf); // Allocates *buf // Writes to *buf // Reads from *buf // Writes to *buf // Reads from *buf A B C D aa oo buf is the pointer of buffer in memory.

7 Buffer Privatization TaskA(in buf); TaskB(in buf); TaskC(in pri); TaskD(in pri); ff ff // Writes to *buf // Reads from *buf // Writes to *pri // Reads from *pri A B C D False memory dependences are eliminated by separating accessed buffers. Init(out pri); // Allocates *pri Finish(in pri); // Destroys *pri

8 Buffer Privatization The CP renames the buf argument in each iteration of the loop with a corresponding new buffer in memory. A B A B while (...) { //Allocates *buf Init (out buf); // Writes data to *buf TaskA (in buf); // Reads data from * buf TaskB (in buf); //Destroys *buf Finish (in buf); }

9 Parameter De-Aggregation // Reads/Writes struc->x TaskA (in struc_x, out struc_x) // Reads/Writes struc->y TaskB (in struc_y, out struc_y) // Writes struc->x TaskC (out struc_x) A B C Hardware renaming is enabled by exposing the structure fields to the URF.

10 Summary The usage of pointers in MLCA programs disables the hardware features of the MLCA, namely synchronization and renaming. The usage of pointers in MLCA programs disables the hardware features of the MLCA, namely synchronization and renaming. We designed code transformations that improve performance of MLCA programs. We designed code transformations that improve performance of MLCA programs. The code transformations produced scaling speed-ups with real multimedia applications; 2.4 (MAD), 5.3(FMR) and 3.2(GSM), and introduced little overheads. The code transformations produced scaling speed-ups with real multimedia applications; 2.4 (MAD), 5.3(FMR) and 3.2(GSM), and introduced little overheads. The analysis results required for the code transformations are well-known compiler analyses and can be generated by a compiler or obtained simply by a programmer. The analysis results required for the code transformations are well-known compiler analyses and can be generated by a compiler or obtained simply by a programmer.

11 Task Formation Writing new applications is relatively simple Writing new applications is relatively simple Write C and Sarek simultaneously, add URF access as necessary Write C and Sarek simultaneously, add URF access as necessary Porting existing applications is more difficult Porting existing applications is more difficult Programmer must find all data accesses in a task body Programmer must find all data accesses in a task body Complex call-graphs, pointers etc.! Complex call-graphs, pointers etc.!

12 Task Formation - Goals Efficiently partition a sequential program into tasks Efficiently partition a sequential program into tasks Find a set of tasks that exhibit as much parallelism as possible Find a set of tasks that exhibit as much parallelism as possible Correctly express data flow between tasks Correctly express data flow between tasks Focus on multimedia/streaming applications Focus on multimedia/streaming applications Regular structure Regular structure Lots of data parallelism Lots of data parallelism

13 Proposed Solution Iterative, heuristic algorithm Iterative, heuristic algorithm Create a programming tool Create a programming tool Source-to-source transformation Source-to-source transformation Look to the programmer when analysis fails Look to the programmer when analysis fails Present multiple task partitions Present multiple task partitions Evaluate each task partition Evaluate each task partition Simulation/profiling Simulation/profiling Static analysis Static analysis

14 Algorithm Highlights Split at loop boundaries Split at loop boundaries Split at function calls Split at function calls Impose a minimum task size Impose a minimum task size Exposes parallel/pipelinable loops Exposes parallel/pipelinable loops Evaluation Evaluation Evaluate with and without “conservative” task inputs/outputs Evaluate with and without “conservative” task inputs/outputs Flag critical dependences for manual examination Flag critical dependences for manual examination

15 Current State Mechanics of task formation implemented in ORC Mechanics of task formation implemented in ORC An early stage in the overall MLCA tool-chain An early stage in the overall MLCA tool-chain Algorithm refinement Algorithm refinement Task Generator Task Generator Optimizer Tasks Control The MLCA Compiler Optimizer Sequential C Code Optimized Control Optimized Tasks

16 MLCA Memory Management Desire an efficient memory system Desire an efficient memory system Efficient = PUs find data nearby + Efficient = PUs find data nearby + minimize memory system area + energy Many design possibilities: Many design possibilities: Private/shared memory Private/shared memory Centralized/distributed memory Centralized/distributed memory DRAM/SRAM technology DRAM/SRAM technology Caches,… Caches,… Approach: Approach: Choose appropriate design Choose appropriate design Investigate compiler/run-time solutions for memory mgmt Investigate compiler/run-time solutions for memory mgmt Evaluate Evaluate

17 Attractive Memory Design Desire locality in memory access Desire locality in memory access MLCA naturally breaks down data into two types: MLCA naturally breaks down data into two types: Intra -task data: created and destroyed by task each time it executes, not needed by other tasks Intra -task data: created and destroyed by task each time it executes, not needed by other tasks Inter -task data: needed by more than one task, identified through the URF Inter -task data: needed by more than one task, identified through the URF store intra-task data Distributed, shared, store inter-task data Universal Register File Task Scheduler Control Processor PU 0 Private memory Bank 0 PU 1 Private memory Bank 1 PU 2 Private memory Bank 2 PU 3 Private memory Bank 3

18 Proposed solution Focus on memory mgmt of inter -task data Focus on memory mgmt of inter -task data Intra-task data handled by PU cache Intra-task data handled by PU cache Task-program identifies global data needed by each task call Task-program identifies global data needed by each task call Associate each task call with a memory bank (PU) Associate each task call with a memory bank (PU) Indicated by compiler, influences run-time task scheduler Indicated by compiler, influences run-time task scheduler Place inter-task data near PU(s) that need it Place inter-task data near PU(s) that need it Compiler inserts move commands Compiler inserts move commands Turn off banks if not needed to save power Turn off banks if not needed to save power Directives inserted by compiler Directives inserted by compiler Compiler uses heuristics or ILP solution Compiler uses heuristics or ILP solution

19 Example taskA (out r1) on bank1; taskB (out r2) on bank2;// independent task creating data taskC (out r3) on bank3; move r2, r3, bank1;// collect all data in bank1 turn-off bank2, bank3;// shut-down other banks taskD (in r1, in r2, in r3, out r4, out r5) on bank1; taskE (in r4, out r4) on bank1; turn-on bank2; // reactivate bank2 for independent taskF move r5, bank2; // move required data in bank2 taskF (in r5, out r5) on bank2; …

20 Run-time optimizations Bank renaming Bank renaming used by run-time task scheduler to break false dependencies on banks used by run-time task scheduler to break false dependencies on banks Move scheduling Move scheduling Execute moves ahead of time to avoid extra waits Execute moves ahead of time to avoid extra waits Delay move execution to reduce contention on target bank Delay move execution to reduce contention on target bank Ignore moves if too expensive Ignore moves if too expensive

21 Dynamic Voltage Scaling DVS enables changing the supply voltage and frequency at run-time DVS enables changing the supply voltage and frequency at run-time Trade-off between performance and power consumption Trade-off between performance and power consumption Several discrete voltage/frequency levels Several discrete voltage/frequency levels Transitions controlled by software Transitions controlled by software Supported by an increasing number of processors Supported by an increasing number of processors Intel XScale, IBM PowerPC 405LP, Transmeta Crusoe Intel XScale, IBM PowerPC 405LP, Transmeta Crusoe Problem – apply DVS to MLCA Problem – apply DVS to MLCA

22 Our Goals Exploit slack in the application applying DVS Exploit slack in the application applying DVS Voltage selection: what runs at what level? Voltage selection: what runs at what level? Task scheduling Task scheduling The available slack generally depends on scheduling! The available slack generally depends on scheduling! Previous work inadequate Previous work inadequate If T2 is slowed down, there is no performance penalty when tasks are executed on two processors!

23 Solution Profile-driven, heuristic compile-time approach focusing on control-program loops without control-flow in the loop body Profile-driven, heuristic compile-time approach focusing on control-program loops without control-flow in the loop body In practice, MLCA multimedia applications follow this pattern In practice, MLCA multimedia applications follow this pattern Using profiling information to find out task execution times Using profiling information to find out task execution times Dependence analysis of the target loop Dependence analysis of the target loop This information is used to compute the voltage selection This information is used to compute the voltage selection Identifying the critical path and distributing slack across non-critical tasks Identifying the critical path and distributing slack across non-critical tasks Task scheduling scheme defined so as to complement the voltage selection algorithm Task scheduling scheme defined so as to complement the voltage selection algorithm

24 Evaluation We have evaluated the approach using three realistic multimedia applications We have evaluated the approach using three realistic multimedia applications We use the MLCA simulator for evaluation We use the MLCA simulator for evaluation Processor power savings vs. execution slowdown Processor power savings vs. execution slowdown JPEG: 9.5% savings, 0.9% slowdown JPEG: 9.5% savings, 0.9% slowdown GSM: 5.5% savings, no slowdown GSM: 5.5% savings, no slowdown MPEG: 8.4% savings, 1.5% slowdown MPEG: 8.4% savings, 1.5% slowdown Power savings achieved with very small performance penalties Power savings achieved with very small performance penalties We believe that the technique could be generalized to the task-level parallelism in multimedia applications We believe that the technique could be generalized to the task-level parallelism in multimedia applications

Optimizations for the Multi-Level Computing Architecture Presented by: Utku Aydonat Kirk Stewart Ahmed Abdelkhalek Ivan Matosevic Supervisor: Prof. Tarek.

Similar presentations

Presentation on theme: "Optimizations for the Multi-Level Computing Architecture Presented by: Utku Aydonat Kirk Stewart Ahmed Abdelkhalek Ivan Matosevic Supervisor: Prof. Tarek."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizations for the Multi-Level Computing Architecture Presented by: Utku Aydonat Kirk Stewart Ahmed Abdelkhalek Ivan Matosevic Supervisor: Prof. Tarek.

Similar presentations

Presentation on theme: "Optimizations for the Multi-Level Computing Architecture Presented by: Utku Aydonat Kirk Stewart Ahmed Abdelkhalek Ivan Matosevic Supervisor: Prof. Tarek."— Presentation transcript:

Similar presentations

About project

Feedback