Evaluation of Offset Assignment Heuristics Johnny Huynh, Jose Nelson Amaral, Paul Berube University of Alberta, Canada Sid-Ahmed-Ali Touati Universite.

Slides:



Advertisements
Similar presentations
fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Static Single-Assignment ? ? Introduction: Over last few years [1991] SSA has been Stablished as… Intermediate program representation.
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Instruction Set Design
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Computer Architecture Instruction-Level Parallel Processors
1 Optimizing compilers Managing Cache Bercovici Sivan.
Lecture 6 Programming the TMS320C6x Family of DSPs.
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
CPU Review and Programming Models CT101 – Computing Systems.
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Constraint Programming for Compiler Optimization March 2006.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
The number of edge-disjoint transitive triples in a tournament.
H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.
1 Rainer Leupers, University of Dortmund, Computer Science Dept. ISSS ´98 A Uniform Optimization Technique for Offset Assignment Problems Rainer Leupers,
Execution of an instruction
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
Reliability-Aware Frame Packing for the Static Segment of FlexRay Bogdan Tanasa, Unmesh Bordoloi, Petru Eles, Zebo Peng Linkoping University, Sweden 1.
Register Allocation (via graph coloring)
Data Flow Analysis Compiler Design Nov. 8, 2005.
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
Intermediate Code. Local Optimizations
1 Liveness analysis and Register Allocation Cheng-Chia Chen.
Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Data Flow Analysis Compiler Design Nov. 8, 2005.
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
 2004 Deitel & Associates, Inc. All rights reserved. Chapter 9 – Real Memory Organization and Management Outline 9.1 Introduction 9.2Memory Organization.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Hardness Results for Problems
Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.
Part II: Addressing Modes
Precision Going back to constant prop, in what cases would we lose precision?
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
SOFT COMPUTING (Optimization Techniques using GA) Dr. N.Uma Maheswari Professor/CSE PSNA CET.
1 11 Subcarrier Allocation and Bit Loading Algorithms for OFDMA-Based Wireless Networks Gautam Kulkarni, Sachin Adlakha, Mani Srivastava UCLA IEEE Transactions.
A performance evaluation approach openModeller: A Framework for species distribution Modelling.
CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
05/04/06 1 Integrating Logic Synthesis, Tech mapping and Retiming Presented by Atchuthan Perinkulam Based on the above paper by A. Mishchenko et al, UCAL.
Subject: Operating System.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Register Placement for High- Performance Circuits M. Chiang, T. Okamoto and T. Yoshimura Waseda University, Japan DATE 2009.
Execution of an instruction
Solving the Maximum Cardinality Bin Packing Problem with a Weight Annealing-Based Algorithm Kok-Hua Loh University of Maryland Bruce Golden University.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.
A Graph Theoretic Approach to Cache-Conscious Placement of Data for Direct Mapped Caches Mirza Beg and Peter van Beek University of Waterloo June
Memory Management.
Software Engineering (CSI 321)
Embedded Systems Design
Chapter 9 – Real Memory Organization and Management
A Closer Look at Instruction Set Architectures
Methodology of a Compiler that Compresses Code using Echo Instructions
Min Cost Network Flow C.Gebotys, ECE 602.
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
ARM ORGANISATION.
Presentation transcript:

Evaluation of Offset Assignment Heuristics Johnny Huynh, Jose Nelson Amaral, Paul Berube University of Alberta, Canada Sid-Ahmed-Ali Touati Universite de Versailles, France

Outline Background Traditional Approach to Offset Assignment Simple Offset Assignment Address-Register Assignment Improving the Problem Model Optimal Address-Code Generation Memory Layout Permutations Evaluating Current Heuristics Methodology Results Conclusions and Future Work

Outline Background Traditional Approach to Offset Assignment Simple Offset Assignment Address-Register Assignment Improving the Problem Model Optimal Address-Code Generation Memory Layout Permutations Evaluating Current Heuristics Methodology Results Conclusions and Future Work

Background Digital Signal Processors (DSPs) have few general purpose registers Program variables kept in memory Address Registers (AR) used to access variables After a variable is accessed, the AR can be auto-incremented (or decremented) by one word in the same cycle.

Processor Model Texas Instruments TMS320C54X DSP family: Accumulator-based DSP 8 Address Registers Initializing an address register requires 2 cycles of overhead Explicit address computations require 1 cycle of overhead Using auto-increment (or auto-decrement) has no overhead.

Processor Model Example: add ‘A’ and ‘B’, store in accumulator $AR0 = &A $ACC = *$AR0 $AR0 = $AR0 + 2 $ACC += *$AR0 $AR0 = &A $ACC = *$AR0++ $ACC += *$AR0 Explicit address computation Auto-Increment ACB ABC 0x1000 0x1001 0x1002

Processor Model Example: add ‘A’ and ‘B’, store in accumulator $AR0 = &A $ACC = *$AR0 $AR0 = $AR0 + 2 $ACC += *$AR0 $AR0 = &A $ACC = *$AR0++ $ACC += *$AR0 Explicit address computation Auto-Increment ACB ABC 0x1000 0x1001 0x1002

The Offset-Assignment Problem Given k address registers and a basic block accessing n variables, find a memory layout that minimizes address- computation overhead. How should the variables be placed in memory? Which register should access each variable?

Outline Background Traditional Approach to Offset Assignment Simple Offset Assignment Address-Register Assignment Improving the Problem Model Optimal Address-Code Generation Memory Layout Permutations Evaluating Current Heuristics Methodology Results Conclusions and Future Work

Traditional Approach to Offset Assignment Access Sequence Address Register Assignment Sub-Sequence Sub-Layout Simple Offset Assignment Sub-Layout Simple Offset Assignment Sub-Layout Simple Offset Assignment Basic Block Generate Access Sequence Address-Computation Overhead Address-Code Generation

Traditional Approach: Simple Offset Assignment (SOA) In 1992, Bartley introduced the simplest form of the offset assignment problem: Given a single address register and basic block with n variables, find a memory layout that minimizes overhead. Equivalent to finding a maximum weight path cover (NP-complete) Many researchers have proposed heuristics for this problem: Liao et. al. (1996) Leupers and Marwedel (1996) Sugino et. al. (1996)

Simple Offset Assignment (SOA) Fix the access sequence Assume only one address register (k = 1) Find an ordering of variables in memory (memory layout) that has minimum overhead. AB D FC E Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layout:

Simple Offset Assignment (SOA) Create Access Graph G = (V, E) V = variables weight of edge is the frequency of consecutive accesses A path defines a memory layout -- Find the Maximum Weight Path Cover NP-Complete! AB D FC E Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layout:

Simple Offset Assignment (SOA) Create Access Graph G = (V, E) V = variables weight of edge is the frequency of consecutive accesses A path defines a memory layout -- Find the Maximum Weight Path Cover NP-Complete! AB D FC E Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layout: dafceb

Traditional Approach: General Offset Assignment (GOA) Problem presented by Liao et. al. in Given k address registers, and a basic block with n variables, find an assignment of variables to address registers that minimizes the total overhead of all registers. This problem formulation is more accurately described as Address- Register Assignment (ARA). Consists of SOA problems, and is at least NP-hard. Many researchers have proposed heuristics for address-register assignment: Leupers and Marwedel (1996) Sugino et. al. (1996) Zhuang et. al. (2003)

General Offset Assignment (GOA) Fix the access sequence Allow multiple address registers (k>1) Find an ordering of variables in memory (memory layout) that has minimum overhead. Assign each variable to an address register to form access sub-sequences. AB D FC E Ex. Access Sequence: ‘a d b e c f b e c f a d’ Sub-sequence1: ‘a b c b c a’ Sub-sequence2: ‘d e f e f d’

General Offset Assignment (GOA) AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Sub-sequence1: ‘a b c b c a’ Sub-sequence2: ‘d e f e f d’ Each sub-sequence can be viewed as an independent SOA problem. Solve each sub-sequence as independent SOA problems. More appropriate to call this problem the Address Register Assignment (ARA) problem. Requires solving SOA instances, so is at least NP-hard.

General Offset Assignment (GOA) AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef Each sub-sequence can be viewed as an independent SOA problem. Solve each sub-sequence as independent SOA problems. More appropriate to call this problem the Address Register Assignment (ARA) problem. Requires solving SOA instances, so is at least NP-hard.

Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1

Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1

Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1

Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1

Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1

Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1

Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1

Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1 *Requires Explicit Address Computations

‘a d b e c f b e c f a d’ ‘a b c b c a’ ‘d e f e f d’ [a, b, c][d, e, f] Simple Offset Assignment Simple Offset Assignment Address Register Assignment Sub-sequence and memory layout accessed by AR0 Sub-sequence and memory layout accessed by AR1 Traditional Approach to Offset Assignment

Outline Background Traditional Approach to Offset Assignment Simple Offset Assignment Address-Register Assignment Improving the Problem Model Optimal Address-Code Generation Memory Layout Permutations Evaluating Current Heuristics Methodology Results Conclusions and Future Work

Optimal Address-Code Generation Given a fixed access sequence and memory layout, it is possible to generate optimal addressing-code in polynomial time: Minimum-Cost Circulation (Gebotys, 1997) Minimum-Weight Perfect Matching (Udayanarayanan, 2000)

Optimal Address-Code Generation Build a network-flow graph Vertices represent variable accesses For each access a i that occurs before another a j, there is an edge (a i,a j ) (not all shown the graph). Edges represent an opportunity for a register to access variables. Each unit flow represents the accesses performed by an address register. Optimal Address-Code is found by finding a minimum- cost circulation.

Traditional Approach to Offset Assignment Access Sequence Address Register Assignment Sub-Sequence Sub-Layout Simple Offset Assignment Address-Computation Overhead Address-Code Generation Sub-Sequence Sub-Layout Simple Offset Assignment Sub-Sequence Sub-Layout Simple Offset Assignment NP-Hard NP-Complete Solved, but not used!

Memory Layout Permutations (MLP) Since optimal address-code generation algorithms exist, they can be applied after a memory layout is formed (by traditional approaches). However, the traditional approach generates multiple sub-layouts that were originally assumed to be independent. How is a single memory layout formed from a set of sub-layouts?

Memory Layout Permutations Let M i be a memory sub-layout. Let M i r be the reciprocal of M i Given an access sequence and m memory sub- layouts, arrange {(M 1 |M 1 r ),…,(M m |M m r )}, such that overhead is minimum when the sub-layouts are placed contiguously in memory.

Memory Layout Permutations Example: ‘a d b e c f b e c f a d’ ‘a b c b c a’ ‘d e f e f d’ {a, b, c}{d, e, f} [a, b, c, d, e, f], [f, e, d, c, b, a] [c, b, a, d, e, f], [f, e, d, a, b, c] [a, b, c, f, e, d], [d, e, f, c, b, a] [c, b, a, f, e, d], [d, e, f, a, b, c] Simple Offset Assignment Simple Offset Assignment Address Register Assignment Memory Layout Permutations This is an optimal address register assignment These are optimal simple offset assignments All possible Memory Layout Permutations (all have cost > 4) Optimal Layout: {b, c, a, d, e, f} with cost = 4 is not found

Outline Background Traditional Approach to Offset Assignment Simple Offset Assignment Address-Register Assignment Improving the Problem Model Optimal Address-Code Generation Memory Layout Permutations Evaluating Current Heuristics Methodology Results Conclusions and Future Work

Experimental Methodology Evaluating the Solution Space Testcases are DSP code kernels from the UTDSP benchmark suite. Use gcc to obtain access sequences. The quality of a memory layout is evaluated using the minimum-cost circulation technique. The entire solution space is found for each access sequence, to be used as a point of reference. Basic Block Compile with gcc Access Sequence Compute Overhead of All Layouts using Minimum-Cost Flow KernelAccessesVariablesPossible # of layouts iir_arr21820,160 iir_arr_swp ,500,800 latnrm_arr_swp30101,824,400 latnrm_ptr30101,824,400 latnrm_ptr_swp30101,824,400

Experimental Methodology Evaluating Current Heuristics Identified and implemented three Address-Register Assignment heuristic algorithms: Leupers Sugino Zhuang LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values

Experimental Methodology Evaluating Current Heuristics Identified and implemented five Simple Offset Assignment heuristic algorithms: Liao Leupers ALOMA Order-First Use (OFU) Branch and Bound (B&B) LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values

Experimental Methodology Evaluating Current Heuristics Each combination of ARA and SOA algorithm generates a set of sub-layouts. All possible memory layout permutations are generated, forming a set of memory layouts. Each memory layout is evaluated using the Minimum-Cost Circulation technique. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values

Results The 15 combinations of algorithms produce 15 distributions overhead values. The distributions are aggregated into one distribution. The aggregate distributions represent the solution space of all current algorithms.

Results Memory layouts have a significant impact on overhead. Some layouts have 100% higher overhead than the minimum. Over 99% of all layouts have an overhead that is 50% higher than the minimum.

Results Memory layouts produced by traditional approaches have a large range of possible overhead values -- sometimes the same as the entire solution space itself. In some cases, no combination of ARA and SOA heuristics can produce an optimal layout.

Results Memory layouts produced by traditional approaches have a large range of possible overhead values -- sometimes the same as the entire solution space itself. In some cases, no combination of ARA and SOA heuristics can produce an optimal layout.

Distribution of Overhead Values Testcase: iir_arr_swp -- infinite impulse response filter Overhead (cycles)ExhaustiveAlgorithmic Average Overhead

Exhaustive Solution Space Testcase: iir_arr_swp -- infinite impulse response filter

Algorithmic Solution Space Testcase: iir_arr_swp -- infinite impulse response filter

Efficiency of SOA Algorithms For each SOA algorithm, combine with each of the 5 ARA algorithms to generate 5 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values

Efficiency of SOA Algorithms For each SOA algorithm, combine with each of the 5 ARA algorithms to generate 5 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values

Efficiency of SOA Algorithms For each SOA algorithm, combine with each of the 5 ARA algorithms to generate 5 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values

Efficiency of SOA Algorithms For each SOA algorithm, combine with each of the 5 ARA algorithms to generate 5 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values

Efficiency of SOA Algorithms For each SOA algorithm, combine with each of the 5 ARA algorithms to generate 5 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values

Overhead (cycles)LiaoLeupersSuginoB&BOFU Efficiency of SOA Algorithms Testcase: iir_arr_swp -- infinite impulse response filter

Overhead (cycles) Frequency Liao Leupers Sugino BNB OFU

Evaluating SOA Algorithms Testcase: latnrm_ptr -- normalized lattice filter Overhead (Cycles) Frequency Liao Leupers Sugino BNB OFU

Efficiency of ARA Algorithms For each ARA algorithm, combine with each of the 3 SOA algorithms to generate 3 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values

Efficiency of ARA Algorithms For each ARA algorithm, combine with each of the 3 SOA algorithms to generate 3 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values

Efficiency of ARA Algorithms For each ARA algorithm, combine with each of the 3 SOA algorithms to generate 3 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values

Efficiency of ARA Algorithms Testcase: iir_arr_swp -- infinite impulse response filter Overhead (cycles)LeupersSuginoZhuang

Efficiency of ARA Algorithms Testcase: iir_arr_swp -- infinite impulse response filter Overhead (Cycles) Frequency Leupers Sugino Zhuang

Evaluating ARA Algorithms Testcase: latnrm_ptr -- normalized lattice filter Overhead (Cycles) Frequency Leupers Sugino Zhuang

Evaluating Offset Assignment Algorithms There is low variability between SOA algorithms -- may be attributed to small problem sizes. The choice of ARA algorithm has more impact on overhead. Much of the variability attributed to the different number of address registers used. For all combinations of SOA and ARA algorithms, the permutation of sub-layouts affects the overhead.

Outline Background Traditional Approach to Offset Assignment Simple Offset Assignment Address-Register Assignment Improving the Problem Model Optimal Address-Code Generation Memory Layout Permutations Evaluating Current Heuristics Methodology Results Conclusions and Future Work

Conclusions The objective is to minimize address-computation overhead. Given a fixed access sequence and memory layout, the minimum-cost circulation (MCC) technique can minimize overhead. Offset assignment algorithms should be evaluated with MCC. Offset assignment still has a significant impact on overhead. To be effective, current offset assignment algorithms (ARA,SOA) must address the Memory Layout Permutation problem.

Future Work A new algorithm is needed to generate memory layouts that will minimize overhead as computed by the Minimum-Cost Flow technique. Address-computation overhead must be minimized for loop bodies and for variables that are live between basic blocks and procedures.

References Gebotys, C.: DSP address optimization using a minimum cost circulation technique. Proceedings of the 1997 IEEE/ACM International Conference on Computer-Aided Design Leupers, R., Marwedel, P.: Algorithms for address assignment in DSP code generation. Proceedins of the 1996 IEEE/ACM International Conference on Computer-Aided Design Liao, S., Devadas, S., Keutzer, K., Tjiang, S., Wang, A.: Storage assignment to decrease code size. ACM Transactions of Programming Languages and Systems 18(3) (1996) Sugino, N., Iimuro, S., Nishihara, A., Jujii, N.: DSP code optimization utilizing memory addressing operation. IEICE Transaction Fundamentals 8 (1996) Zhuang, X., Lau, C., Pande, S.: Storage assignment optimizations through variable coalescence for embedded processors. Proceedings of the 2003 ACM SIGPLAN Conference on Language, Compiler, and Tools for Embedded Systems Bartley, D.H.: Optimizing stack frame accesses for processors with restricted addressing modes. Software – Practice & Experience 22(2) (2001)

Questions?