Evaluation of Offset Assignment Heuristics Johnny Huynh, Jose Nelson Amaral, Paul Berube University of Alberta, Canada Sid-Ahmed-Ali Touati Universite de Versailles, France
Outline Background Traditional Approach to Offset Assignment Simple Offset Assignment Address-Register Assignment Improving the Problem Model Optimal Address-Code Generation Memory Layout Permutations Evaluating Current Heuristics Methodology Results Conclusions and Future Work
Outline Background Traditional Approach to Offset Assignment Simple Offset Assignment Address-Register Assignment Improving the Problem Model Optimal Address-Code Generation Memory Layout Permutations Evaluating Current Heuristics Methodology Results Conclusions and Future Work
Background Digital Signal Processors (DSPs) have few general purpose registers Program variables kept in memory Address Registers (AR) used to access variables After a variable is accessed, the AR can be auto-incremented (or decremented) by one word in the same cycle.
Processor Model Texas Instruments TMS320C54X DSP family: Accumulator-based DSP 8 Address Registers Initializing an address register requires 2 cycles of overhead Explicit address computations require 1 cycle of overhead Using auto-increment (or auto-decrement) has no overhead.
Processor Model Example: add ‘A’ and ‘B’, store in accumulator $AR0 = &A $ACC = *$AR0 $AR0 = $AR0 + 2 $ACC += *$AR0 $AR0 = &A $ACC = *$AR0++ $ACC += *$AR0 Explicit address computation Auto-Increment ACB ABC 0x1000 0x1001 0x1002
Processor Model Example: add ‘A’ and ‘B’, store in accumulator $AR0 = &A $ACC = *$AR0 $AR0 = $AR0 + 2 $ACC += *$AR0 $AR0 = &A $ACC = *$AR0++ $ACC += *$AR0 Explicit address computation Auto-Increment ACB ABC 0x1000 0x1001 0x1002
The Offset-Assignment Problem Given k address registers and a basic block accessing n variables, find a memory layout that minimizes address- computation overhead. How should the variables be placed in memory? Which register should access each variable?
Outline Background Traditional Approach to Offset Assignment Simple Offset Assignment Address-Register Assignment Improving the Problem Model Optimal Address-Code Generation Memory Layout Permutations Evaluating Current Heuristics Methodology Results Conclusions and Future Work
Traditional Approach to Offset Assignment Access Sequence Address Register Assignment Sub-Sequence Sub-Layout Simple Offset Assignment Sub-Layout Simple Offset Assignment Sub-Layout Simple Offset Assignment Basic Block Generate Access Sequence Address-Computation Overhead Address-Code Generation
Traditional Approach: Simple Offset Assignment (SOA) In 1992, Bartley introduced the simplest form of the offset assignment problem: Given a single address register and basic block with n variables, find a memory layout that minimizes overhead. Equivalent to finding a maximum weight path cover (NP-complete) Many researchers have proposed heuristics for this problem: Liao et. al. (1996) Leupers and Marwedel (1996) Sugino et. al. (1996)
Simple Offset Assignment (SOA) Fix the access sequence Assume only one address register (k = 1) Find an ordering of variables in memory (memory layout) that has minimum overhead. AB D FC E Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layout:
Simple Offset Assignment (SOA) Create Access Graph G = (V, E) V = variables weight of edge is the frequency of consecutive accesses A path defines a memory layout -- Find the Maximum Weight Path Cover NP-Complete! AB D FC E Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layout:
Simple Offset Assignment (SOA) Create Access Graph G = (V, E) V = variables weight of edge is the frequency of consecutive accesses A path defines a memory layout -- Find the Maximum Weight Path Cover NP-Complete! AB D FC E Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layout: dafceb
Traditional Approach: General Offset Assignment (GOA) Problem presented by Liao et. al. in Given k address registers, and a basic block with n variables, find an assignment of variables to address registers that minimizes the total overhead of all registers. This problem formulation is more accurately described as Address- Register Assignment (ARA). Consists of SOA problems, and is at least NP-hard. Many researchers have proposed heuristics for address-register assignment: Leupers and Marwedel (1996) Sugino et. al. (1996) Zhuang et. al. (2003)
General Offset Assignment (GOA) Fix the access sequence Allow multiple address registers (k>1) Find an ordering of variables in memory (memory layout) that has minimum overhead. Assign each variable to an address register to form access sub-sequences. AB D FC E Ex. Access Sequence: ‘a d b e c f b e c f a d’ Sub-sequence1: ‘a b c b c a’ Sub-sequence2: ‘d e f e f d’
General Offset Assignment (GOA) AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Sub-sequence1: ‘a b c b c a’ Sub-sequence2: ‘d e f e f d’ Each sub-sequence can be viewed as an independent SOA problem. Solve each sub-sequence as independent SOA problems. More appropriate to call this problem the Address Register Assignment (ARA) problem. Requires solving SOA instances, so is at least NP-hard.
General Offset Assignment (GOA) AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef Each sub-sequence can be viewed as an independent SOA problem. Solve each sub-sequence as independent SOA problems. More appropriate to call this problem the Address Register Assignment (ARA) problem. Requires solving SOA instances, so is at least NP-hard.
Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1
Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1
Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1
Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1
Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1
Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1
Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1
Address-Code Generation Recall that variables are assigned to address registers. There is nothing left to decide – each address register has a defined sequence of accesses. Imposes a restriction that all access to a variable is done by a single address register. AB D FC E 2 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: abcdef AR0 AR1 *Requires Explicit Address Computations
‘a d b e c f b e c f a d’ ‘a b c b c a’ ‘d e f e f d’ [a, b, c][d, e, f] Simple Offset Assignment Simple Offset Assignment Address Register Assignment Sub-sequence and memory layout accessed by AR0 Sub-sequence and memory layout accessed by AR1 Traditional Approach to Offset Assignment
Outline Background Traditional Approach to Offset Assignment Simple Offset Assignment Address-Register Assignment Improving the Problem Model Optimal Address-Code Generation Memory Layout Permutations Evaluating Current Heuristics Methodology Results Conclusions and Future Work
Optimal Address-Code Generation Given a fixed access sequence and memory layout, it is possible to generate optimal addressing-code in polynomial time: Minimum-Cost Circulation (Gebotys, 1997) Minimum-Weight Perfect Matching (Udayanarayanan, 2000)
Optimal Address-Code Generation Build a network-flow graph Vertices represent variable accesses For each access a i that occurs before another a j, there is an edge (a i,a j ) (not all shown the graph). Edges represent an opportunity for a register to access variables. Each unit flow represents the accesses performed by an address register. Optimal Address-Code is found by finding a minimum- cost circulation.
Traditional Approach to Offset Assignment Access Sequence Address Register Assignment Sub-Sequence Sub-Layout Simple Offset Assignment Address-Computation Overhead Address-Code Generation Sub-Sequence Sub-Layout Simple Offset Assignment Sub-Sequence Sub-Layout Simple Offset Assignment NP-Hard NP-Complete Solved, but not used!
Memory Layout Permutations (MLP) Since optimal address-code generation algorithms exist, they can be applied after a memory layout is formed (by traditional approaches). However, the traditional approach generates multiple sub-layouts that were originally assumed to be independent. How is a single memory layout formed from a set of sub-layouts?
Memory Layout Permutations Let M i be a memory sub-layout. Let M i r be the reciprocal of M i Given an access sequence and m memory sub- layouts, arrange {(M 1 |M 1 r ),…,(M m |M m r )}, such that overhead is minimum when the sub-layouts are placed contiguously in memory.
Memory Layout Permutations Example: ‘a d b e c f b e c f a d’ ‘a b c b c a’ ‘d e f e f d’ {a, b, c}{d, e, f} [a, b, c, d, e, f], [f, e, d, c, b, a] [c, b, a, d, e, f], [f, e, d, a, b, c] [a, b, c, f, e, d], [d, e, f, c, b, a] [c, b, a, f, e, d], [d, e, f, a, b, c] Simple Offset Assignment Simple Offset Assignment Address Register Assignment Memory Layout Permutations This is an optimal address register assignment These are optimal simple offset assignments All possible Memory Layout Permutations (all have cost > 4) Optimal Layout: {b, c, a, d, e, f} with cost = 4 is not found
Outline Background Traditional Approach to Offset Assignment Simple Offset Assignment Address-Register Assignment Improving the Problem Model Optimal Address-Code Generation Memory Layout Permutations Evaluating Current Heuristics Methodology Results Conclusions and Future Work
Experimental Methodology Evaluating the Solution Space Testcases are DSP code kernels from the UTDSP benchmark suite. Use gcc to obtain access sequences. The quality of a memory layout is evaluated using the minimum-cost circulation technique. The entire solution space is found for each access sequence, to be used as a point of reference. Basic Block Compile with gcc Access Sequence Compute Overhead of All Layouts using Minimum-Cost Flow KernelAccessesVariablesPossible # of layouts iir_arr21820,160 iir_arr_swp ,500,800 latnrm_arr_swp30101,824,400 latnrm_ptr30101,824,400 latnrm_ptr_swp30101,824,400
Experimental Methodology Evaluating Current Heuristics Identified and implemented three Address-Register Assignment heuristic algorithms: Leupers Sugino Zhuang LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Experimental Methodology Evaluating Current Heuristics Identified and implemented five Simple Offset Assignment heuristic algorithms: Liao Leupers ALOMA Order-First Use (OFU) Branch and Bound (B&B) LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Experimental Methodology Evaluating Current Heuristics Each combination of ARA and SOA algorithm generates a set of sub-layouts. All possible memory layout permutations are generated, forming a set of memory layouts. Each memory layout is evaluated using the Minimum-Cost Circulation technique. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Results The 15 combinations of algorithms produce 15 distributions overhead values. The distributions are aggregated into one distribution. The aggregate distributions represent the solution space of all current algorithms.
Results Memory layouts have a significant impact on overhead. Some layouts have 100% higher overhead than the minimum. Over 99% of all layouts have an overhead that is 50% higher than the minimum.
Results Memory layouts produced by traditional approaches have a large range of possible overhead values -- sometimes the same as the entire solution space itself. In some cases, no combination of ARA and SOA heuristics can produce an optimal layout.
Results Memory layouts produced by traditional approaches have a large range of possible overhead values -- sometimes the same as the entire solution space itself. In some cases, no combination of ARA and SOA heuristics can produce an optimal layout.
Distribution of Overhead Values Testcase: iir_arr_swp -- infinite impulse response filter Overhead (cycles)ExhaustiveAlgorithmic Average Overhead
Exhaustive Solution Space Testcase: iir_arr_swp -- infinite impulse response filter
Algorithmic Solution Space Testcase: iir_arr_swp -- infinite impulse response filter
Efficiency of SOA Algorithms For each SOA algorithm, combine with each of the 5 ARA algorithms to generate 5 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Efficiency of SOA Algorithms For each SOA algorithm, combine with each of the 5 ARA algorithms to generate 5 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Efficiency of SOA Algorithms For each SOA algorithm, combine with each of the 5 ARA algorithms to generate 5 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Efficiency of SOA Algorithms For each SOA algorithm, combine with each of the 5 ARA algorithms to generate 5 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Efficiency of SOA Algorithms For each SOA algorithm, combine with each of the 5 ARA algorithms to generate 5 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Overhead (cycles)LiaoLeupersSuginoB&BOFU Efficiency of SOA Algorithms Testcase: iir_arr_swp -- infinite impulse response filter
Overhead (cycles) Frequency Liao Leupers Sugino BNB OFU
Evaluating SOA Algorithms Testcase: latnrm_ptr -- normalized lattice filter Overhead (Cycles) Frequency Liao Leupers Sugino BNB OFU
Efficiency of ARA Algorithms For each ARA algorithm, combine with each of the 3 SOA algorithms to generate 3 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Efficiency of ARA Algorithms For each ARA algorithm, combine with each of the 3 SOA algorithms to generate 3 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Efficiency of ARA Algorithms For each ARA algorithm, combine with each of the 3 SOA algorithms to generate 3 distributions of overhead values. The distributions can be aggregated to form a single distribution. LeupersSuginoZhuang LiaoLeupersALOMAOFUB&B Access Sequence Sub-Sequences Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Efficiency of ARA Algorithms Testcase: iir_arr_swp -- infinite impulse response filter Overhead (cycles)LeupersSuginoZhuang
Efficiency of ARA Algorithms Testcase: iir_arr_swp -- infinite impulse response filter Overhead (Cycles) Frequency Leupers Sugino Zhuang
Evaluating ARA Algorithms Testcase: latnrm_ptr -- normalized lattice filter Overhead (Cycles) Frequency Leupers Sugino Zhuang
Evaluating Offset Assignment Algorithms There is low variability between SOA algorithms -- may be attributed to small problem sizes. The choice of ARA algorithm has more impact on overhead. Much of the variability attributed to the different number of address registers used. For all combinations of SOA and ARA algorithms, the permutation of sub-layouts affects the overhead.
Outline Background Traditional Approach to Offset Assignment Simple Offset Assignment Address-Register Assignment Improving the Problem Model Optimal Address-Code Generation Memory Layout Permutations Evaluating Current Heuristics Methodology Results Conclusions and Future Work
Conclusions The objective is to minimize address-computation overhead. Given a fixed access sequence and memory layout, the minimum-cost circulation (MCC) technique can minimize overhead. Offset assignment algorithms should be evaluated with MCC. Offset assignment still has a significant impact on overhead. To be effective, current offset assignment algorithms (ARA,SOA) must address the Memory Layout Permutation problem.
Future Work A new algorithm is needed to generate memory layouts that will minimize overhead as computed by the Minimum-Cost Flow technique. Address-computation overhead must be minimized for loop bodies and for variables that are live between basic blocks and procedures.
References Gebotys, C.: DSP address optimization using a minimum cost circulation technique. Proceedings of the 1997 IEEE/ACM International Conference on Computer-Aided Design Leupers, R., Marwedel, P.: Algorithms for address assignment in DSP code generation. Proceedins of the 1996 IEEE/ACM International Conference on Computer-Aided Design Liao, S., Devadas, S., Keutzer, K., Tjiang, S., Wang, A.: Storage assignment to decrease code size. ACM Transactions of Programming Languages and Systems 18(3) (1996) Sugino, N., Iimuro, S., Nishihara, A., Jujii, N.: DSP code optimization utilizing memory addressing operation. IEICE Transaction Fundamentals 8 (1996) Zhuang, X., Lau, C., Pande, S.: Storage assignment optimizations through variable coalescence for embedded processors. Proceedings of the 2003 ACM SIGPLAN Conference on Language, Compiler, and Tools for Embedded Systems Bartley, D.H.: Optimizing stack frame accesses for processors with restricted addressing modes. Software – Practice & Experience 22(2) (2001)
Questions?