Compiler Supports and Optimizations for PAC VLIW DSP Processors

Compiler Supports and Optimizations for PAC VLIW DSP Processors
Y.-C. Lin C.-L. Tang C.-J. Wu M.-Y. Hung Y.-P. You Y.-C. Moo S.-Y. Chen and J.-K. Lee National Tsing-Hua University Taiwan

Outline PAC VLIW DSP Architectures Optimization Issues
Preliminary Compiler Supports Experimental Results Conclusion 11/23/2018 LCPC2005

Introduction Parallel Architecture Core (PAC) is designed by SoC Technology Center, ITRI, Taiwan. 32bit, fixed-point, 5-way issue VLIW DSP scalable architecture optimized instruction set for audio/video/image innovative register file structure two generations developed TSMC’s 0.13 μm technology (taped-out in Aug. 2005) High-performance Low-power 11/23/2018 LCPC2005

Key Issues Deploy the general-purpose high-performance open source compiler for DSP processors ORC  PAC DSP Address issues for fragmentary register banks of DSP processors Methods for irregular register constraints and instruction scheduling 11/23/2018 LCPC2005

PAC DSP Overview Five-Way Issues: Cluster Design:
1 Scalar/Control Unit (B) 2 Arithmetic Unit (I) 2 Load/Store Unit (M) Cluster Design: Scalability Explicit Inter-Cluster Data Transfer Instructions Distributed Register Files: 5 Local Register Files (A, AC, R) 2 Global Register Files (D) I-Unit B-Unit M-Unit Cluster Cluster Cluster M-Unit A Registers I-Unit AC Registers D Registers Extend More Clusters Other Features: 8-bit/16-bit SIMD operations Variable instruction word/bundle length Dynamic Power Management Standard AMBA interface Cluster B-Unit R Registers M-Unit A Registers I-Unit AC Registers D Registers M-Unit A Registers I-Unit AC Registers B-Unit R Registers B-Unit R Registers M-Unit A Registers I-Unit AC Registers D Registers M-Unit A Registers I-Unit AC Registers B-Unit R Registers A Registers A Registers I-Unit B-Unit M-Unit I-Unit M-Unit I-Unit M-Unit D Registers D Registers B-Unit R Registers AC Registers AC Registers 11/23/2018 LCPC2005

Ping-pong Register File Structure
Used by Global Register File (D) Concept: Overlap processing different data streams in a cluster Benefit: Decrease the port number for low-power and size M-Unit I-Unit So called as Ping-pong! Load Compute Store M-Unit and I-Unit operate on different data streams at the same time! 11/23/2018 LCPC2005

Ping-pong Register Access
Each ‘D’ register file contains 2 banks. Rules: Access by one unit to the 2 banks is mutually-exclusive in a cycle. M-Unit and I-Unit can only access to different banks in a cycle. Instructional Switcher M-Unit I-Unit Bank 1 Bank 2 M-Unit I-Unit Bank 1 Bank 2 M-Unit I-Unit Bank 1 Bank 2 Only 1 state for each cycle! 11/23/2018 LCPC2005

Issues for Ping-pong Registers(1)
Example for ping-pong usage: Able to form a bundle Unable to form a bundle Lw D8, A0 Add D1,D0,AC0 We need to schedule into 2 bundles since they use the same bank! For compilers optimizations: Better register (file/bank) allocation  Better schedule in fewer bundles Lw D2, A0 Add D1,D0,AC0 11/23/2018 LCPC2005

Issues for Ping-pong Registers(2)
Data transfer between ping-pong banks: Add D1,D0,AC0 Lw D8, A0 Sub D9,D8,D1 Sw D1, A0 Invalid operation! Need cross ping-pong communication! Sub D9,D8,AC1 Mov AC1, D1 Sw D1, A0 Additional copy-operation needed! For compiler optimizations: Well-handle data-communication between ping-pong banks within any code manipulation Generate additional copy-operation as few as possible 11/23/2018 LCPC2005

Issues for Inter-cluster Communication
To exploit cluster parallelism: PAC needs explicit instruction to be issued for inter-cluster communication! Cluster1 Cluster2 Additional Cross-Cluster Copy A B C D E F G Cluster1 Cluster2 B-Unit A B C D Optimize code partitioning: Fewer communication Better scheduling E F G 11/23/2018 LCPC2005

More Considerations Two optimized codes of the same performance:
Upper  Smaller code size Lower  Lower power consumption 11/23/2018 LCPC2005

Compiler Supports for PAC DSP
Essential supports (IA-64 ORC  PAC) New Target_Info PAC Architecture and ISA descriptions Complicated hazard descriptions PAC application-binary-interface (ABI) data type mapping memory usage layout register usage conventions calling conventions PAC code generation 32-bit WHIRL code generation PAC WHIRL-to-CGIR procedures PAC assembly code emission 11/23/2018 LCPC2005

Simulated-Annealing (SA) Based Register Allocation Approach
Motivation: Complex interference from: We appreciate a machine-learning method to give a near-optimal results. To be a base reference for developing heuristic methods! Register Allocation Instruction Scheduling Code Insertion for Distributed Register Communication 11/23/2018 LCPC2005

To Determine: Virtual Register  Register File (Bank)
Input: un-scheduled instructions Output: a schedule of the instructions a register file assignment (RFA) map RFA map = {(v1, f1), (v2, f2), ...} Where vi : a virtual register, fi : a register file (bank) PAC_Scheduler: Graph-coloring based register allocation according to the RFA map Instruction scheduling and code insertion for register file communication Setup SA: An initial random RFA map schedule_len = PAC_Scheduler ( initial RFA map ) SA control variables: threshold p_test: a probability test value (0 < p_test < 1). energy: initial value > threshold. 11/23/2018 LCPC2005

To Optimize: Scheduling Result
Randomly change: a mapping (vi, fi) Re-run: new_schedule_len = PAC_Scheduler (new RFA map) new RFA map SA stop test: energy > threshold yes Better result test: new_schedule_len < schedule_len energy-- schedule_len = new_schedule_len Random test: a random number > p_test energy++ yes no new RFA map old RFA map Final RFA map & schedule no 11/23/2018 LCPC2005

Preliminary Experimental Results (DSPStone benchmarks)
11/23/2018 LCPC2005

Related Works Register Allocation Register File Organizations
R. Leupers: Instruction scheduling for clustered VLIW DSPs. In Proc. Int’l Conference on Parallel Architecture and Compilation Techniques, pages 291–300, Oct. 2000 Register File Organizations S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens: Register organization for media processing. International Symposium on High Performance Computer Architecture (HPCA), pp , 2000 Tay-Jyi Lin, Chin-Chi Chang. Chen-Chia Lee, and Chein-Wei Jen: An Efficient VLIW DSP Architecture for Baseband Processing. Proceedings of the 21th International Conference on Computer Design, 2003 11/23/2018 LCPC2005

Conclusion We developed a compiler prototype for a new VLIW DSP architecture, called as PAC. Based on ORC New optimization issues by the irregular hardware design Highly distributed register files Port-access restricted ping-pong structures A SA approach employed to obtain a preliminary result of exploiting register allocation on PAC We will extend our works on the upcoming next version of PAC DSP. 11/23/2018 LCPC2005

Compiler Supports and Optimizations for PAC VLIW DSP Processors

Similar presentations

Presentation on theme: "Compiler Supports and Optimizations for PAC VLIW DSP Processors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Compiler Supports and Optimizations for PAC VLIW DSP Processors

Similar presentations

Presentation on theme: "Compiler Supports and Optimizations for PAC VLIW DSP Processors"— Presentation transcript:

Similar presentations

About project

Feedback