Compiler Supports and Optimizations for PAC VLIW DSP Processors

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

Lecture 13: 10/8/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

1 ECE462/562 ISA and Datapath Review Ali Akoglu. 2 Instruction Set Architecture A very important abstraction –interface between hardware and low-level.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

SPIM and MIPS programming

Systems Architecture Lecture 5: MIPS Instruction Set

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.

University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.

Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

TigerSHARC processor General Overview. 6/28/2015 TigerSHARC processor, M. Smith, ECE, University of Calgary, Canada 2 Concepts tackled Introduction to.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,

Memory Access Scheduling and Binding Considering Energy Minimization in Multi- Bank Memory Systems Chun-Gi Lyuh, Taewhan Kim DAC 2004, June 7-11, 2004.

Processor Types And Instruction Sets Barak Perelman CS147 Prof. Lee.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

Computer Organization and Architecture Reduced Instruction Set Computers (RISC) Chapter 13.

CDA 3101 Fall 2012 Introduction to Computer Organization Instruction Set Architecture MIPS Instruction Format 04 Sept 2013.

1 A Simple but Realistic Assembly Language for a Course in Computer Organization Eric Larson Moon Ok Kim Seattle University October 25, 2008.

CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.

The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Sunpyo Hong, Hyesoon Kim

Lecture 2: Instruction Set Architecture part 1 (Introduction) Mehran Rezaei.

The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.

Invitation to Computer Science 6th Edition

Applied Operating System Concepts

Advanced Architectures

Assembly language.

Distributed Shared Memory

A Closer Look at Instruction Set Architectures

Conception of parallel algorithms

Microprocessor Systems Design I

Announcements MP 3 CS296 (Chase Geigle

Embedded Systems Design

Advanced Topic: Alternative Architectures Chapter 9 Objectives

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

A Closer Look at Instruction Set Architectures

Parallel and Multiprocessor Architectures

Stream Architecture: Rethinking Media Processor Design

Instruction Scheduling for Instruction-Level Parallelism

Pipelining and Vector Processing

TigerSHARC processor General Overview.

Superscalar Processors & VLIW Processors

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.

Systems Architecture Lecture 5: MIPS Instruction Set

Operating System Concepts

* From AMD 1996 Publication #18522 Revision E

September 17 Test 1 pre(re)view Fang-Yi will demonstrate Spim

COMS 361 Computer Organization

Chapter 12 Pipelining and RISC

Mapping DSP algorithms to a general purpose out-of-order processor

Lecture 4: Instruction Set Design/Pipelining

Operating System Concepts

COMPUTER ORGANIZATION AND ARCHITECTURE

Introduction to Computer Systems Engineering

ELEC / Computer Architecture and Design Fall 2014 Introduction

Presentation transcript:

Compiler Supports and Optimizations for PAC VLIW DSP Processors Y.-C. Lin C.-L. Tang C.-J. Wu M.-Y. Hung Y.-P. You Y.-C. Moo S.-Y. Chen and J.-K. Lee National Tsing-Hua University Taiwan

Outline PAC VLIW DSP Architectures Optimization Issues Preliminary Compiler Supports Experimental Results Conclusion 11/23/2018 LCPC2005

Introduction Parallel Architecture Core (PAC) is designed by SoC Technology Center, ITRI, Taiwan. 32bit, fixed-point, 5-way issue VLIW DSP scalable architecture optimized instruction set for audio/video/image innovative register file structure two generations developed TSMC’s 0.13 μm technology (taped-out in Aug. 2005) High-performance Low-power 11/23/2018 LCPC2005

Key Issues Deploy the general-purpose high-performance open source compiler for DSP processors ORC  PAC DSP Address issues for fragmentary register banks of DSP processors Methods for irregular register constraints and instruction scheduling 11/23/2018 LCPC2005

PAC DSP Overview Five-Way Issues: Cluster Design: 1 Scalar/Control Unit (B) 2 Arithmetic Unit (I) 2 Load/Store Unit (M) Cluster Design: Scalability Explicit Inter-Cluster Data Transfer Instructions Distributed Register Files: 5 Local Register Files (A, AC, R) 2 Global Register Files (D) I-Unit B-Unit M-Unit Cluster Cluster Cluster M-Unit A Registers I-Unit AC Registers D Registers Extend More Clusters Other Features: 8-bit/16-bit SIMD operations Variable instruction word/bundle length Dynamic Power Management Standard AMBA interface Cluster B-Unit R Registers M-Unit A Registers I-Unit AC Registers D Registers M-Unit A Registers I-Unit AC Registers B-Unit R Registers B-Unit R Registers M-Unit A Registers I-Unit AC Registers D Registers M-Unit A Registers I-Unit AC Registers B-Unit R Registers A Registers A Registers I-Unit B-Unit M-Unit I-Unit M-Unit I-Unit M-Unit D Registers D Registers B-Unit R Registers AC Registers AC Registers 11/23/2018 LCPC2005

Ping-pong Register File Structure Used by Global Register File (D) Concept: Overlap processing different data streams in a cluster Benefit: Decrease the port number for low-power and size M-Unit I-Unit So called as Ping-pong! Load Compute Store M-Unit and I-Unit operate on different data streams at the same time! 11/23/2018 LCPC2005

Ping-pong Register Access Each ‘D’ register file contains 2 banks. Rules: Access by one unit to the 2 banks is mutually-exclusive in a cycle. M-Unit and I-Unit can only access to different banks in a cycle. Instructional Switcher M-Unit I-Unit Bank 1 Bank 2 M-Unit I-Unit Bank 1 Bank 2 M-Unit I-Unit Bank 1 Bank 2 Only 1 state for each cycle! 11/23/2018 LCPC2005

Issues for Ping-pong Registers(1) Example for ping-pong usage: Able to form a bundle Unable to form a bundle Lw D8, A0 Add D1,D0,AC0 We need to schedule into 2 bundles since they use the same bank! For compilers optimizations: Better register (file/bank) allocation  Better schedule in fewer bundles Lw D2, A0 Add D1,D0,AC0 11/23/2018 LCPC2005

Issues for Ping-pong Registers(2) Data transfer between ping-pong banks: Add D1,D0,AC0 Lw D8, A0 Sub D9,D8,D1 Sw D1, A0 Invalid operation! Need cross ping-pong communication! Sub D9,D8,AC1 Mov AC1, D1 Sw D1, A0 Additional copy-operation needed! For compiler optimizations: Well-handle data-communication between ping-pong banks within any code manipulation Generate additional copy-operation as few as possible 11/23/2018 LCPC2005

Issues for Inter-cluster Communication To exploit cluster parallelism: PAC needs explicit instruction to be issued for inter-cluster communication! Cluster1 Cluster2 Additional Cross-Cluster Copy A B C D E F G Cluster1 Cluster2 B-Unit A B C D Optimize code partitioning: Fewer communication Better scheduling E F G 11/23/2018 LCPC2005

More Considerations Two optimized codes of the same performance: Upper  Smaller code size Lower  Lower power consumption 11/23/2018 LCPC2005

Compiler Supports for PAC DSP Essential supports (IA-64 ORC  PAC) New Target_Info PAC Architecture and ISA descriptions Complicated hazard descriptions PAC application-binary-interface (ABI) data type mapping memory usage layout register usage conventions calling conventions PAC code generation 32-bit WHIRL code generation PAC WHIRL-to-CGIR procedures PAC assembly code emission 11/23/2018 LCPC2005

Simulated-Annealing (SA) Based Register Allocation Approach Motivation: Complex interference from: We appreciate a machine-learning method to give a near-optimal results. To be a base reference for developing heuristic methods! Register Allocation Instruction Scheduling Code Insertion for Distributed Register Communication 11/23/2018 LCPC2005

To Determine: Virtual Register  Register File (Bank) Input: un-scheduled instructions Output: a schedule of the instructions a register file assignment (RFA) map RFA map = {(v1, f1), (v2, f2), ...} Where vi : a virtual register, fi : a register file (bank) PAC_Scheduler: Graph-coloring based register allocation according to the RFA map Instruction scheduling and code insertion for register file communication Setup SA: An initial random RFA map schedule_len = PAC_Scheduler ( initial RFA map ) SA control variables: threshold p_test: a probability test value (0 < p_test < 1). energy: initial value > threshold. 11/23/2018 LCPC2005

To Optimize: Scheduling Result Randomly change: a mapping (vi, fi) Re-run: new_schedule_len = PAC_Scheduler (new RFA map) new RFA map SA stop test: energy > threshold yes Better result test: new_schedule_len < schedule_len energy-- schedule_len = new_schedule_len Random test: a random number > p_test energy++ yes no new RFA map old RFA map Final RFA map & schedule no 11/23/2018 LCPC2005

Preliminary Experimental Results (DSPStone benchmarks) 11/23/2018 LCPC2005

Related Works Register Allocation Register File Organizations R. Leupers: Instruction scheduling for clustered VLIW DSPs. In Proc. Int’l Conference on Parallel Architecture and Compilation Techniques, pages 291–300, Oct. 2000 Register File Organizations S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens: Register organization for media processing. International Symposium on High Performance Computer Architecture (HPCA), pp.375-386, 2000 Tay-Jyi Lin, Chin-Chi Chang. Chen-Chia Lee, and Chein-Wei Jen: An Efficient VLIW DSP Architecture for Baseband Processing. Proceedings of the 21th International Conference on Computer Design, 2003 11/23/2018 LCPC2005

Conclusion We developed a compiler prototype for a new VLIW DSP architecture, called as PAC. Based on ORC New optimization issues by the irregular hardware design Highly distributed register files Port-access restricted ping-pong structures A SA approach employed to obtain a preliminary result of exploiting register allocation on PAC We will extend our works on the upcoming next version of PAC DSP. 11/23/2018 LCPC2005