An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and Jihong Kim Computer Architecture Lab School of Computer Science.

Slides:

Advertisements

Similar presentations

Instruction Level Parallelism and Superscalar Processors

Advertisements

Lab 2 – DSP software architecture and the real life DSP characteristics of signals that make it necessary.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

DSPs Vs General Purpose Microprocessors

DAP teaching computer architecture at Berkeley since 1977

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.

9. Code Scheduling for ILP-Processors TECH Computer Science {Software! compilers optimizing code for ILP-processors, including VLIW} 9.1 Introduction 9.2.

Application of Binary Translation to Java Reconfigurable Architectures Antonio Carlos S. Beck Filho Luigi Carro Instituto.

Courseware Path-Based Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens Plads,

SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.

Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.

Multiscalar processors

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Memory Access Scheduling and Binding Considering Energy Minimization in Multi- Bank Memory Systems Chun-Gi Lyuh, Taewhan Kim DAC 2004, June 7-11, 2004.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Generic Software Pipelining at the Assembly Level Markus Pister

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Advanced Processor Technology Architectural families of modern computers are CISC RISC Superscalar VLIW Super pipelined Vector processors Symbolic processors.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Computer Organization and Architecture Tutorial 1 Kenneth Lee.

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

INFAnt: NFA Pattern Matching on GPGPU Devices Author: Niccolo’ Cascarano, Pierluigi Rolando, Fulvio Risso, Riccardo Sisto Publisher: ACM SIGCOMM 2010 Presenter:

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Simulation of Decode Filter Cache using SimpleScalar simulator Presented by Fei Hong.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Re-configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-micron Instructions Bus Siu-Kei Wong.

1 Aphirak Jansang Thiranun Dumrongson

PipeliningPipelining Computer Architecture (Fall 2006)

Advanced Architectures

15-740/ Computer Architecture Lecture 3: Performance

Variable Word Width Computation for Low Power

Microarchitecture.

Low-power Digital Signal Processing for Mobile Phone chipsets

L. Benini, G. DeMicheli Stanford University, USA A. Macii, E. Macii, M

CSC 4250 Computer Architectures

Instruction Scheduling for Instruction-Level Parallelism

Instruction Scheduling Hal Perkins Summer 2004

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

EE 445S Real-Time Digital Signal Processing Lab Spring 2014

Instruction Scheduling Hal Perkins Winter 2008

Ann Gordon-Ross and Frank Vahid*

Static Code Scheduling

Instruction Scheduling Hal Perkins Autumn 2005

Chapter 12 Pipelining and RISC

Instruction Level Parallelism

Overview Problem Solution CPU vs Memory performance imbalance

Chapter 4 The Von Neumann Model

Presentation transcript:

An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and Jihong Kim Computer Architecture Lab School of Computer Science and Engineering Seoul National University, Korea

School of CSE Seoul National University Workshop on Complexity-Effective Design 2 Outline Motivations VLIW Instruction Encodings LOR Problem and Solution GOR Problem and Solution Experiment Conclusions

School of CSE Seoul National University Workshop on Complexity-Effective Design 3 Motivations Many mobile devices are designed using VLIW processors for high performance, which usually consume more power than single-issue processors. Many mobile devices are designed using VLIW processors for high performance, which usually consume more power than single-issue processors. In digital CMOS circuits, switching activity accounts for over 90% of total power consumption. In digital CMOS circuits, switching activity accounts for over 90% of total power consumption. We propose a post-pass optimization technique that can reduce switching activity during the instruction fetch phase in VLIW processors We propose a post-pass optimization technique that can reduce switching activity during the instruction fetch phase in VLIW processors

School of CSE Seoul National University Workshop on Complexity-Effective Design 4 VLIW Instruction Encoding-Uncompressed IADD /*IntU*/ || FADD /*FpU*/ || LOAD /*MemU*/ || STORE /*MEMU*/ ISUB /*IntU*/ || IMUL /*IntU*/ IADD /*IntU*/ || BEG /*BrU*/ IntU FpU MemU CmpUBrU IADDNOPFADDNOPLOADSTORENOP ISUBIMULNOP IADDNOP BEG IADDNOPFADDNOPLOADSTORENOP IMULISUBNOP IADDNOP BEG Alternative encoding Functional Unit Program

School of CSE Seoul National University Workshop on Complexity-Effective Design 5 VLIW Instruction Encoding - Compressed IADD /*IntU*/ || FADD /*FpU*/ || LOAD /*MemU*/ || STORE /*MEMU*/ ISUB /*IntU*/ || IMUL /*IntU*/ IADD /*IntU*/ || BEG /*BrU*/ IADD IntU 1 FADD FpU 1 LOAD MemU 1 STORE MemU 0 ISUB IntU 1 IMUL IntU 0 IADD IntU 1 BEG BrU 0 IADD IntU 1 FADD FpU 1 LOAD MemU 1 STORE MemU 0 ISUB IntU 1 IMUL IntU 0 IADD IntU 1 BEG BrU 0 Instruction 1Instruction 2Instruction 3 Instruction 1Instruction 2Instruction 3 Alternative encoding Possible choices = 4! 2! 2! Which encoding is the best for low-power consumption? Program Parallel bit

School of CSE Seoul National University Workshop on Complexity-Effective Design 6 Machine Model External Memory Internal Cache VLIW Processor Core Ins Memory block is fetched from the main memory through the b mem -bit width instruction bus on cache-miss. Because of the compressed encoding format, several VLIW instructions are fetched together in a single fetch from the instruction cache. A fetch packet consists of N operations, and b mem = b cache /N b mem -bit width bus b cache -bit width bus Ins FP OP Ins FP

School of CSE Seoul National University Workshop on Complexity-Effective Design 7 Basic Idea Instruction Cache (a) Before operation rearrangement(b) After operation rearrangement 14 bit transitions 12 bit transitions 13 bit transitions 8 bit transitions 10 bit transitions 11 bit transitions Total 39 bit transitionsTotal 29 bit transitions The total # of bit changes are reduced by 25%

School of CSE Seoul National University Workshop on Complexity-Effective Design 8 Problem Formulation how to reorder given VLIW instructions to reduce the number of bit transitions between successive instruction fetches. Problem Local Operation Rearrangement (LOR) : each basic block is independently considered. Global Operation Rearrangement (GOR) : all the basic blocks are simultaneously considered. Solutions

School of CSE Seoul National University Workshop on Complexity-Effective Design 9 LOR Problem SW = SW cache + SW mem B BB SW cache is the number of bit changes at the internal instruction bus. SW cache is the number of bit changes at the internal instruction bus. B SW mem is the number of bit changes at the external instruction bus. SW mem is the number of bit changes at the external instruction bus. B  is the load capacitance ratio of the external instruction bus to the internal instruction bus.  is the load capacitance ratio of the external instruction bus to the internal instruction bus.

School of CSE Seoul National University Workshop on Complexity-Effective Design 10 LOR Problem External MemoryInternal Cache VLIW Processor Core OP 1 OP 2 OP N OP 1 OP 2 FP 2 FP 3 FP 1 SW mem SW cache SW B =  SW intra +  SW inter FP SW intra FP SW inter FP...

School of CSE Seoul National University Workshop on Complexity-Effective Design 11 Solution for LOR START B i FP 1, B i 2, B i 3, B i 4, B i 1,1 + B i 2,1 + B i 3,1 + B i 4,1 + END SW intra FP SW inter FP EQ(FP i ) B EQ(FP i ) : The set of equivalent fetch packets of FP i. BB

School of CSE Seoul National University Workshop on Complexity-Effective Design 12 Solution for LOR We find the shortest path from START to END, which is the solution of operation rearrangement to minimize the SW B A node v i+1 in graph finds the node v i through which the shortest path from START to the node v i+1 should pass. START B i FP 1, B i 2, B i 3, B i 4, B i 1,1 + B i 2,1 + B i 3,1 + B i 4,1 + END

School of CSE Seoul National University Workshop on Complexity-Effective Design 13 GOR Problem All the basic blocks in a program are simultaneously considered –how many times each basic block is executed. –how often each basic block experiences cache misses. –how basic blocks are related each other. SW S =   SW inter (bb i,bb j ) +  SW intra (bb i ) BB SW inter and SW intra is represented by SW inter, SW intra, weight of each basic block, and cache miss rate. BB FP

School of CSE Seoul National University Workshop on Complexity-Effective Design 14 Solution for GOR This method may require an excessive amount of memory and cycles. We need a heuristic solution. GOR Problem Shortest Path Problem Shortest Path Problem Graph Transformation (branch merging, loop rolling) Graph Transformation (branch merging, loop rolling) Graph Construction Graph Construction LOR Algorithm Solution

School of CSE Seoul National University Workshop on Complexity-Effective Design 15 Heuristic for GOR All the basic blocks are not equally treated. –Basic blocks with larger effects on the total switching activity are more thoroughly reordered than ones with smaller effects. Not all the equivalent basic blocks in EQ(bb i ) are tried to find an optimal solution. –Only N cand equivalent basic blocks are created and included in graph.

School of CSE Seoul National University Workshop on Complexity-Effective Design 16 Experiment Fixed-point DSP VLIW processor that can specify eight 32-bit operations in a single 256-bit instruction. Use a compressed encoding Fixed-point DSP VLIW processor that can specify eight 32-bit operations in a single 256-bit instruction. Use a compressed encoding TMS320C6201 Instruction Cache External Memory Internal Bus 256-bit width 32-bit width External Bus VLIW Processor Core FU1FU5 FU2FU6 FU3FU7 FU4FU8

School of CSE Seoul National University Workshop on Complexity-Effective Design 17 Experiment Results For our benchmark programs, the bit transitions was reduced by 34% on an average.

School of CSE Seoul National University Workshop on Complexity-Effective Design 18 Conclusions Described a post-pass optimal operation rearrangement method for low-power VLIW instruction fetch. –The switching activity was reduced by 34% on an average. Future works –The phase-ordering problem between the operation rearrangement and other compiler optimization steps. –Operation rearrangement problem in super-scalar processors.