Instruction Selection for Compilers that Target Architectures with Echo Instructions Philip BriskAni NahapetianMajid Sarrafzadeh Embedded and Reconfigurable.

Slides:

Advertisements

Similar presentations

fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.

Advertisements

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat.

Instruction-Level Parallelism

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Register Allocation after Classical SSA Elimination is NP-complete Fernando M Q Pereira Jens Palsberg - UCLA - The University of California, Los Angeles.

Register Allocation CS 320 David Walker (with thanks to Andrew Myers for most of the content of these slides)

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

COMPILERS Register Allocation hussein suleman uct csc305w 2004.

1 CS 201 Compiler Construction Machine Code Generation.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-

Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Instruction Generation and Regularity Extraction for Reconfigurable Processors Philip Brisk, Adam Kaplan, Ryan Kastner*, Majid Sarrafzadeh Computer Science.

A Dictionary Construction Technique for Code Compression Systems with Echo Instructions Embedded and Reconfigurable Systems Lab Computer Science Department.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

VLIW Compilation Techniques in a Superscalar Environment Kemal Ebcioglu, Randy D. Groves, Ki- Chang Kim, Gabriel M. Silberman and Isaac Ziv PLDI 1994.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

Cpeg421-08S/final-review1 Course Review Tom St. John.

CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.

1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.

A Dynamic Binary Translation Approach to Architectural Simulation Harold “Trey” Cain, Kevin Lepak, and Mikko Lipasti Computer Sciences Department Department.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Code Generation Professor Yihjia Tsai Tamkang University.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Code Generation for Basic Blocks Introduction Mooly Sagiv html:// Chapter

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Register Allocation (via graph coloring)

ICCAD’01: November, 2001 Instruction Generation for Hybrid Reconfigurable Systems Ryan Kastner, Seda Ogrenci-Memik, Elaheh Bozorgzadeh and Majid Sarrafzadeh.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Software Pipelining in Pegasus/CASH Cody Hartwig Elie Krevat

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Automated Design of Custom Architecture Tulika Mitra

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.

R Enabling Trusted Software Integrity Darko Kirovski Microsoft Research Milenko Drinić Miodrag Potkonjak Computer Science Department University of California,

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Kyushu University Koji Inoue ICECS'061 Supporting A Dynamic Program Signature: An Intrusion Detection Framework for Microprocessors Koji Inoue Department.

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

Florida State University Automatic Tuning of Libraries and Applications, LACSI 2006 In Search of Near-Optimal Optimization Phase Orderings Prasad A. Kulkarni.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Overview of Compilers and JikesRVM John.

1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑教授組員 : R 張馨怡 R 林秀萍.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.

Code Optimization Overview and Examples

Code Optimization.

Ph.D. in Computer Science

Improving Program Efficiency by Packing Instructions Into Registers

Methodology of a Compiler that Compresses Code using Echo Instructions

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

In Search of Near-Optimal Optimization Phase Orderings

Dynamic Hardware Prediction

Lecture 19: Code Optimisation

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop-Level Parallelism

Lecture 4: Instruction Set Design/Pipelining

Presentation transcript:

Instruction Selection for Compilers that Target Architectures with Echo Instructions Philip BriskAni NahapetianMajid Sarrafzadeh Embedded and Reconfigurable Systems Lab Computer Science Department University of California, Los Angeles

Outline Code Compression LZ77 Compression Echo Instructions Compiler Framework Experimental Results Summary

Code Compression Why Compress a Program? Silicon Requirements for on-chip ROM Power Consumption Cost to Fabricate Cost Paid by the Consumer Are there Costs to Decompression? Performance Overhead Dedicated Hardware (or… Software) Longer CPU Pipelines Increased Branch Misprediction Penalty

LZ77 Compression To Compress a String, Identify Repeated Substrings and Replace Each with a Pointer (Offset, Length of Sequence) ABCDBCABCDBACABCDBADAABCDBDC 28 ABCDBCABCDBACABCDBADAABCDBDC 28 ABCDBC(6, 5)AC(9, 5)ADA(12, 5)DC 16 Length one character

Decompresses LZ77-compressed Programs with Minimal Hardware Requirements 2 Dedicated Registers: R 1, R 2 1 Decrementer with = 0 Test (NOR) Echo(Offset, N) 1. Save PC and N in R 1 and R 2 2. Branch to PC – Offset 3. Execute the next N instructions 4. Return to the Call Point 5. Restore PC from R 1 Echo Instructions (Fraser, Microsoft ’02) (Lau, CASES ’03)

Substring Matching $1  $2 + $3 $11  $7 * $8 $8  $7 * $1 $1  $11 / $8 $1  $8 + 1 … $1  $2 + $3 $11  $7 * $8 $8  $7 * $1 $1  $11 / $8 $1  $8 + 1 … $1  $2 + $3 $11  $7 * $8 $8  $7 * $1 $1  $11 / $8 $1  $ $1  $2 + $3 $11  $7 * $8 $8  $7 * $1 $1  $11 / $8 $1  $8 + 1 … $Echo(240, 5) $11  $7 * $8 $8  $7 * $1 $1  $11 / $8 $1  $8 + 1 … Echo(304, 5) $11  $7 * $8 $8  $7 * $1 $1  $11 / $8 $1  $ (Fraser, SCC ’84)

Reschedule/Rename $1  $2 + $3 $11  $7 * $8 $8  $7 * $1 $1  $11 / $8 $1  $8 + 1 … $10  $5 + $4 $11  $9 * $6 $6  $9 * $10 $10  $11 / $6 $10  $ … $11  $7 * $8 $1  $2 + $3 $8  $7 * $1 $1  $11 / $8 $1  $ $1  $2 + $3 $11  $7 * $8 $8  $7 * $1 $1  $11 / $8 $1  $8 + 1 … $1  $2 + $3 $11  $7 * $8 $8  $7 * $1 $1  $11 / $8 $1  $8 + 1 … $1  $2 + $3 $11  $7 * $8 $8  $7 * $1 $1  $11 / $8 $1  $ Rename $4 : $3$5 : $2 $6 : $8$9 : $7 $10 : $1$11 : $11 Reschedule (Cooper, PLDI ’99) (Debray, TOPLAS ’00) (Lau, CASES ’03)

+ * * + / $1  $2 + $3 $11  $7 * $8 $8  $7 * $1 $1  $11 / $8 $1  $8 + 1 … $10  $5 + $4 $11  $9 * $6 $6  $9 * $10 $10  $11 / $6 $10  $ … $11  $7 * $8 $1  $2 + $3 $8  $7 * $1 $1  $11 / $8 $1  $ DFG Isomorphism Our Approach Represent Sequences as Data Flow Graphs Identify Repeated Isomorphic Subgraphs Replace Subgraphs with Echo Instructions

Isomorphic Subgraph Identification Edge Contraction (Kastner, ICCAD ’01) Consider a New Subgraph for Each DFG Edge

Isomorphic Subgraph Identification Compute an Independent Set for Each Edge Type NP-Complete Problem Iterative Improvement Algorithm - (Kirovski, DAC ’98)

Isomorphic Subgraph Identification Replace Most Frequently Occurring Pattern with a Template Original DFG Edge Data Dependencies that Cross Template Boundaries Data Dependencies Incident on Templates

Isomorphic Subgraph Identification Edge Contraction in the Presence of Templates Generate New Templates Along Bold Edges Test for Template Equivalence is DAG Isomorphism Used the Publicly Available VF2 Algorithm

Isomorphic Subgraph Identification

Isomorphic Subgraph Identification Replace Most Frequently Occurring Pattern with a Template Original DFG Edge Data Dependencies Incident on Templates Data Dependencies that Cross Template Boundaries

Isomorphic Subgraph Identification Replace Most Frequently Occurring Pattern with a Template Original DFG Edge Data Dependencies Incident on Templates Data Dependencies that Cross Template Boundaries

Register Allocation Isomorphic Templates Must Have Identical Usage of Registers Registers Shuffle or Spill Code

Register Allocation Code Reuse Constraints May Work Against Code Size Present Status: The Allocator is a Work-in-Progress Registers Shuffle or Spill Code Each Template Eliminates 3 Instrs. 5 Shuffle/Spill Ops. are Required The General Problem is Very Complicated Existing Allocation Techniques Are Not Applicable

Isomorphic Subgraph Identification After Register Allocation, Replace Subgraphs with Echo Instructions Echo

Experimental Framework Built Subgraph Identification into the Machine- SUIF Framework Pass Placed Between Instruction Selection and Register Allocation Current Implementation Supports Alpha as Target Allows for Future Integration with SimpleScalar Simulator Our Goal is to Evaluate the Effectiveness of Subgraph Identification

Experimental Methodology Without Allocation in Place, We Cannot: Estimate Where Shuffle/Spill Code Will be Inserted at Template Boundaries Determine Which Copy Instructions Will be Coalesced But We Can: Make Assumptions Regarding the Starting Point for Register Allocation

Two Approaches to Coalescing Pessimistic Coalescing (Most Allocators) Begin with All Copy Instructions in Place Coalesce Copies When Safe Optimistic Coalescing (Park & Moon, PACT ’98): Initially Coalesce ALL Copy Instructions Re-Introduce Coalesced Copies to Avoid Spilling Live Ranges Whenever Possible

Pessimistic Assumption No Copy Instructions are Coalesced Optimistic Assumption ALL Copy Instructions are Coalesced Compute the Number of DFG Operations Before and After Compression Step Assumptions and Measurement PU: Pessimistic, UncompressedOU: Optimistic, Uncompressed PC: Pessimistic, CompressedOC: Optimistic, Compressed

Taken from the MediaBench and MiBench Application Suites Benchmarks BenchmarkDescription ADPCMAdaptive Differential Pulse Code Modulation BlowfishSymmetric Block Cipher with Variable Key Length EpicImage Data Compression Utility G721Voice Compression JPEGImage Compression and Decompression MPEG2 DecMPEG2 Decoder MPEG2 EncMPEG2 Encoder PegwitPublic Key Encryption and Authentication

Experimental Results PU: Pessimistic, UncompressedOU: Optimistic, Uncompressed PC: Pessimistic, CompressedOC: Optimistic, Compressed

Algorithm Ran Efficiently (a few seconds) for Most Benchmarks Several Notable Exceptions Four Common Features Large DFGs User-Defined Macros Unrolled Loops Cyclic Shifting of Parameters sha1.c (Pegwit) – One DFG Compilation Time Was in Excess of 3 Hrs Runtime Considerations

sha1.c Runtime Considerations #define R0(v, w, x, y, z, i) { z += …; w = … } void SHA1Transform( unsigned long state[5], … ) { unsigned long a = state[0], b = state[1], c = state[2], d = state[3], e = state[4]; R0(a, b, c, d, e, 0); R0(e, a, b, c, d, 1); R0(d, e, a, b, c, 2); R0(c, d, e, a, b, 3); R0(b, c, d, e, a, 4); R0(a, b, c, d, e, 5); … R0(a, b, c, d, e, 15); }

sha1.c Runtime Considerations #define R0(v, w, x, y, z, i) { z += …; w = … } void SHA1Transform( unsigned long state[5], … ) { unsigned long a = state[0], b = state[1], c = state[2], d = state[3], e = state[4], tmp; for( unsigned long i = 0; i < 16; i++) { R0(a, b, c, d, e, i); tmp = e; e = d; d = c; c = b; b = a; a = tmp; } Compilation Time Was Reduced to Seconds

Echo Instructions Compression at a Minimal Hardware Cost Performance Overhead is Two Branches per Echo Compiler Optimization Identify Redundancy via Subgraph Isomorphism New Challenges for Register Allocation Experiments Significant Redundancy Observed in Compiler’s IR Conclusion