Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

Lecture 6 Programming the TMS320C6x Family of DSPs.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Optimizing ARM Assembly Computer Organization and Assembly Languages Yung-Yu Chuang with slides by Peng-Sheng Chen.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 CS 201 Compiler Construction Software Pipelining: Circular Scheduling.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
CS6290 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.
ECE 2162 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Topic 8: Data Transfer Instructions CSE 30: Computer Organization and Systems Programming Winter 2010 Prof. Ryan Kastner Dept. of Computer Science and.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.
ARM-7 Assembly: Example Programs 1 CSE 2312 Computer Organization and Assembly Language Programming Vassilis Athitsos University of Texas at Arlington.
Compiler Techniques for ILP
Dynamic Scheduling Why go out of style?
GCSE COMPUTER SCIENCE Computers 1.5 Assembly Language.
Instruction Level Parallelism
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Benjamin Goldberg, Emily Crutcher NYU
CS203 – Advanced Computer Architecture
Henk Corporaal TUEindhoven 2009
Instruction Scheduling for Instruction-Level Parallelism
Morgan Kaufmann Publishers The Processor
Performance Optimization for Embedded Software
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
STUDY AND IMPLEMENTATION
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Midterm 2 review Chapter
Optimizing ARM Assembly
ARM ORGANISATION.
How to improve (decrease) CPI
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the Dynamic Mapping of Alternate Register Structures

Motivation 2 Embedded Processors have fewer registers. Compiler Optimizations increase register pressure Difficult to apply aggressive compiler optimizations on embedded systems

Vector Multiply Example 3 Even before aggressive optimizations, 60% of available registers are already used Further optimizations like Loop Unrolling and Software Pipelining are inhibited int A[1000], B[1000]; void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * B[I-2]; }.L3: ldrr1,[r2,r3, lsl #2] ldrr12,[r4], #4 mulr0,r12,r1 strr0,[r5,r3, lsl #2] addr3,r3,#1 cmpr3, #1000 blt.L3

Application Configurable Processors 4 Exploit common reference patterns found in code Small register files mimic these reference behaviors. Map Table provides register redirection. Changed architecture to add more registers, but have minimal impact on ISA support, particularly not increasing operand size

Architectural Modifications 5 Register File Queue Q1 Queue Q2 Queue Q3 Stack Q4 Circular Buffer Q5 Map Table R6 R0 R1Q1 R15

Software Pipelining 6 Software pipelining is not often found in embedded compilers. Software pipelining reduces the overall cycle time of a loop. Extracts iterations Consumes Stalls Consumes registers!!

Software Pipelining Example 7 Stalls Present when Loop Run.L3: ldr r1,[r2,r3, lsl #2] ldr r12,[r4], #4 stall mul r0,r12,r1 stall str r0,[r5,r3, lsl #2] add r3,r3,#1 cmp r3, #1000 bgt.L3 int A[1000], B[1000]; void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * C[I]; }.L3: ldrr1,[r2,r3, lsl #2] ldrr12,[r4], #4 mulr0,r12,r1 strr0,[r5,r3, lsl #2] addr3,r3,#1 cmpr3, #1000 blt.L3

Instruction 8 Goal: Minimal modification to existing instruction set. Single cycle instruction latency Method: Add a single instruction to the ISA that is used to map and unmap a common register specifier into a customized register structure. qmap qmap r3,#4,q3

Architectural Modifications 9 Register File Queue Q1 Queue Q2 Queue Q3 Destructive Queue Q4 Circular Buffer Q5 Map Table R6 R0 An access to R0, which has no mapping in the table would get the data from the register file. R1 is mapped into Q1 and would retrieve its data from there. R0 R1Q1 R15

4 30 Software Pipelining Example Q1 Q2 530 Q3 int A[1000], B[1000]; void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * C[I]; }

Register Usage 11

Results – Multiplies varying latency, load latency set at four 12

Results – Loads varying latency, multiply latency set at four 13

Conclusions 14 Customized register structures reduce register pressure. Software pipelining is viable in resource constrained environments Performance can be improved with minor impact to the ISA.

Extra’s

Reference Behaviors 16 ldr r1,[r6,r4, lsl #4] ldr r12,[r6,r4, lsl #8] ldr r8,[r6,r4, lsl #12] str r8,[r3,r4, lsl #16] str r12,[r3,r4, lsl #20] str r1,[r3,r4, lsl #24] Stack Reference Behavior

Application Configurable Architecture 17 Application configurable processors are designed using a mapping table similar to a register rename table found in many out of order implementations. The map table is read during every access to the architected register file. This serves as a method of determining if a register specifier is used in the original architected register file or a customized register structure.

Application Configurable Architecture 18 The customized register files are small in size but they efficiently manage the values that would require many architected registers. The customized register files can mimic queues, stacks, and circular buffers. These structures are accessed using the same register specifier that is used to access the architected register file.

Remove Reference Behaviors 19 ldr r1,[r6,r4, lsl #4] ldr r12,[r6,r4, lsl #8] ldr r8,[r6,r4, lsl #12] str r8,[r3,r4, lsl #16] str r12,[r3,r4, lsl #20] str r1,[r3,r4, lsl #24] Stack Reference Behavior R8 R12 R1 r1 ldr r1,[r6,r4, lsl #4] ldr r1,[r6,r4, lsl #8] ldr r1,[r6,r4, lsl #12] str r1,[r3,r4, lsl #16] str r1,[r3,r4, lsl #20] str r1,[r3,r4, lsl #24] Free up r8 and r12 for use.

Remove Qmap Instruction 20 R8 R12 R1 q0 Free up r8 and r12 for use.

Modulo Scheduling 21 For our work we used modulo scheduling. This requires using the dependences and latencies of the loop instructions to generate a modulo scheduled loop. The prolog and epilog are then built based off of this schedule. The prolog and epilog in require register renaming of loop carried dependencies to verify a correct loop. Renaming in embedded processors is often not possible.

Register Renaming due to software pipelining 22 Renaming doesn’t work… not enough registers. Rotating registers would require a significant rewrite of the embedded ISA. The loop carried values can simply be mapped into a register queue to hold the value across several iterations.

Results Register Savings 23 As latency grows for the instructions more iterations of the loop are extracted to spread out the latency. The extra registers that would be required to perform renaming have measured from 25% to 200% of the available registers in the ARM.