Project Presentation by Joshua George Advisor – Dr. Jack Davidson.

Slides:



Advertisements
Similar presentations
CSCI 4717/5717 Computer Architecture
Advertisements

Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems Jason D. Hiser, Daniel Williams, Wei Hu, Jack W. Davidson, Jason.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CPU Review and Programming Models CT101 – Computing Systems.
CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Chapter 9 TRAP Routines and Subroutines. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 9-2 Subroutines.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Chapter 10- Instruction set architectures
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
TMS320C6000 Architectural Overview.  Describe C6000 CPU architecture.  Introduce some basic instructions.  Describe the C6000 memory map.  Provide.
Lecture 8: MIPS Instruction Set
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Program Representations. Representing programs Goals.
The Little man computer
LC-3 Computer LC-3 Instructions
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
Execution of an instruction
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Chapter 12 Pipelining Strategies Performance Hazards.
Assemblers Dr. Monther Aldwairi 10/21/20071Dr. Monther Aldwairi.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts.
A Simple Two-Pass Assembler
December 8, 2003Other ISA's1 Other ISAs Next, we discuss some alternative instruction set designs. – Different ways of specifying memory addresses – Different.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
Instruction Set Architecture The portion of the machine visible to the programmer Issues: Internal storage model Addressing modes Operations Operands Encoding.
Execution of an instruction
M. Mateen Yaqoob The University of Lahore Spring 2014.
Computer Architecture and Operating Systems CS 3230 :Assembly Section Lecture 5 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.
Chapter 7 Object Code Generation. Chapter 7 -- Object Code Generation2  Statements in 3AC are simple enough that it is usually no great problem to map.
More on MIPS programs n SPIM does not support everything supported by a general MIPS assembler. For example, –.end doesn’t work Use j $ra –.macro doesn’t.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Retargetting of VPO to the tms320c54x - a status report Presented by Joshua George Advisor: Dr. Jack Davidson.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.
The Little man computer
Assembly language.
Chapter 9 TRAP Routines and Subroutines
CS 3304 Comparative Languages
Execution time Execution Time (processor-related) = IC x CPI x T
Assembly Language Programming of 8085
Computer Architecture & Operations I
Digital Signal Processors
Instruction set architectures
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
Chapter 9 TRAP Routines and Subroutines
EECE.3170 Microprocessor Systems Design I
EECE.3170 Microprocessor Systems Design I
Instruction set architectures
A Simple Two-Pass Assembler
Other ISAs Next, we’ll first we look at a longer example program, starting with some C code and translating it into our assembly language. Then we discuss.
Other ISAs Next, we’ll first we look at a longer example program, starting with some C code and translating it into our assembly language. Then we discuss.
Chapter 9 TRAP Routines and Subroutines
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
Instruct Set Architecture Variations
Intermediate Code Generation
Execution time Execution Time (processor-related) = IC x CPI x T
Lecture 4: Instruction Set Design/Pipelining
Lecture 5: Pipeline Wrap-up, Static ILP
Chapter 9 TRAP Routines and Subroutines
Chapter 9 TRAP Routines and Subroutines
Presentation transcript:

Project Presentation by Joshua George Advisor – Dr. Jack Davidson

Exploiting hardware for loop optimizations ZOLB – Zero Overhead Loop Buffers Decrement, compare and jump instructions Compare with zero instructions

Background VPO – Very Portable Optimizer – operates on low- level, m/c independent form – RTLs ZOLB – Many DSPs have a compiler managed cache for loops Reduces loop over-head No branch Buffered in internal buffer – save on instruction fetch Can reduce code size Power overhead remains low Decrement, compare and jump Eg. loop instruction on x86, banz on tms320c54x Compare with zero Eg. SPARC

Status Added support for Repeat instructions (ZOLB) on tms320c54x. Support for converting loops to count down so as to make use of decrement, compare and jmp instructions – retargetted to three machines – x86, SPARC and tms320c54x.

Implementation - guidelines Add minimum possible code to m/c dependent part (md), while doing most of the implementation in the m/c independent part (lib). Design the interface between lib and md to allow for possible issues with other targets.

Issues (ZOLB) How to describe effectively? An example :- BRC=10; (Block Repeat Count) RSA=L1;REA=EN[L1]; (Repeat Start Address and Repeat End Address) L1: w[0]=w[0]+1;W[w[0]]=0; PC=BRC>0,L1;BRC=BRC-1;

Issues (ZOLB) How to bind the rpt instruction to the start of the rpt block (on the tms320c54x, the start of the rpt block is implicitly the instruction after the rpt instruction) Changing vpo to support ‘binding’ of an instruction to the next would be overkill. Solution: Make fixentry() take care of this. (after vpo has finished its optimization loop).

Issues (ZOLB) How to describe unrepeatable instructions? The machine description sets the UNREPEATABLE flag for each unrepeatable instruction. Machine description also provides a list of instructions that disappear after conversion. VPO ignores instructions in this list when checking for unrepeatability.

Issues (ZOLB) How to specify end-label? If we simply label the next-block, vpo wont print the label since it cannot see a jump to that label. Solution: Use mangled version of the start label (eg. L1_end) as the end label for the rpt instruction. Output same mangled version of the start label when the last instruction in the rptblock is encountered in fixentry. Note that this last instruction contains the start label.

Implementation Information supplied by md to lib. Which instructions are unrepeatable. The number of instructions that would remain after the conversion. The list of rtls involved in the compare and jmp. The elements involved in the compare (the register, expression it is being compared with, and the relational operator)  helps to determine iteration count. Identifying a comparison rtl. How to initialize a register to an expression. Note : Many md parts were already in place – for eg. loop strength reduction support code.

Implementation (cont..) The md does the actual insertion of rpt rtls b[0]=10; (Block Repeat Count)  sr_init() b[1]=L1;b[2]=EN[L1];  md_convert_rpt_block (Repeat Start Address and Repeat End Address) L1: w[0]=w[0]+1;W[w[0]]=0; PC=b[0]>0,L1;b[0]=b[0]-1;  md_convert_rpt_block (The last rtl is simply converted to a label when outputting the assembly)

Implementation What is done in lib? Ensuring that the instructions in the loop are repeatable. Counting number of instructions that will remain in loop after conversion. This is useful to allow md to determine if it wants to convert this to a single-rpt instruction. Analysis of uses and life-time of loop control variable to determine if control variable increments can disappear. Finding iteration count of the loop. Identifying loop control variable/increment points. Finding loop exit block. Note : A lot of functionality (marked above) was already present in vpo lib.

Example Before conversion – 5 instructions. Has a branch. w[0]=_A;  stm #0, ar0 L1: w[0]=w[0]+1;W[w[0]]=0;  st #0, *ar0+ r[0] = (w[0]{24)}24;  ld *(ar0), A r[0] = r[0] – (_A + 10);  sub _A+#10, A PC=r[0]<0,L1;  bc L1, Alt After conversion to a single instruction repeat – only 3 instructions. Dynamic instruction count becomes much higher once the instruction is in the pipeline. w[0]=_A;  stm #0,ar0 n[1]=L6;n[2]=EL[L6];n[0]=9;  rpt #9 L6 w[0]=w[0]+1;W[w[0]]=0;  st #0, *ar0+ PC=n[0]>0,L6;n[0]=n[0]-1;

Future work How to prevent vpo from changing block size (for eg. when spills are added)? In single repeat instruction, how to add support for auto-increment direct addressing mode. Eg. rpt #123 mvdk *ar1, #800h

Count down loops Objective – convert loops to count down to zero, instead of counting up to a constant or counting down to a constant. Reasoning Most architectures have a single compare to zero instruction. Comparing to other values needs at least one more instruction. Some architectures can decrement, compare and jmp in a single instruction!

Implementation Information supplied by md to lib List of registers that are candidates to form the count down to zero induction variable. (eg. on x86 it is advantageous to do this conversion only if the count down uses the ecx register) Is this conversion worthwhile on this loop.

Implementation (cont..) Information supplied by md to lib How to initialize a register to an expression. How to decrement a register. Elements of a comparison. Identifying a comparison rtl. The relop used for comparing to zero.

Implementation What is done in lib? Finding the expression that represents the iteration count. Identifying the loop control variable/increment points. Analysis of uses and life-time of loop control. variable to determine if conversion is worth-while. Decision made by md. Identifying the exit block. Spill/re-load new loop control variable if needed.

Implementation What is done in lib? Analyze list of candidate registers to select the best one for this loop. First preference – the current control variable, provided it is free. If worthwhile, then any other free register. Last option is to use a register that is live across the loop, but not used within the loop. This register will have to be spilled in the loop pre-header and reloaded at loop exit.

Performance – spec on x86

Analysis Average performance has improved after applying the count down optimization.

Conclusion More fine-tuning needed to realize substantial performance gains. Primary objective of adding easily retargetable support for these loop optimizations accomplished – retargeted to 3 targets!

Acknowledgements Dr. Jack Davidson (advisor) Jason Hiser Clark Coleman