Instruction Selection CS 671 February 19, 2008. CS 671 – Spring 2008 1 The Back End Essential tasks: Instruction selection Map low-level IR to actual.

Slides:



Advertisements
Similar presentations
RAM (cont.) 220 bytes of RAM (1 Mega-byte) 20 bits of address Address
Advertisements

CS412/413 Introduction to Compilers Radu Rugina Lecture 27: More Instruction Selection 03 Apr 02.
Tiling Examples for X86 ISA Slides Selected from Radu Ruginas CS412/413 Lecture on Instruction Selection at Cornell.
Instruction Set Design
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Lecture 13: 10/8/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
Review of the MIPS Instruction Set Architecture. RISC Instruction Set Basics All operations on data apply to data in registers and typically change the.
Register Allocation CS 320 David Walker (with thanks to Andrew Myers for most of the content of these slides)
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
RISC / CISC Architecture By: Ramtin Raji Kermani Ramtin Raji Kermani Rayan Arasteh Rayan Arasteh An Introduction to Professor: Mr. Khayami Mr. Khayami.
1 CS 201 Compiler Construction Machine Code Generation.
Code Generation Steve Johnson. May 23, 2005Copyright (c) Stephen C. Johnson The Problem Given an expression tree and a machine architecture, generate.
Instruction Set Architecture Classification According to the type of internal storage in a processor the basic types are Stack Accumulator General Purpose.
Dynamic Programming.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
CS 536 Spring Code generation I Lecture 20.
CS 104 Introduction to Computer Science and Graphics Problems
Code Generation Simple Register Allocation Mooly Sagiv html:// Chapter
Instruction Selection, II Tree-pattern matching Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in.
Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Unit -II CPU Organization By- Mr. S. S. Hire. CPU organization.
4/6/08Prof. Hilfinger CS164 Lecture 291 Code Generation Lecture 29 (based on slides by R. Bodik)
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
RISC and CISC. Dec. 2008/Dec. and RISC versus CISC The world of microprocessors and CPUs can be divided into two parts:
4-1 Chapter 4 - The Instruction Set Architecture Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring.
Linked Lists in MIPS Let’s see how singly linked lists are implemented in MIPS on MP2, we have a special type of doubly linked list Each node consists.
COMP2011 Assembly Language Programming and Introduction to WRAMP.
Instruction Selection II CS 671 February 26, 2008.
CSC 3210 Computer Organization and Programming Chapter 1 THE COMPUTER D.M. Rasanjalee Himali.
Chapter 7 Syntax-Directed Compilation (AST & Target Code) 1.
Lecture 4: MIPS Instruction Set Reminders: –Homework #1 posted: due next Wed. –Midterm #1 scheduled Friday September 26 th, 2014 Location: TODD 430 –Midterm.
4-1 Chapter 4 - The Instruction Set Architecture Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles.
Execution of an instruction
M. Mateen Yaqoob The University of Lahore Spring 2014.
Introduction CPSC 388 Ellen Walker Hiram College.
Code Generation Ⅰ CS308 Compiler Theory1. 2 Background The final phase in our compiler model Requirements imposed on a code generator –Preserving the.
Operand Addressing And Instruction Representation Cs355-Chapter 6.
Reduced Instruction Set Computers. Major Advances in Computers(1) The family concept —IBM System/ —DEC PDP-8 —Separates architecture from implementation.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Chapter# 6 Code generation.  The final phase in our compiler model is the code generator.  It takes as input the intermediate representation(IR) produced.
Csci 136 Computer Architecture II – Summary of MIPS ISA Xiuzhen Cheng
COMPILERS Instruction Selection hussein suleman uct csc305h 2005.
The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.
Computer Systems – Machine & Assembly code. Objectives Machine Code Assembly Language Op-code Operand Instruction Set.
COMPILERS Instruction Selection hussein suleman uct csc3003s 2007.
ISA's, Compilers, and Assembly
Computer Organization Instructions Language of The Computer (MIPS) 2.
RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.
1 March 16, March 16, 2016March 16, 2016March 16, 2016 Azusa, CA Sheldon X. Liang Ph. D. Azusa Pacific University, Azusa, CA 91702, Tel: (800)
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 10 Ahmed Ezzat.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
CS 404 Introduction to Compiler Design
Assembly language.
COMPILERS Instruction Selection
The compilation process
Optimization Code Optimization ©SoftMoore Consulting.
Lecture 4: MIPS Instruction Set
Lecture 30 (based on slides by R. Bodik)
The University of Adelaide, School of Computer Science
Instruction Selection, II Tree-pattern matching
Computer Instructions
Introduction to Microprocessor Programming
8 Code Generation Topics A simple code generator algorithm
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Target Code Generation
Lecture 4: Instruction Set Design/Pipelining
Presentation transcript:

Instruction Selection CS 671 February 19, 2008

CS 671 – Spring The Back End Essential tasks: Instruction selection Map low-level IR to actual machine instructions Not necessarily 1-1 mapping CISC architectures, addressing modes Register allocation Low-level IR assumes unlimited registers Map to actual resources of machines Goal: maximize use of registers

CS 671 – Spring Instruction Selection Low-level IR different from machine ISA Why? Allow different back ends Abstraction – to make optimization easier Differences between IR and ISA IR: simple, uniform set of operations ISA: many specialized instructions Often a single instruction does work of several operations in the IR Instruction Set Architecture

CS 671 – Spring Instruction Selection Easy solution Map each IR operation to an instruction (or set of) May need to include memory operations Problem: inefficient use of ISA x = y + z; mov y, r1 mov z, r2 add r2, r1 mov r1, x

CS 671 – Spring Instruction Selection Instruction sets ISA often has many ways to do the same thing Idiom: A single instruction that represents a common pattern or sequence of operations Consider a machine with the following instructions: add r2, r1 r1 ← r1 + r2 muli c, r1 r1 ← r1 * c load r2, r1 r1 ← * r2 store r2, r1* r1 ← r2 movem r2, r1* r1 ← * r2 movex r3, r2, r1* r1 ← * (r2 + r3) Sometimes [r2]

CS 671 – Spring Example Generate code for: Simplifying assumptions All variables are globals (No stack offset computation) All variables are in registers (Ignore load/store of variables) a[i+1] = b[j] IR t1 = j*4 t2 = b+t1 t3 = *t2 t4 = i+1 t5 = t4*4 t6 = a+t5 *t6 = t3

CS 671 – Spring Possible Translation Address of b[j]: Load value b[j]: Address of a[i+1]: Store into a[i+1]: IR t1 = j*4 t2 = b+t1 t3 = *t2 t4 = i+1 t5 = t4*4 t6 = a+t5 *t6 = t3 Assembly muli 4, rj add rj, rb load rb, r1 addi 1, ri muli 4, ri add ri, ra store r1, ra

CS 671 – Spring Another Translation Address of b[j]: (no load) Address of a[i+1]: Store into a[i+1]: IR t1 = j*4 t2 = b+t1 t3 = *t2 t4 = i+1 t5 = t4*4 t6 = a+t5 *t6 = t3 Assembly muli 4, rj add rj, rb addi 1, ri muli 4, ri add ri, ra movem rb, ra Direct memory-to- memory operation

CS 671 – Spring Yet Another Translation Index of b[j]: (no load) Address of a[i+1]: Store into a[i+1]: IR t1 = j*4 t2 = b+t1 t3 = *t2 t4 = i+1 t5 = t4*4 t6 = a+t5 *t6 = t3 Assembly muli 4, rj addi 1, ri muli 4, ri add ri, ra movex rj,rb,ra Compute the address of b[j] in the memory move operation movex rj, rb, ra* ra ← * (rj + rb)

CS 671 – Spring Different Translations Why is last translation preferable? Fewer instructions Instructions have different costs –Space cost: size of each instruction –Time cost: number of cycles to complete Example add r2, r1 cost = 1 cycle muli c, r1 cost = 10 cycles load r2, r1 cost = 3 cycles store r2, r1 cost = 3 cycles movem r2, r1 cost = 4 cycles movex r3, r2, r1 cost = 5 cycles Idioms are cheaper than constituent parts

CS 671 – Spring Wacky x86 Idioms What does this do? Why not use this? Answer: Immediate operands are encoded in the instruction, making it bigger and therefore more costly to fetch and execute xor%eax, %eax mov$0, %eax

CS 671 – Spring More Wacky x86 Idioms What does this do? Swap the values of %eax and %ebx Why do it this way? No need for extra register! xor%ebx, %eax xor%eax, %ebx xor%ebx, %eax eax = b  a eax = a  (b  a) = ? ebx = (b  a)  b = ?

CS 671 – Spring Architecture Differences RISC (PowerPC, MIPS) Arithmetic operations require registers Explicit loads and stores CISC (x86) Complex instructions (e.g., MMX and SSE) Arithmetic operations may refer to memory BUT, only one memory operation per instruction ld8(r0), r1 add$12, r1 add8(%esp), %eax

CS 671 – Spring Addressing Modes Problem: Some architectures (x86) allow different addressing modes in instructions other than load and store add$1, 0xaf0080 add$1, (%eax) add$1, -8(%eax) add$1, -8(%eax, %ebx, 2) Addr = 0xaf0080 Addr = contents of %eax Addr = %eax - 8 Addr = (%eax + %ebx * 2) - 8

CS 671 – Spring Minimizing Cost Goal: Find instructions with low overall cost Difficulty How to find these patterns? Machine idioms may subsume IR operations that are not adjacent Idea: back to tree representation Convert computation into a tree Match parts of the tree IR t1 = j*4 t2 = b+t1 t3 = *t2 t4 = i+1 t5 = t4*4 t6 = a+t5 *t6 = t4 movem rb, ra

CS 671 – Spring Tree Representation Build a tree: Goal: find parts of the tree that correspond to machine instructions a[i+1] = b[j] + a  + 1i 4 storeload + b  j4 IR t1 = j*4 t2 = b+t1 t3 = *t2 t4 = i+1 t5 = t4*4 t6 = a+t5 *t6 = t3

CS 671 – Spring Tiles Idea: a tile is contiguous piece of the tree that correponds to a machine instruction + a  + 1i 4 storeload + b  j4 IR t1 = j*4 t2 = b+t1 t3 = *t2 t4 = i+1 t5 = t4*4 t6 = a+t5 *t6 = t3 movem rb, ra

CS 671 – Spring Tiling Tiling: cover the tree with tiles + a  + 1i 4 storeload + b  j4 Assembly muli 4, rj add rj, rb addi 1, ri muli 4, ri add ri, ra movem rb, ra

CS 671 – Spring Tiling + a  + 1i 4 storeload + b  j4 + a  + 1i 4 storeload + b  j4 load rb, r1 store r1, ra movex rj, rb, ra

CS 671 – Spring Generating Code Given a tiling of a tree A tiling implements a tree if: –It covers all nodes in the tree Post-order tree walk Emit machine instructions for each tile Tie boundaries together with registers Note: order of children matters + b  j4

CS 671 – Spring Tiling What’s hard about this? Define system of tiles in the compiler Finding a tiling that implements the tree (Covers all nodes in the tree) Finding a “good” tiling Different approaches Ad-hoc pattern matching Automated tools To guarantee every tree can be tiled, provide a tile for each individual kind of node + t1 t2 movt1, t3 addt2, t3

CS 671 – Spring Algorithms Goal: find a tiling with the fewest tiles Ad-hoc top-down algorithm Start at top of the tree Find largest tile matches top node Tile remaining subtrees recursively Tile(n) { if ((op(n) == PLUS) && (left(n).isConst())) { Code c = Tile(right(n)); c.append(ADDI left(n) right(n)) }

CS 671 – Spring Ad-Hoc Algorithm Problem: what does tile size mean? Not necessarily the best fastest code (Example: multiply vs add) How to include cost? Idea: Total cost of a tiling is sum of costs of each tile Goal: find a minimum cost tiling

CS 671 – Spring Including Cost Algorithm: For each node, find minimum total cost tiling for that node and the subtrees below Key: Once we have a minimum cost for subtree, can find minimum cost tiling for a node by trying out all possible tiles matching the node Use dynamic programming

CS 671 – Spring Dynamic Programming Idea For problems with optimal substructure Compute optimal solutions to sub-problems Combine into an optimal overall solution How does this help? Use memoization: Save previously computed solutions to sub-problems Sub-problems recur many times Can work top-down or bottom-up

CS 671 – Spring Recursive Algorithm Memoization For each subtree, record best tiling in a table (Note: need a quick way to find out if we’ve seen a subtree before – some systems use DAGs instead of trees) At each node First check table for optimal tiling for this node If none, try all possible tiles, remember lowest cost Record lowest cost tile in table Greedy, top-down algorithm We can emit code from table

CS 671 – Spring Pseudocode Tile(n) { if (best(n)) return best(n) // -- Check all tiles if ((op(n) == STORE) && (op(right(n)) == LOAD) && (op(child(right(n)) == PLUS)) { Code c = Tile(left(n)) c.add(Tile(left(child(right(n))) c.add(Tile(right(child(right(n))) c.append(MOVEX...) if (cost(c) < cost(best(n)) best(n) = c } //... and all other tiles... return best(n) } + a  + 1i 4 storeload + b  j4

CS 671 – Spring Ad-Hoc Algorithm Problem: Hard-codes the tiles in the code generator Alternative: Define tiles in a separate specification Use a generic tree pattern matching algorithm to compute tiling Tools: code generator generators

CS 671 – Spring Code Generator Generators Tree description language Represent IR tree as text Specification IR tree patterns Code generation actions Generator Takes the specification Produces a code generator Provides linear time optimal code generation BURS (bottom-up rewrite system) burg, Twig, BEG

CS 671 – Spring Tree Notation Use prefix notation to avoid confusion + a  + 1i 4 storeload + b  j4  (j, 4) +(b,  (j, 4)) load(+(b,  (j, 4))) +(i,1)  (+(i,1),4) +(a,  (+(i,1),4)) store(+(a,  (+(i,1),4)),load(+(b,  (j, 4))))

CS 671 – Spring Rewrite Rules Rule Pattern to match and replacement Cost Code generation template May include actions – e.g., generate register name Pattern, replacementCostTemplate +(reg 1,reg 2 ) → reg 2 1add r1, r2 store(reg 1, load(reg 2 )) → done5movem r2, r1

CS 671 – Spring Rewrite Rules Example rules: #Pattern, replacementCostTemplate 1+(reg 1,reg 2 ) → reg 2 1 add r1, r2 2 (reg 1,reg 2 ) → reg 2 10 mul r1, r2 3+(num,reg 1 ) → reg 2 1 addi num, r1 4 (num,reg 1 ) → reg 2 10 muli num, r1 5store(reg 1, load(reg 2 )) → done5 movem r2, r1

CS 671 – Spring Example + a  + 1i 4 storeload + b  j4 Assembly muli 4, rj add rj, rb addi 1, ri muli 4, ri add ri, ra movem rb, ra

CS 671 – Spring Rewriting Process store(+(ra,  (+(ri,1),4)),load(+(rb,  (rj, 4)))) 4 muli 4, rj 1 add rj, rb store(+(ra,  (+(ri,1),4)),load(+(rb, rj))) store(+(ra,  (+(ri,1),4)),load(rb)) 3 addi 1, ri store(+(ra,  (ri,4)),load(rb)) 4 muli 4, ri store(+(ra,ri),load(rb)) 1 add ri, ra store(ra,load(rb)) 5 movem rb, ra done

CS 671 – Spring Mem and Move Alternative to load, store a[i+1] = b[j] + a  move mem + b  4 + a  + 1i 4 storeload + b  j4 mem + sp 8 mem + sp 4

CS 671 – Spring Modern Processors Execution time not sum of tile times Instruction order matters Pipelining: parts of different instructions overlap Bad ordering stalls the pipeline – e.g., too many operations of one type Superscalar: some operations executed in parallel Cost is an approximation Instruction scheduling helps

CS 671 – Spring Next Time… Test: The Front End And then … Introduction to optimization