Pipeline Enhancements for the Y86 Architecture

Slides:

Advertisements

Similar presentations

Fabián E. Bustamante, Spring 2007 Machine-Level Programming II: Control Flow Today Condition codes Control flow structures Next time Procedures.

Advertisements

Floating Point Unit. Floating Point Unit ( FPU ) There are eight registers, namely ST(0), ST(1), ST(2), …, ST(7). They are maintained as a stack.

Program Optimization (Chapter 5)

%rax %eax %rbx %ebx %rdx %edx %rcx %ecx %rsi %esi %rdi %edi %rbp %ebp %rsp %esp %r8 %r8d %r9 %r9d %r11 %r11d %r10 %r10d %r12 %r12d %r13 %r13d.

Real-World Pipelines: Car Washes Idea  Divide process into independent stages  Move objects through stages in sequence  At any instant, multiple objects.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Prelim Review.

PipelinedImplementation Part I CSC 333. – 2 – Overview General Principles of Pipelining Goal Difficulties Creating a Pipelined Y86 Processor Rearranging.

1 ICS 51 Introductory Computer Organization Fall 2006 updated: Oct. 2, 2006.

Accessing parameters from the stack and calling functions.

Lecture Notes from Randal E. Bryant, CMU CS:APP Chapter 4 Computer Architecture Instruction Set Architecture CS:APP Chapter 4 Computer Architecture Instruction.

Instruction Set Architecture CSC 333. – 2 – Instruction Set Architecture Assembly Language View Processor state Registers, memory, … Instructions addl,

1 Assembly Language: Overview. 2 If you’re a computer, What’s the fastest way to multiply by 5? What’s the fastest way to divide by 5?

David O’Hallaron Carnegie Mellon University Processor Architecture PIPE: Pipelined Implementation Part I Processor Architecture PIPE: Pipelined Implementation.

Computer Architecture and Operating Systems CS 3230 :Assembly Section Lecture 7 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.

Y86 Processor State Program Registers

CS:APP2e CS:APP Chapter 4 Computer Architecture Instruction Set Architecture.

Processor Architecture: The Y86 Instruction Set Architecture

Randal E. Bryant adapted by Jason Fritts CS:APP2e CS:APP Chapter 4 Computer Architecture Instruction Set Architecture CS:APP Chapter 4 Computer Architecture.

1 Seoul National University Pipelined Implementation : Part I.

Datapath Design I Topics Sequential instruction execution cycle Instruction mapping to hardware Instruction decoding Systems I.

Based on slides by Patrice Belleville CPSC 121: Models of Computation Unit 10: A Working Computer.

Assembly Language. Symbol Table Variables.DATA var DW 0 sum DD 0 array TIMES 10 DW 0 message DB ’ Welcome ’,0 char1 DB ? Symbol Table Name Offset var.

1 Carnegie Mellon Assembly and Bomb Lab : Introduction to Computer Systems Recitation 4, Sept. 17, 2012.

1 Sequential CPU Implementation. 2 Outline Logic design Organizing Processing into Stages SEQ timing Suggested Reading 4.2,4.3.1 ~

1 Processor Architecture. Coverage Our Approach –Work through designs for particular instruction set Y86---a simplified version of the Intel IA32 (a.k.a.

1 Seoul National University Pipelining Basics. 2 Sequential Processing Seoul National University.

תרגול 5 תכנות באסמבלי, המשך

1 Assembly Language: Function Calls Jennifer Rexford.

Chapter 8 String Operations. 8.1 Using String Instructions.

Machine-Level Programming 2 Control Flow Topics Condition Codes Setting Testing Control Flow If-then-else Varieties of Loops Switch Statements.

CPSC 121: Models of Computation

CPSC 121: Models of Computation

Reading Condition Codes (Cont.)

Machine-Level Programming 2 Control Flow

IA32 Processors Evolutionary Design

Conditional Branch Example

Module 10: A Working Computer

Aaron Miller David Cohen Spring 2011

Computer Architecture

Recitation 2 – 2/4/01 Outline Machine Model

Machine-Level Programming II: Arithmetic & Control

Chapter 3 Machine-Level Representation of Programs

Machine-Level Representation of Programs II

Computer Architecture adapted by Jason Fritts then by David Ferry

asum.ys A Y86 Programming Example

Y86 Processor State Program Registers

Machine-Level Programming 4 Procedures

Instructor: David Ferry

Seoul National University

Systems I Pipelining III

Machine-Level Programming 2 Control Flow

Systems I Pipelining II

Fundamentals of Computer Organisation & Architecture

Machine-Level Programming 2 Control Flow

Machine-Level Programming III: Procedures Sept 18, 2001

Machine-Level Representation of Programs III

Machine-Level Programming 2 Control Flow

Pipelined Implementation : Part I

Recap: Performance Comparison

Chapter 3 Machine-Level Representation of Programs

X86 Assembly Review.

Chapter 4 Processor Architecture

Disassembly תרגול 7 ניתוח קוד.

02/02/10 20:53 Assembly Questions תרגול 12 1.

Real-World Pipelines: Car Washes

CS-447– Computer Architecture M,W 10-11:20am Lecture 5 Instruction Set Architecture Sep 12th, 2007 Majd F. Sakr

Sequential Design תרגול 10.

ICS51 Introductory Computer Organization

Presentation transcript:

Pipeline Enhancements for the Y86 Architecture Kelly Carothers

Enhancments done Hardware: BTFNT Branch Jumping Load-forwarding for variables Software: Use of IADDL Rearrangement of code Loop Unrolling Will be presented in order done and Avg CPE values will be cummulative.

Load-forwarding The passing of variables from further in the pipe backwards before it is written to a register or memory. CPE Avg: 17.15 Used to prevent stalling by moving variables that have yet to be written to the cache of the previous stage. CPE decrease of 1.00 Pipe only stalls for POPL and MRMOVL instructions in Execute stage

Load-forwarding from Memory stage to Execute Stage

IADDL Single instruction replaces the IRMOVL and ADDL instructions for an immediate add. CPE Avg: 14.22 Cuts down on an instruction each time it replaces IRMOVL & ADDL. Frees a variable for other purposes. CPE decrease of 2.93 The CPE decrease is for both the inclusion of the new instruction in hardware and replacing of the two instructions in ncopy where applicable. Most useful, Biggest CPE decrease (likely b/c it is the most used instruction)

IADDL implementation Very simple as its implementation is a mix between the IRMOVL and ADDL instructions without the intermediate storing and loading processes.

IADDL Code Comparison: Original vs. Modified # Loop header xorl %esi,%esi # count = 0; andl %edx,%edx # len <= 0? jle Done # if so, goto Done: # Loop body. Loop: mrmovl (%ebx), %eax rmmovl %eax, (%ecx) andl %eax, %eax # val <= 0? jle Npos # if so, goto Npos: iaddl $1, %esi # count++ Npos: iaddl $-1, %edx # len-- iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %edx,%edx # len > 0? jg Loop # if so, goto Loop: # Loop header xorl %esi,%esi # count = 0; andl %edx,%edx # len <= 0? jle Done # if so, goto Done: # Loop body. Loop: mrmovl (%ebx), %eax #src... rmmovl %eax, (%ecx) # ...and store it to dst andl %eax, %eax # val <= 0? jle Npos # if so, goto Npos: irmovl $1, %edi addl %edi, %esi # count++ Npos: irmovl $1, %edi subl %edi, %edx # len-- irmovl $4, %edi addl %edi, %ebx # src++ addl %edi, %ecx # dst++ andl %edx,%edx # len > 0? jg Loop # if so, goto Loop: Very simple as its implementation is a mix between the IRMOVL and ADDL instructions without the intermediate storing and loading processes.

BTFNT Branch Jumping BTFNT – Backwards Taken Forwards Not Taken: Always take the smaller address. CPE Avg : 12.37 65% success rate for BTFNT. 60% success rate for default, always take. CPE decrease of 1.85 2nd most useful

Code Rearrangement *Code was arranged specifically for BTFNT *Many unnecessary checks removed Avg CPE: 11.71 No loop unrolling CPE decrease of .66

Code Rearrangement: IADDL Mod vs. End Result # Loop header xorl %esi,%esi # count = 0; andl %edx,%edx # len <= 0? jle Done # if so, goto Done: # Loop body. Loop: mrmovl (%ebx), % rmmovl %eax, (%ecx) andl %eax, %eax # val <= 0? jle Npos # if so, goto Npos: iaddl $1, %esi # count++ Npos: iaddl $-1, %edx # len-- iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %edx,%edx # len > 0? jg Loop # if so, goto Loop: rrmovl %edx, %esi iaddl $1, %edx Loop: iaddl $-1, %edx jle Done Loop1: mrmovl (%ebx), %eax rmmovl %eax, (%ecx) Npos: iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %eax, %eax jle decEsi jmp Loop decEsi: iaddl $-1, %esi, jg Loop Very simple as its implementation is a mix between the IRMOVL and ADDL instructions without the intermediate storing and loading processes.

*Increases code size *Decreases CPE Loop Unrolling *Increases code size *Decreases CPE More unrolling = faster code b/c of less looping but much larger size due to repeated code Cheap way to decrease the CPE.

Loop Unrolling: No unrolling vs. 1 unroll Loop1: mrmovl (%ebx), %eax rmmovl %eax, (%ecx) Npos: iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %eax, %eax jle decEsi jmp Loop Loop1: mrmovl (%ebx), %eax rmmovl %eax, (%ecx) Npos: iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %eax, %eax jle decEsi iaddl $-1, %edx jle Done mrmovl (%ebx), %eax iaddl $4, %ebx iaddl $4, %ecx jmp Loop

Loop Unrolling Results No Unrolling, Base Avg. CPE: 11.64 1 Unroll, Avg CPE: 11.16 2 Unrolls, Avg CPE: 11.00 More unrolling = faster code b/c of less looping but much larger size due to repeated code. No unrolling is same CPE as after code rearrangement. 1 unroll is .48 less than no unrolling. 2 unrolls is .64 less than no unrolling.

Total Results Initial Avg CPE: 18.15 Final Avg CPE: 11.00 Total Decrease of 7.15 CPE. Final Avg CPE is based on 2 loop unrolls, seemed to be the best choice, performance wise b/c the gain after that was small and seemed to shrink exponentially.

Final Results Enhancement AVG CPE CPE Decrease None 18.15 ------- Load-Forwarding 17.15 1.00 IADDL 14.22 2.93 BTFNT 12.37 1.85 Code Rearranging 11.64 .73 1 Loop Unrolled 11.16 .48 2 Loops Unrolled 11.00 .16