Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

Instruction Set Design

Program Optimization (Chapter 5)

CPU Review and Programming Models CT101 – Computing Systems.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Performance of Cache Memory

Assembly Language for x86 Processors 6th Edition Chapter 5: Procedures (c) Pearson Education, All rights reserved. You may modify and copy this slide.

COMP 2003: Assembly Language and Digital Logic

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Computer Architecture and Operating Systems CS 3230 :Assembly Section Lecture 2 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.

C Programming and Assembly Language Janakiraman V – NITK Surathkal 2 nd August 2014.

Web siteWeb site ExamplesExamples Irvine, Kip R. Assembly Language for Intel-Based Computers, MUL Instruction The MUL (unsigned multiply) instruction.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Assembly Language for Intel-Based Computers Chapter 2: IA-32 Processor Architecture Kip Irvine.

Chapter 4 Basic Instructions. 4.1 Copying Data mov Instructions mov (“move”) instructions are really copy instructions, like simple assignment statements.

Cisc Complex Instruction Set Computing By Christopher Wong 1.

Assembly Language – Lab 5

Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.

RISC:Reduced Instruction Set Computing. Overview What is RISC architecture? How did RISC evolve? How does RISC use instruction pipelining? How does RISC.

Dr. José M. Reyes Álamo 1.  The 80x86 memory addressing modes provide flexible access to memory, allowing you to easily access ◦ Variables ◦ Arrays ◦

Assembly Questions תרגול 12.

ITEC 352 Lecture 20 JVM Intro. Functions + Assembly Review Questions? Project due today Activation record –How is it used?

What have mr aldred’s dirty clothes got to do with the cpu

The x86 Architecture Lecture 15 Fri, Mar 4, 2005.

RISC Architecture RISC vs CISC Sherwin Chan.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

Arithmetic Flags and Instructions

Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.

Assembly Language. Symbol Table Variables.DATA var DW 0 sum DD 0 array TIMES 10 DW 0 message DB ’ Welcome ’,0 char1 DB ? Symbol Table Name Offset var.

Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.

Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.

LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.

Microprocessor & Assembly Language Arithmetic and logical Instructions.

UNIT III -PIPELINE.

Preocedures A closer look at procedures. Outline Procedures Procedure call mechanism Passing parameters Local variable storage C-Style procedures Recursion.

Week 6 Dr. Muhammad Ayaz Intro. to Assembly Language.

Microprocessors CSE- 341 Dr. Jia Uddin Assistant Professor, CSE, BRAC University Dr. Jia Uddin, CSE, BRAC University.

Chapter 8 String Operations. 8.1 Using String Instructions.

PipeliningPipelining Computer Architecture (Fall 2006)

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

Chapter Overview General Concepts IA-32 Processor Architecture

Instruction Set Architecture

Computer Architecture and Assembly Language

Instruction Level Parallelism

A Closer Look at Instruction Set Architectures

A Closer Look at Instruction Set Architectures

Introduction to Compilers Tim Teitelbaum

Basic Microprocessor Architecture

Chapter 4: Instructions

CS 301 Fall 2002 Computer Organization

Practical Session 4.

Multiplication and Division Instructions

Instruction Level Parallelism (ILP)

X86 Assembly Review.

ARM ORGANISATION.

CSC 497/583 Advanced Topics in Computer Security

Lecture 11: Machine-Dependent Optimization

Computer Architecture and System Programming Laboratory

Presentation transcript:

Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation

Background In order to generate efficient code, modern compilers must consider the architecture for which they generate code for. Therefore, the engineer who implements the compiler must be very familiar with the architecture. There is rarely “one way” to write code for one solution. However, some implementations may be able to take better advantage of the architecture’s features than others and achieve higher performance. Naturally, we always want the most “efficient” (fastest) solution.

Methodology For our project, we performed research on the architecture of the AMD64 Athlon and Opteron processors. While gathering information on the AMD64 architecture, we selected a subset of relevant optimization techniques that should “in theory” yield better performance than other similar approaches. Using the Microsoft Macro Assembler (MASM), we implemented a series of 15 small sample programs where each program isolates a single optimization technique (or lack thereof).

Methodology II After assembling all of the test programs, all were instrumented and profiled on a machine with a single core AMD64 Athlon processor. We used AMD’s downloadable “Code Analyst” suite to profile the program’s behavior and collect results such as clock events, cache misses, dispatch stalls, and time to completion. The goal was to determine which optimization techniques yielded the best performance gain and validate our assumptions pertaining to the system architecture.

Optimization Classes Examined Decode Memory Access Arithmetic Integer Scheduling Loop Instruction Overhead Loop Unrolling Function Inline

Decode Optimization - Decoding IA32 instructions is complicated. Using single complex instructions rather than multiple simpler instructions reduces the number of instructions that must be decoded into micro-operations. Example: add edx, DWORD PTR [eax] instead of mov ebx, QWORD PTR [eax] add edx, ebx This optimization reduces: - decoder usage. - register-pressure (allocated registers in the register file). - data dependent instructions in the pipeline (stalls).

Memory Access Optimization - The L1 cache is implemented as 8 separate banks. Each line in the bank is 8 bytes wide. If two consecutive load instructions try to read from a different line in the same bank then the second instruction has to wait for the previous instruction to finish. If no bank conflict occurs, then both loads can complete during the same cycle. Example: mov ebx, DWORD ptr [edi] mov edx, DWORD ptr [edi + TYPE intarray] imul ebx, ebx imul edx, edx instead of mov ebx, DWORD ptr [edi] imul ebx, ebx mov edx, DWORD ptr [edi + TYPE intarray] imul edx, edx Assuming that each array value is 4 bytes in size, a bank conflict can not occur since both loads read from the same cache line.

Arithmetic Optimization - Dividing by 16 bit integers is faster than dividing by 32 bit integers. Example: mov dx, 0 mov ax, mov bx, 5 idiv bx instead of mov edx, 0 mov eax, mov ebx, 5 idiv ebx This optimization reduces: - Time to perform division

Integer Scheduling Optimization - Pushing data onto the stack directly from memory is faster than loading the value into a register first and then pushing the register onto the stack. Example: push [edi] instead of lea ebx, [edi] push ebx This optimization reduces: - register-pressure (allocated registers in the register file). - data dependent instructions in the pipeline (stalls).

Integer Scheduling Optimization - Two writes to different portions of the same register is slower than writing to different registers. Example: mov bl, 0012h ; Load the constant into lower word of EBX mov ah, 0000h ; Load the constant into upper word of EAX ; No false dependency on the completion ; of previous instruction since BL and AH are different ;registers. instead of mov bl, 0012h ; Load the constant into lower word of EBX mov bh, 0000h ; Load the constant into upper word of EBX ; Instruction has a false dependency on the completion ; of previous instruction since BL and BH share EBX. This optimization reduces: - Dependent instructions in the pipeline.

Loop Instruction Overhead Optimization - The “LOOP” instruction has an 8 cycle latency. Faster to use other instructions like decrement and jump. Example: mov ecx, LENGTHOF intarray dec ecx jnz L1 instead of mov ecx, LENGTHOF intarray loop L1 This optimization reduces: - Loop overhead latency.

Loop Unrolling Optimization - Unrolling the body of a loop reduces the total number of iterations that need to be performed which eliminates a great deal of loop overhead and overall faster execution. Example: lea ebx, [edi]; Get the next element from memory into register push ebx; Push the next element onto the stack pop ebx; Pop the element from the stack lea ebx, 2[edi]; Get the next element from memory into register push ebx; Push the next element onto the stack pop ebx; Pop the element from the stack lea ebx, 4[edi]; Get the next element from memory into register push ebx; Push the next element onto the stack pop ebx; Pop the element from the stack lea ebx, 6[edi]; Get the next element from memory into register push ebx; Push the next element onto the stack pop ebx; Pop the element from the stack lea ebx, 8[edi]; Get the next element from memory into register push ebx; Push the next element onto the stack pop ebx; Pop the element from the stack instead of lea ebx, [edi]; Get the next element from memory into register push ebx; Push the next element onto the stack pop ebx

Function Inline Optimization - The body of small functions can replace the function call in order to reduce function call overhead. Example: mov edx, 0; Sign extend dividend. mov eax, 65535; Load the dividend mov ebx, 5; Load the divisor idiv ebx; Perform the division instead of call DoDivision; Perform the division

Decode Results Discussion LoadExecuteNoOp vs LoadExecuteWithOp Confirmed expectations. Data cache misses were about the same for both programs. The non-optimized program required / = 1.21x more cycles than the optimized program. The non-optimized program required / = 1.25x more instructions than the optimized program. The optimized program introduced / = 1.49x more stalls than the non-optimized program. The non-optimized program took / = 1.22x longer to finish than the optimized program. Even though the optimized program had a higher stall rate, the overall reduction in instructions and cycles created an overall net-performance gain.

Memory Access Results Discussion MemAccessNoOp vs MemAccessWithOp Did not confirm expectations. There was no real observable difference between either of these programs. Both executed roughly the same number of cycles in the same period of time with the similar stall and cache miss occurrences. We are guessing that the same micro-operations get generated even for the optimized program. LEA16 vs LEA32 Did not confirm expectations. There was no real observable difference between either of these programs. Both executed roughly the same number of cycles in the same period of time with the similar stall and cache miss occurrences.

Arithmetic Results Discussion DIVIDE32 vs DIVIDE16 Confirmed expectations. Data cache misses were about the same for both programs. Both programs executed roughly the same number of instructions. However, the DIVIDE32 took / = 1.57x as many cycles as DIVIDE16 to finish. DIVIDE16 finished / = 1.57x faster than DIVIDE32. The instructions per cycle decreased by / = 1.56x for the DIVIDE32. The stalls per instruction ratio increased by slightly in the DIVIDE32 program / = x. As expected, the 32bit division appeared to run significantly slower than 16 bit division.

Integer Scheduling Results Discussion IssueNoOp vs IssueWithOp Did not confirm expectations. The optimized program required / = 1.06x more cycles than the non-optimized program. The non-optimized program required / = 1.25x more instructions. The optimized program had / 941 = x more cache misses than the non optimized program. The optimized program achieved / = 1.32x fewer instructions per cycle than the non-optimized program. It took the optimized program / = 1.05x longer to finish than the non-optimized program. While the optimization did reduce the code density, the cache miss rate increased greatly which diminished any performance returns to the point that the performance actually became worse.

Integer Scheduling Results Discussion PartialRWNoOp vs PartialRWWithOp Did not confirm expectations. Both of these programs had very close to the same performance profile. We expected to see the number of stalls reduce in the optimized program since the false dependencies were eliminated. However both had about the same number of measured stalls and both finished execution in about the same amount of time.

Loop Instruction Overhead Results Discussion IssueNoOp vs IssueWithNoLOOP Confirmed expectations. Data cache misses were about the same for both programs. Replacing the LOOP instruction with dec/jump had the following effects: Number of cycles reduced by / = 1.91x Instruction count increased by / = 1.20x Stalls increased by / = 2.64x Total runtime decreased by / = 2.0x While the overall instruction count and stall count increased, the total number of cycles needed was reduced by almost half which allowed a great performance gain.

Loop Unrolling Results Discussion IssueNoOp vs IssueWithLoopUnrolled Confirmed expectations. Unrolling the loop body 5 times had the following effects: Number of cycles reduced by / = 1.87x Number of instructions reduced by / = 1.47x Cache miss rate was slightly less 941 / 696 = 1.35x Instructions per cycle increased / = 1.28x Stalls increased by / = 11x Total runtime decreased by / = 1.93x Even though more stalls were introduced by the optimization, the total number of required cycles decreased significantly. Much of the loop overhead was removed which is what allowed an overall net performance increase.

Function Inline Results Discussion DIVIDE32 vs DIVIDE32FuncCall Confirmed expectations. Data cache misses were about the same for both programs. The DIVIDE32FuncCall required / = 1.39x more instructions than the in-lined DIVIDE32. The instructions per cycle increased in the DIVIDE32FuncCall, but this is most likely a false positive as the function call overhead introduced more instructions into the pipeline. The inline DIVIDE32 program finished / = 1.043x faster than the function call implementation. The number of stalls increased by / = 1.12x in the function call implementation. The function call implementation required / = 1.04x as many clock cycles as the inline version. As expected, the added overhead of the function call made a noticeable impact on performance. Also note that we did not pass any parameters to the function. Had parameters been passed, it could be expected that the overhead would increase.

Most Significant Performance Gains 1.) Loop Instruction Overhead (2x speedup) 2.) Loop Unrolling (1.93x speedup) 3.) Arithmetic (1.57x speedup) 4.) Decode (1.22x speedup) 5.) Function Inline (1.043x speedup) 6.) Memory Access (0x speedup) 7.) Integer Scheduling (-1.05x speedup)

Thank you. Questions?