1 Agenda AVX overview Proposed AVX ABI changes −For IA-32 −For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: −Stack alignment.

Slides:

Advertisements

Similar presentations

CPU Structure and Function

Advertisements

Overview of programming in C C is a fast, efficient, flexible programming language Paradigm: C is procedural (like Fortran, Pascal), not object oriented.

Introduction to Machine/Assembler Language Noah Mendelsohn Tufts University Web:

Lecture 3: Instruction Set Principles Kai Bu

Procedures Procedures are very important for writing reusable and maintainable code in assembly and high-level languages. How are they implemented? Application.

Machine/Assembler Language Putting It All Together Noah Mendelsohn Tufts University Web:

Some Other Instruction Set Architectures. Overview Alpha SPARC i386.

10/9: Lecture Topics Starting a Program Exercise 3.2 from H+P Review of Assembly Language RISC vs. CISC.

ITCS 3181 Logic and Computer Systems 2015 B. Wilkinson slides3.ppt Modification date: March 16, Addressing Modes The methods used in machine instructions.

Advanced topics in X86 assembly by Istvan Haller.

Computer Organization and Architecture

Computer Organization and Architecture

Computer Organization. This module surveys the physical resources of a computer system. –Basic components CPUMemoryBus I/O devices –CPU structure Registers.

Informationsteknologi Saturday, September 29, 2007 Computer Architecture I - Class 41 Today’s class More assembly language programming.

Inline Assembly Section 1: Recitation 7. In the early days of computing, most programs were written in assembly code. –Unmanageable because No type checking,

Computer Architecture CPSC 321 E. J. Kim. Overview Logical Instructions Shifts.

S. Barua – CPSC 440 CHAPTER 2 INSTRUCTIONS: LANGUAGE OF THE COMPUTER Goals – To get familiar with.

©UCB CPSC 161 Lecture 5 Prof. L.N. Bhuyan

Intro to Java The Java Virtual Machine. What is the JVM  a software emulation of a hypothetical computing machine that runs Java bytecodes (Java compiler.

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit

IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.

VAX-11/780 – A Virtual Address Extension to the DEC PDP-11 Family ( Author : W.D.Strecker ) By Padmaja chowti.

1 Appendix B Classifying Instruction Set Architecture Memory addressing mode Operations in the instruction set Control flow instructions Instruction format.

Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas Miriam Leeser Northeastern University Boston, MA.

CS533 Concepts of Operating Systems Jonathan Walpole.

Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.

History of Microprocessor MPIntroductionData BusAddress Bus

Topic 2d High-Level languages and Systems Software

COP4020 Programming Languages Subroutines and Parameter Passing Prof. Xin Yuan.

Cis303a_chapt04.ppt Chapter 4 Processor Technology and Architecture Internal Components CPU Operation (internal components) Control Unit Move data and.

Assembly תרגול 5 תכנות באסמבלי. Assembly vs. Higher level languages There are NO variables’ type definitions.  All kinds of data are stored in the same.

Computer Science 516 RISC Architecture: MIPS, ARM.

ARM (Advanced RISC Machine; initially Acorn RISC Machine) Load/store architecture 65 instructions (all fixed length – one word each = 32 bits) 16 registers.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Introduction to MMX, XMM, SSE and SSE2 Technology

Lecture 04: Instruction Set Principles Kai Bu

With a focus on floating point.  For floating point (i.e., real numbers), MASM supports:  real4  single precision; IEEE standard; analogous to float.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

X86_64 programming Tutorial #1 CPSC 261. X86_64 An extension of the IA32 (often called x86 – originated in the Intel 8086 processor) instruction set to.

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.

ARM (Advanced RISC Machine; initially Acorn RISC Machine) Load/store architecture 65 instructions (all fixed length – one word each = 32 bits) 16 registers.

LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

Computer Science 516 Intel x86 Overview. Intel x86 Family Eight-bit 8080, 8085 – 1970s 16-bit 8086 – was internally 16 bits, externally 8 bits.

Procedures Procedures are very important for writing reusable and maintainable code in assembly and high-level languages. How are they implemented? Application.

Credits and Disclaimers

More GDB, Intro to x86 Calling Conventions, Control Flow, & Lab 2

143A: Principles of Operating Systems Lecture 4: Calling conventions

The Stack & Procedures CSE 351 Spring 2017

Basics Of X86 Architecture

The HP OpenVMS Itanium® Calling Standard

MMX Multi Media eXtensions

CSE 351 Section 10 The END…Almost 3/7/12

Introduction to Intel IA-32 and IA-64 Instruction Set Architectures

MIPS Instructions.

MIPS Procedure Calls CSE 378 – Section 3.

Machine Level Representation of Programs (IV)

Computer Instructions

Other Processors Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.

Chapter 11 Processor Structure and function

Credits and Disclaimers

Computer Architecture and System Programming Laboratory

Presentation transcript:

1 Agenda AVX overview Proposed AVX ABI changes −For IA-32 −For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: −Stack alignment. >IA32 relative performance numbers for SPEC CPU 2K and >DWARF2 change for stack alignment. −Status of gcc AVX branch. >Limit AVX vectorizer to 128bit.

2 Intel® Advanced Vector Extensions (Intel® AVX) 2X Vector Width A 256-bit vector extension to SSE Intel® AVX extends all 16 XMM registers to 256bits Intel® AVX works on either −The whole 256-bits −The lower 128-bits (like existing SSE instructions) >A drop-in replacement for all existing scalar/128-bit SSE instructions >The upper part of the register is zeroed out >No alignment fault on ld-op arithmetic operations 256 bits (2010) YMM0 XMM0 128 bits (1999)

3 Intel® Advanced Vector Extensions (Intel® AVX) – New Encoding System Nearly all SSE FP instructions “promoted” to 256-bits −VADDPSYMM1, YMM2, [m256] Nearly all (*) SSE instructions encode-able in new format −VADDPSXMM1, XMM2, [m128] −VMULSSXMM1, XMM2, [m32] −VPUNPCKHQDQXMM1, XMM2, [m128] 128-bit and scalar promoted instructions have full inter-operability with 256-bit operations (*) instructions referencing MMX registers are NOT promoted to Intel AVX

4 Key Intel® Advanced Vector Extensions (Intel® AVX) Features Wider Vectors −Increased from 128 bit to 256 bit KEY FEATURES BENEFITS Up to 2x peak FLOPs (floating point operations per second) output with good power efficiency Intel® AVX is a general purpose architecture, expected to supplant SSE in all applications used today Enhanced Data Rearrangement −Use the new 256 bit primitives to broadcast, mask loads and permute data Organize, access and pull only necessary data more quickly and efficiently Three and four Operands, Non Destructive Syntax −Designed for efficiency and future extensibility Fewer register copies, better register use for both vector and scalar code Flexible unaligned memory access support More opportunities to fuse load and compute operations Extensible new opcode (VEX) Code size reduction

5 AVX related changes Assembler support in binutils − Under –msse2avx, sse instructions will map to AVX instructions New data type __m256 − Natural alignment is 32 bytes − Requires stack alignment greater than guaranteed by IA- 32 and x86-64 ABI Intrinsics for AVX instructions − Under –mavx sse intrinsics will be mapped AVX instructions Automatic code generation − Vectorizer work ongoing

6 Proposed AVX ABI changes __m256 *p; a reference to *p will generate a 32-byte aligned 32-byte load − In particular, __m256 variables on stack also need to be 32-byte aligned − Has implications on aligning stack (talked yesterday) as well as parameter passing (today)

7 Parameter passing ABIs (Linux-32) Almost everything on stack −__m128 variables in xmm0-2, all caller save − __m128 after first 3 parameters passed on stack > gcc assumes stack aligned at 16 bytes > Inserts padding as desired Proposed change − __m128/__m256 variables in xmm0-2/ymm0-2, all caller save − Note overlap between xmm and lower halves of ymm − After first 3 parameters, __m128/__m256 passed on stack > Requires stack aligned to 32 bytes (Joey Ye’s talk yesterday) > Insert padding as desired Varargs dealt later

8 Parameter passing ABIs (Linux-64) __m128 variables in xmm0-7, all caller save > __m128 after first 8 parameters passed on stack ABI guarantees stack aligned to 16 bytes Inserts padding as desired Proposed change − __m128/__m256 variables in xmm0-7/ymm0-7, all caller save − Note overlap between xmm and lower halves of ymm − After first 8 parameters, __m128/__m256 passed on stack > Requires stack aligned to 32 bytes (Joey Ye’s talk yesterday) > Insert padding as desired Varargs dealt later

9 Varargs (Linux-32) Current − Everything (named, unnamed, including __m128s) passed on stack − gcc assumes stack aligned at 16 bytes − Inserts padding as desired Proposed change − Everything (named, unnamed, including __m128/__m256) passed on stack − Stack aligned to 32 bytes if __m256 is passed on stack − Insert padding as desired

10 Varargs (Linux-64) Current − No change from non-varargs case except >Register al contains number of xmm registers used as parameters ABI pages 50 (rax) and footnote 14 (al) may/may not be in contradiction − (Callee) has register save area whose layout is defined by the ABI > rdi, rsi, rdx, rcx, r8, r9 (integer registers for parameters) followed by xmm0-15 Why xmm0-15 instead of xmm0-7 as only xmm0-7 can be for parameters? Proposed change options − Aligned stack to 32 bytes if __m256 parameters are passed on stack − Register al contains number of xmm/ymm registers used as parameters − For __m256 parameters > Option 1: All __m256 parameters (named, unnamed) on stack > Option 2: Only unnamed __m256 parameters on stack

11 Unprototyped functions (Linux-64) Current − Same as prototyped + al defined − Works even if function is non varargs or varargs Proposed change (assume __m256 as parameter) − Same as prototyped + al defined(?) − Option 1: (all ___m256 on stack): Does not work if function is a varargs function − Option 2: (unnamed __m256 on stack): Does not work if __m256 parameter is among unnamed For unprototyped functions caller must treat as prototyped with al defined for performance − If we want unprototyped functions to work when they are really vararg functions you must extend register save area >Unknown performance penalty for all vararg functions On Linux-32 (similar abi for __m128 to option 1), unprototyped functions do not work when really vararg functions

12 AVX and Vectorizer AVX −128bit INT and 256bit FP vector operations >Can use 256bit FP vector AND, ANDN, OR, XOR to emulate 256bit INT. −256bit vector 256bit vector −128bit vector 256bit vector Vectorizer −Doesn’t support vector conversion of different vector sizes. −Doesn’t support different vector sizes based on operations.

13 AVX Branch Status Implemented: −AVX code generation. -mavx generates pure AVX instructions without legacy SSE instructions. −AVX intrinsics. −AVX vectorizer is limited to 128bit. To do: −Variable argument −Verify unwind and debug. −AVX specific tests. >Runtime intrinsic tests. >Variable arguments. >Unwind with 256bit vector. −256bit Vectorizer support.

14 Stack Alignment Branch Collect stack alignment info in middle-end. Use DW_OP_operation to describe call frame with stack alignment. −Need to handle DRAP properly without changing CFA. Implemented x86 target hooks for stack alignment. Added ~70 C/C++ runtime tectcases for stack alignment. On 45nm Core 2 Duo in 32bit, compared against gcc 4.4 revision at –O2, stack alignment introduced 0% regression on SPEC CPU 2006 INT/FP, 0.3% regression on SPEC CPU 2K INT and 0.6% regressions on SPEC CPU 2K FP. Updated gdb prologue analyzer to recognize the x86 prologues with stack alignment.

15 Float128 on x86 Gcc uses SSE/SSE2 to implement float128. Gcc only supports float128 on ix −Update ia32 psABI. >Alignment. 16byte >Parameter (varargs) passing. On stack, aligned at 16byte. −Check TARGET_SSE/TARGET_SSE2/TARGET_SSE_MATH instead of TARGET_64BIT. There is no run-time support for float128. −I/O. −String to __float128 function −Math library. IEEE 754R −Existing float128 API implementations. −Implement float128 with DFP in a separate run-time library.

16 Backup