COMPUTER ARCHITECTURE (P175B125)

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
INSTRUCTION SET ARCHITECTURES
1 Agenda AVX overview Proposed AVX ABI changes −For IA-32 −For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: −Stack alignment.
The University of Adelaide, School of Computer Science
Fixed Point Numbers The binary integer arithmetic you are used to is known by the more general term of Fixed Point arithmetic. Fixed Point means that we.
ECE291 Computer Engineering II Lecture 24 Josh Potts University of Illinois at Urbana- Champaign.
Real time DSP Professors: Eng. Julian S. Bruno Eng. Jerónimo F. Atencio Sr. Lucio Martinez Garbino.
Time Optimization of HEVC Encoder over X86 Processors using SIMD
Computers Organization & Assembly Language Chapter 1 THE 80x86 MICROPROCESSOR.
Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.
Computer Organization and Architecture
Pentium 4 and IA-32 ISA ELEC 5200/6200 Computer Architecture and Design, Fall 2006 Lectured by Dr. V. Agrawal Lectured by Dr. V. Agrawal Kyungseok Kim.
Data Representation Computer Organization &
S. Barua – CPSC 440 CHAPTER 2 INSTRUCTIONS: LANGUAGE OF THE COMPUTER Goals – To get familiar with.
Recap.
1 RISC Machines l RISC system »instruction –standard, fixed instruction format –single-cycle execution of most instructions –memory access is available.
Embedded Systems Programming
CS854 Pentium III group1 Instruction Set General Purpose Instruction X87 FPU Instruction SIMD Instruction MMX Instruction SSE Instruction System Instruction.
Computer Organization and Assembly language
CMPE 511 Computer Architecture Caner AKSOY CmpE Boğaziçi University December 2006 Intel ® Core 2 Duo Desktop Processor Architecture.
AMD Opteron - AMD64 Architecture Sean Downes. Description Released April 22, 2003 The AMD Opteron is a 64 bit microprocessor designed for use in server.
NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.
Assembly Language for Intel-Based Computers, 4 th Edition Chapter 2: IA-32 Processor Architecture (c) Pearson Education, All rights reserved. You.
Computer Arithmetic Nizamettin AYDIN
Streaming SIMD Extensions CSE 820 Dr. Richard Enbody.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
XP Practical PC, 3e Chapter 16 1 Looking “Under the Hood”
NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design of a Multimedia Extension for RISC Processor Ing. Eduardo Jonathan Martínez.
Intel Pentium II Processor Brent Perry Pat Reagan Brian Davis Umesh Vemuri.
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.
History of Microprocessor MPIntroductionData BusAddress Bus
1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.
Chapter Six Sun SPARC Architecture. SPARC Processor The name SPARC stands for Scalable Processor Architecture SPARC architecture follows the RISC design.
Computer Architecture and Organization
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Introduction to MMX, XMM, SSE and SSE2 Technology
CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.
November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.
With a focus on floating point.  For floating point (i.e., real numbers), MASM supports:  real4  single precision; IEEE standard; analogous to float.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
EEL5708/Bölöni Lec 8.1 9/19/03 September, 2003 Lotzi Bölöni Fall 2003 EEL 5708 High Performance Computer Architecture Lecture 5 Intel 80x86.
The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.
Xinsong1 Multimedia Extension Technology survey Xinsong Yang Electrical and Computer Engineering 734 Final Project 5/10/2002.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
Time Optimization of HEVC Encoder over X86 Processors using SIMD Kushal Shah Advisor: Dr. K. R. Rao Spring 2013 Multimedia.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
Computer Science 516 Intel x86 Overview. Intel x86 Family Eight-bit 8080, 8085 – 1970s 16-bit 8086 – was internally 16 bits, externally 8 bits.
Chapter Overview General Concepts IA-32 Processor Architecture
Visit for more Learning Resources
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.
Vector Processing => Multimedia
Advanced Computer Architecture 5MD00 / 5Z032 Instruction Set Design
MMX Multi Media eXtensions
Special Instructions for Graphics and Multi-Media
Introduction to Microprocessor Programming
Morgan Kaufmann Publishers Arithmetic for Computers
Other Processors Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 16 – RISC-V Vectors Krste Asanovic Electrical Engineering and.
Computer Organization and Assembly Language
Presentation transcript:

COMPUTER ARCHITECTURE (P175B125) Assoc.Prof. Stasys Maciulevičius Computer Dept. stasys.maciulevicius@ktu.lt

Extension of instruction sets 2009-2013 ©S.Maciulevičius

Extension of instruction set Reasons and presumptions former processors have been focused on processing of integer and floating-point numbers spread of digital processing of graphics and audio information technology development and the reduction of technology process from 0.35 μm to 0.13 μm led to a significant increase in the number of transistors in chip RISC core is compact – it uses relatively small number of transistors increasing the length of the word from 32 bits to 64 bits in many cases, it is sufficient 16 or even 8-bit to encode digital graphics and audio information possibility to use SIMD and vector processing principles 2009-2013 ©S.Maciulevičius

Extension of instruction set In 1996, Intel introduced MMX technology - instruction set of processor has been extended by adding of 57 new instructions for optimization of multimedia applications These instructions treat data as it is in SIMD (Single Instruction - Multiple Data) system Similar extensions to the instruction set introduced and other companies in the processors a little later 2009-2013 ©S.Maciulevičius

Extensions of instruction set Abbrev. Name Company Processors MMX MultiMedia eXtension Intel Pentium w. MMX, Pentium II Cyrix MediaGX KNI, SSE, SSE2, … Katmai New Instr. Streaming SIMD Extens. Pentium III, Pentium 4 3DNow! AMD K6, K7 (Athlon, Duron) AltiVec Motorola, IBM G4, G5 Power 4+, PPC 970 VIS Visual Instruction Set Sun Microsyst. UltraSPARC MAX-2 Multimedia Architectural eXtension HP PA-7100LC, PA-8000 2009-2013 ©S.Maciulevičius

Requirements for extension of instruction set In order to maintain compatibility with existing software and operating system, designers had to consider the following: programs using MMX instructions must be able to run in existing operating systems; it means, MMX technology shouldn’t add any new architecturally visible states or events (exceptions) programs which don’t use MMX instructions must be able to run without any changes; it means, MMX technology shouldn’t change any existing IA-32 instruction 2009-2013 ©S.Maciulevičius

Requirements for extension of instruction set Available applications must be able to use MMX technology without reprogramming of task, which means that the MMX technology can be used in a separate procedure, leaving the rest unchanged, and they requande that MMX instructions should work in the current procedure call system programs using MMX instructions must be able to run in older processors, which doesn’t support MMX; it means, DLL should be written for processors with MMX and without MMX technology 2009-2013 ©S.Maciulevičius

MMX registers MM0 andFP0 MM1 and FP1 MM2 and FP2 MM3 and FP3 Tags 79 64 63 0 MM0 andFP0 MM1 and FP1 MM2 and FP2 MM3 and FP3 MM4 and FP4 MM5 and FP5 MM6 and FP6 MM7 and FP7 00 When FPU registers are used as MMX registers, sign bit and all exponent bits are set to 1 (according to IEEE-754 standard, this means NaN). In transition from the FPU to MMX mode, tags are set to 11 - which means that registers are"empty" 2009-2013 ©S.Maciulevičius

Pixel encoding 8 bits 8 bit color pixels Gray pixels Index Pixel Gray pixels Intensity 12 bit color pixels R G B 32 bit color pixels  2009-2013 ©S.Maciulevičius

Addition – simple and with saturation Consider two 8-bit integers: +85 and +58. Add them: 0.1010101 0.0111010 1.0001111 Result can be interpreted in different ways: overflow is fixed result is set equal to 0.0001111=15 (carry-out will be ignored; this is by adding mod 128) result is set equal to 0.1111111=127 (this is maximal value for positive 8-bit integer) + 2009-2013 ©S.Maciulevičius

Data range and saturation Lower boundary Upper boundary Signed Hexadecimal Decimal 1 byte 80H -128 7FH 127 2 bytes 8000H -32 768 7FFFH 32 767 Unsigned 00H FFH 255 0000H FFFFH 65 535 2009-2013 ©S.Maciulevičius

Some graphic instructions Mnemo-nic Instruction Operands Operation t means: n - nibble b - byte h - halfword - word x means: u - unsigned s - signed us - mixed padd.t Packed Add rd, rs1, rs2 rd:rd+1  rs1:rs1+1 + rs2:rs2+1 sum mod 2t padds.x.t Packed Add and Saturate rd:rd+1  rs1:rs1+1 + rs2:rs2+1 sum with saturation; psub.t Packed Subtract rd:rd+1  rs1:rs1+1 - rs2:rs2+1 subtraction mod 2t psubs.x.t Packed Subtract and Saturate rd:rd+1  rs1:rs1+1 - rs2:rs2+1 subtraction with saturation 2009-2013 ©S.Maciulevičius

Pixel addition - examples padd.b padds.u.b padds.s.b padds.us.b 00 55 80 AA 7F FF 54 2A 29 2009-2013 ©S.Maciulevičius

Some graphic instructions Mnemo-nic Instruction Operands Operation pmulh Packed Multiply high (on words) rd, rs1, rs2 rd:rd+1  rs1  rs2:rs2+1 pixel multiply pmadd Packed multiply on words and add resulting pairs rd:rd+1  rs1:rs1+1  + rs2:rs2+1 multiply words and add resulting pairs 2009-2013 ©S.Maciulevičius

Final shift and addition are needed Vector product a0 a1 a2 a3 a4 a5 a6 a7  c0 c1 c2 c3 c4 c5 c6 c7 a0c0+a1c1 a2c2+a3c3 a4c4+a5c5 a6c6+a7c7 + x = (a(i)  c(i)), i=0..7 s0145 s2367 Final shift and addition are needed pmadd 2009-2013 ©S.Maciulevičius

Vector product: you win Number of instr. without MMX Number of MMX instructions Load 16 4 Multiply 8 2 Shift Add 7 1 Store Other - 3 Total 40 13 2009-2013 ©S.Maciulevičius

Pecularity of using MMX While MMX and FPU instructions use the same registers for different purposes, developers should carefully write program code, which uses MMX and FPU alternately MMX modules should be separated from the floating-point code modules. One type of code (MMX or floating-point) should be grouped as much as possible In order to achieve maximum performance, in cycles of modules should not be conditional jumps into another type of module 2009-2013 ©S.Maciulevičius

SSE In Pentium III (1999) 70 new instructions - SSE (Streaming SIMD Extensions) - are added (as a reply to AMD's 3DNow! ) Main difference from MMX is in following: some useful new operations, such as min/max are added; some cache and memory management operations are added, which optimize exchange between L2/L3 cache and main memory; SSE originally added eight new 128-bit registers known as XMM0 through XMM7 and floating point instructions (32 bit numbers) 2009-2013 ©S.Maciulevičius

SSE XMM block carries out: vector operations over set of 4 operands (pairs); scalar operations over one operand (pair) – lower 32 bit word When instructions are executed in XMM block, FPU/MMX unit is free, so SSE instructions can be executed in parallel with floating-point instructions Thus, the MMX unit executes integer instructions, and the XMM block - 32-bit floating-point instructions 2009-2013 ©S.Maciulevičius

SSE2 SSE2, introduced with the Pentium 4, is a major enhancement to SSE SSE2 adds new math instructions for double-precision (64-bit) floating point and also extends MMX instructions to operate on 128-bit XMM registers SSE2 enables the programmer to perform SIMD math on any data type (from 8-bit integer to 64-bit float) entirely with the XMM vector-register file, without the need to use the legacy MMX or FPU registers 2009-2013 ©S.Maciulevičius

Data formats in SSE2 128 bit integer Two 64 bit integers: 64 bit integer 64 bit integer Four 32 bit integers: 32 bit int. 32 bit int. 32 bit int. 32 bit int. Eigth 16 bit integers: 16 b. 16 b. 16 b. 16 b. 16 b. 16 b. 16 b. 16 b. Sixteen 8 bit integers: 8 b 8 b 8 b. 8 b 8 b 8 b. 8 b. 8 b.8 b. 8 b. 8 b. 8 b 8 b 8 b 8 b 8 b. 2009-2013 ©S.Maciulevičius

64 bit floating point 64 bit floating point Data formats in SSE2 Two 64 bit floating point numbers: 64 bit floating point 64 bit floating point Four 32 bit floating point numbers: 32 bit fl.p. 32 bit fl.p. 32 bit fl.p. 32 bit fl.p. 2009-2013 ©S.Maciulevičius

SSE3 SSE3, also called Prescott New Instructions (PNI), is an incremental upgrade to SSE2, adding a handful of DSP-oriented mathematics instructions and some process (thread) management instructions 2009-2013 ©S.Maciulevičius

Some examples of SSE3 MOVSLDUP – Move Packed Single-FP Low and Duplicate: OpA (128 bit, 4 words): a3 | a2 | a1 | a0 OpB (128 bit, 4 words): b3 | b2 | b1 | b0 Result: b2 | b2 | b0 | b0 HADDPS – “horizontal” addition: Result : b3 + b2 | b1 + b0 | a3 + a2 | a1 + a0 ADDSUBPS – addition and subtraction: Result: a3 + b3 | a2 - b2 | a1 + b1 | a0 - b0 2009-2013 ©S.Maciulevičius

SSSE3 SSSE3 is an incremental upgrade to SSE3, adding 16 new opcodes which include permuting the bytes in a word, multiplying 16-bit fixed-point numbers with correct rounding, and within-word accumulate instructions It was introduced by Intel in Core microarchitecture, used in Xeon 5100 and Core 2 processors 2009-2013 ©S.Maciulevičius

SSE4 In Intel Core and AMD K10 microarchitecture processors (2006) new 54 instructions (SSE4.1 set has 47 instructions, SSE4.2 – 7 instructions) were introduced Unlike all previous iterations of SSE, SSE4 contains instructions that execute operations which are not specific to multimedia applications SSE4 operations use 128 bit registers One example: MPSADBW computes eight sums of difference in one instruction: |x0-y0|+|x1-y1|+|x2-y2|+|x3-y3|, |x0-y1|+|x1-y2|+|x2-y3|+|x3-y4|, ...; such operation is usefull in HDTV coding devices 2009-2013 ©S.Maciulevičius

3DNow! 3DNow! is an extension to the x86 instruction set developed by AMD The original idea behind its creation was to extend it from only operating on integer math to also accelerating floating-point calculations It adds SIMD instructions to the base x86 instruction set, enabling it to perform simple vector processing, which improves the performance of many graphic-intensive applications The first microprocessor to implement 3DNow! was the AMD K6-2, which was introduced in 1998 2009-2013 ©S.Maciulevičius

SSE5 The SSE5 (short for Streaming SIMD Extensions version 5), announced by AMD on August 30, 2007, is an extension to the 128-bit SSE core instructions in the AMD64 instruction set for the Bulldozer processor core The details of how the instructions are coded was revised in May 2009 for better compatibility with Intel's proposed AVX (Advanced Vector Extensions) instruction set 2009-2013 ©S.Maciulevičius

SSE5 At the same time, the name SSE5 was changed to : XOP – new operations over integer vectors FMA4 – contain fused multiply-and-add instructions for floating point scalar and SIMD operations CVT16 – half precision floating point conversion SSE5 instruction set consisted of 170 instructions (including 46 base instructions) 2009-2013 ©S.Maciulevičius

Advanced Vector Extensions Advanced Vector Extensions (AVX) is a new 256-bit SIMD FP vector extension of Intel Architecture Its introduction was targeted for the Sandy Bridge processor family in the 2010 timeframe Intel AVX accelerates FP intensive computation in general purpose applications like image, video, and audio processing, engineering applications such as 3D modeling and analysis, scientific simulation, and financial analytics 2009-2013 ©S.Maciulevičius

Advanced Vector Extensions The size of the SIMD vector registers is increased from 128-bits XMM registers to 256-bits registers called YMM0 - YMM15 Existing 128-bit instructions use the lower half of the YMM registers Further extensions to 512 or 1024 bits are expected in the future Instructions are non-destructive: the AVX instruction set allows all two-operand XMM instructions to be modified into non-destructive three-operand forms where the destination register is different from both source registers. For example a = a + b is replaced by c = a + b so that register a is unchanged after the instruction 2009-2013 ©S.Maciulevičius

Advanced Vector Extensions 2 Advanced Vector Extensions 2 (AVX2), also known as Haswell New Instructions, is an expansion of the AVX instruction set to be first introduced in Intel's Haswell microarchitecture. AVX2 makes the following additions: Expansion of most integer AVX instructions to 256 bits 3-operand general-purpose bit manipulation and multiply Gather support, enabling vector elements to be loaded from non-contiguous memory locations Vector shifts and 3-operand fused multiply-accumulate support 2009-2013 ©S.Maciulevičius

Advantages of MMX 2009-2013 ©S.Maciulevičius