Presentation is loading. Please wait.

Presentation is loading. Please wait.

vector computer overlap arithmetic operation on the elements of the vectorinstruction-level.

Similar presentations


Presentation on theme: "vector computer overlap arithmetic operation on the elements of the vectorinstruction-level."— Presentation transcript:

1

2 claude.tadonki@mines-paristech.fr claude.tadonki@u-psud.fr vector computer overlap arithmetic operation on the elements of the vectorinstruction-level (data) parallelism A vector computer contains a set of special arithmetic units called pipelines. These pipelines overlap the execution of the different parts of an arithmetic operation on the elements of the vector, producing a instruction-level (data) parallelism. Do you see any similarity with the hyperthreading ? In the literature, this kind of computation is often refered as: vector computing; SIMD (Single Instruction Multiple Data); Instruction-level parallelism(ILP); dual issue; double FPU; Most of modern processors have the ability to perform vector computations. Vector computing does not mean or require several processors (or cores). Some special architectures are equiped with several floating point units (FPU) Example: the IBM® Blue Gene®/L supercomputer, whose processors are enhanced with a specially designed dual Floating Point Unit. INTRODUCTION Vector computation requires specific registers called vector registers. Vector operations are performed on vector registers. Thus, their length is an important hardware characteristic, also does their number. The memory bandwidth is also important when evaluating vector computing potential.

3 claude.tadonki@mines-paristech.fr claude.tadonki@u-psud.fr Could you identify and explain other type of pipelines ? VECTOR PROCESSING scalar implementation 6n steps A scalar implementation of adding two array of length n will require 6n steps vector implementation 6 + (n-1) steps A vector implementation of adding two array of length n will require 6 + (n-1) steps vector processing applies to # operations Depending on the architecture, vector processing applies to # operations (arith, logical). 6 steps (stages) involved in a floating-point addition Consider the 6 steps (stages) involved in a floating-point addition on a sequential machine with IEEE arithmetic hardware: A. exponents are compared for the smallest magnitude. B. exponents are equalized by shifting the significand smaller. C. the significands are added. D. the result of the addition is normalized. E. checks are made for floating-point exceptions such as overflow. F.rounding is performed. pipelinevector register n = 2, 4, 8 The pipeline process occurs within a vector register (thus n = 2, 4, 8, …) Some vector architecture provide a wider vectorization by chaining the pipelines. p-length vector n-arrayn/p Roughly speeking, a p-length vector computation on a given n-array needs n/p steps.

4 claude.tadonki@mines-paristech.fr claude.tadonki@u-psud.fr Intel Sandy Bridge processor (2011) family with 256-bit (16 bytes) vector registers. FPGA (Field Programmable Gate Array), an integrated flexible circuit which can be configured as desired to implement a specific processing. SOME PROCESSORS WITH VECTOR PROCESSING UNITS PowerPC (Motorola/Apple/IBM), with 32 128-bit vector registers. SPE of the IBM-CELL BE with 128 128-bit vector registers. GPU

5 claude.tadonki@mines-paristech.fr claude.tadonki@u-psud.fr SIMD single operation on multiple data A SIMD enabled-processor can execute a single operation on multiple data. SIMDimage, audio, video and digital signal processing SIMD works well on image, audio, video and digital signal processing APPLICATIONS OF VECTOR COMPUTING SIMD stream processing SIMD suits for stream processing (typically uniform streaming) The expected characteristics are: Compute intensive (arithmetic operations are dominant compare to I/O) Data parallel (same function to all records independently) Data locality (data to be accessed are contiguous on the memory) No branching (no control flow, straight line) AA BB B? B? SIMD video games programming, genomics, linear algebra SIMD is widely use in video games programming, genomics, linear algebra, … Could you explain the meaning of “independently” in the data parallelism ? Could you explain why branching is hindering for SIMD ?

6 claude.tadonki@mines-paristech.fr claude.tadonki@u-psud.fr VECTOR COMPUTING IMPLEMENTATIONS Altivec or VMX (PowerPC) IBM CELL-BE SPE intrinsics Intrinsics are there to facilitate vector programming We can expect or force the compiler to vectorize our code (but do no rely on it !)

7 claude.tadonki@mines-paristech.fr claude.tadonki@u-psud.fr DATA ALIGNMENT Data alignment Data alignment is crucial in vector computing and important for performances data alignment Lack of knowledge about data alignment could raise the following issues Your software will run slower. Your application will lock up. Your operating system will crash. Your software will silently fail, yielding incorrect results. Memory Memory is accessed by chunks of constant sizes (cache line) memory address p-aligned multiple of p A memory address is said to be p-aligned iff it is a multiple of p (typically 128) align the size It is important to also align the size of data types (padding if necessary). compilerautomatically padd Sometimes the compiler will automatically padd your data structures (check with sizeof() ) specialized libraries memory aligned There are specialized libraries for memory aligned allocations Write a C routine that implements aligned memory allocations ? Illustrate a misaligned memory address and explain some consequences ? typedef struct{char a; long b; char c;}mystruct; Is mystruct aligned ? How to fix it ?

8 claude.tadonki@mines-paristech.fr claude.tadonki@u-psud.fr SIMD CODES (SSE)

9 claude.tadonki@mines-paristech.fr claude.tadonki@u-psud.fr SIMD CODES (AVX) __m256 xxx256_x_coeff1 = _mm256_load_ps( &interp_coef_x[0] ); __m256 xxx256_data0a = _mm256_load_ps( &pf[index_signal_start] ); __m256 xxx256_data1a = _mm256_load_ps( &pf[index_signal_start+nx] ); __m256 xxx256_data2a = _mm256_load_ps( &pf[index_signal_start+nx2] ); xxx256_data0a = _mm256_mul_ps( xxx256_data0a, xxx256_x_coeff1 ); xxx256_data1a = _mm256_mul_ps( xxx256_data1a, xxx256_x_coeff1 ); xxx256_data2a = _mm256_mul_ps( xxx256_data2a, xxx256_x_coeff1 ); __m256 xxx256_sum1 = _mm256_add_ps(_mm256_mul_ps(xxx256_data2a,xxx256_y_coeff2), _mm256_mul_ps(xxx256_data3a, xxx256_y_coeff3) ); xxx256_sum = _mm256_add_ps(xxx256_sum, _mm256_movehdup_ps(xxx256_sum)); xxx256_sum1 = _mm256_unpackhi_ps(xxx256_sum, xxx256_sum) ; xxx256_sum1 = _mm256_add_ps(xxx256_sum, xxx256_sum1) ; xxx256_sum = _mm256_permute2f128_ps(xxx256_sum1, xxx256_sum1, 0x01) ; xxx256_sum = _mm256_add_ps(xxx256_sum, xxx256_sum1); _mm256_store_ps( f, xxx256_sum ); signal_value = f[0];

10 claude.tadonki@mines-paristech.fr claude.tadonki@u-psud.fr SIMD CODES (VMX) int vmult(float *array1, float *array2, float *out, int arraySize) { /* This code assumes that the arrays are quadword-aligned. */ /* This code assumes that the arraySize is divisible by 4. */ int i, arraySizebyfour; arraySizebyfour = arraySize >> 2; /* arraySize/4 vectors */ vector float *varray1 = (vector float *) (array1); vector float *varray2 = (vector float *) (array2); vector float *vout = (vector float *) (out); for(i = 0; i < arraySizebyfour; i++) { /*vec_mul is an intrinsic that multiplies vectors */ vout[i] = vec_mul(varray1[i], varray2[i]); } return 0; }


Download ppt "vector computer overlap arithmetic operation on the elements of the vectorinstruction-level."

Similar presentations


Ads by Google