Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 11, 2014 6c.1
Data Parallel Computations Same operation performed on different data elements simultaneously; i.e., in parallel, fully synchronous. Particularly convenient because: • Can scale easily to larger problem sizes. • Many numeric and some non-numeric problems can be cast in a data parallel form. Ease of programming (only one program!). 6c.2
Single Instruction Multiple Data (SIMD) model Data parallel model used in vector super-computers designs in1970s: Synchronism at the instruction level. Each instruction specifies a “vector” operation and elements of array to perform operation on. Multiple execution units, each executes operation on a different element or pairs of elements in synchronism Only one instruction fetch/decode unit Subsequently seen in Intel processors -- Vector SSE (Streaming SIMD Extensions) instructions. 6c.3
(SIMD) Data Parallel Pattern Could be described a computational “pattern”: (SIMD) Data Parallel Pattern Program Same program instruction sent to all execution units at the same time Execution units Data Each execution unit performs same operation but on different data in parallel. Usually data are elements of an array
SIMD Example To add same constant, k, to each element of an array: for (i = 0; i < N; i++) a[i] = a[i] + k; Statement a[i] = a[i] + k; could be executed simultaneously by multiple processors, each using a different index i (0<i<=n). Vector instruction Meaning add k to all elements of A[i] , 0 <i<N 6c.5
Using forall construct for data parallel pattern Could use forall to specify data parallel operations forall (i = 0; i < n; i++) a[i] = a[i] + k However, forall is more general – it states that the n instances of the body can be executed simultaneously or in any order (not necessarily executed at the same time). We shall see this in GPU implementation of data parallel pattern. Note forall does imply synchronism at its end – all instances must complete before continuing, which will be true in GPUs 6.6
Data Parallel Example Prefix Sum Problem Given a list of numbers, x0, …, xn-1, compute all the partial summations, i.e.: x0 + x1; x0 + x1 + x2; x0 + x1 + x2 + x3; x0 + x1 + x2 + x3 + x4; … Can also be defined with associative operations other than addition. Widely studied. Practical applications in areas such as processor allocation, data compaction, sorting, and polynomial evaluation. 6.7
Data parallel method for prefix sum operation 6.8
Sequential pseudo code Parallel code using forall notation for (j = 0; j < log(n); j++) // at each step for (i = 2 j; i < n; i++) // accumulate sum x[i] = x[i] + x[i + 2 j]; ** Parallel code using forall notation forall (i = 0; i < n; i++) // accumulate sum if (i >= 2 j) x[i] = x[i] + x[i + 2 j]; ** ** Note this will not work because of data dependences - the x array will get overwritten, so in real code use temp array or sequentially count downwards. 6c.9
Low level image processing Involves manipulating image pixels (picture elements) and often the same operation on each pixel using neighboring pixel values SIMD (single instruction multiple data) model very applicable. Historically, GPUs designed for creating image data for displays using this model.
Single Instruction Multiple Thread Programming Model (SIMT) A version of SIMD used in recent GPUs. Multiple threads, each execute the same instruction sequence. Groups of threads scheduled to execute at the same time on execution cores. Very low thread overhead.
SIMT Example -- Matrix Multiplication Matrix multiplication easy to make a data parallel version. Change two for’s to forall’s: forall (i = 0; i < n; i++) // for each row of A forall (j = 0; j < n; j++) { // for each column of B c[i][j] = 0; for (k = 0; k < n; k++) c[i][j] += a[i][k] * b[k][j]; } Each instance of body is a separate thread, doing same calculation but on different elements of array 6c.1212
forall (i = 0; i < n; i++) // for each row of A forall (j = 0; j < n; j++) { // for each column of B } Threads c[0][0] = 0; for (k = 0; k < n; k++) c[0][0]+=a[0][k]*b[k][0]; c[n-1][n-1] = 0; for (k = 0; k < n; k++) c[n-1][n-1]+=a[n-1][k]*b[k][n-1]; One thread for each c element, doing the same calculation but using different a and b elements 6c.13
We will explore programming GPUs for high performance computing next. Questions so far 6.14