Map-Scan Node Accelerator for Big-Data - IEEE-BigData–Boston-2017 - Map-Scan Node Accelerator for Big-Data Mihaela Malița (Saint Anselm College, US) Gheorghe M. Ștefan (Politehnica University of Bucharest, RO)
Content Hybrid Computation Map-Scan Organization Map-Scan Architecture Evaluation Concluding Remarks Content Dec. 2017 IEEE BigData 2017 - Boston
Hybrid Computation In the current platforms for BigData Symmetric Multiprocessors Clusters Grids are used accelerators GPU (Nvidia): architectural legacy Many Integrated Core (Intel): ad hoc organization FPGA (Intel, Xilinx): requests competent hardware designers thus, each node becomes a Hybrid Computer IEEE BigData 2017 - Boston Dec. 2017
Too low actual-performance/peak-performance Application: object detection Intel i7 CPU: achieves 32% from peak performance (OK) Intel Xeon Phi: achieves 1.36% from peak performance (?!) Titan X GPU: achieves 1.05% from peak performance (?!) Application: scan operation GeForce 8800 GTX GPU (575 cores) accelerates 6x an Intel mono-core (?!) IEEE BigData 2017 - Boston Dec. 2017
Parametrizable & configurable programmable generic accelerator for FPGA FPGA accelerators require quality hardware designers that we are in shortage The automated way from the high level code to the efficient hardware is questionable, complex and expensive Our proposal: a generic programmable accelerator for writing the code to be accelerated the compiled code is used to: set the parameters configure the accelerator Main advantage: the user sees a programmable engine instead of a circuit Outcome: a fast and cheap solution with almost the same performance IEEE BigData 2017 - Boston Dec. 2017
For both, ASIC and FPGA versions, our proposal is: MAP-SCAN ACCELERATOR HOST → APU: the node in SMP, Cluster, or Grid MAP-SCAN ACCELERATOR: ARM & MEMORY MAP-SCAN ARRAY INTERFACE CONTROLLER MAP: p cells of mem & eng SCAN: log-depth network IEEE BigData 2017 - Boston Dec. 2017
SCAN is a PREFIX network PREFIX(x0, …, xn-1) = <y0, …, yn-1>, for the associative operation ◦, is: y0 = x0 y1 = x0 ◦ x1 y2 = x0 ◦ x1 ◦ x2 ... yn-1 = x0 ◦ x1 ◦ … ◦ xn-1 REDUCE(x0, …, xn-1) = yn-1 Sscan(n) O(n), Dscan(n) O(log n) IEEE BigData 2017 - Boston Dec. 2017
Users view of the MAP memory resources Vertical vector (one in each mem): Wj = sj0, sj1, …, sj(m-1) for j = 0, 1, …, p-1 Horizontal vectors: V0 = s00, s01, …, s0(p-1) V1 = s10, s11, …, s1(p-1) … Vm-1 = s00, s01, …, s0(m-1) IEEE BigData 2017 - Boston Dec. 2017
Instruction set of our Map-Scan Accelerator ISAMapScan = (ISACONTROLLER × ISAARRAY ) ISACONTROLLER = (SSARITH&LOGIC SSCONTROL SSCOMMUNICATION ) ISAARRAY = (SSARITH&LOGIC SSSPATIAL_CONTROL SSTRANSFER ) same difference temporal spatial spatial control temporal control where (cond) if (cond) elsewhere else endwhere endif IEEE BigData 2017 - Boston Dec. 2017
Map-Scan code for a Map-Reduce version …………. v[15]=15 16 … 30 31 v[16] = 3 3 …. 3 3 v[17] = r r …. r r IEEE BigData 2017 - Boston Dec. 2017
Hardware evaluation for 28 nm simulation: Number of 32-bit cells: 2048 Local memory in each cell: 4KB Clock frequency; 1GHz Area: (9.2 × 9.2) mm2 Power consumption: see diagram Peak performance: 2TOP/s or 1.4TFLOP/s (for 20% flop weight) IEEE BigData 2017 - Boston Dec. 2017
How to use the Map-Scan Accelerator MAP-SCAN ARRAY: runs a library kernel (could be Eigen Library) under the control of CONTROLLER MAP-SCAN ACCELERATOR: uses the library kernel to develop the library using a high level language (C++, Pyton, …) under the control of ARM (or equivalent) ACCELERATED PROCESSING UNIT: runs on HOST the application which uses the accelerated library. IEEE BigData 2017 - Boston Dec. 2017
Application: k-means clustering Consider n d-dimension entities which are clustered in k sets. Algorithm: attribute randomly “positions” to the k centers and associate randomly entities compute for each d-dimension entity the Euclidean distance to each center and assign each to the nearest compare the current assignment with the previous if there are no differences stop the computation else continue move the k centers to the means of the created groups, and go to step 2 The degree of parallelism is maximal, p, for steps 2 and 3, while the degree of parallelism for step 4 is p/k. Acceleration is 546 for p=1024 and k>10. IEEE BigData 2017 - Boston Dec. 2017
I/O Bounded Application: scan computing Number of scalars: 220 stored in memory with the result back in memory GeForce 8800 GTX at 1350 MHZ: execution time 1.11 ms Map-Scan Acc. At 1000 MHz: execution time 0.133 ms The architectural acceleration: 11.2x If only the computation is considered: the architectural acceleration is 38.8x Architectural efficiency in using the core: (38.8 x 575/1024)x = 21.8x Architectural acceleration compared with one core: 130x > p/log p = 100 IEEE BigData 2017 - Boston Dec. 2017
Application: Deep Neural Networks On both, convolutional layers and fully connected layers is performed matrix-vector multiplication. DNN computation is not I/O bounded For M×N matrices with N ≤ p, M ≤ m T(M,N) = M + 2 + log2p p: number of cells in MAP m: memory size in each cell IEEE BigData 2017 - Boston Dec. 2017
Concluding Remarks: Preliminary evaluations are encouraging For all investigated applications the architectural acceleration is higher than: p/log2 p Supralinear acceleration is provided due to the execution in parallel of: control, map, scan Unfortunately: “von Neumann Bottleneck” persists. IEEE BigData 2017 - Boston Dec. 2017
Thank you Questions & possible answers IEEE BigData 2017 - Boston Dec. 2017
IEEE BigData 2017 - Boston Dec. 2017
IEEE BigData 2017 - Boston Dec. 2017
IEEE BigData 2017 - Boston Dec. 2017
Pointers: https://www.altera.com/products/boards_and_kits/dev- kits/altera/acceleration-card-arria-10-gx.html IEEE BigData 2017 - Boston Dec. 2017