Map-Scan Node Accelerator for Big-Data

Map-Scan Node Accelerator for Big-Data
- IEEE-BigData–Boston Map-Scan Node Accelerator for Big-Data Mihaela Malița (Saint Anselm College, US) Gheorghe M. Ștefan (Politehnica University of Bucharest, RO)

Content Hybrid Computation Map-Scan Organization Map-Scan Architecture
Evaluation Concluding Remarks Content Dec. 2017 IEEE BigData Boston

Hybrid Computation In the current platforms for BigData
Symmetric Multiprocessors Clusters Grids are used accelerators GPU (Nvidia): architectural legacy Many Integrated Core (Intel): ad hoc organization FPGA (Intel, Xilinx): requests competent hardware designers thus, each node becomes a Hybrid Computer IEEE BigData Boston Dec. 2017

Too low actual-performance/peak-performance
Application: object detection Intel i7 CPU: achieves 32% from peak performance (OK) Intel Xeon Phi: achieves 1.36% from peak performance (?!) Titan X GPU: achieves 1.05% from peak performance (?!) Application: scan operation GeForce 8800 GTX GPU (575 cores) accelerates 6x an Intel mono-core (?!) IEEE BigData Boston Dec. 2017

Parametrizable & configurable programmable generic accelerator for FPGA
FPGA accelerators require quality hardware designers that we are in shortage The automated way from the high level code to the efficient hardware is questionable, complex and expensive Our proposal: a generic programmable accelerator for writing the code to be accelerated the compiled code is used to: set the parameters configure the accelerator Main advantage: the user sees a programmable engine instead of a circuit Outcome: a fast and cheap solution with almost the same performance IEEE BigData Boston Dec. 2017

For both, ASIC and FPGA versions, our proposal is:
MAP-SCAN ACCELERATOR HOST → APU: the node in SMP, Cluster, or Grid MAP-SCAN ACCELERATOR: ARM & MEMORY MAP-SCAN ARRAY INTERFACE CONTROLLER MAP: p cells of mem & eng SCAN: log-depth network IEEE BigData Boston Dec. 2017

SCAN is a PREFIX network
PREFIX(x0, …, xn-1) = <y0, …, yn-1>, for the associative operation ◦, is: y0 = x0 y1 = x0 ◦ x1 y2 = x0 ◦ x1 ◦ x2 ... yn-1 = x0 ◦ x1 ◦ … ◦ xn-1 REDUCE(x0, …, xn-1) = yn-1 Sscan(n)  O(n), Dscan(n)  O(log n) IEEE BigData Boston Dec. 2017

Users view of the MAP memory resources
Vertical vector (one in each mem): Wj = sj0, sj1, …, sj(m-1) for j = 0, 1, …, p-1 Horizontal vectors: V0 = s00, s01, …, s0(p-1) V1 = s10, s11, …, s1(p-1) … Vm-1 = s00, s01, …, s0(m-1) IEEE BigData Boston Dec. 2017

Instruction set of our Map-Scan Accelerator
ISAMapScan = (ISACONTROLLER × ISAARRAY ) ISACONTROLLER = (SSARITH&LOGIC  SSCONTROL  SSCOMMUNICATION ) ISAARRAY = (SSARITH&LOGIC  SSSPATIAL_CONTROL  SSTRANSFER ) same difference temporal spatial spatial control temporal control where (cond)  if (cond) elsewhere  else endwhere  endif IEEE BigData Boston Dec. 2017

Map-Scan code for a Map-Reduce version
…………. v[15]=15 16 … v[16] = … v[17] = r r …. r r IEEE BigData Boston Dec. 2017

Hardware evaluation for 28 nm simulation: Number of 32-bit cells: 2048
Local memory in each cell: 4KB Clock frequency; 1GHz Area: (9.2 × 9.2) mm2 Power consumption: see diagram Peak performance: 2TOP/s or 1.4TFLOP/s (for 20% flop weight) IEEE BigData Boston Dec. 2017

How to use the Map-Scan Accelerator
MAP-SCAN ARRAY: runs a library kernel (could be Eigen Library) under the control of CONTROLLER MAP-SCAN ACCELERATOR: uses the library kernel to develop the library using a high level language (C++, Pyton, …) under the control of ARM (or equivalent) ACCELERATED PROCESSING UNIT: runs on HOST the application which uses the accelerated library. IEEE BigData Boston Dec. 2017

Application: k-means clustering
Consider n d-dimension entities which are clustered in k sets. Algorithm: attribute randomly “positions” to the k centers and associate randomly entities compute for each d-dimension entity the Euclidean distance to each center and assign each to the nearest compare the current assignment with the previous if there are no differences stop the computation else continue move the k centers to the means of the created groups, and go to step 2 The degree of parallelism is maximal, p, for steps 2 and 3, while the degree of parallelism for step 4 is p/k. Acceleration is 546 for p=1024 and k>10. IEEE BigData Boston Dec. 2017

I/O Bounded Application: scan computing
Number of scalars: 220 stored in memory with the result back in memory GeForce 8800 GTX at 1350 MHZ: execution time 1.11 ms Map-Scan Acc. At 1000 MHz: execution time ms The architectural acceleration: 11.2x If only the computation is considered: the architectural acceleration is 38.8x Architectural efficiency in using the core: (38.8 x 575/1024)x = 21.8x Architectural acceleration compared with one core: 130x > p/log p = 100 IEEE BigData Boston Dec. 2017

Application: Deep Neural Networks
On both, convolutional layers and fully connected layers is performed matrix-vector multiplication. DNN computation is not I/O bounded For M×N matrices with N ≤ p, M ≤ m T(M,N) = M log2p p: number of cells in MAP m: memory size in each cell IEEE BigData Boston Dec. 2017

Concluding Remarks: Preliminary evaluations are encouraging
For all investigated applications the architectural acceleration is higher than: p/log2 p Supralinear acceleration is provided due to the execution in parallel of: control, map, scan Unfortunately: “von Neumann Bottleneck” persists. IEEE BigData Boston Dec. 2017

Thank you Questions & possible answers
IEEE BigData Boston Dec. 2017

IEEE BigData Boston Dec. 2017

Pointers: kits/altera/acceleration-card-arria-10-gx.html IEEE BigData Boston Dec. 2017

Map-Scan Node Accelerator for Big-Data

Similar presentations

Presentation on theme: "Map-Scan Node Accelerator for Big-Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Map-Scan Node Accelerator for Big-Data

Similar presentations

Presentation on theme: "Map-Scan Node Accelerator for Big-Data"— Presentation transcript:

Similar presentations

About project

Feedback