Map-Scan Node Accelerator for Big-Data

Slides:



Advertisements
Similar presentations
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Advertisements

Computer Abstractions and Technology
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Computer Graphics Graphics Hardware
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
István Lőrentz 1 Mihaela Malita 2 Răzvan Andonie 3 Mihaela MalitaRăzvan Andonie 3 (presenter) 1 Electronics and Computers Department, Transylvania University.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Performed by: Yotam Platner & Merav Natanson Instructor: Guy Revach המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון.
Philipp Gysel ECE Department University of California, Davis
Computer Organization IS F242. Course Objective It aims at understanding and appreciating the computing system’s functional components, their characteristics,
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
Martin Kruliš by Martin Kruliš (v1.1)1.
“SMT Capable CPU-GPU Systems for Big Data”
Mihaela Malița Gheorghe M. Ștefan
Computer Engg, IIT(BHU)
Appendix C Graphics and Computing GPUs
Computer Graphics Graphics Hardware
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Scalpel: Customizing DNN Pruning to the
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
NFV Compute Acceleration APIs and Evaluation
Yang Gao and Dr. Jason D. Bakos
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Analysis of Sparse Convolutional Neural Networks
Employing compression solutions under openacc
Seth Pugsley, Jeffrey Jestes,
Low-Cost High-Performance Computing Via Consumer GPUs
Morgan Kaufmann Publishers
FPGA Acceleration of Convolutional Neural Networks
Architecture Background
MASS CUDA Performance Analysis and Improvement
NVIDIA Fermi Architecture
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Constructing a system with multiple computers or processors
A High Performance SoC: PkunityTM
Chapter 1 Introduction.
Final Project presentation
Computer Graphics Graphics Hardware
Graphics Processing Unit
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Map-Scan Node Accelerator for Big-Data - IEEE-BigData–Boston-2017 - Map-Scan Node Accelerator for Big-Data Mihaela Malița (Saint Anselm College, US) Gheorghe M. Ștefan (Politehnica University of Bucharest, RO)

Content Hybrid Computation Map-Scan Organization Map-Scan Architecture Evaluation Concluding Remarks Content Dec. 2017 IEEE BigData 2017 - Boston

Hybrid Computation In the current platforms for BigData Symmetric Multiprocessors Clusters Grids are used accelerators GPU (Nvidia): architectural legacy Many Integrated Core (Intel): ad hoc organization FPGA (Intel, Xilinx): requests competent hardware designers thus, each node becomes a Hybrid Computer IEEE BigData 2017 - Boston Dec. 2017

Too low actual-performance/peak-performance Application: object detection Intel i7 CPU: achieves 32% from peak performance (OK) Intel Xeon Phi: achieves 1.36% from peak performance (?!) Titan X GPU: achieves 1.05% from peak performance (?!) Application: scan operation GeForce 8800 GTX GPU (575 cores) accelerates 6x an Intel mono-core (?!) IEEE BigData 2017 - Boston Dec. 2017

Parametrizable & configurable programmable generic accelerator for FPGA FPGA accelerators require quality hardware designers that we are in shortage The automated way from the high level code to the efficient hardware is questionable, complex and expensive Our proposal: a generic programmable accelerator for writing the code to be accelerated the compiled code is used to: set the parameters configure the accelerator Main advantage: the user sees a programmable engine instead of a circuit Outcome: a fast and cheap solution with almost the same performance IEEE BigData 2017 - Boston Dec. 2017

For both, ASIC and FPGA versions, our proposal is: MAP-SCAN ACCELERATOR HOST → APU: the node in SMP, Cluster, or Grid MAP-SCAN ACCELERATOR: ARM & MEMORY MAP-SCAN ARRAY INTERFACE CONTROLLER MAP: p cells of mem & eng SCAN: log-depth network IEEE BigData 2017 - Boston Dec. 2017

SCAN is a PREFIX network PREFIX(x0, …, xn-1) = <y0, …, yn-1>, for the associative operation ◦, is: y0 = x0 y1 = x0 ◦ x1 y2 = x0 ◦ x1 ◦ x2 ... yn-1 = x0 ◦ x1 ◦ … ◦ xn-1 REDUCE(x0, …, xn-1) = yn-1 Sscan(n)  O(n), Dscan(n)  O(log n) IEEE BigData 2017 - Boston Dec. 2017

Users view of the MAP memory resources Vertical vector (one in each mem): Wj = sj0, sj1, …, sj(m-1) for j = 0, 1, …, p-1 Horizontal vectors: V0 = s00, s01, …, s0(p-1) V1 = s10, s11, …, s1(p-1) … Vm-1 = s00, s01, …, s0(m-1) IEEE BigData 2017 - Boston Dec. 2017

Instruction set of our Map-Scan Accelerator ISAMapScan = (ISACONTROLLER × ISAARRAY ) ISACONTROLLER = (SSARITH&LOGIC  SSCONTROL  SSCOMMUNICATION ) ISAARRAY = (SSARITH&LOGIC  SSSPATIAL_CONTROL  SSTRANSFER ) same difference temporal spatial spatial control temporal control where (cond)  if (cond) elsewhere  else endwhere  endif IEEE BigData 2017 - Boston Dec. 2017

Map-Scan code for a Map-Reduce version …………. v[15]=15 16 … 30 31 v[16] = 3 3 …. 3 3 v[17] = r r …. r r IEEE BigData 2017 - Boston Dec. 2017

Hardware evaluation for 28 nm simulation: Number of 32-bit cells: 2048 Local memory in each cell: 4KB Clock frequency; 1GHz Area: (9.2 × 9.2) mm2 Power consumption: see diagram Peak performance: 2TOP/s or 1.4TFLOP/s (for 20% flop weight) IEEE BigData 2017 - Boston Dec. 2017

How to use the Map-Scan Accelerator MAP-SCAN ARRAY: runs a library kernel (could be Eigen Library) under the control of CONTROLLER MAP-SCAN ACCELERATOR: uses the library kernel to develop the library using a high level language (C++, Pyton, …) under the control of ARM (or equivalent) ACCELERATED PROCESSING UNIT: runs on HOST the application which uses the accelerated library. IEEE BigData 2017 - Boston Dec. 2017

Application: k-means clustering Consider n d-dimension entities which are clustered in k sets. Algorithm: attribute randomly “positions” to the k centers and associate randomly entities compute for each d-dimension entity the Euclidean distance to each center and assign each to the nearest compare the current assignment with the previous if there are no differences stop the computation else continue move the k centers to the means of the created groups, and go to step 2 The degree of parallelism is maximal, p, for steps 2 and 3, while the degree of parallelism for step 4 is p/k. Acceleration is 546 for p=1024 and k>10. IEEE BigData 2017 - Boston Dec. 2017

I/O Bounded Application: scan computing Number of scalars: 220 stored in memory with the result back in memory GeForce 8800 GTX at 1350 MHZ: execution time 1.11 ms Map-Scan Acc. At 1000 MHz: execution time 0.133 ms The architectural acceleration: 11.2x If only the computation is considered: the architectural acceleration is 38.8x Architectural efficiency in using the core: (38.8 x 575/1024)x = 21.8x Architectural acceleration compared with one core: 130x > p/log p = 100 IEEE BigData 2017 - Boston Dec. 2017

Application: Deep Neural Networks On both, convolutional layers and fully connected layers is performed matrix-vector multiplication. DNN computation is not I/O bounded For M×N matrices with N ≤ p, M ≤ m T(M,N) = M + 2 + log2p p: number of cells in MAP m: memory size in each cell IEEE BigData 2017 - Boston Dec. 2017

Concluding Remarks: Preliminary evaluations are encouraging For all investigated applications the architectural acceleration is higher than: p/log2 p Supralinear acceleration is provided due to the execution in parallel of: control, map, scan Unfortunately: “von Neumann Bottleneck” persists. IEEE BigData 2017 - Boston Dec. 2017

Thank you Questions & possible answers IEEE BigData 2017 - Boston Dec. 2017

IEEE BigData 2017 - Boston Dec. 2017

IEEE BigData 2017 - Boston Dec. 2017

IEEE BigData 2017 - Boston Dec. 2017

Pointers: https://www.altera.com/products/boards_and_kits/dev- kits/altera/acceleration-card-arria-10-gx.html IEEE BigData 2017 - Boston Dec. 2017