Computing on Low Power SoC Architecture

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.

Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.

This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

FSOSS Dr. Chris Szalwinski Professor School of Information and Communication Technology Seneca College, Toronto, Canada GPU Research Capabilities.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Energy-Efficient Query Processing on Embedded CPU-GPU Architectures Xuntao Cheng, Bingsheng He, Chiew Tong Lau Nanyang Technological University, Singapore.

Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.

1 AppliedMicro X-Gene ® ARM Processors Optimized Scale-Out Solutions for Supercomputing.

Contemporary Languages in Parallel Computing Raymond Hummel.

1 petaFLOPS+ in 10 racks TB2–TL system announcement Rev 1A.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

ARM Cortex-A9 performance in HPC applications Kurt Keville, Clark Della Silva, Merritt Boyd ARM gaining market share in embedded systems and SoCs Current.

Microprocessors SUBTITLE Team 3: David Meadows David Foster Sichao Ni Khareem Gordon.

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Prof. JunDong Cho VADA Lab. Project.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Carlo del Mundo Department of Electrical and Computer Engineering Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms?

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Heterogeneous and Reconfigurable Computing Group Objective: develop technologies to improve computer performance 1 Processor Generation Max. Clock Speed.

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,

From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Introduction to Microprocessors

Introducing the Raspberry Pi Nauru ICT Department April 2016.

+ The INFN COSA project Daniele Cesini – INFN-CNAF (On behalf of the COSA collaboration)

+ The INFN COSA project Tommaso Boccali (alcune slides riadattate da Daniele Cesini)

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,

Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.

 A few notes on testing  Preliminary tests  Computed Tomography  Astrophysics  Molecular Dynamics  HS COSA f2f - 18/05/ Lucia Morganti – INFN-CNAF.

+ The INFN COSA project Michele Michelotto

Effects of Limiting Numerical Precision on Neural Networks

GPGPU use cases from the MoBrain community

HPC Roadshow Overview of HPC systems and software available within the LinkSCEEM project.

Yang Gao and Dr. Jason D. Bakos

GPU Computing Jan Just Keijser Nikhef Jamboree, Utrecht

Scientific requirements and dimensioning for the MICADO-SCAO RTC

Introduction to Computers

Low Power processors in HEP

Infrastructure for testing accelerators and new

General Programming on Graphical Processing Units

General Programming on Graphical Processing Units

Multicore and GPU Programming

Multicore and GPU Programming

Presentation transcript:

Computing on Low Power SoC Architecture Daniele Cesini – INFN-CNAF Andrea Ferraro – INFN-CNAF Lucia Morganti – INFN-CNAF GDB - 11/02/2015

Outline Modern Low Power Systems on Chip Computing on System on Chip ARM CPU SoC GPU Low Power from Intel Conclusion Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Low-Power System on Chip (SoCs) Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Where do I find a SoC? Mobile Embedded Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Vector vs Micro computing power NEC SX-5 CRAY-1 HITACHI S820/60 NEC SX-ACE INTEL8086 MOS 6510 Pentium Pro INTEL i7 Why did microprocessors take over? They have never been more powerful… ….but they were cheaper, highly available and less power demanding Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Vector vs Micro vs ARM based Is history repeating? Daniele Cesini – INFN-CNAF GDB - 11/02/2015

ARM based processor shipment ARM based processors are shipped in billions of units ARM licences the Intellectual Properties to manufactures …many manufactures …. Samsung (Korea), MediaTek (China), Allwinner (China), Qualcomm (USA), NVIDIA (USA), RockChip (China), Freescale (USA), Texas Instruments (USA), HiSilicon(China), Xilinx (USA), Broadcom(USA), Apple(USA), Altera(USA), ST(EU) , WanderMedia(Taiwan), Marvel(USA), AMD(USA)etc.. Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Ok, but then....an iPhone cluster? NO, we are not thinking to build an iPhone cluster We want to use these processors in a standard computing center configuration Rack mounted Linux powered Running scientific application mostly in a batch environment ..... Use development board... Daniele Cesini – INFN-CNAF GDB - 11/02/2015

ODROID-XU3 Powered by ARM® big.LITTLE™ technology, with a Heterogeneous Multi-Processing (HMP) solution 4 core ARM A15 + 4 cores ARM A7 Exynos 5422 by Samsung ~ 20 GFLOPS peak (32bit) single precision Mali- T628 MP6 GPU ~ 110 GFLOPS peak single precision 2 GB RAM 2xUSB3.0, 2xUSB2.0, 1x107100 eth Ubuntu 14.4 HDMI 1.4 port 64 GB flash storage Power consumption max ~ 15 W Costs 150 euro! Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Other nice boards ...and counting... PandaBoard WandBoard DragonBoard Rock2Board SabreBoard CubieBoard Arndale OCTA Board Texas Instruments EVMK2H ...and counting... http://elinux.org/Development_Platforms Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Some specs TDP between 5W and 15W (EVMK2H > 15W) BOARD GFLOPS SOC GFLOPS (CPU+GPU) Eth Model ARM IP GPU IP DSP IP FREESCALE (Embedded SoC) SABRE Board Freescale i.MX6Q ARM A9(4) Vivante GC2100 (19.2GFlops) 25 1Gb ARNDALE (Mobile SoC) Octa Board Samsung Exynos 5420 A15(4) A7(4) Mali-T628 MP6 (110Gflops) 115 10/100 HARDKERNEL Odroid-XU-E Exynos 5410 Imagination Technologies PowerVR SGX544MP3 (51.1 Gflops) 65 Odroid-XU3 Exynos 5422 A7(4) (HMP) ARM Mali-T628 MP6 (110 Gflops) 130 INTRINSIC DragonBoard Qualcomm Snapdragon 800 Qualcomm Krait(4) Adreno 330 (130Gflops) 145 TI (Embedded SoC) EVMK2H TI Keystone 66AK2H14 A15(2) MS320C66x (189Gflops) 210 1Gb (10Gb) TDP between 5W and 15W (EVMK2H > 15W) Daniele Cesini – INFN-CNAF GDB - 11/02/2015

NVIDIA JETSON K1 First ARM+CUDA programmable 4 cores ARM A15 CPU GPU-accelerated Linux development board! 4 cores ARM A15 CPU 192 cores NVIDIA GPU  300 GFLOPS (peak sp) ...for less than 200 Euros Daniele Cesini – INFN-CNAF GDB - 11/02/2015

CPU GFLOPS/Watt Daniele Cesini – INFN-CNAF 10/12/2014

GPU acceleration in K1 2xE5-2640+1xK40 ~ 3 GFLOPS/W dp ~ 9 GFLOPS/W sp 4 core ARM A15 ~ 18 GFLOPS Kepler SMX1 192 core ~ 300 GFLOPS ~ 15 Watt ~ 21 GFLOPS /W N.B. Single precision – 32 bit architecture ~ 1.5 GFLOPS /€ (0.67 €/GFLOPS) 2xE5-2640+1xK40 ~ 3 GFLOPS/W dp ~ 9 GFLOPS/W sp Daniele Cesini – INFN-CNAF GDB - 11/02/2015

How do you program them? (in a Linux environment) GCC+OpenMP+MPI available for ARM architectures OpenCL for the GPU If you are lucky enough to find working drivers CUDA available only on the Jetson K1 Computing capability 3.2 (vs 3.5) Cross compilation GCC5+OpenMP4 tests ongoing… Daniele Cesini – INFN-CNAF GDB - 11/02/2015

GPU acceleration in scientific computation 2 x (E5-2673v2 (IvyBridge) 8 cores) ~ 2 x100 = 200 GFLOPS (double precision) 2 x 110 Watt = 220 W ~ 1 GFLOPS/W 1xNVIDIA TESLA K40 2880 cores 12 GB RAM ~ 1400 GFLOPS (double precision) ~ 4300 GFLOPS (single precision) 235 Watt ~ 6 GFLOPS / W dp ~ 18 GFLOPS/W sp CPU+GPU ~ 3 GFLOPS/W dp ~ 9 GFLOPS/W sp Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Limitations Commodity SoCs and development boards have a number of limitations: 32 bit Small caches Small RAM size in the boards (O(2GB)) However modern SoCs can address 40bit No ECC memory Frequent failures and system crashes Slow connections (10/100Mb eth) in many cases Ethernet via USB in same boards HW bugs Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Gbit Ethernet While latency was comparable to a server class 1Gb ethernet card (50/75 us) Daniele Cesini – INFN-CNAF GDB - 11/02/2015

OpenMP π computation CPU ONLY x20 Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Prime numbers computation CPU ONLY x 4.6 Daniele Cesini – INFN-CNAF GDB - 11/02/2015

CMS 2014 results CPU ONLY High Energy Physics MonteCarlo simulations David Abdurachmanov et al 2014 J. Phys.: Conf. Ser. 513 052008 doi:10.1088/1742-6596/513/5/052008 ? ARM slower by a factor 3 or 4 but… …ARM better by a factor 3 or 5 on the power ratio Daniele Cesini – INFN-CNAF GDB - 11/02/2015

FFT on CPU and GPU fftw3 on CPUs cuFFT on GPUs Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Molecular Dynamics on Jetson-K1 Parallel application for CPU and GPU: real life use case with GROMACS CPU and GPU Lower is better 1 jetson MPI 1 jetson CUDA 2 jetson MPI Jetson-K1 about 10X slower using the same number of CPU cores Jetson-K1 about 10X slower using the GPU (vs. an NVIDIA Tesla K20) Jetson-K1 13.5Watt Xeon+K20 ~320Watt GDB - 11/02/2015 Daniele Cesini – INFN-CNAF

Filtered Backprojection CPU and GPU On (2xE5-2620+K20): 1956 images analyzed in 1 hour: 350Wh (GPU not fully loaded) On 5xJetson-K1: 2095 images analyzed in 1 hour: 41 Wh Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Lattice Boltzmann on the Tegra K1 GPU only (*) On Tegra-K1 (preliminary) 15 GFLOPS GB/s Pe~ 10 Watt 40x slower than a K20m Porting easier than expected Performance under investigation (*) Schifano et al. ; A portable OpenCL Lattice Boltzmann code for multi- And many-core processor architectures; Procedia Computer Science Volume 29, 2014, Pages 40-49, doi: 10.1016/j.procs.2014.05.004 Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Only ARM based SoCs? And Intel? INTEL produce SoCs Probably you have one in your laptop Some of them are low power Already 64bit Integrated GPU CILK++ programmable OpenCL programmable Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Some low power from Intel Daniele Cesini – INFN-CNAF 10/12/2014

2 cores + GPU 4.5 Watt (TDP) Intel HD Graphics OpenCL 2.0 Support (*) http://www.notebookcheck.net/Intel-Core-M-5Y70-Broadwell-Review.130930.0.html (**) http://www.intel.com/content/www/us/en/processors/core/core-m-processor-family-spec-update.html Daniele Cesini – INFN-CNAF GDB - 11/02/2015

AVOTON on HP Moonshot - HS06 (data from A.Chierici@HEPIX https://indico.cern.ch/event/305362/session/2/contribution/22/material/slides/0.pdf ) Daniele Cesini – INFN-CNAF GDB - 11/02/2015

HS06 HS06 on Exynos5, TegraK1 and Atom C2750 – Per core loaded (data from M.Michelotto@HEPIX 2014 https://indico.cern.ch/event/320819/session/3/contribution/30/material/slides/0.pptx ) Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Test on Intel AVOTON Eth interconnect in both machines N.B. – Comparison with an old Penryn Xeon CPU Daniele Cesini – INFN-CNAF 10/12/2014

Conclusion Mobile and embedded low power System-on-Chip are becoming attractive for scientific computing In particular if you manage to extract power from the GPU For selected applications Image processing No high RAM/RAM bandwidth requirements They still have many limitations for a production environment 32bit, no ECC, bugs, system stability, etc.. (BUT we used development boards - not server grade machines) NVIDIA K1 in our tests was the most powerful ARM based SoC Easy to install and use Easy to port CUDA based applications Intel has interesting low power SoCs Avoton has a high HS06/W ratio Looking forward to development boards based on 64bit SoCs with an ARM CPU on board Daniele Cesini – INFN-CNAF GDB - 11/02/2015

Links and contacts http://www.cosa-project.it http://montblanc-project.eu https://indico.cern.ch/event/320819/session/3/contribution/30/material /slides/0.pptx https://indico.cern.ch/event/305362/session/2/contribution/22/material /slides/0.pdf Daniele Cesini – INFN-CNAF GDB - 11/02/2015