+ The INFN COSA project Daniele Cesini – INFN-CNAF (On behalf of the COSA collaboration)

Slides:

Advertisements

Similar presentations

4. Workload directed adaptive SMP multicores

Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.

This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

FSOSS Dr. Chris Szalwinski Professor School of Information and Communication Technology Seneca College, Toronto, Canada GPU Research Capabilities.

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

Some Thoughts on Technology and Strategies for Petaflops.

Server Platforms Week 11- Lecture 1. Server Market $ 46,100,000,000 ($ 46.1 Billion) Gartner.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Computing on Low Power SoC Architecture

Contemporary Languages in Parallel Computing Raymond Hummel.

1 petaFLOPS+ in 10 racks TB2–TL system announcement Rev 1A.

CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.

GPGPU platforms GP - General Purpose computation using GPU

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

2nd Workshop on Energy for Sustainable Science at Research Infrastructures Report on parallel session A3 Wayne Salter on behalf of Dr. Mike Ashworth (STFC)

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

ARM Cortex-A9 performance in HPC applications Kurt Keville, Clark Della Silva, Merritt Boyd ARM gaining market share in embedded systems and SoCs Current.

Microprocessors SUBTITLE Team 3: David Meadows David Foster Sichao Ni Khareem Gordon.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Performance and Energy Efficiency of GPUs and FPGAs

Prof. JunDong Cho VADA Lab. Project.

Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.

Computer Graphics Graphics Hardware

Engineering & Instrumentation Department, ESDG, Rob Halsall, 24th February 2005CFI/Confidential CFI - Opto DAQ - Status 24th February 2005.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

3. April 2006Bernd Panzer-Steindel, CERN/IT1 HEPIX 2006 CPU technology session some ‘random walk’

Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |

Accelerating image recognition on mobile devices using GPGPU

Tracking with CACTuS on Jetson Running a Bayesian multi object tracker on a low power, embedded system School of Information Technology & Mathematical.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,

Why it might be interesting to look at ARM Ben Couturier, Vijay Kartik Niko Neufeld, PH-LBC SFT Technical Group Meeting 08/10/2012.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Using Uncacheable Memory to Improve Unity Linux Performance

Introducing the Raspberry Pi Nauru ICT Department April 2016.

+ The INFN COSA project Tommaso Boccali (alcune slides riadattate da Daniele Cesini)

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Hardware Architecture

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,

 A few notes on testing  Preliminary tests  Computed Tomography  Astrophysics  Molecular Dynamics  HS COSA f2f - 18/05/ Lucia Morganti – INFN-CNAF.

+ The INFN COSA project Michele Michelotto

Computer Graphics Graphics Hardware

M. Bellato INFN Padova and U. Marconi INFN Bologna

NFV Compute Acceleration APIs and Evaluation

HPC Roadshow Overview of HPC systems and software available within the LinkSCEEM project.

Yang Gao and Dr. Jason D. Bakos

GPU Computing Jan Just Keijser Nikhef Jamboree, Utrecht

LCG 3D Distributed Deployment of Databases

Scientific requirements and dimensioning for the MICADO-SCAO RTC

Heterogeneous Computation Team HybriLIT

Introduction to Computers

Low Power processors in HEP

Stato progetto COSA

Infrastructure for testing accelerators and new

General Programming on Graphical Processing Units

General Programming on Graphical Processing Units

Computer Graphics Graphics Hardware

Computer Evolution and Performance

Introduction to Single Board Computer

Multicore and GPU Programming

Multicore and GPU Programming

Presentation transcript:

+ The INFN COSA project Daniele Cesini – INFN-CNAF (On behalf of the COSA collaboration)

+ INFN-CSN5 COSA project COSA: Computing On SOC Architecture Duration: 2 years from January 2015 Departments: 7 INFN CNAF, PI, PD, ROMA1, FE, PR, LNL BUDGET :51.5 kEuro Year I CCR - 27/05/2015Daniele Cesini – INFN-CNAF 2

+ CCR - 27/05/2015Daniele Cesini – INFN-CNAF 3 Objectives Acquire know-how Porting e benchmarking of low power System on Chip Operations of Linux system on SoCs Benchmarking hybrid architectures Unification of INFN HW testing activities Continuation of the COKA project Porting on traditional accelerator (GPU/MIC) Continuation of the HEPMARK projects X86 benchmarking Study of dedicated low latency interconnection with ARM+FPGA devices Prepare H2020 proposal on LowPower computing calls

+ Low-Power System on Chip (SoCs) Daniele Cesini – INFN-CNAF 4 CCR - 27/05/2015

+ Where do I find a SoC? Mobile Embedded Daniele Cesini – INFN-CNAF 5 CCR - 27/05/2015

+ Vector vs Micro computing power Daniele Cesini – INFN-CNAF 6 NEC SX-5 CRAY-1 HITACHI S820/60 NEC SX-ACE INTEL8086MOS 6510 Pentium Pro INTEL i7 Why did microprocessors take over? They have never been more powerful… ….but they were cheaper, highly available and less power demanding CCR - 27/05/2015

+ Vector vs Micro vs ARM based Daniele Cesini – INFN-CNAF 7 Is history repeating? CCR - 27/05/2015

+ Ok, but then....an iPhone cluster? NO, we are not thinking to build an iPhone cluster We want to use these processors in a standard computing center configuration Rack mounted Linux powered Running scientific application mostly in a batch environment..... Use development board... Daniele Cesini – INFN-CNAF 8 CCR - 27/05/2015

+ Powered by ARM® big.LITTLE™ technology, with a Heterogeneous Multi-Processing (HMP) solution 4 core ARM A cores ARM A7 Exynos 5422 by Samsung ~ 20 GFLOPS peak (32bit) single precision Mali- T628 MP6 GPU ~ 110 GFLOPS peak single precision 2 GB RAM 2xUSB3.0, 2xUSB2.0, 1x eth Ubuntu 14.4 HDMI 1.4 port 64 GB flash storage ODROID-XU3 Daniele Cesini – INFN-CNAF 9 Power consumption max ~ 15 W Costs 150 euro! CCR - 27/05/2015

+ Texas Instruments EVMK2H DragonBoard SabreBoard PandaBoard Other nice boards...and counting... Daniele Cesini – INFN-CNAF 10 WandBoard Rock2Board CubieBoard Arndale OCTA Board CCR - 27/05/2015

+ Some specs Daniele Cesini – INFN-CNAF 11 BOARD SOC GFLOPS (CPU+GPU) Eth ModelARM IPGPU IPDSP IP FREESCALE (Embedded SoC) SABRE Board Freescale i.MX6Q ARM A9(4) Vivante GC2100 (19.2GFlops) 251Gb ARNDALE (Mobile SoC) Octa Board Samsung Exynos 5420 ARM A15(4) A7(4) ARM Mali-T628 MP6 (110Gflops) 11510/100 HARDKERNEL (Mobile SoC) Odroid-XU-E Samsung Exynos 5410 ARM A15(4) A7(4) Imagination Technologies PowerVR SGX544MP3 (51.1 Gflops) 6510/100 HARDKERNEL (Mobile SoC) Odroid-XU3 Samsung Exynos 5422 ARM A15(4) A7(4) (HMP) ARM Mali-T628 MP6 (110 Gflops) 13010/100 INTRINSIC (Mobile SoC) DragonBoard Qualcomm Snapdragon 800 Qualcomm Krait(4) Qualcomm Adreno 330 (130Gflops) 1451Gb TI (Embedded SoC) EVMK2H TI Keystone 66AK2H14 ARM A15(2) TI MS320C66x (189Gflops) 210 1Gb (10Gb) TDP between 5W and 15W (EVMK2H > 15W) CCR - 27/05/2015

+ NVIDIA JETSON TK1 First ARM+CUDA programmable SoC based Linux development board 4 cores ARM A15 CPU 192 cores NVIDIA GPU  300 GFLOPS (peak sp) ~ 21 GFLOPS/W (sp... for less than 200 Euros 32bit 64bit version announced 12 CCR - 27/05/2015Daniele Cesini – INFN-CNAF

+ How do you program them? (in a Linux environment) GCC+OpenMP+MPI available for ARM architectures OpenCL for the GPU If you are lucky enough to find working drivers CUDA available only on the Jetson K1 Computing capability 3.2 (vs 3.5) Cross compilation GCC5+OpenMP4 tests ongoing… Daniele Cesini – INFN-CNAF 13 CCR - 27/05/2015

+ Limitations of the tested boards Commodity SoCs and development boards have a number of limitations: 32 bit Small caches Small RAM size in the boards (O(2GB)) However modern SoCs can address 40bit No ECC memory Frequent failures and system crashes Slow connections (10/100Mb eth) in many cases Ethernet via USB in same boards HW bugs Daniele Cesini – INFN-CNAF 14 CCR - 27/05/2015

+ Daniele Cesini – INFN-CNAF 15 COSA Clusters CNAF ROMA1 PD 4 board ARM+FPGA based + 1 server 16 board ARM+FPGA based + 4 server ex cluster COKA (2 server) + nuove acquisizioni (1 server) (server = cpu + acceleratori) Anno I + ~10 nuove board ~25 board SoC Based (GFLOPS nominali di 2 server tradizionali con GPU) Anno I Anno II Anno I Anno II Ex cluster HEPMARK + nuove acquisizioni (~ 3 server) + nuove acquisizione (~ 3 server) Anno I Anno II

+ Activities & WPs WP1: Management WP2: Technology Tracking and Benchmarking WP3: Implementation of the CNAF prototype WP4: Development of dedicated network connections WP5: Application Porting WP6: Technology Transfer and Dissemination CCR - 27/05/2015Daniele Cesini – INFN-CNAF 16

+ Project Timeline CCR - 27/05/2015 Daniele Cesini – INFN-CNAF 17 Benchmarking Tests of new boards Benchmarks with real life applications and synthetic tests Tests of first FPGAs for dedicated low latency network Selection of the final architecture for the clusters Cluster implementation Cluster setup at CNAF (25 boards) Cluster Setup in ROME (4 FPGA bords) Application Porting Tests under real life condition Clusters upgrade +12 ROME) Jan 2015 Sept 2015 Jan 2016 Dec 2016 Dissemination and H2020 Preparation

+ CNAF COSA cluster network MASTER (lowpower x86-64) ARM /24 eth1 eth0 CNAF ssh /23 CCR - 27/05/2015Daniele Cesini – INFN-CNAF 18 DHCP, TFT, NFS, LDAP, NAT, etc ……

+ COSA power network Card 12V 5V 230V AC Power probe LabVIEW DC probe Card CCR - 27/05/2015Daniele Cesini – INFN-CNAF 19 ……

+ OPENMP π computation on Jetson CCR - 27/05/2015Daniele Cesini – INFN-CNAF 20 #TIME (s) POWER (W) ENERGY (J) 126,14,6120,1 213,16,281,2 38,7 75,7 46,5 42,2 CURRENT (A) TIME

+ CCR - 27/05/2015Daniele Cesini – INFN-CNAF 21

+ PSU&Cables PSU HX1000i 12 linee 12V (Jetson) 6 linee 5V Cavi GRIDSEED Da 1 MOLEX a 3 BARREL CCR - 27/05/2015Daniele Cesini – INFN-CNAF 22

+ CCR - 27/05/2015Daniele Cesini – INFN-CNAF 23 Applications Theoretical Physics (PR, FE) Parallel applications usually run in HPC environments Lattice Boltzmann fluid dynamics Monte Carlo simulations of Spin-Glass systems Lattice Quantum ChromoDynamics simulations Experimental Physics (PI, PD, CNAF) HEP experiments High Level Trigger applications Montecarlo and analysis of LHC experiments Applications needing portable systems Computer tomography Neural Networks (RM1) DPSNN-STDP code

+ Filtered Backprojection (1/2) CCR - 27/05/ x 1024 pixel 24 Daniele Cesini – INFN-CNAF CPU &GPU Lower is better (*) in collaboration with the physics department of the Bologna University and CHNET Master thesis of Elena Corni: Implementazione dell’algoritmo Filtered Back- Projection (FBP) per architetture Low-Power di tipo Systems-On-Chip

+ CCR - 27/05/ x 1024 pixel 25 Daniele Cesini – INFN-CNAF Lower is better On (2xE K20): 3241 slices in 1 h: 350 Wh On 5 x Jetson-K1: 3840 slices in 1 h: 41 Wh Filtered Backprojection (2/2)

+ Molecular Dynamics on Jetson-K1 Daniele Cesini – INFN-CNAF 26 Jetson-K1 about 10X slower using the same number of CPU cores Jetson-K1 about 10X slower using the GPU (vs. an NVIDIA Tesla K20) Jetson-K1 13.5W Xeon+K20 ~320W Parallel application for CPU and GPU: real life use case with GROMACS Lower is better 2 jetson MPI 1 jetson MPI 1 jetson CUDA CCR - 27/05/2015 CPU&GPU

+ Formation of a galaxy cluster with Gadget2 CCR - 27/05/ Daniele Cesini – INFN-CNAF MULTI BOARD CPU ONLY particles, formation of a cluster of galaxies in an expanding Universe

+ Formation of a galaxy cluster CCR - 27/05/ Daniele Cesini – INFN-CNAF

+ Neural Networks: DPSNN-STDP CCR - 27/05/2015Daniele Cesini – INFN-CNAF 29

+ CCR - 27/05/2015Daniele Cesini – INFN-CNAF 30 See Enrico Calore talk

+ HS06 Daniele Cesini – INFN-CNAF 31 HS06 on Exynos5, TegraK1 and Atom C2750 – Per core loaded (data from ) CCR - 27/05/2015

+ Tests from CMS CCR - 27/05/2015Daniele Cesini – INFN-CNAF 32 More on Pantaleo talk…

+ Low latency connection through FPGAs FPGA SoCs are hybrid components that integrate a programmable hw component and a multicore low power ARM CPU Two main reasons to test them in COSA Acceleration of computational task executed by microP embedded in the FPGA taking advantage of the RDMA capabilities of the APEnet architecture APEnet v5, NaNet, PICOLO systems Computational tasks executed on the ARM CPU taking advantage of the integrated low latency connection capabilities in the Study of new architectures, i.e ExaNeSt CCR - 27/05/2015Daniele Cesini – INFN-CNAF 33 More on Lonardo talk… More on Lonardo talk…

+ CCR - 27/05/2015Daniele Cesini – INFN-CNAF 34

+ Only ARM based SoCs? And Intel? INTEL produce SoCs Some of them are low power Already 64bit Integrated GPU CILK++ programmable OpenCL programmable Daniele Cesini – INFN-CNAF 35 CCR - 27/05/2015

+ Some low power from Intel CCR - 27/05/2015Daniele Cesini – INFN-CNAF 36 XEON D CORE M AVOTON ATOM - BayTrail i7 Mobile

+ 2 cores + GPU Intel HD Graphics OpenCL 2.0 Support 4.5 Watt (TDP) Daniele Cesini – INFN-CNAF 37 (*) Broadwell-Review html (**) ore-m-processor-family-spec-update.html CCR - 27/05/2015

+ AVOTON on HP Moonshot - HS06 Daniele Cesini – INFN-CNAF 38 (data from ) CCR - 27/05/2015

+ Conclusion COSA is showing that mobile and embedded low power System-on- Chip are becoming attractive for scientific computing In particular if you manage to extract power from the GPU But hey still have many limitations for a production environment We are continuing to acquire know-how on hybrid architectures ( both low power and traditional systems) Operations Performance Measurement Application porting Programming models It was decided to wait for the next generation 64bit SoCs before expanding the low power clusters COSA participated in the creation of the H2020 proposal for PICOLO LEIT-ICT call CCR - 27/05/2015Daniele Cesini – INFN-CNAF 39

+ Links and contacts Talk ( ribution/57/material/slides/0.pdf ) ribution/57/material/slides/0.pdf talk ( material/slides/0.pdf) material/slides/0.pdf COSA Indico ( 779 ) 779 COSA mailing list: cosa-project__at__lists.infn.it Daniele Cesini – INFN-CNAF 40 CCR - 27/05/2015

+ Backup CCR - 27/05/2015Daniele Cesini – INFN-CNAF 41

+ ARM based processor shipment ARM based processors are shipped in billions of units ARM licences the Intellectual Properties to manufactures …many manufactures …. Samsung (Korea), MediaTek (China), Allwinner (China), Qualcomm (USA), NVIDIA (USA), RockChip (China), Freescale (USA), Texas Instruments (USA), HiSilicon(China), Xilinx (USA), Broadcom(USA), Apple(USA), Altera(USA), ST(EU), WanderMedia(Taiwan), Marvel(USA), AMD(USA)etc.. Daniele Cesini – INFN-CNAF 42 CCR - 27/05/2015

+ CPU GFLOPS/Watt CCR - 27/05/2015Daniele Cesini – INFN-CNAF 43

+ Gbit Ethernet Daniele Cesini – INFN-CNAF 44 While latency was comparable to a server class 1Gb ethernet card (50/75 us) CCR - 27/05/2015

+ Real use-case: evolving the Universe Jetson-K1 about 2.6X slower than Xeon using 4 CPU cores, and about 4.4X slower using 8 CPU cores Jetson-K W Xeon ~220 W CCR - 27/05/ particles in a cube of 250 Mpc side, 13.6 Gyr 45 Daniele Cesini – INFN-CNAF MULTI BOARD CPU ONLY

+ Real use-case: evolving the Universe CCR - 27/05/ Daniele Cesini – INFN-CNAF

+ Test on Intel AVOTON CCR - 27/05/2015Daniele Cesini – INFN-CNAF 47 N.B. – Comparison with an old Penryn Xeon CPU Eth interconnect in both machines

+ GPU acceleration in K1 4 core ARM A15 ~ 18 GFLOPS Kepler SMX1 192 core ~ 300 GFLOPS (sp) ~ 15 W ~ 21 GFLOPS/W (sp) 2 x (E5-2673v2 (IvyBridge) 8 core) ~ 200 GFLOPS (dp) 220 W NVIDIA TESLA K core ~ 1400 GFLOPS (dp) ~ 4300 GFLOPS (sp) 235 W CPU+GPU ~ 3 GFLOPS/W (dp) ~ 9 GFLOPS/W (sp) 48 CCR - 27/05/2015Daniele Cesini – INFN-CNAF