Www.cineca.it CINECA: the Italian HPC infrastructure and his evolution in the European scenario Giovanni Erbacci, Supercomputing, Applications and Innovation.

Slides:



Advertisements
Similar presentations
Founded in 2010: UCL, Southampton, Oxford and Bristol Key Objectives of the Consortium: Prove the concept of shared, regional e-infrastructure services.
Advertisements

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
1 Computational models of the physical world Cortical bone Trabecular bone.
Exascale Computing: Challenges and Opportunities Ahmed Sameh and Ananth Grama NNSA/PRISM Center, Purdue University.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Today’s topics Single processors and the Memory Hierarchy
Parallel Research at Illinois Parallel Everywhere
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
LinkSCEEM-2: A computational resource for the development of Computational Sciences in the Eastern Mediterranean Mostafa Zoubi SESAME SESAME – LinkSCEEM.
Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
Parallel Computing Overview CS 524 – High-Performance Computing.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
Lecture 1: Introduction to High Performance Computing.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, September 2012 G. Erbacci.
The exascale Challenge Carlo Cavazzoni – SuperComputing Applications and Innovation Department.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Leibniz Supercomputing Centre Garching/Munich Matthias Brehm HPC Group June 16.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Interconnection network network interface and a case study.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Impatto delle architetture ibride sui modelli di programmazione e sulla gestione dell'infrastruttura. (Towards exascale ) Carlo Cavazzoni,
The Evolution of the Italian HPC Infrastructure Carlo Cavazzoni CINECA – Supercomputing Application & Innovation 31 Marzo 2015.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
COMP7330/7336 Advanced Parallel and Distributed Computing Dr. Xiao Qin Auburn University
NIIF HPC services for research and education
NFV Compute Acceleration APIs and Evaluation
HPC Roadshow Overview of HPC systems and software available within the LinkSCEEM project.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing
Modern supercomputers, Georgian supercomputer project and usage areas
LinkSCEEM-2: A computational resource for the development of Computational Sciences in the Eastern Mediterranean Mostafa Zoubi SESAME Outreach SESAME,
Appro Xtreme-X Supercomputers
Super Computing By RIsaj t r S3 ece, roll 50.
CS : Technology Trends August 31, 2015 Ion Stoica and Ali Ghodsi (
Overview of HPC systems and software available within
Constructing a system with multiple computers or processors
Types of Parallel Computers
Presentation transcript:

CINECA: the Italian HPC infrastructure and his evolution in the European scenario Giovanni Erbacci, Supercomputing, Applications and Innovation Department, CINECA, Italy

Agenda CINECA: the Italian HPC Infrastructure CINECA and the Euroepan HPC Infrastructure Evolution: Parallel Programming Trends in Extremely Scalable Architectures

Agenda CINECA: the Italian HPC Infrastructure CINECA and the Euroepan HPC Infrastructure Evolution: Parallel Programming Trends in Extremely Scalable Architectures

CINECA CINECA non profit Consortium, made up of 50 Italian universities*, The National Institute of Oceanography and Experimental Geophysics - OGS, the CNR (National Research Council), and the Ministry of Education, University and Research (MIUR). CINECA is the largest Italian computing centre, one of the most important worldwide. The HPC department - manages the HPC infrastructure, - provide support end HPC resources to Italian and European researchers, - promote technology transfer initiatives for industry.

1969: CDC st system for scientific computing 1975: CDC st supercomputer 1985: Cray X-MP / 4 81 st vector supercomputer 1989: Cray Y-MP / : Cray C-90 / : Cray T3D 641 st parallel supercomputer 1995: Cray T3D : Cray T3E st MPP supercomputer 2002: IBM SP Teraflops 2005: IBM SP : IBM BCX10 Teraflops 2009: IBM SP6100 Teraflops 2012: IBM BG/Q> 1 Petaflops The Story

CINECA and Top 500

Trend di sviluppo 10 PF 1 PF 100 TF 10 TF 1 TF 100 GF 10 GF 1 GF >2 PF

HPC Infrastructure for Scientific computing Logical NameSP6 (Sep 2009)BGP (jan 2010)PLX (2011) ModelIBM P575IBM BG / PIBM IDATAPLEX Architecture SMPMPPLinux Cluster ProcessorIBM Power GhzIBM PowerPC 0,85 GHzIntel Westmere Ec 2.4 Ghz # of core GPGPU Nvidia Fermi M2070 # of node # of rack12110 Total RAM20 Tera Byte2 Tera Byte~ 13 Tera Byte InterconnectionQlogic Infiniband DDR 4xIBM 3D TorusQlogiq QDR 4x Operating SystemAIXSuseRedHat Total Power~ 800 Kwatts~ 80 Kwatts~ 200 Kwatts Peak Performance> 101 Tera Flops~ 14 Tera Flops~ 300 Tera Flops

Visualisation system Visualisation and computer graphycs Virtual Theater 6 video-projectors BARCO SIM5 Audio surround system Cylindric screen 9.4x2.7 m, angle 120° Ws + Nvidia cards RVN nodes on PLX system

Storage Infrastructure System Available bandwidth (GB/s)Space (TB) Connection TecnologyDisk Tecnology 2 x S2A95003,2140FCP 4Gb/sFC 4 x S2A95003,2140FCP 4Gb/sFC 6 x DCS99005,0540FCP 8Gb/sSATA 4 x DCS99005,0720FCP 4Gb/sSATA 3 x DCS99005,01500FCP 4Gb/sSATA Hitachi Ds3,2360FCP 4Gb/sSATA 3 x SFA100010,02200QDRSATA 1 x IBM51003,266FCP 8Gb/sFC > 5,6 PB

SP CINECA compute nodes IBM p575 Power6 (4.7GHz) compute cores (32 core / node) Gbyte RAM / node (21Tbyte RAM) - IB x 4 DDR (double data rate) Peak performance 101 TFlops Peak performance 101 TFlops Rmax Tflop/s Rmax Tflop/s Efficiency (workload) % Efficiency (workload) % N. 116 Top500 (June 11) N. 116 Top500 (June 11) - 2 login nodes IBM p I/O + service nodes IBM p PByte Storage row: 500 Tbyte working area High Performance 700 Tbyte data repository 700 Tbyte data repository

CINECA Model: IBM BlueGene / P Architecture: MPP Processor Type: IBM PowerPC 0,85 GHz Compute Nodes: 1024 (quad core, 4096 total) RAM: 4 GB/compute node (4096 GB total) Internal Network: IBM 3D Torus OS: Linux (login nodes) CNK (compute nodes) Peak Performance: 14.0 TFlop/s

CINECA IBM Server dx360M3 – Compute node 2 x processori Intel Westmere 6c X GHz 12MB Cache, DDR3 1333MHz 80W 48GB RAM su 12 DIMM 4GB DDR3 1333MHz 1 x HDD 250GB SATA 1 x QDR Infiniband Card 40Gbs 2 x NVIDIA m2070 (m2070q su 10 nodi) Peak performance 32 TFlops (3288 cores a 2.40GHz) Peak performance 565 TFlops Single Precision o 283 TFlops Double Precision (548 Nvidia M2070) N. 54 Top500 (June 11)

CINECA Scientific Area Chemistry Physics Life Science Engineering Astronomy Geophysics Climate Cultural Heritage National Institutions INFM-CNR SISSA INAF INSTM OGS INGV ICTP Academic Institutions Main Activities Molecular Dynamics Material Science Simulations Cosmology Simulations Genomic Analysis Geophysics Simulations Fluid dynamics Simulations Engineering Applications Application Code development/ parallelization/optimization Help desk and advanced User support Consultancy for scientific software Consultancy and research activities support Scientific visualization support

The HPC Model at CINECA From agreements with National Institutions to National HPC Agency in an European context - Big Science – complex problems - Support Advanced computational science projects - HPC support for computational sciences at National and European level - CINECA calls for advanced National Computational Projects ISCRA Italian SuperComputing Resource Allocation   Objective: support large-scale, computationally intensive projects not possible or productive without terascale, and in future petascale, computing.  Class A: Large Projects (> CPUh x project): two calls per Year  Class B: Standard Projects. two calls per Year  Class C: Test and Development Projects (< CPU h x project): continuous submission scheme; proposals reviewed 4 times per year,

ISCRA: Italian SuperComputing Resource Allocation iscra.cineca.it Sp6, 80TFlops (5376 core) N. 116 Top500, June 2011 BGP, 17 TFlops (4096 core)PLX, 142TFlops (3288 core nVidia M2070) N. 54 Top500 June National scientific committee - Blind National Peer review system - Allocation procedure

CINECA and Industry CINECA provides HPC service to Industry: –ENI (geophysics) –BMW-Oracle (American cup, CFD structure) –Arpa (weather forecast, Meteoclimatology) –Dompé (pharmaceutical) CINECA hosts the ENI HPC system: HP ProLiant SL390s G7 Xeon 6C X5650, Infiniband, Cluster Linux HP, cores N. 60 Top 500 (June 2011) Tflop/s Peak, Linpack

CINECA Summer schools

Agenda CINECA: the Italian HPC Infrastructure CINECA and the Euroepan HPC Infrastructure Evolution: Parallel Programming Trends in Extremely Scalable Architectures

PRACE PRACE Research Infrastructure ( the top level of the European HPC ecosystemwww.prace-ri.eu CINECA: - represents Italy in PRACE - hosting member in PRACE - Tier-1 system > 5 % PLX + SP6 - Tier-0 system in 2012 BG/Q 2 Pflop/s - involved in PRACE 1IP, 2IP - PRACE 2IP prototype EoI European (PRACE) Local Tier 0 Tier 1 Tier 2 National The European HPC-Ecosystem Creation of a European HPC ecosystem involving all stakeholders HPC service providers on all tiers Scientific and industrial user communities The European HPC hw and sw industry

HPC-Europa 2: Providing access to HPC resources HPC-Europa 2 - consortium of seven European HPC infrastructures - integrated provision of advanced computational services to the European research community - Provision of transnational access to some of the most powerful HPC facilities in Europe - Opportunities to collaborate with scientists working in related fields at a relevant local research institute   HP-Europa 2: 2009 – 2012  (FP7-INFRASTRUCTURES )

DEISA PRACE EUDAT EMI HPC-Europa Verce Plug-it Europlanet Vph-op EESI HPCworld VMUST DeepMontblanc

Agenda CINECA: the Italian HPC Infrastructure CINECA and the Euroepan HPC Infrastructure Evolution: Parallel Programming Trends in Extremely Scalable Architectures

BG/Q in CINECA The Power A2 core has a 64-bit instruction set (unlike the prior 32-bit PowerPC chips used in BG/L and BG/P The A2 core have four threads and has in-order dispatch, execution, and completion instead of out-of-order execution common in many RISC processor designs. The A2 core has 16KB of L1 data cache and another 16KB of L1 instruction cache. Each core also includes a quad-pumped double-precision floating point unit: Each FPU on each core has four pipelines, which can be used to execute scalar floating point instructions, four- wide SIMD instructions, or two-wide complex arithmetic SIMD instructions.  This Power A2 core has a 64-bit instruction set, like other commercial Power-based processors sold by IBM since 1995 but unlike the prior 32-bit PowerPC chips used in prior BlueGene/L and BlueGene/P supercomputers. The A2 core have four threads and has in-order dispatch, execution, and completion instead of out-of-order execution common in many RISC processor designs. The A2 core has 16KB of L1 data cache and another 16KB of L1 instruction cache. Each core also includes a quad-pumped double-precision floating point unit, which is blocked out thus: 16 core 1.6 GHz - a crossbar switch links the cores and L2 cache memory together. - 5D torus interconnect

HPC Evolution Moore’s law is holding, in the number of transistors – Transistors on an ASIC still doubling every 18 months at constant cost – 15 years of exponential clock rate growth has ended Moore’s Law reinterpreted – Performance improvements are now coming from the increase in the number of cores on a processor (ASIC) – #cores per chip doubles every 18 months instead of clock – threads per node will become visible soon From Herb Sutter

Top500Top500 …. Paradigm Change in HPC What about applications? Next HPC systems with more than e in the order of cores

Real HPC Crisis is with Software A supercomputer application and software are usually much more long-lived than a hardware - Hardware life typically four-five years at most. - Fortran and C are still the main programming models Programming is stuck - Arguably hasn’t changed so much since the 70’s Software is a major cost component of modern technologies. - The tradition in HPC system procurement is to assume that the software is free. It’s time for a change - Complexity is rising dramatically - Challenges for the applications on Petaflop systems - Improvement of existing codes will become complex and partly impossible - The use of O(100K) cores implies dramatic optimization effort - New paradigm as the support of a hundred threads in one node implies new parallelization strategies - Implementation of new parallel programming methods in existing large applications has not always a promising perspective There is the need for new community codes

Roadmap to Exascale (architectural trends)

What about parallel App? In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law). maximum speedup tends to 1 / ( 1 − P ) P= parallel fraction core P = serial fraction=

Programming Models Message Passing (MPI) Shared Memory (OpenMP) Partitioned Global Address Space Programming (PGAS) Languages  UPC, Coarray Fortran, Titanium Next Generation Programming Languages and Models  Chapel, X10, Fortress Languages and Paradigm for Hardware Accelerators  CUDA, OpenCL Hybrid: MPI + OpenMP + CUDA/OpenCL

trends Vector Distributed memory Shared Memory Hybrid codes MPP System, Message Passing: MPI Multi core nodes: OpenMP Accelerator (GPGPU, FPGA): Cuda, OpenCL Scalar Application

Message Passing domain decomposition memory CPU node memory CPU node memory CPU node memory CPU node memory CPU node memory CPU node Internal High Performance Network

Ghost Cells - Data exchange i,ji+1,ji-1,j i,j+1 i,j-1 sub-domain boundaries Ghost Cells i,ji+1,ji-1,j i,j+1 i,j-1 Processor 1 Processor 2 i,ji+1,ji-1,j i,j+1 Ghost Cells exchanged between processors at every update i,ji+1,ji-1,j i,j+1 i,j-1 Processor 1 Processor 2 i,ji+1,ji-1,j i,j+1

Message Passing: MPI Main Characteristic Library Coarse grain Inter node parallelization (few real alternative) Domain partition Distributed Memory Almost all HPC parallel App Open Issue Latency OS jitter Scalability

Shared memory memory CPU node CPU Thread 0 Thread 1 Thread 2 Thread 3 x y

Shared Memory: OpenMP Main Characteristic Compiler directives Medium grain Intra node parallelization (pthreads) Loop or iteration partition Shared memory Many HPC App Open Issue Thread creation overhead Memory/core affinity Interface with MPI

OpenMP !$omp parallel do do i = 1, nsl call 1DFFT along z ( f [ offset( threadid ) ] ) end do !$omp end parallel do call fw_scatter (... ) !$omp parallel do i = 1, nzl !$omp parallel do do j = 1, Nx call 1DFFT along y ( f [ offset( threadid ) ] ) end do !$omp parallel do do j = 1, Ny call 1DFFT along x ( f [ offset( threadid ) ] ) end do !$omp end parallel

Accelerator/GPGPU Sum of 1D array +

CUDA sample void CPUCode( int* input1, int* input2, int* output, int length) { for ( int i = 0; i < length; ++i ) { output[ i ] = input1[ i ] + input2[ i ]; } } __global__void GPUCode( int* input1, int*input2, int* output, int length) { int idx = blockDim.x * blockIdx.x + threadIdx.x; if ( idx < length ) { output[ idx ] = input1[ idx ] + input2[ idx ]; } } Each thread execute one loop iteration

CUDA OpenCL Main Characteristic Ad-hoc compiler Fine grain offload parallelization (GPU) Single iteration parallelization Ad-hoc memory Few HPC App Open Issue Memory copy Standard Tools Integration with other languages

Hybrid (MPI+OpenMP+CUDA+… Take the positive off all models Exploit memory hierarchy Many HPC applications are adopting this model Mainly due to developer inertia Hard to rewrite million of source lines …+python)

Hybrid parallel programming MPI: Domain partition OpenMP: External loop partition CUDA: assign inner loops Iteration to GPU threads Quantum ESPRESSO Python: Ensemble simulations

Storage I/O The I/O subsystem is not keeping the pace with CPU Checkpointing will not be possible Reduce I/O On the fly analysis and statistics Disk only for archiving Scratch on non volatile memory (“close to RAM”)

Conclusion Exploit millions of ALU Hybrid Hardware Hybrid codes Memory Hierarchy Flops/Watt (more that Flops/Sec) I/O subsystem Non volatile memory Fault Tolerance! Parallel programming trends in extremely scalable architectures