Www.cineca.it Impatto delle architetture ibride sui modelli di programmazione e sulla gestione dell'infrastruttura. (Towards exascale ) Carlo Cavazzoni,

Slides:



Advertisements
Similar presentations
Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Today’s topics Single processors and the Memory Hierarchy
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Slide 3-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 3 Operating System Organization.
Contemporary Languages in Parallel Computing Raymond Hummel.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
1 petaFLOPS+ in 10 racks TB2–TL system announcement Rev 1A.
Panda: MapReduce Framework on GPU’s and CPU’s
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
CINECA: the Italian HPC infrastructure and his evolution in the European scenario Giovanni Erbacci, Supercomputing, Applications and Innovation.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
High Performance Computing G Burton – ICG – Oct12 – v1.1 1.
Many Integrated Core Prototype PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations Sofia, September 2012 G. Erbacci.
7th Workshop on Fusion Data Processing Validation and Analysis Integration of GPU Technologies in EPICs for Real Time Data Preprocessing Applications J.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
The exascale Challenge Carlo Cavazzoni – SuperComputing Applications and Innovation Department.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
GPU Programming with CUDA – Optimisation Mike Griffiths
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Cluster Workstations. Recently the distinction between parallel and distributed computers has become blurred with the advent of the network of workstations.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
GPU Architecture and Programming
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.
Interconnection network network interface and a case study.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Parallel IO for Cluster Computing Tran, Van Hoai.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Next Generation of Apache Hadoop MapReduce Owen
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
The Evolution of the Italian HPC Infrastructure Carlo Cavazzoni CINECA – Supercomputing Application & Innovation 31 Marzo 2015.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Societal applications of large scalable parallel computing systems ARTEMIS & ITEA Co-summit, Madrid, October 30th 2009.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Compute and Storage For the Farm at Jlab
HPC Roadshow Overview of HPC systems and software available within the LinkSCEEM project.
Enabling Effective Utilization of GPUs for Data Management Systems
Linchuan Chen, Xin Huo and Gagan Agrawal
Chapter 17: Database System Architectures
Overview of HPC systems and software available within
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
Database System Architectures
Cluster Computers.
Presentation transcript:

Impatto delle architetture ibride sui modelli di programmazione e sulla gestione dell'infrastruttura. (Towards exascale ) Carlo Cavazzoni, HPC department, CINECA

CINECA CINECA non profit Consortium, made up of 50 Italian universities*, The National Institute of Oceanography and Experimental Geophysics - OGS, the CNR (National Research Council), and the Ministry of Education, University and Research (MIUR). CINECA is the largest Italian computing centre, one of the most important worldwide. The HPC department manage the HPC infrastructure, provide support to Italian and European researchers, promote technology transfer initiatives for industry.

nVIDIA GPU Fermi implementation will pack 512 processor cores

CINECA PRACE Tier-0 System Architecture: 10 BGQ Frame Model: IBM-BG/Q Processor Type: IBM PowerA2, 1.6 GHz Computing Cores: Computing Nodes: RAM: 1GByte / core Internal Network: 5D Torus Disk Space: 2PByte of scratch space Peak Performance: 2PFlop/s ISCRA & PRACE call for projects now open!

Roadmap to Exascale (architectural trends)

Dennard Scaling law (MOSFET) L’ = L / 2 V’ = V / 2 F’ = F * 2 D’ = 1 / L 2 = 4D P’ = P do not hold anymore! The power crisis! L’ = L / 2 V’ = ~V F’ = ~F * 2 D’ = 1 / L 2 = 4 * D P’ = 4 * P The core frequency and performance do not grow following the Moore’s law any longer CPU + Accelerator to maintain the architectures evolution In the Moore’s law Programming crisis!

Where Watts are burnt? Today (at 40nm) moving 3 64bit operands to compute a 64bit floating- point FMA takes 4.7x the energy with respect to the FMA operation itself ABCABC D = A + B* C Extrapolating down to 10nm integration, the energy required to move date Becomes 100x !

Accelerator A set (one or more) of very simple execution units that can perform few operations (with respect to standard CPU) with very high efficiency. When combined with full featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system. (Carlo Cavazzoni) CPU ACC. CPU ACC. Physical integration CPU & ACC Architectural integration Single thread perf. throughput

ATI FireStream, AMD GPU 2012 New Graphics Core Next “GCN” With new instruction set and new SIMD design

Intel MIC (Knight Ferry)

What about parallel App? In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law). maximum speedup tends to 1 / ( 1 − P ) P= parallel fraction core P = serial fraction=

Programming Models Message Passing (MPI) Shared Memory (OpenMP) Partitioned Global Address Space Programming (PGAS) Languages  UPC, Coarray Fortran, Titanium Next Generation Programming Languages and Models  Chapel, X10, Fortress Languages and Paradigm for Hardware Accelerators  CUDA, OpenCL Hybrid: MPI + OpenMP + CUDA/OpenCL

Message Passing: MPI Main Characteristic Library Coarse grain Inter node parallelization (few real alternative) Domain partition Distributed Memory Almost all HPC parallel App Open Issue Latency OS jitter Scalability memory CPU node memory CPU node memory CPU node memory CPU node memory CPU node memory CPU node Internal High Performance Network

Shared Memory: OpenMP Main Characteristic Compiler directives Medium grain Intra node parallelization (pthreads) Loop or iteration partition Shared memory Many HPC App Open Issue Thread creation overhead Memory/core affinity Interface with MPI memory CPU node CPU Thread 0 Thread 1 Thread 2 Thread 3

Accelerator/GPGPU Sum of 1D array +

CUDA sample void CPUCode( int* input1, int* input2, int* output, int length) { for ( int i = 0; i < length; ++i ) { output[ i ] = input1[ i ] + input2[ i ]; } } __global__void GPUCode( int* input1, int*input2, int* output, int length) { int idx = blockDim.x * blockIdx.x + threadIdx.x; if ( idx < length ) { output[ idx ] = input1[ idx ] + input2[ idx ]; } } Each thread execute one loop iteration

Codes becomes Hybrid too (MPI+OpenMP+CUDA+… Take the positive off all models (Message Passing, Shared Memory, GPGPU) Exploit memory hierarchy Many HPC applications are adopting this model Mainly due to developer inertia Hard to rewrite million of source lines …+python)

System Software Components: Monitoring: HNagios (Nagios customized for HPC) I/O: GPFS (Parallel filesystem, comparison with LUSTRE) Resource Manager: PBSPro (focus on GPUs) System PLX, PRACE tier1 Model: IBM PLX (iDataPlex DX360M3)DX360M3 Architecture: Linux Infiniband Cluster Nodes: 274 IBM iDataPlex M3 Processors: 2 six-cores Intel Westmere 2.40 GHz per node GPU: 2 NVIDIA Tesla M2070 per node RAM: 48 Gb/node Internal Network: Infiniband with 4x QDR switches Operating System: Red Hat RHEL 5.6 Peak Performance: 300 TFlop/s (152 TFlops sustained - Linpack benchmark)

 (node01, GPU0)  (node01, GPU1)  (node04, GPFS client)  (node01, GPU0)  (node01, GPU1)  (node04, GPFS client)  Secondary Server, hosts for primary server  Filters “local” messages  Top level Server,  send messages and summaries Problem: manage the alarms with thousands of heterogeneous notes Solution: Design, development and adaptation of Nagios for HPC clusters Monitoring

Parallel File Systems (Focus on GPFS and Lustre) GPU codes tend to use less nodes (one or few) require a larger node I/O bandwidth Objective Planning, installation and configuration of scalable parallel file systems in order to maximize performance and limit the impact of HPC infrastructure problems. Solution adopted GPFS Testing of Lustre as a possible alternative (on different hardware/storage) Main topics and results Back-end storage scalability issues, management, high availability and reliability issues General configuration and tuning issues Solution design o Separation of data and metadata o File systems characterization (scratch, repo, home) Scalability issues and limitations

Parallel File Systems (Focus on GPFS and Lustre) Comparison - reliability/availability/maintainability GPFS more reliable than Lustre based on more sophisticated availability mechanisms and it is simpler to be maintained. all server functions are redounded, as many times one likes. metadata management is not centralized and metadata can be replicated across many disks data blocks can be replicated across disks disks can be emptied and replaced on the fly, simply invoking proper and simple management commands file system check operations are "one command shot" and can be performed on live file systems (with limitations)

Parallel File Systems (Focus on GPFS and Lustre) Comparison - performance Tested infrastructures are different, so direct comparison is impossible GPFS is very sensitive to the file system block size expecially on SATA storage, because striping schema is perceived like a random access by backend storage both technologies seems able to scale-up in terms of number of disk arrays and server nodes, Scalability in terms of clients seems an issue in GPFS clusters, due to the token passing serialization protocol (clients holds locks for file system integrity) Scalability in terms of servers seems an issue in Lustre clusters, due to the centralized metadata management

PBSPro (Resource Manager, focus on GPUs) Two different configurations possible:  Basic GPU scheduling (configured as single custom resources) GPUs are treated with equal priority (similar to the built-in ncpus PBS resource). request 4 nodes with one GPU: qsub –lselect=4:ncpus=1:ngpus=1 myjob  Advanced GPUs scheduling (configured as vnodes) Each individual GPU on a node can be separately allocated by a job (PBS knows that GPU have an ID). request 4 nodes with GPU id 0: qsub –lselect=4:ncpus=1:gpu_id=gpu0 myjob CPUGPU CPUGPU  Node with resources CPUGPU CPUGPU vnode0 vnode1 Node with vnodes In both cases the placement/binding of tasks to the GPU is a responsibility of the application

PBSPro (Resource Manager, focus on GPUs) Main results Basic GPUs scheduling  Easier to configure, administer and use (from end user point of view)  More flexible with heterogeneous jobs (cpu only and gpu) Basic and Advanced  No binding between tasks and GPUs, no enforcing (neither with the advanced scheduling).  Always conflicts when sharing nodes between jobs using GPUs

Conclusions Hybrid clusters can bridge the gap with exascale machines New hybrid programming models Hierarchical monitor for hybrid and heterogeneous systems. GPFS as a “reliable” parallel filesystem. PBSpro, easy to manage GPU, but node sharing should be avoided thank you Contribution Daniela Galetti Marco Sbrighi Federico Paladin Fabrizio Vitale