Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

Slides:

Advertisements

Similar presentations

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Advertisements

Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

©2009 HP Confidential template rev Ed Turkel Manager, WorldWide HPC Marketing 4/7/2011 BUILDING THE GREENEST PRODUCTION SUPERCOMPUTER IN THE.

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

PVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments Palden Lama, Xiaobo Zhou, University of Colorado at Colorado Springs.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

OpenFOAM on a GPU-based Heterogeneous Cluster

LinkSCEEM-2: A computational resource for the development of Computational Sciences in the Eastern Mediterranean Mostafa Zoubi SESAME SESAME – LinkSCEEM.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

1 petaFLOPS+ in 10 racks TB2–TL system announcement Rev 1A.

Panda: MapReduce Framework on GPU’s and CPU’s

© 2005, it - instituto de telecomunicações. Todos os direitos reservados. System Level Resource Discovery and Management for Multi Core Environment Javad.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.

GPGPU platforms GP - General Purpose computation using GPU

OpenSSL acceleration using Graphics Processing Units

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

Extracted directly from:

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Kyle Spafford Jeremy S. Meredith Jeffrey S. Vetter

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Design of a Software Correlator for the Phase I SKA Jongsoo Kim Cavendish Lab., Univ. of Cambridge & Korea Astronomy and Space Science Institute Collaborators:

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.

PDSF at NERSC Site Report HEPiX April 2010 Jay Srinivasan (w/contributions from I. Sakrejda, C. Whitney, and B. Draney) (Presented by Sandy.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.

Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Parallel Plasma Equilibrium Reconstruction Using GPU

LinkSCEEM-2: A computational resource for the development of Computational Sciences in the Eastern Mediterranean Mostafa Zoubi SESAME Outreach SESAME,

Hybrid Programming with OpenMP and MPI

Presentation transcript:

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing,100190, China 2) Graduate University of Chinese Academy of Sciences, Beijing,100190, China 3)State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing,100190, China 1.Introduction Linpack is a de facto standard benchmark for supercomputer. Based on Linpack benchmark, TOP500 website[1] lists top 500 high performance computers in whole world every half year. NVIDIA Fermi GPU[2] achieve a significant improvement in general purpose computing. Especially, faster double precision performance, ECC and cache are very important features for scientific computing. According to the above advantages, IPE [3] used Fermi GPUs to build their heterogeneous supercomputer named Mole-8.5, though it is mainly designed for multi-scale discrete simulations that are most suitable for fine-grain massive parallel process, it is still interesting to see its Linpack Performance. We carried out an early benchmarking of the system and have got 207.3TFlops, No.19 on Top June list. In the following, we will introduce the architecture of computing node in IPE Mole-8.5 Cluster. Then, we will briefly review HPL on Fermi package modified by NVIDIA and the tuning tips. Next, we will analyze the data transfer between CPU and GPU. At last, it’s the result and conclusion. 2.The Architecture of IPE Mole-8.5 Cluster Fig. 1. The architecture of computing node in IPE Mole-8.5 Cluster Figure 1 shows the architecture of computing node in IPE Mole-8.5 Cluster. There are 2 Intel quad cores CPUs and 2 IOH chips, which are connected with QPI. In each IOH chip, there is 2 PCI switches which provide 2 PCIe X16 slots. Therefore, each IOH chip provides 4 PCIe X16 slots. One IOH Chip equips 3 Fermi GPUs and the other IOH Chip equips 3 Fermi GPUs and 1 Infiniband HCA. Table 1 lists the configuration details. Table 1. The configuration of computing node (At first, it had 24 GB memory. In 288 and 320 nodes Linpack runs, we enlarged the memory to 48GB.) 3.HPL on Fermi package and tuning tips NVIDIA provided modified HPL package for Fermi. The method is similar with reference [4] and the main modifications are listed as follow. Implement CPU/GPU hybrid DGEMM function as shown as Figure 2. Automatically split the work load between GPU and CPU. Use optimized DGEMM function for Fermi Fig.2. CPU/GPU hybrid DGEMM (The shadow matrix is running on GPU) Implement CPU/GPU hybrid DTRSM function. The method is similar with DGEMM Use CUDA stream to overlap the data transfer and computing. In tuning Linpack performance, DGEMM/DTRSM split ratio, problem size(N), block size (nb) and grid size (PxQ) are the key factors. You should firstly optimize these. Although there is many GPUs and CPUs in single node, we find that binding the process to the neighboring CPU and GPU has the better performance. 4.Analyzing data transfer between CPU and GPU Fig.3. Linpack performance (Gflops) on single node (Use 1-6 GPUs) As Figure 3 shown, the speedup drops significantly on 5-6 GPUs. Obviously, the bottleneck is data transfer between CPU and GPU. To investigate the requirement of data transfer bandwidth via PCIe, we logged the amount of data. In single GPU, when N equals and NB is 1024, the amount of data is 371GB from CPU to GPU. In backward direction, it’s 337GB. The Linpack running time is about 379 seconds. Therefore, the bandwidth per GPU needs above 2GB/s. In 6 GPUs, we developed a rough trace tool to replay the data transfer behavior. In the replay running, data transfer costs about 54.9s whereas Linpack costs 61.9s. Therefore, the bottleneck indeed is PCIe bandwidth in this system. The PCI switch could not support enough bandwidth for Linpack benchmark. 5.Linpack result Fig.4. Linpack performance (Tflops) on nodes and final output Figure 4 depicts Linpack performance from 80 nodes to 320 nodes. According to the Linpack results, there is good scalability in multi IPE Mole-8.5 nodes. Meanwhile, the ECC of Fermi smoothes our Linpack runs in large scale. 6.Conclusion In short, we got 207.3TFlops on 320 nodes (1920 NVIDIA Fermi GPUs). The bottleneck of Linpack benchmark on this system is data transfer between CPU and GPU via PCIe. We advice that, for better Linpack efficiency, single node equips less than 4 GPUs. Nevertheless, for Mole-8.5, the emphasis is on multi-scale discrete simulations which has lighter load on PCIe, so current architecture may still be justified. Acknowledgement This work is partly supported by Ministry of Finance under the Grant (No. ZDYZ2008-2), the National 863 Plan of China (No.2006AA01A125, No. 2009AA01A129, No. 2009AA01A134). Reference [1] [2] [3] [4] M.Fatica, “Accerlerating linpack with CUDA on heterogeneous clusters,” in Proc. Of 2nd Workshop on General Purpose Processing on Graphics Processing Units, 2009 CPU2-ways Intel Xeon E5520 (Quad Cores) 2.26GHz Memory24GB,48GB DDR3 GPU6 NVIDIA Tesla C2050 Cards OSCentOS 5.4 Linux Kernel CompilerGCC 4.1 CUDA SDK3.0 NVIDIA driver MPIOpenMPI BLASGotoBLAS2-1.13