11 Brian Van Straalen Portable Performance Discussion August 7, 2015. FASTMath SciDAC Institute.

Slides:

Advertisements

Similar presentations

Prasanna Pandit R. Govindarajan

Advertisements

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

SE-292 High Performance Computing

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

Computing with Accelerators: Overview ITS Research Computing Mark Reed.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Contemporary Languages in Parallel Computing Raymond Hummel.

Panda: MapReduce Framework on GPU’s and CPU’s

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

KUAS.EE Parallel Computing at a Glance. KUAS.EE History Parallel Computing.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Computer System Architectures Computer System Software

High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Lecture 2 : Introduction to Multicore Computing

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Hybrid MPI and OpenMP Parallel Programming

GPU Architecture and Programming

1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.

Experts in numerical algorithms and HPC services Compiler Requirements and Directions Rob Meyer September 10, 2009.

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Carlo del Mundo Department of Electrical and Computer Engineering Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms?

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Contemporary Languages in Parallel Computing Raymond Hummel.

CS 732: Advance Machine Learning

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

My Coordinates Office EM G.27 contact time:

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Martin Kruliš by Martin Kruliš (v1.1)1.

Prof. Zhang Gang School of Computer Sci. & Tech.

Productive Performance Tools for Heterogeneous Parallel Computing

Community Grids Laboratory

GPU Computing Jan Just Keijser Nikhef Jamboree, Utrecht

MPI-Message Passing Interface

Exascale Programming Models in an Era of Big Computation and Big Data

Presentation transcript:

11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute

22  C/C++11 will run on all systems correctly C++ std::thread might not be performant for several more years  Fortran 2003 should be working everywhere. Fortran 2008 might be available  CoArrays, DO CONCURRENT  MPI will be available on CPUs  CPU-based machines will have some form of threading OpenMP is the current expected flavor.  GPU accelerator machines will support a kernel offload style of computing through CUDA or OpenCL.  C/C++/CUDA/OpenCLFortran will link together correctly First, the portable and performant basics

33  Multicore CPU compute nodes (Cray XC30, XC40)  Multicore CPU host with NVIDIA Accelerator (XE6, XK6, PowerEdge) host and accelerator connected by PCIe bus  Early Manycore hosted platforms (BG/Q). Higher count of simpler cores Current hardware landscape

44  Developers With the minimum amount of code divergence capture as much of the available performance on a the range of target architectures.  Users Application parallelism and library parallelism not mandated  MPI Endpoints a possible way for MPI+X to work with flat MPI. Users can decide to adopt a library’s programming model and build environment.  It is hard to not subject Users to the Library data structure choices. Developers vs Users

55  MPI3+OpenMP4 as a portable programming model Department of Energy vendors recommended BoxLib already uses MPI2+OpenMP3  MPI3 More asynchronous styles available One-sided communication Fewer dynamic load-balancing options than threads default private address spaces  OpenMP4 threading, SIMD vector and offload kernel directives Move towards code generator for OpenMP code  More I/O abstractions Chombo is going to drink the Kool-aid

66  On-core networking and off-core networking will start to merge. System-On-Chip designs are more energy efficient NIC-on-chip is already on the roadmap.  NIC and On-Chip are merging.  Intel: Sherkar Borkar: Sending messages using source and destination addresses would be the most efficient and sensible approach because all of the hardware necessary to accomplish it is already present.  NVIDIA: Steve Oberlin :We can make hardware to accelerate matching for MPI, but it is redundant hardware given address matching logic is already in there  MPI Tag matching is ulcer-inducing for next generation hardware designers  Heterogenous compute resources but Unified Memory Unified, but certainly not Uniform  Chombo will be minimizing reliance on MPI semantics What convergence of hardware can we expect?