John C. Linford ParaTools, Inc. IGC7, Harvard SEAS, Cambridge, MA 4 May 2015.

Slides:

Advertisements

Similar presentations

GPU Programming using BU Shared Computing Cluster

Advertisements

Computing with Accelerators: Overview ITS Research Computing Mark Reed.

XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Why GPU Computing. GPU CPU Add GPUs: Accelerate Science Applications © NVIDIA 2013.

FSOSS Dr. Chris Szalwinski Professor School of Information and Communication Technology Seneca College, Toronto, Canada GPU Research Capabilities.

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,

All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.

Panda: MapReduce Framework on GPU’s and CPU’s

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

1 Programming Multicore Processors Aamir Shafi High Performance Computing Lab

DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

A performance analysis of multicore computer architectures Michel Schelske.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

Use/User:LabServerField Engineer Electrical Engineer Software Engineer Mechanical Engineer Requirements: Small form factor.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

GPU Architecture and Programming

AES Encryption Code Generator Undergraduate Research Project by Paul Magrath. Supervised by Dr David Gregg.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Single Node Optimization Computational Astrophysics.

Workshop on Advanced Computing for Accelerators Day 3 Roger Barlow.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Sunpyo Hong, Hyesoon Kim

TAU Performance System ® TAU is a profiling and tracing toolkit that supports programs written in C, C++, Fortran, Java, Python,

Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.

Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp Superscalar GP-GPU design of SIMT.

MD5 CUDA by n VIDIA BARSWF NETWORK SECURITY. MD5  Designer Ronald L. Rivest  Published April 1992  Digest size 128 bits  Rounds 4  ReplacesMD4 

Date of download: 6/1/2016 Copyright © 2016 SPIE. All rights reserved. Triangulated shapes of human head layer boundaries employed in simulations: (a)

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

University GPU Club Tues 29 Oct

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.

J.J. Keijser Nikhef Amsterdam Grid Group MyFirstMic experience Jan Just Keijser 26 November 2013.

Computer Engg, IIT(BHU)

Conclusions on CS3014 David Gregg Department of Computer Science

A Tool for Chemical Kinetics Simulation on Accelerated Architectures

GPU Computing Jan Just Keijser Nikhef Jamboree, Utrecht

ParaTools ThreadSpotter ParaTools, Inc.

TAU integration with Score-P

A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.

Discussion HPC Priority project for COSMO consortium

Introduction to CUDA.

Graphics Processing Unit

Multicore and GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

Run time performance for all benchmarked software.

Presentation transcript:

John C. Linford ParaTools, Inc. IGC7, Harvard SEAS, Cambridge, MA 4 May 2015

IGC7, Copyright © ParaTools, Inc.2 Eight threads Intel Core i7-4820K CPU 3.70GHz

IGC7, Copyright © ParaTools, Inc.3 Eight threads Intel Core i7-4820K CPU 3.70GHz

IGC7, Copyright © ParaTools, Inc.4 Code generation Analysis Lexical parser CUDA Kppa Fortran C C Python Code optimized Domain Specific Language Serial Multi-core GPGPU Intel MIC Architecture

IGC7, Copyright © ParaTools, Inc.5 <% d_Decomp.begin() piv = lang.Variable('piv', REAL, 'Row element divided by diagonal') piv.declare() %> ${size_t} idx = blockDim.x*blockIdx.x+threadIdx.x; if(idx < ${ncells32}) { ${A} += idx; <% lang.upindent() for i in xrange(1, nvar): for j in xrange(crow[i], diag[i]): c = icol[j] d = diag[c] piv.assign(A[j*ncells32] / A[d*ncells32]) A[j*ncells32].assign(piv)... %> } A[334] = t1; A[337] = -A[272]*t1 + A[337]; A[338] = -A[273]*t1 + A[338]; A[339] = -A[274]*t1 + A[339]; A[340] = -A[275]*t1 + A[340]; t2 = A[335]/A[329]; A[335] = t2; A[337] = -A[330]*t2 + A[337]; A[338] = -A[331]*t2 + A[338]; A[339] = -A[332]*t2 + A[339]; A[340] = -A[333]*t2 + A[340]; t3 = A[341]/A[59]; Generates C, C++, CUDA, Fortran, or Python

IGC7, Copyright © ParaTools, Inc.6 X(28) = A(44) - 1 Simplify X[27] = ((A[43]*A[43]*A[43]) + (A[43]*A[43]) – (A[43]) – (5+4)) / (A[43]*A[43] + (8-6)*A[43] + 1) Fortran

Baseline: hand-tuned KPP-generated code 1.Use KPP to generate a serial code 2.A skilled programmer parallelizes the code Comparison: unmodified Kppa-generated code – Same input file format as KPP – Little or no modification to source Results: – Xeon Phi: 140% faster – NVIDIA K40: 160% faster – Xeon with OpenMP: 60% faster IGC7, Copyright © ParaTools, Inc.7 Exact same hardware

IGC7, Copyright © ParaTools, Inc.8

9 TAU configured with PAPI and PDT

IGC7, Copyright © ParaTools, Inc.10 Will Sawyer, Charles Joseph (CSCS)

Partnership with David Topping (Manchester) – EMiT 2015 Keynote – Horizon 2020 proposal Aerosols MCM Large mechanisms Partnership with Will Sawyer (CSCS) – PRACE integrator About 4x faster in COSMO-ART IGC7, Copyright © ParaTools, Inc.11

Downloads, tutorials, resources