Frameworks to Aid Code Development and Performance Portability 1 iCSC2015, André Pereira, LIP-Minho/University of Minho Development of High Performance.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Computer Abstractions and Technology
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,
OpenFOAM on a GPU-based Heterogeneous Cluster
Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
Contemporary Languages in Parallel Computing Raymond Hummel.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU platforms GP - General Purpose computation using GPU
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
SAGE: Self-Tuning Approximation for Graphics Engines
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
GPU Architecture and Programming
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.
EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.
My Coordinates Office EM G.27 contact time:
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Real-Time Ray Tracing Stefan Popov.
Linchuan Chen, Xin Huo and Gagan Agrawal
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
6- General Purpose GPU Programming
Presentation transcript:

Frameworks to Aid Code Development and Performance Portability 1 iCSC2015, André Pereira, LIP-Minho/University of Minho Development of High Performance Computing Applications Across Heterogeneous Systems Lecture 2 Frameworks to Aid Code Development and Performance Portability André Pereira LIP-Minho/University of Minho Inverted CERN School of Computing, February 2015

Frameworks to Aid Code Development and Performance Portability 2 iCSC2015, André Pereira, LIP-Minho/University of Minho Agenda  Motivation  Frameworks for Heterogeneous Programming  A Small Example with DICE  Performance Analysis of Case Studies

Frameworks to Aid Code Development and Performance Portability 3 iCSC2015, André Pereira, LIP-Minho/University of Minho HetPlats Challenges  “I spent months optimising my code for HetPlats, I bet it will be super fast on this new system I just bought”  No! You need to re-tune the code for each system…  How is it possible to  achieve code scalability in each device?  simultaneously use both computing devices?  write the code once and guarantee its performance across different HetPlats? Motivation

Frameworks to Aid Code Development and Performance Portability 4 iCSC2015, André Pereira, LIP-Minho/University of Minho Levels of Parallelism Parallelism Shared Memory Distributed Memory Task Data Vectorisation Domain Partitioning Platform Technique Motivation

Frameworks to Aid Code Development and Performance Portability 5 iCSC2015, André Pereira, LIP-Minho/University of Minho Frameworks  There are frameworks to help the development of code for heterogeneous platforms  They provide several key features to the programmer  Abstraction of the distributed memory environment  Automatic workload balance among processing units  Coding the algorithm once to run on different processing units  Management of different task dependencies  Adaptation to the computing platform  They are open source!  And provide several tutorials

Frameworks to Aid Code Development and Performance Portability 6 iCSC2015, André Pereira, LIP-Minho/University of Minho Frameworks  The downside is…  Steep learning curve for non-computer scientists  Production code has to be re-written to fit their programming model  Some frameworks require user configuration for each task/algorithm, which may have a huge impact on performance  Different frameworks use different strategies to  Implement the algorithms  Minimise the costs of transferring the data among processing units  Handle RAW, WAW, and WAR task dependencies  Schedule the workload among processing units Frameworks

Frameworks to Aid Code Development and Performance Portability 7 iCSC2015, André Pereira, LIP-Minho/University of Minho Revisiting the Challenges  Different architectures  Distinct designs of parallelism  Distinct memory hierarchies  Different programming paradigms  Distinct code for efficient algorithms among devices  Workload management  High latency communication between CPU and device  Different throughputs among devices ✔ ✔ ✔ Frameworks

Frameworks to Aid Code Development and Performance Portability 8 iCSC2015, André Pereira, LIP-Minho/University of Minho StarPU  “Task programming library for hybrid architectures”  Implementation through the library API or compiler pragmas  Uses a task-based parallelisation approach  Programmer codes codelets to run on the processing units  StarPU hides memory transfer costs by interleaving different tasks  Fixed workload grain size (defined by the user)  Also works on cluster environments with MPI Frameworks

Frameworks to Aid Code Development and Performance Portability 9 iCSC2015, André Pereira, LIP-Minho/University of Minho Legion  “Data-centric programming model for writing high performance applications”  A parallelisation approach focused on the data set  User specifies properties to the data structures, such as organisation, partitioning, and coherence  Legion handles the parallelism and data transfer, according to the specified properties  User maps the tasks to the processing units  Legion schedules the workload at runtime to handle irregular tasks Frameworks

Frameworks to Aid Code Development and Performance Portability 10 iCSC2015, André Pereira, LIP-Minho/University of Minho DICE  Programming model and runtime system for irregular applications  Dynamic Irregular Computing Environment  Data parallelism approach with an unified memory space  Provides various data containers, with different properties  Allows to provide optimised code for each processing unit  The user has to code a dicing function – used to minimise the data transfers  Workload grain size adapts dynamically at runtime  Requires some expertise to produce high performing code Frameworks

Frameworks to Aid Code Development and Performance Portability 11 iCSC2015, André Pereira, LIP-Minho/University of Minho DICE – Runtime System Frameworks

Frameworks to Aid Code Development and Performance Portability 12 iCSC2015, André Pereira, LIP-Minho/University of Minho A Small Example with DICE  SAXPY – Single precision alpha * x[i] + y[i]  Linear complexity O(n)  No data dependencies void saxpyCPU (float a, float *x, float *y, float *r, int N) { for (int i = 0; i < N; i++) r[i] = a * x[i] + y[i]; }

Frameworks to Aid Code Development and Performance Portability 13 iCSC2015, André Pereira, LIP-Minho/University of Minho A Small Example with DICE  Data Structures  Defined inside the work class  Global memory construct to be shared among processing units  Scalar variables do not need a special identifier  This belongs to the high level API smartPtr R; smartPtr X; smartPtr Y; float alpha;

Frameworks to Aid Code Development and Performance Portability 14 iCSC2015, André Pereira, LIP-Minho/University of Minho A Small Example with DICE  Data properties  Assigned to the data structure when they are initialised  smarPtr are classes, implementing getters and setters  Properties: DEVICE, SHARED, READ_ONLY smartPtr R = smartPtr (N, Property) ;

Frameworks to Aid Code Development and Performance Portability 15 iCSC2015, André Pereira, LIP-Minho/University of Minho A Small Example with DICE  Define the task properties  Give an unique identifier to each task enum WORK_TYPE { /*!< Empty job description. DO NOT CHANGE */ WORK_NONE = W_NONE, /*!< SAXPY job definition */ WORK_SAXPY, /*TO DO: Add you job descriptions here */ /*!< Total number of job definitions. DO NOT CHANGE */ WORK_TOTAL, /*!< Reserved bit mask job. DO NOT CHANGE */ WORK_RESERVED = W_RESERVED };

Frameworks to Aid Code Development and Performance Portability 16 iCSC2015, André Pereira, LIP-Minho/University of Minho A Small Example with DICE  Fit the code to the Worker class  Declare an empty constructor with the job description  W_REGULAR indicates that the workload is irregular (as opposed to W_IRREGULAR)  W_SAXPY maps the defined identifier to the method saxpy() : work(WORK_SAXPY | W_REGULAR) { }

Frameworks to Aid Code Development and Performance Portability 17 iCSC2015, André Pereira, LIP-Minho/University of Minho A Small Example with DICE  Fit the code to the Worker class  __HYBRID__ indicates that the code is to be simultaneously scheduled among all processing units  __DEVICE__, accompanied by a template specifies the code to be compiled for a specific device __HYBRID__ saxpy() : work(WORK_SAXPY | W_REGULAR) { }

Frameworks to Aid Code Development and Performance Portability 18 iCSC2015, André Pereira, LIP-Minho/University of Minho A Small Example with DICE  Fit the code to the Worker class  Create another construct that receives the inputs as smartPtr data structures  Length, lower, and upper? __HYBRID__ saxpy( smartPtr _R, smartPtr _X, smartPtr _Y, float _alpha, unsigned _LENGTH, unsigned lo, unsigned hi) : work(WORK_SAXPY | W_REGULAR), R(_R), X(_X), Y(_Y), alpha(_alpha), LENGTH(_LENGTH), lower(lo), upper(hi) { }

Frameworks to Aid Code Development and Performance Portability 19 iCSC2015, André Pereira, LIP-Minho/University of Minho A Small Example with DICE  The dicing function (the hard bit…) template __DEVICE__ List * dice(unsigned &number) { unsigned range = (upper-lower); unsigned number_of_units = range / number; if(number_of_units == 0) { number_of_units = 1; number = range; } unsigned start = lower; List * L = new List (number); for (unsigned k = 0; k < number; k++) { saxpy* s = new saxpy(R,X,Y,alpha,LENGTH,start,start+number_of_units); (*L)[k] = s; start += number_of_units; } return L; }

Frameworks to Aid Code Development and Performance Portability 20 iCSC2015, André Pereira, LIP-Minho/University of Minho A Small Example with DICE  Finally, the SAXPY code!  tid will define the position to process (similar to CUDA)  The code takes the upper and lower bound of the vector into account template __DEVICE__ void execute() { if(TID > (upper-lower)) return; unsigned long tid = TID + lower; for(; tid < upper; tid+=TID_SIZE) r.set(tid, x.get(tid)*alpha+y.get(tid)); }

Frameworks to Aid Code Development and Performance Portability 21 iCSC2015, André Pereira, LIP-Minho/University of Minho A Small Example with DICE  Initialise the runtime system, prepare the input data, and execute the code // Initialize runtime system RuntimeScheduler* rs = new RuntimeScheduler(); // Create global memory space for shared vectors smartPtr R = smartPtr (sizeof(float)*N, SHARED); smartPtr X = smartPtr (sizeof(float)*N, SHARED); smartPtr Y = smartPtr (sizeof(float)*N, SHARED); // Initialise the data… … // Create work description saxpy* s = new saxpy(R,X,Y,alpha,N,0,N); // Submit work for execution and synchronize after execution rs->submit(s); rs->synchronize();

Frameworks to Aid Code Development and Performance Portability 22 iCSC2015, André Pereira, LIP-Minho/University of Minho Testbed Environment  Morpheus  2x Intel Xeon 6-core 2.6 GHz  2x NVidia Tesla C GB DRAM  MacBook Pro  Intel i7 Ivy Bridge 4-core GHz  NVidia 650M GPU  Software  GNU compiler version  CUDA Toolkit 6.5

Frameworks to Aid Code Development and Performance Portability 23 iCSC2015, André Pereira, LIP-Minho/University of Minho SAXPY with DICE Scalability of SAXPY for various system configurations for a vector of 300M elements C – CPU G – GPU Performance Analysis

Frameworks to Aid Code Development and Performance Portability 24 iCSC2015, André Pereira, LIP-Minho/University of Minho SAXPY with DICE  This problem is not scalable…  The overhead of communications and scheduling restricts performance  The problem is extremely regular and too simple (computationally)  Analyse your code first! Performance Analysis

Frameworks to Aid Code Development and Performance Portability 25 iCSC2015, André Pereira, LIP-Minho/University of Minho Barnes-Hut with DICE  Barnes-Hut algorithm simulates n-body system interactions  Divides the space, creates an hierarchy to speedup particle interaction calculations, with a complexity of O(n log(n))  Particle clusters should be on the same processor  Workload is dynamic  Fastest GPU implementation by Burtscher and Pingali (B&P) Performance Analysis

Frameworks to Aid Code Development and Performance Portability 26 iCSC2015, André Pereira, LIP-Minho/University of Minho Barnes-Hut with DICE Execution time of Barnes-Hut for 1M particles C – CPU G – GPU Performance Analysis

Frameworks to Aid Code Development and Performance Portability 27 iCSC2015, André Pereira, LIP-Minho/University of Minho Barnes-Hut with DICE  Not a big improvement over the best GPU implementation  The problem size is not big enough  The communication and workload management overhead restricts the performance Performance Analysis

Frameworks to Aid Code Development and Performance Portability 28 iCSC2015, André Pereira, LIP-Minho/University of Minho Barnes-Hut with DICE Scalability of Barnes-Hut for various problem sizes C – CPU G – GPU Performance Analysis

Frameworks to Aid Code Development and Performance Portability 29 iCSC2015, André Pereira, LIP-Minho/University of Minho Path Tracing with DICE  Monte Carlo simulation to render physically accurate scenes  Recursive algorithm  Dynamic workload Performance Analysis

Frameworks to Aid Code Development and Performance Portability 30 iCSC2015, André Pereira, LIP-Minho/University of Minho Path Tracing with DICE Ray count for progressive pathtracer with variance threshold of 1% (left) and 25%(right) Performance Analysis

Frameworks to Aid Code Development and Performance Portability 31 iCSC2015, André Pereira, LIP-Minho/University of Minho Path Tracing with DICE Comparison of the StarPU and DICE implementations of the Path Tracer running on Morpheus Performance Analysis

Frameworks to Aid Code Development and Performance Portability 32 iCSC2015, André Pereira, LIP-Minho/University of Minho Path Tracing with DICE DICE provides a 20% performance improvement DICE is 2x faster than StarPU in the best case Handles CPU+GPU parallelisation better than StarPU Performance Analysis

Frameworks to Aid Code Development and Performance Portability 33 iCSC2015, André Pereira, LIP-Minho/University of Minho Path Tracing with DICE Workload distribution between the CPU and GPU for the Adaptive Path Tracer (irregular workload) Frame Number Performance Analysis

Frameworks to Aid Code Development and Performance Portability 34 iCSC2015, André Pereira, LIP-Minho/University of Minho – DynamicStatic Path Tracing with DICE Frame by frame execution time for dynamic vs static (40% CPU, 60% GPU) schedulers Dynamic: 122 s Static: 157 s Frame Number Performance Analysis

Frameworks to Aid Code Development and Performance Portability 35 iCSC2015, André Pereira, LIP-Minho/University of Minho Path Tracing with DICE  DICE dynamically adapts to the change of task execution time  It is always tuning the amount of work that the CPU/GPU processes  Dynamic scheduling is 30% faster than a tuned static scheduling technique Performance Analysis

Frameworks to Aid Code Development and Performance Portability 36 iCSC2015, André Pereira, LIP-Minho/University of Minho Conclusions  Coding for HetPlats is complex and time consuming  Simultaneously deal with different levels of parallelism  There are frameworks to help code development  Some effort is required to get familiar with  Automatically balance the workload among CPUs and GPUs  Adapt to the computing platform and irregular tasks at runtime

Frameworks to Aid Code Development and Performance Portability 37 iCSC2015, André Pereira, LIP-Minho/University of Minho Acknowledgment  A special thanks to the DICE developers for letting me use the framework – which is still in beta  João Barbosa – UMinho & University of Texas  Roberto Ribeiro – UMinho  Donald Fussell – University of Texas  Calvin Lin – University of Texas  Luís Paulo Santos – UMinho  Alberto Proença – UMinho

Frameworks to Aid Code Development and Performance Portability 38 iCSC2015, André Pereira, LIP-Minho/University of Minho Development of High Performance Computing Applications Across Heterogeneous Systems Lecture 2 Frameworks to Aid Code Development and Performance Portability André Pereira LIP-Minho/University of Minho Inverted CERN School of Computing, February 2015