Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos

Slides:

Advertisements

Similar presentations

Prasanna Pandit R. Govindarajan

Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.

Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.

Programming Multiprocessors with Explicitly Managed Memory Hierarchies ELEC 6200 Xin Jin 4/30/2010.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Chapter 13 Embedded Systems

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

1 Design and Implementation of an Efficient MPEG-4 Interactive Terminal on Embedded Devices Yi-Chin Huang, Tu-Chun Yin, Kou-Shin Yang, Yan-Jun Chang, Meng-Jyi.

OpenMP China MCP.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.

Advanced / Other Programming Models Sathish Vadhiyar.

OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.

GPU Architecture and Programming

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,

OpenCL Programming James Perry EPCC The University of Edinburgh.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Detecting Atomicity Violations via Access Interleaving Invariants

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,

My Coordinates Office EM G.27 contact time:

Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.

Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

Lecture 15 Introduction to OpenCL

Prof. Zhang Gang School of Computer Sci. & Tech.

Konstantis Daloukas1 Christos D. Antonopoulos1 Nikolaos Bellas1 Sek M.

NFV Compute Acceleration APIs and Evaluation

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

Current Generation Hypervisor Type 1 Type 2.

An Introduction to GPU Computing

Patrick Cozzi University of Pennsylvania CIS Spring 2011

Texas Instruments TDA2x and Vision SDK

Performance Tuning Team Chia-heng Tu June 30, 2009

Processing Framework Sytse van Geldermalsen

Improving java performance using Dynamic Method Migration on FPGAs

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Department of Computer Science University of California,Santa Barbara

Many-core Software Development Platforms

GPU Programming using OpenCL

Introduction to OpenCL 2.0

Department of Computer Science University of California, Santa Barbara

Objective Understand the concepts of modern operating systems by investigating the most popular operating system in the current and future market Provide.

EE 4xx: Computer Architecture and Performance Programming

General Purpose Graphics Processing Units (GPGPUs)

Objective Understand the concepts of modern operating systems by investigating the most popular operating system in the current and future market Provide.

Multicore and GPU Programming

What is a Thread? A thread is similar to a process, but it typically consists of just the flow of control. Multiple threads use the address space of a.

Presentation transcript:

GLOpenCL: OpenCL Support on Hardware- and Software-Managed Cache Multicores Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos Department of Computer and Communications Engineering University of Thessaly Volos, Greece

Introduction OpenCL: a unified programming standard and framework Targets: Homogeneous and heterogeneous multicores Accelerator-based systems Aims at being platform-agnostic Enhances portability 20/11/2018 HiPEAC 2011

Motivation Vendor specific OpenCL implementations target either only hardware- or software-controlled cache multicores AMD Stream SDK – x86 IBM OpenCL SDK – Cell B.E. Lack of a unified framework for architectures with diverse characteristics 20/11/2018 HiPEAC 2011

Contribution GLOpenCL: a unified OpenCL framework Compilation infrastructure Run-time support Enables native execution of OpenCL applications on: Hardware-controlled cache multicores Software-controlled cache multicores Achieves comparable performance to architecture-specific implementations 20/11/2018 HiPEAC 2011

Outline Introduction The OpenCL Programming Model Compilation Infrastructure Run-Time Support Experimental Evaluation Conclusions 20/11/2018 HiPEAC 2011

OpenCL Platform Model From: http://www.viznet.ac.uk 20/11/2018 HiPEAC 2011

OpenCL Execution Model An OpenCL application consists of two parts: Main program that executes on the host A number of kernels that execute on the compute devices Main constructs of the OpenCL execution model: Kernels Memory Buffers Command Queues 20/11/2018 HiPEAC 2011

OpenCL Kernel Execution “Geometry” 20/11/2018 HiPEAC 2011

Outline Introduction The OpenCL Programming Model Compilation Infrastructure Run-Time Support Experimental Evaluation Conclusions 20/11/2018 HiPEAC 2011

Granularity Management work-group 20/11/2018 HiPEAC 2011

Serialization of Work Items C code OpenCL code __kernel void Vadd(…) { int index; for( i = 0; i < get_local_size(2); i++) for( j = 0; j < get_local_size(1); j++) for( k = 0; k < get_local_size(0); k++) { index = get_item_gid(0); c[index] = a[index] + b[index]; } __kernel void Vadd(…) { int index = get_global_id(0); c[index] = a[index] + b[index]; } #define get_item_gid(N) \ (__global_id[N] + (N == 0 ? __k : (N == 1 ? __j : __i))) 20/11/2018 HiPEAC 2011

Elimination of Synchronization Operations OpenCL code C code triple_nested_loop { Statements_block1 barrier(); Statements_block2 } Statements_block1 barrier(); Statements_block2 20/11/2018 HiPEAC 2011

Elimination of Synchronization Operations C code C code triple_nested_loop { Statements_block1 } //barrier(); Statements_block2 triple_nested_loop { Statements_block1 barrier(); Statements_block2 } 20/11/2018 HiPEAC 2011

Variable Privatization C code C code triple_nested_loop { Statements_block1 x = …; } Statements_block2 … = … x …; triple_nested_loop { Statements_block1 x[i][j][k] = …; } Statements_block2 … = … x[i][j][k] …; 20/11/2018 HiPEAC 2011

Outline Introduction The OpenCL Programming Model Compilation Infrastructure Run-Time Support Experimental Evaluation Conclusions 20/11/2018 HiPEAC 2011

Run-Time System Library Provide run-time support for: Work management Manipulation of memory buffers The run-time system provides a unified design for both architectures 20/11/2018 HiPEAC 2011

Architecture for Hardware- Controlled Cache Multicores Main Thread C2 C1 Command Queue 20/11/2018 HiPEAC 2011

Architecture for Hardware- Controlled Cache Multicores Main Thread C2 C1 Command Queue 20/11/2018 HiPEAC 2011

Architecture for Hardware- Controlled Cache Multicores Main Thread Worker Thread … Worker Thread C1 C2 C3 Command Queue Ready Command Queue 20/11/2018 HiPEAC 2011

Architecture for Hardware- Controlled Cache Multicores Main Thread Worker Thread 1 … Worker Thread N C2 C1 T1 TN-1 C3 T2 TN Command Queue Ready Command Queue Work Queue 1 Work Queue N 20/11/2018 HiPEAC 2011

Architecture for Hardware- Controlled Cache Multicores Main Thread Worker Thread 1 … Worker Thread N C2 T1 TN-1 C3 T2 TN Command Queue Ready Command Queue Work Queue 1 Work Queue N 20/11/2018 HiPEAC 2011

Architecture for Hardware- Controlled Cache Multicores Main Thread Worker Thread 1 … Worker Thread N C2 C3 Command Queue Ready Command Queue Work Queue 1 Work Queue N 20/11/2018 HiPEAC 2011

Architecture for Hardware- Controlled Cache Multicores Main Thread Worker Thread 1 … Worker Thread N C2 C3 T1 TN-1 C3 T2 TN Command Queue Ready Command Queue Work Queue 1 Work Queue N 20/11/2018 HiPEAC 2011

Architecture for Hardware- Controlled Cache Multicores Main Thread Worker Thread 1 … Worker Thread N C3 T1 TN T2 TN Command Queue Ready Command Queue Work Queue 1 Work Queue N 20/11/2018 HiPEAC 2011

Architecture for Hardware- Controlled Cache Multicores Helper Thread Async. Copy Queue Main Thread Worker Thread 1 … Worker Thread N Command Queue Ready Command Queue Work Queue 1 Work Queue N 20/11/2018 HiPEAC 2011

Architecture for Software- Controlled Cache Multicores Host Side Accelerator Side Work related req./replies Worker Thread 1 Work Queue 1 Main Thread … Helper Thread SW cache req./replies Worker Thread N Ready Command Queue Work Queue N Command Queue 20/11/2018 HiPEAC 2011

Outline Introduction The OpenCL Programming Model Compilation Infrastructure Run-Time Support Experimental Evaluation Conclusions 20/11/2018 HiPEAC 2011

Experimental Evaluation Evaluate the framework on two representative architectures: Homogeneous, hardware-controlled cache memory processor – Intel’s E5520 i7 (Nehalem) processor Heterogeneous, software-controlled cache memory processor – Cell B.E. processor Compare with vendor provided, platform-customized implementations AMD Stream SDK for x86 architectures IBM OpenCL SDK for the Cell B.E. 20/11/2018 HiPEAC 2011

Vector Add – x86 20/11/2018 HiPEAC 2011

Vector Add – Cell 20/11/2018 HiPEAC 2011 - Average Performance Overhead: 27.5% 20/11/2018 HiPEAC 2011

AES Encryption – x86 20/11/2018 HiPEAC 2011

AES Encryption – Cell 20/11/2018 HiPEAC 2011

BlackScholes – x86 20/11/2018 HiPEAC 2011

BlackScholes – Cell 20/11/2018 HiPEAC 2011

Outline Introduction The OpenCL Programming Model Compilation Infrastructure Run-Time Support Experimental Evaluation Conclusions 20/11/2018 HiPEAC 2011

Conclusions GLOpenCL achieves comparable, or better performance than architecture specific implementations OpenCL standard leaves enough room for platform specific optimizations May yield significant performance improvements Future Work: Reduce the effect of shared data structures Memory access pattern analysis 20/11/2018 HiPEAC 2011

Acknowledgements This project is partially supported by the EC Marie Curie International Reintegration Grant (IRG) 223819 20/11/2018 HiPEAC 2011