An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

Slides:

Advertisements

Similar presentations

Prasanna Pandit R. Govindarajan

Advertisements

Shredder GPU-Accelerated Incremental Storage and Computation

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.

Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.

Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

Supporting GPU Sharing in Cloud Environments with a Transparent

1 Charm++ on the Cell Processor David Kunzman, Gengbin Zheng, Eric Bohm, Laxmikant V. Kale.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Kenichi Kourai (Kyushu Institute of Technology) Takuya Nagata (Kyushu Institute of Technology) A Secure Framework for Monitoring Operating Systems Using.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.

GPU Programming with CUDA – Optimisation Mike Griffiths

Exploiting Data Parallelism in SELinux Using a Multicore Processor Bodhisatta Barman Roy National University of Singapore, Singapore Arun Kalyanasundaram,

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

Advanced / Other Programming Models Sathish Vadhiyar.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,

OpenCL Programming James Perry EPCC The University of Edinburgh.

QCAdesigner – CUDA HPPS project

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

My Coordinates Office EM G.27 contact time:

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

NFV Compute Acceleration APIs and Evaluation

Hee-Seok Kim, Izzat El Hajj, John Stratton,

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos

Dynamic Code Mapping Techniques for Limited Local Memory Systems

Multicore and GPU Programming

Presentation transcript:

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu Kim, Thanh Tuan Dao, Yongjin Cho, Sung Jong Seo, Seung Hak Lee, Seung Mo Cho, Hyo Jung Song, Sang-Bum Suh, and Jong-Deok Choi School of Computer Science and Engineering, Seoul National University, Seoul , Korea Samsung Electronics Co., Nongseo-dong, Giheung-gu, Yongin-si, Geonggi-do , Korea Presenter : Jen-Jung, Cheng

Outline Introduction Background – OpenCL platform Design and Implementation – OpenCL runtime – Work-item coalescing – Web-based variable expansion – Preload-poststore buffering Evaluation Conclusion

Introduction(1/2) The target architecture Main memory Interconnect bus APC Local store GPC L1 $ L2 $ APC Local store APC Local store …

Introduction(2/2) Two major challenges in design and implementation of the OpenCL framework – Implements hundreds of virtual PEs with a single accelerator core and make them efficient – Overcomes the limited size and incoherency of the local store

OpenCL platform(1/2) The OpenCL platform model

OpenCL platform(2/2) OpenCL platform ： a host processor, compute devices, compute units, and processing elements Abstract Index Space ： global ID, workgroup ID, and local ID Memory Region ： private, local, constant, and global Synchronization ： work-group barrier and command-queue barrier

OpenCL runtime(1/3) Mapping platform components to the target architecture

OpenCL runtime(2/3) Command Scheduler Event Queue Command Queues … Ready Queue CU Status Array Command Executor IssueAssign Device CU … Work-groups OpenCL Host thread OpenCL Runtime thread The command scheduler and the command executor DAG Execution ordering GPC

OpenCL runtime(3/3) The runtime implements a software-managed cache in each APC‘s local store. It caches the contents of the global and constant memory. To guarantee OpenCL memory consistency for shared memory objects between commands, the command executor flushes software-managed caches whenever it dequeues a command from the ready-queue or it removes an event object from the DAG after the associated command has completed.

Work-item coalescing(1/3) Executing work-items on a CU by switching one work-item to another incurs a significant overhead. When a kernel and its callee functions do not contain any barrier, any execution ordering defined between the two statements from different work-items in the same work-group satisfies the OpenCL semantics. Work-item coalescing loop(WCL) iterates on the index space of a single work-group.

Work-item coalescing(2/3) Int __i, __ j, __k; __kernel void vec_add (__global float *a, __global float *b, __global float *c) { int id; for( __k = 0; __k < __local_size[2]; __k++ ) { for( __ j = 0; __ j < __local_size[1]; __ j++ ) { for( __ i = 0; __ i < __local_size[0]; __ i++ ) { id = get_global_id(0); c[id] = a[id] + b[id]; } Int __i, __ j, __k; __kernel void vec_add (__global float *a, __global float *b, __global float *c) { int id; for( __k = 0; __k < __local_size[2]; __k++ ) { for( __ j = 0; __ j < __local_size[1]; __ j++ ) { for( __ i = 0; __ i < __local_size[0]; __ i++ ) { id = get_global_id(0); c[id] = a[id] + b[id]; } __kernel void vec_add( __global float *a, __global float *b, __global float *c) { int id; id = get_global_id(0); c[id] = a[id] + b[id]; } __kernel void vec_add( __global float *a, __global float *b, __global float *c) { int id; id = get_global_id(0); c[id] = a[id] + b[id]; } OpenCL C source-to-source translator

Work-item coalescing(3/3) S1 barrier(); S2 S1 barrier(); S2 [S1’ barrier(); [S2’ [S1’ barrier(); [S2’ if (c) { S1 barrier(); S2 } if (c) { S1 barrier(); S2 } [t = C’; if (t) { [S1’ barrier(); [S2’ } [t = C’; if (t) { [S1’ barrier(); [S2’ } while (c) { S1 barrier(); S2 } while (c) { S1 barrier(); S2 } while (1) { [t = C’; if (!t) break; [S1’ barrier(); [S2’ } while (1) { [t = C’; if (!t) break; [S1’ barrier(); [S2’ }

Web-based variable expansion(1/5) A kernel code region that needs to be enclosed with a WCL is called a work-item coalescing region (WCR). A work-item private variable that is defined in one WCR and used in another needs a separate location for different work-items. A du-chain for a variable connects a definition of the variable to all uses reached by the definition. A web for a variable is all du-chains of the variable that contain a common use of the variable.

Web-based variable expansion(2/5) … = x… = x … = x… = x x = … … = x… = x … = x… = x t1 = C1 x = … Entry if (t1) while (1) t2 = C2 if (t2) x = … barrier () … = x … = x… = x … = x… = x x = … … = x… = x … = x… = x Exit WCR

Web-based variable expansion(3/5) … = x… = x … = x… = x x = … … = x… = x … = x… = x t1 = C1 x = … Entry if (t1) while (1) t2 = C2 if (t2) x = … barrier () … = x … = x… = x … = x… = x x = … … = x… = x … = x… = x Exit WCR Identifying du-chains

Web-based variable expansion(4/5) … = x… = x … = x… = x x = … … = x… = x … = x… = x t1 = C1 x = … Entry if (t1) while (1) t2 = C2 if (t2) x = … barrier () … = x … = x… = x … = x… = x x = … … = x… = x … = x… = x Exit WCR Identifying webs

Web-based variable expansion(5/5) =x1[][][] x = … … = x… = x … = x… = x t1 = C1 x1[][][]= x1=malloc() if (t1) while (1) t2 = C2 if (t2) x1[][][]= barrier () =x1[][][] x = … … = x… = x … = x… = x Free(x1) WCR Exit Entry After variable expansion

Preload-poststore buffering(1/4) Preload-poststore buffering enables gathering DMA transfers together for array accesses and minimizes the time spent waiting for them to complete by overlapping them.

for(k = 0; k < ls[2]; k++ ) { for(j = 0; j < ls[1]; j++ ) { for(i = 0; i < ls[0]; i++ ) { if( i < 100) a[j][i] = c[j][b[i]]; c[j][b[i]] = a[j][3*i+1] + a[j][i+1024]; } for(k = 0; k < ls[2]; k++ ) { for(j = 0; j < ls[1]; j++ ) { for(i = 0; i < ls[0]; i++ ) { if( i < 100) a[j][i] = c[j][b[i]]; c[j][b[i]] = a[j][3*i+1] + a[j][i+1024]; } for(k = 0; k < ls[2]; k++ ) { for(j = 0; j < ls[1]; j++ ) { PRELOAD(buf_b, &b[0], ls[0]); PRELOAD(buf_a1, &a[j][0], ls[0]+1024); for(i = 0; i < ls[0]; i++ ) PRELOAD(buf_a2[i], &a[j][3*i+1]); WAITFOR(buf_b); for(i = 0; i < ls[0]; i++ ) PRELOAD(buf_c[i], &c[j][buf_b[i]]); for(i = 0; i < ls[0]; i++ ) { if( i < 100) buf_a1[i] = buf_c[i]; buf_c[i] = buf_a2[i]+ buf_a1[i+1024]; } POSTSTORE(buf_a1, &a[j][0], ls[0]+1024); for(i = 0; i < ls[0]; i++ ) POSTSTORE (buf_c[i], &c[j][buf_b[i]]); } for(k = 0; k < ls[2]; k++ ) { for(j = 0; j < ls[1]; j++ ) { PRELOAD(buf_b, &b[0], ls[0]); PRELOAD(buf_a1, &a[j][0], ls[0]+1024); for(i = 0; i < ls[0]; i++ ) PRELOAD(buf_a2[i], &a[j][3*i+1]); WAITFOR(buf_b); for(i = 0; i < ls[0]; i++ ) PRELOAD(buf_c[i], &c[j][buf_b[i]]); for(i = 0; i < ls[0]; i++ ) { if( i < 100) buf_a1[i] = buf_c[i]; buf_c[i] = buf_a2[i]+ buf_a1[i+1024]; } POSTSTORE(buf_a1, &a[j][0], ls[0]+1024); for(i = 0; i < ls[0]; i++ ) POSTSTORE (buf_c[i], &c[j][buf_b[i]]); } Preload-poststore buffering(2/4)

Preload-poststore buffering(3/4) Buffering candidate c*I + d, where c and d are loop invariant to L c*x + d, where x is an array reference and c and d are loop invariant to L. [lower bound : upper bound : stride] [1 : 3 * ls[0] - 2 : 3] 3 * i A buf_a2 … … n-1 3 * ls[0] - 2

Preload-poststore buffering(4/4) Condition for single buffer – a loop-independent flow dependence (read-after-write) – a loop-independent output dependence (write-after-write)

Evaluation(1/5) Experimental Setup – an IBM QS22 Cell blade server with two 3.2GHz PowerXCell 8i processors. – The Cell BE processor consists of a single Power Processor Element (PPE) and eight Synergistic Processor Elements (SPEs). – Fedora Linux 9 – SPE has 256KB of local store

Evaluation(2/5) Applications used

Evaluation(3/5) speedup

Evaluation(4/5) Comparison with the IBM OpenCL framework for Cell BE.

Evaluation(5/5) two Intel Xeon X5660 hexa-core processors (CPU) an NVIDIA Tesla C1060 GPU (GPU). The speedup of the OpenCL applications with multicore CPUs and a GPU.

Conclusion This paper presents the design and implementation of an OpenCL runtime and OpenCL C source-to-source translator that target heterogeneous accelerator multicore architectures with local memory.

Web-based variable expansion(1/3) … = x… = x … = x… = x x = … … = x… = x … = x… = x t1 = C1 x = … Entry if (t1) while (1) t2 = C2 if (t2) x = … barrier () … = x … = x… = x … = x… = x x = … … = x… = x … = x… = x Exit WCR

Web-based variable expansion(2/3) … = x… = x … = x… = x x = … … = x… = x … = x… = x t1 = C1 x = … Entry if (t1) while (1) t2 = C2 if (t2) x = … barrier () … = x … = x… = x … = x… = x x = … … = x… = x … = x… = x Exit WCR

Web-based variable expansion(3/3) … = x… = x … = x… = x x = … … = x… = x … = x… = x t1 = C1 x = … Entry if (t1) while (1) t2 = C2 if (t2) x = … barrier () … = x … = x… = x … = x… = x x = … … = x… = x … = x… = x Exit WCR