On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.

Slides:



Advertisements
Similar presentations
© 2013 IBM Corporation Implement high-level parallel API in JDK Richard Ning – Enterprise Developer 1 st June 2013.
Advertisements

International Technology Alliance In Network & Information Sciences International Technology Alliance In Network & Information Sciences 1 Interference.
Paweł Łukasik Wroc.NET mail:
Efficient Multiprogramming for Multicores with SCAF Published in MICRO-46, December Published by Timothy Creech, Aparna Kotha and Rajeev Barua. Presented.
Power Aware Scheduling for AND/OR Graphs in Multi-Processor Real-Time Systems Dakai Zhu, Nevine AbouGhazaleh, Daniel Mossé and Rami Melhem PARTS Group.
Time-based Transactional Memory with Scalable Time Bases Torvald Riegel, Christof Fetzer, Pascal Felber Presented By: Michael Gendelman.
Group 4 : Christopher Thorpe Jonghyun Kim ELEG-652 Principles of Parallel Computer Architectures Instructor : Dr. Gao Mentor : Joseph Data : 12/9/05.
Intel Software College Tuning Threading Code with Intel® Thread Profiler for Explicit Threads.
Outline Introduction Assumptions and notations
Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems Håkan Sundell Philippas Tsigas.
Wait-Free Linked-Lists Shahar Timnat, Anastasia Braginsky, Alex Kogan, Erez Petrank Technion, Israel Presented by Shahar Timnat 469-+
1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.
Shortest Violation Traces in Model Checking Based on Petri Net Unfoldings and SAT Victor Khomenko University of Newcastle upon Tyne Supported by IST project.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Håkan Sundell, Chalmers University of Technology 1 Evaluating the performance of wait-free snapshots in real-time systems Björn Allvin.
Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.
Sven Woop Computer Graphics Lab Saarland University
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 1 Evaluation of Message Passing Synchronization Algorithms in Embedded Systems.
Wait-Free Reference Counting and Memory Management Håkan Sundell, Ph.D.
Håkan Sundell, Chalmers University of Technology 1 Space Efficient Wait-free Buffer Sharing in Multiprocessor Real-time Systems Based.
Performance and power consumption evaluation of concurrent queue implementations 1 Performance and power consumption evaluation of concurrent queue implementations.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
OpenFOAM on a GPU-based Heterogeneous Cluster
Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.
CS6963 L19: Dynamic Task Queues and More Synchronization.
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta.
SUPPORTING LOCK-FREE COMPOSITION OF CONCURRENT DATA OBJECTS Daniel Cederman and Philippas Tsigas.
CS6235 L15: Dynamic Scheduling. L14: Dynamic Task Queues 2 CS6235 Administrative STRSM due March 23 (EXTENDED) Midterm coming -In class March 28, can.
Efficient Lists Intersection by CPU-GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab,
Trip report: GPU UERJ Felice Pantaleo SFT Group Meeting 03/11/2014 Felice Pantaleo SFT Group Meeting 03/11/2014.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Simple Wait-Free Snapshots for Real-Time Systems with Sporadic Tasks Håkan Sundell Philippas Tsigas.
CS6963 L14: Dynamic Scheduling. L14: Dynamic Task Queues 2 CS6963 Administrative STRSM due March 17 (EXTENDED) Midterm coming -In class April 4, open.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Kinshuk Govil, Dan Teodosiu*, Yongqiang Huang, and Mendel Rosenblum
Håkan Sundell, Chalmers University of Technology 1 Using Timing Information on Wait-Free Algorithms in Real-Time Systems (2 papers)
Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.
Håkan Sundell, Chalmers University of Technology 1 NOBLE: A Non-Blocking Inter-Process Communication Library Håkan Sundell Philippas.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
QUICK TIPS (--THIS SECTION DOES NOT PRINT--) This PowerPoint template requires basic PowerPoint (version 2007 or newer) skills. Below is a list of commonly.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Håkan Sundell, Chalmers University of Technology 1 Simple and Fast Wait-Free Snapshots for Real-Time Systems Håkan Sundell Philippas.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.
Maged M.Michael Michael L.Scott Department of Computer Science Univeristy of Rochester Presented by: Jun Miao.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
L16: Dynamic Task Queues and Synchronization
Applied Operating System Concepts -
Håkan Sundell Philippas Tsigas
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
A Lock-Free Algorithm for Concurrent Bags
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Parallel Programming in C with MPI and OpenMP
Linchuan Chen, Xin Huo and Gagan Agrawal
Anders Gidenstam Håkan Sundell Philippas Tsigas
Yiannis Nikolakopoulos
NOBLE: A Non-Blocking Inter-Process Communication Library
A Concurrent Lock-Free Priority Queue for Multi-Thread Systems
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Overview Motivation Methods Experimental evaluation Conclusion

The problem setting Work Task Offline Online

Static Load Balancing Processor

Static Load Balancing Processor Task

Static Load Balancing Processor Task

Static Load Balancing Processor Task Subtask

Static Load Balancing Processor Task Subtask

Dynamic Load Balancing Processor Task Subtask

Task sharing Work done? Try to get task New tasks? Perform task Got task? Add task Task Set No, retry Check condition Acquire Task Add Task No, continue Task Done

System Model CUDA Global Memory Gather and scatter Compare-And-Swap Fetch-And-Inc Multiprocessors Maximum number of concurrent thread blocks Multi- processor Thread Block Multi- processor Thread Block Multi- processor Thread Block Global Memory

Synchronization Blocking Uses mutual exclusion to only allow one process at a time to access the object. Lockfree Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps. Waitfree Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.

Load Balancing Methods Blocking Task Queue Non-blocking Task Queue Task Stealing Static Task List

Blocking queue TB 1 TB 2 TB n Free Head Tail

Blocking queue TB 1 TB 2 TB n Free Head Tail

Blocking queue T1T1 TB 1 TB 2 TB n Free Head Tail

Blocking queue T1T1 TB 1 TB 2 TB n Free Head Tail

Blocking queue T1T1 TB 1 TB 2 TB n Free Head Tail

Non-blocking Queue T1T1 T2T2 T3T3 T4T4 TB 1 TB 2 TB 1 TB 2 TB n Head Tail Reference P. Tsigas and Y. Zhang, A simple, fast and scalable non- blocking concurrent FIFO queue for shared memory multiprocessor systems [SPAA01]

Non-blocking Queue T1T1 T2T2 T3T3 T4T4 TB 1 TB 2 TB 1 TB 2 TB n Head Tail

Non-blocking Queue T1T1 T2T2 T3T3 T4T4 TB 1 TB 2 TB 1 TB 2 TB n Head Tail

Non-blocking Queue T1T1 T2T2 T3T3 T4T4 TB 1 TB 2 TB 1 TB 2 TB n Head Tail

Non-blocking Queue T1T1 T2T2 T3T3 T4T4 T5T5 TB 1 TB 2 TB 1 TB 2 TB n Head Tail

Non-blocking Queue T1T1 T2T2 T3T3 T4T4 T5T5 TB 1 TB 2 TB 1 TB 2 TB n Head Tail

Task stealing T1T1 T3T3 T2T2 TB 1 TB 2 TB n Reference Arora N. S., Blumofe R. D., Plaxton C. G., Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]

Task stealing T1T1 T4T4 T3T3 T2T2 TB 1 TB 2 TB n

Task stealing T1T1 T4T4 T5T5 T3T3 T2T2 TB 1 TB 2 TB n

Task stealing T1T1 T4T4 T3T3 T2T2 TB 1 TB 2 TB n

Task stealing T1T1 T3T3 T2T2 TB 1 TB 2 TB n

Task stealing T3T3 T2T2 TB 1 TB 2 TB n

Task stealing T2T2 TB 1 TB 2 TB n

Static Task List T1T1 T2T2 T3T3 T4T4 In

Static Task List T1T1 T2T2 T3T3 T4T4 In TB 1 TB 2 TB 3 TB 4

Static Task List T1T1 T2T2 T3T3 T4T4 In Out TB 1 TB 2 TB 3 TB 4

Static Task List T1T1 T2T2 T3T3 T4T4 T5T5 In Out TB 1 TB 2 TB 3 TB 4

Static Task List T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 In Out TB 1 TB 2 TB 3 TB 4

Static Task List T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 In Out TB 1 TB 2 TB 3 TB 4

Octree Partitioning Bandwidth bound

Octree Partitioning Bandwidth bound

Octree Partitioning Bandwidth bound

Octree Partitioning Bandwidth bound

Four-in-a-row Computation intensive

Graphics Processors 8800GT 14 Multiprocessors 57 GB/sec bandwidth 9600GT 8 Multiprocessors 57 GB/sec bandwidth

Blocking Queue – Octree/9600GT

Blocking Queue – Octree/8800GT

Blocking Queue – Four-in-a-row

Non-blocking Queue – Octree/9600GT

Non-blocking Queue – Octree/8800GT

Non-blocking Queue - Four-in-a-row

Task stealing – Octree/9600GT

Task stealing – Octree/8800GT

Task stealing – Four-in-a-row

Static List

Octree Comparison

Previous work Korch M., Raubert T., A comparison of task pools for dynamic load balancing of irregular algorithms, Concurrency and Computation: Practice & Experience, 16, 2003 Heirich A., Arvo J., A competetive analysis of load balancing strategies for parallel ray tracing, Journal of Supercomputing, 12, 1998 Foley T., Sugerman J., KD-tree acceleration structures for a GPU raytracer, Graphics Hardware 2005

Conclusion Synchronization plays a significant role in dynamic load- balancing Lock-free data structures/synchronization scales well and looks promising also in the GPU general purpose programming Locks perform poorly It is good that operations such as CAS and FAA have been introduced in the new GPUs Work stealing could outperform static load balancing

Thank you!