List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
List Ranking and Parallel Prefix
Linked List Ranking Parallel Algorithms 1. Work Analysis – Pointer Jumping Number of Steps: Tp = O(Log N) Number of Processors: N Work = O(N log N) T1.
1 Parallel Algorithms (chap. 30, 1 st edition) Parallel: perform more than one operation at a time. PRAM model: Parallel Random Access Model. p0p0 p1p1.
Data Structures: A Pseudocode Approach with C
Lecture 15. Graph Algorithms
Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA Zheng Wei and Joseph JaJa Rohit Nigam
CHP-5 LinkedList.
Part IV: Memory Management
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Lecture 3: Parallel Algorithm Design
1 Parallel Parentheses Matching Plus Some Applications.
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
Data Structures: A Pseudocode Approach with C
Data Structures: A Pseudocode Approach with C 1 Chapter 5 Contd... Objectives Explain the design, use, and operation of a linear list Implement a linear.
COSC 1P03 Data Structures and Abstraction 5.1 Linear Linked Structures.
Linked Lists Compiled by Dr. Mohammad Alhawarat CHAPTER 04.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
Advanced Topics in Algorithms and Data Structures Lecture 6.1 – pg 1 An overview of lecture 6 A parallel search algorithm A parallel merging algorithm.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
Memory Management A memory manager should take care of allocating memory when needed by programs release memory that is no longer used to the heap. Memory.
Computer Organization and Architecture
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Chapter 3: Arrays, Linked Lists, and Recursion
Fall 2008Paradigms for Parallel Algorithms1 Paradigms for Parallel Algorithms.
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
Memory Allocation CS Introduction to Operating Systems.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Computer Science 101 Fast Searching and Sorting. Improving Efficiency We got a better best case by tweaking the selection sort and the bubble sort We.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CSC 211 Data Structures Lecture 13
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
3 Data. Software And Data Data Data element – a single, meaningful unit of data. Name Social Security Number Data structure – a set of related data elements.
Linked List by Chapter 5 Linked List by
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Synchronization These notes introduce:
Binary Tree.
2005MEE Software Engineering Lecture 7 –Stacks, Queues.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
CS 241 Discussion Section (12/1/2011). Tradeoffs When do you: – Expand Increase total memory usage – Split Make smaller chunks (avoid internal fragmentation)
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.
CHAPTER 51 LINKED LISTS. Introduction link list is a linear array collection of data elements called nodes, where the linear order is given by means of.
Arrays, Link Lists, and Recursion Chapter 3. Sorting Arrays: Insertion Sort Insertion Sort: Insertion sort is an elementary sorting algorithm that sorts.
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
Data Structures and Algorithm Analysis Dr. Ken Cosh Linked Lists.
LINKED LISTS.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Sathish Vadhiyar Parallel Programming
Parallel Graph Algorithms
Computer Engg, IIT(BHU)
CS/EE 217 – GPU Architecture and Parallel Programming
Lecture 2 The Art of Concurrency
Synchronization These notes introduce:
Force Directed Placement: GPU Implementation
Presentation transcript:

List Ranking on GPUs Sathish Vadhiyar

List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in a linked list Irregular memory accesses – successor of each node of a linked list can be contained anywhere List ranking – special case of list prefix computations in which all the values are identity, i.e., 1.

List ranking L is a singly linked list Each node contains two fields – a data field, and a pointer to the successor Prefix sums – updating data field with summation of values of its predecessors and itself L represented by an array X with fields X[i].prefix and X[i].succ

Sequential Algorithm Simple and effective Two passes – Pass 1: To identify the head node – Pass 2: Traverses starting from the head, follow the successor nodes accumulating the prefix sums in the traversal order Works well in practice

Parallel Algorithm: Prefix computations on arrays Array X partitioned into subarrays Local prefix sums of each subarray calculated in parallel Prefix sums of last elements of each subarray written to a separate array Y Prefix sums of elements in Y are calculated. Each prefix sum of Y is added to corresponding block of X Divide and conquer strategy

Example ,3,64,9,157,15,24 6,15,24 6,21,45 1,3,6,10,15,21,…

Prefix computation on list The previous strategy cannot be applied here Division of array X that represents list will lead to subarrays each of which can have many submits fragments Head nodes will have to be calculated for each of them

Parallel List Ranking (Wyllies algorithm) Involved repeated pointer jumping Successor pointer of each element is repeatedly updated so that it jumps over its successor until it reaches the end of the list As each processor traverses and updates the successor, the ranks are updated A process or thread is assigned to each element of the list

Parallel List Ranking (Wyllies algorithm) Will lead to high synchronizations among CUDA threads, many kernel invocations

Parallel List Ranking (Helman and JaJa) Randomly select s nodes or splitters. The head node is also a splitter Form s sublists. In each sublist, start from a splitter as the head node, and traverse till another splitter is reached. Form prefix sums in each sublist Form another list, L, consisting of only these splitters in the order they are traversed. The values in each entry of this list will be the prefix sum calculated in the respective sublists Calculate prefix sums for this list Add these sums to the values of the sublists

Parallel List Ranking on GPUs: Steps Step 1: Compute the location of the head of the list Each of the indices between 0 and n-1, except head node, occur exactly only once in the successors. Hence head node = n(n-1)/2 – SUM_SUCC SUM_SUCC = sum of the successor values Can be done on GPUs using parallel reduction

Parallel List Ranking on GPUs: Steps Step 2: Select s random nodes to split list into s random sublists For every subarray of X of size X/s, select random location as a splitter. Highly data parallel, can be done independent of each other

Parallel List Ranking on GPUs: Steps Step 3: Using standard sequential algorithm, compute prefix sums of each sublist separately The most computationally demanding step s sublists allocated equally among CUDA blocks, and then allocated equally among threads in a block Each thread computes prefix sums of each of its sublists, and copy prefix value of last element of sublist i to Sublist[i]

Parallel List Ranking on GPUs: Steps Step 4: Compute prefix sum of splitters, where the successor of a splitter is the next splitter encountered when traversing the list This list is small Hence can be done on CPU

Parallel List Ranking on GPUs: Steps Step 5: Update values of prefix sums computed in step 3 using splitter prefix sums of step 4 This can be done using coalesced memory access

Choosing s Large values of s increase the chance of threads dealing with equal number of nodes However, too large values result in overhead of sublist creation and aggregation

References Fast and Scalable List Ranking on the GPU. ICS Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA. IPDPS 2010.