Sorting Algorithms: Topic Overview

Slides:



Advertisements
Similar presentations
Basic Communication Operations
Advertisements

Introduction to Algorithms Quicksort
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
CSCI-455/552 Introduction to High Performance Computing Lecture 11.
Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.
ALGORITMOS DE ORDENACIÓN EN PARALELO
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
1 Divide-and-Conquer The most-well known algorithm design strategy: 1. Divide instance of problem into two or more smaller instances 2. Solve smaller instances.
CS4413 Divide-and-Conquer
ISOM MIS 215 Module 7 – Sorting. ISOM Where are we? 2 Intro to Java, Course Java lang. basics Arrays Introduction NewbieProgrammersDevelopersProfessionalsDesigners.
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
Advanced Topics in Algorithms and Data Structures Lecture 6.1 – pg 1 An overview of lecture 6 A parallel search algorithm A parallel merging algorithm.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
1 Complexity of Network Synchronization Raeda Naamnieh.
1 Tuesday, November 14, 2006 “UNIX was never designed to keep people from doing stupid things, because that policy would also keep them from doing clever.
Chapter 10 in textbook. Sorting Algorithms
Sorting Algorithms CS 524 – High-Performance Computing.
1 Friday, November 17, 2006 “In the confrontation between the stream and the rock, the stream always wins, not through strength but by perseverance.” -H.
Chapter 4: Divide and Conquer The Design and Analysis of Algorithms.
1 Lecture 11 Sorting Parallel Computing Fall 2008.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.
Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley,
CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.
Topic Overview One-to-All Broadcast and All-to-One Reduction
CSCI-455/552 Introduction to High Performance Computing Lecture 22.
1 Sorting Algorithms - Rearranging a list of numbers into increasing (strictly non-decreasing) order. ITCS4145/5145, Parallel Programming B. Wilkinson.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.
1 Parallel Sorting Algorithms. 2 Potential Speedup O(nlogn) optimal sequential sorting algorithm Best we can expect based upon a sequential sorting algorithm.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
Sorting Fun1 Chapter 4: Sorting     29  9.
Heapsort. Heapsort is a comparison-based sorting algorithm, and is part of the selection sort family. Although somewhat slower in practice on most machines.
Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
Sorting CS 110: Data Structures and Algorithms First Semester,
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
1. 2 Sorting Algorithms - rearranging a list of numbers into increasing (strictly nondecreasing) order.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
“Sorting networks and their applications”, AFIPS Proc. of 1968 Spring Joint Computer Conference, Vol. 32, pp
Student: Fan Bai Instructor: Dr. Sushil Prasad CSc8530.
Chapter 9 Sorting 1. The efficiency of data handling can often be increased if the data are sorted according to some criteria of order. The first step.
Chapter 9 Sorting. The efficiency of data handling can often be increased if the data are sorted according to some criteria of order. The first step is.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
+ Even Odd Sort & Even-Odd Merge Sort Wolfie Herwald Pengfei Wang Rachel Celestine.
Parallel Programming - Sorting David Monismith CS599 Notes are primarily based upon Introduction to Parallel Programming, Second Edition by Grama, Gupta,
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
CSCI-455/552 Introduction to High Performance Computing Lecture 21.
Sorting: Parallel Compare Exchange Operation A parallel compare-exchange operation. Processes P i and P j send their elements to each other. Process P.
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Parallel Odd-Even Sort Algorithm Dr. Xiao.
Parallel Sorting Algorithms
Parallel Computing Spring 2010
Parallel Sorting Algorithms
Parallel Sorting Algorithms
Sorting Algorithms - Rearranging a list of numbers into increasing (strictly non-decreasing) order. Sorting number is important in applications as it can.
Parallel Sorting Algorithms
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Sorting Algorithms: Topic Overview Issues in Sorting on Parallel Computers Sorting Networks Bitonic Sort Bubble Sort and its Variants Quicksort Bucket and Sample Sort Sahalu Junaidu ICS 573: High-Performance Computing

ICS 573: High-Performance Computing Sorting: Overview One of the most commonly used and well-studied kernels. many algorithms require sorted data for easier manipulation Sorting algorithms internal external comparison-based: based on compare-exchange noncomparison-based: based on elements’ properties like their binary representation or their distribution Lower bound complexity classes to sort n numbers: Comparison-based: Θ(nlog n). Noncomparison-based: Θ(n). We focus here on comparison-based sorting algorithms. Sahalu Junaidu ICS 573: High-Performance Computing

Issues in Sorting on Parallel Computers Where are the input and output lists stored? We assume that the input and output lists are distributed. Input specification Each processor has n/p elements An ordering of the processors Output specification Each processor will get n/p consecutive elements of the final sorted array. The “chunk” is determined by the processor ordering. Variations Unequal number of elements on output. In general, this is not a good idea and it may require a shift to obtain the equal size distribution. Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Compare-Exchange Operation Comparison becomes more complicated when elements reside on different processes One element per process: Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Compare-Split Operation The compare-exchange communication cost is ts + tw. Assuming bidirectional channels Delivers poor performance. Why? More than one element per process: compare-split operation. Assume each of two processes has n/p elements. After the compare-split operation, the smaller n/p elements are at process Pi and the larger n/p elements at Pi, where i < j. The compare-split communication cost is ts+ twn/p. Assuming that the two partial lists were initially sorted. Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Compare-Split Operation Sahalu Junaidu ICS 573: High-Performance Computing

ICS 573: High-Performance Computing Sorting Networks Key Idea: Perform many comparisons in parallel. Key Elements: Comparators: Consist of two-input, two-output wires Take two elements on the input wires and outputs them in sorted order in the output wires. Network architecture: The arrangement of the comparators into interconnected comparator columns similar to multi-stage networks Many sorting networks have been developed. Bitonic sorting network with Θ(log2(n)) columns of comparators. Sahalu Junaidu ICS 573: High-Performance Computing

Sorting Networks: Comparators A comparator is a device with two inputs x and y and two outputs x' and y'. For an increasing comparator, x' = min{x,y} and y' = max{x,y}; For an decreasing comparator, x' = max{x,y} and y' = min{x,y}; We denote an increasing comparator by  and a decreasing comparator by Ө. Consists of a series of columns each with comparators connected in parallel. The depth of a network is the number of columns it contains. Sahalu Junaidu ICS 573: High-Performance Computing

Sorting Networks: Comparators Sahalu Junaidu ICS 573: High-Performance Computing

Sorting Networks: Architecture Sahalu Junaidu ICS 573: High-Performance Computing

ICS 573: High-Performance Computing Bitonic Sort Bitonic sorting depends on rearranging a bitonic sequence into a sorted sequence. Uses a bitonic sorting network to sort n elements in Θ(log2n) time. A bitonic sequence <a0, a1, ..., an-1> is such that either there exists an index i, 0 ≤ i ≤ n - 1, such that <a0, ..., ai > is monotonically increasing and <ai +1, ..., an-1> is monotonically decreasing, or there exists a cyclic shift of indices so that (1) is satisfied. Example bitonic sequences: 1,2,4,7,6,0 8,9,2,1,0,4 because it is a cyclic shift of 0,4,8,9,2,1. Sahalu Junaidu ICS 573: High-Performance Computing

Sorting a Bitonic Sequence Let s=a0,a1,…,an-1 be a bitonic sequence such that a0 ≤ a1 ≤ ··· ≤ an/2-1 and an/2 ≥ an/2+1 ≥ ··· ≥ an-1. Consider the following subsequences of s: s1 = min{a0,an/2},min{a1,an/2+1},…,min{an/2-1,an-1} s2 = max{a0,an/2},max{a1,an/2+1},…,max{an/2-1,an-1} Dividing a bitonic sequence into two subsequences as above is called a bitonic split Note that s1 and s2 are both bitonic and each element of s1 is less than every element in s2. Apply bitonic split recursively on s1 and s2 to get the sorted sequence. Sorting a bitonic sequence this way is called bitonic merge How many bitonic splits are required to sort a bitonic sequence? Sahalu Junaidu ICS 573: High-Performance Computing

ICS 573: High-Performance Computing Example: Bitonic Sort Sahalu Junaidu ICS 573: High-Performance Computing

Bitonic Merging Network (BMN) BMN: a network of comparators used to implement the bitonic merge algorithm. BMN contains log n columns each containing n/2 comparators Each column performs one step of the bitonic merge. BMN takes as input the bitonic sequence and outputs the sequence in sorted order. We denote a bitonic merging network with n inputs by BM[n]. Replacing the  comparators by Ө comparators results in a decreasing output sequence; such a network is denoted by ӨBM[n]. Sahalu Junaidu ICS 573: High-Performance Computing

Example: Using a Bitonic Merging Network Sahalu Junaidu ICS 573: High-Performance Computing

Sorting Unordered Elements Sorting can be achieved by repeatedly merging bitonic sequences of increasing lengths Resulting in a single bitonic sequence from the given sequence. A sequence of length 2 is a bitonic sequence. Why? A bitonic sequence of length 4 can be built by sorting the first two elements using BM[2] and next two, using ӨBM[2]. This process can be repeated to generate larger bitonic sequences. The algorithm embodied in this process is called Bitonic Sort and the network is called a Bitonic Sorting Network (BSN). Sahalu Junaidu ICS 573: High-Performance Computing

Sorting Unordered Elements Using a BSN Sahalu Junaidu ICS 573: High-Performance Computing

Details of Sorting Using a BSN Sahalu Junaidu ICS 573: High-Performance Computing

Complexity of Bitonic Sort How many stages are there in a BSN for sorting n elements? What is the depth (# of steps/columns) of the network? Stage i consists of i columns of n/2 comparators How many comparators are there? n/2 log2 n Thus, a serial implementation of the network would have complexity Θ(n log2 n). Sahalu Junaidu ICS 573: High-Performance Computing

Bitonic Sort on Parallel Computers A key aspect of bitonic sort: communication intensive A proper mapping must take into account the topology of the underlying interconnection network We discuss mapping to hypercube and mesh topologies Mapping requirements for good performance Map wires that perform compare-exchange onto neighboring processes Map wires that perform compare-exchange more frequently onto neighboring processes How are wires paired to do compare-exchange in each stage? Which wires communicate most frequently? Wires whose labels differ in the ith least-significant bit perform compare-exchange log n – i + 1 times Sahalu Junaidu ICS 573: High-Performance Computing

Mapping Bitonic Sort to Hypercubes Case 1: one item per processor. What is the comparator? How do the wires get mapped? Recall that processes whose labels differ in only one bit are neighbors in a hypercube This provides a direct mapping of wires to processors. All communication is nearest neighbor! Pairing processes in a d-dimensional hypercube (p=2d): Consider the steps during the last stage of the algorithm: Step 1: processes differing in dth bit exchange elements along the dth dimension Step 2: compare-exchange takes place along (d-1)th dimension Step i: compare-exchange takes place along (d-(i-1))th dimension Sahalu Junaidu ICS 573: High-Performance Computing

Mapping Bitonic Sort to Hypercubes Sahalu Junaidu ICS 573: High-Performance Computing

Communication Pattern of Bitonic Sort on Hypercubes Sahalu Junaidu ICS 573: High-Performance Computing

Bitonic Sort Algorithm on Hypercubes Parallel runtime: Tp = Θ(log2 n) Sahalu Junaidu ICS 573: High-Performance Computing

Mapping Bitonic Sort to Meshes The connectivity of a mesh is lower than that of a hypercube, so we must expect some overhead in this mapping. So we map wires such that the most frequent compare-exchange operations occur between neighboring processes. Some mapping options: row-major mapping, row-major snakelike mapping, and row-major shuffled mapping. Each process is labeled by the wire that is mapped onto it. Advantage of row-major shuffled mapping processes that perform compare-exchange operations reside on square subsections of the mesh. wires that differ in the ith least-significant bit are mapped onto mesh processes that are communication links away. Sahalu Junaidu ICS 573: High-Performance Computing

Bitonic Sort on Meshes: Example Mappings Sahalu Junaidu ICS 573: High-Performance Computing

Mapping Bitonic Sort to Meshes Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Time of Bitonic Sort on Meshes Row-major shuffled mapping wires that differ at the ith least-significant bit are mapped onto mesh processes that are 2(i-1)/2 communication links away. The total amount of communication performed by each process is: The total computation performed by each process is Θ(log2n). The parallel runtime is: Sahalu Junaidu ICS 573: High-Performance Computing

Block of Elements Per Processor Each process is assigned a block of n/p elements. The first step is a local sort of the local block. Each subsequent compare-exchange operation is replaced by a compare-split operation. We can effectively view the bitonic network as having (1 + log p)(log p)/2 steps. Sahalu Junaidu ICS 573: High-Performance Computing

Block of Elements Per Processor: Hypercube Initially the processes sort their n/p elements (using merge sort) in time Θ((n/p)log(n/p)) and then perform Θ(log2p) compare-split steps. The parallel run time of this formulation is Sahalu Junaidu ICS 573: High-Performance Computing

Block of Elements Per Processor: Mesh The parallel runtime in this case is given by: Sahalu Junaidu ICS 573: High-Performance Computing

Bubble Sort and its Variants We now focus on traditional sorting algorithms. We’ll investigate whether n processes can be employed to sort a sequence in Θ(log n)time. Recall that the sequential bubble sort algorithm compares and exchanges adjacent elements. The complexity of bubble sort is Θ(n2). Bubble sort is difficult to parallelize since the algorithm has no concurrency. Sahalu Junaidu ICS 573: High-Performance Computing

Sequential Bubble Sort Algorithm Sahalu Junaidu ICS 573: High-Performance Computing

Odd-Even Transposition Sort: A Bubble Sort Variant Odd-Even sort alternates between two phases, called the odd and even phases Odd phase: compare-exchange elements with odd indices with their right neighbors Similarly for even phase. Sequence is sorted after n phases (n is even), each of which requires n/2 compare-exchange operations. Thus, its sequential complexity is Θ(n2). Sahalu Junaidu ICS 573: High-Performance Computing

Sequential Odd-Even Sort Algorithm Sahalu Junaidu ICS 573: High-Performance Computing

Example: Odd-Even Sort Algorithm Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Odd-Even Transposition Consider the one item per process case. There are n iterations, in each iteration, each process does one compare-exchange. The parallel run time of this formulation is Θ(n). This is cost optimal with respect to the base serial algorithm but not the optimal one. Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Odd-Even Transposition Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Odd-Even Transposition Consider a block of n/p elements per process. The first step is a local sort. Then the processes execute p phases In each subsequent step, the compare-exchange operation is replaced by the compare-split operation. The parallel run time of the formulation is Sahalu Junaidu ICS 573: High-Performance Computing

ICS 573: High-Performance Computing Shellsort Shellsort can provide a substantial improvement over odd- even sort. Moves elements long distances towards their final positions. Consider sorting n elements using p=2d processes. Assumptions Processes are ordered in a logical one-D array, and this ordering defines the global ordering of the sorted sequence Each process is assigned n/p elements Two phases of the algorithm: During the first phase, processes that are far away from each other in the array compare-split their elements. During the second phase, the algorithm switches to an odd-even transposition sort. The odd-even phases are performed only as long as the blocks on the processes are changing Note that elements are moved closer to their final positions in the first phase Thus the even-odd phases performed may be much smaller than p. Sahalu Junaidu ICS 573: High-Performance Computing

ICS 573: High-Performance Computing Parallel Shellsort Initially, each process sorts its block of n/p elements internally. Each process is now paired with its corresponding process in the reverse order of the array. That is, process Pi, where i < p/2, is paired with process Pp-i-1. A compare-split operation is performed. The processes are split into two groups of size p/2 each and the process repeated in each group. Process continues for d steps Sahalu Junaidu ICS 573: High-Performance Computing

A Phase in Parallel Shellsort Sahalu Junaidu ICS 573: High-Performance Computing

Complexity of Parallel Shellsort Each process performs d = log p compare-split operations. In the second phase, l odd and even phases are performed, each requiring time Θ(n/p). The parallel run time of the algorithm is: Sahalu Junaidu ICS 573: High-Performance Computing

ICS 573: High-Performance Computing Quicksort Quicksort is one of the most common sequential algorithms due to its simplicity, low overhead, and optimal average complexity. It is a divide-and-conquer algorithm Pivot selection Divide step: partitioning into to subsequences based on the pivot Conquer step: sort the two subsequence recursively using quicksort Sahalu Junaidu ICS 573: High-Performance Computing

ICS 573: High-Performance Computing Sequential Quicksort Sahalu Junaidu ICS 573: High-Performance Computing

ICS 573: High-Performance Computing Tracing Quicksort Sahalu Junaidu ICS 573: High-Performance Computing

Complexity of Quicksort The performance of quicksort depends critically on the quality of the pivot. There are many methods for selecting the pivot A poorly selected pivot leads to worst performance n2 complexity A well-selected pivot leads to optimum performance n log n complexity. Sahalu Junaidu ICS 573: High-Performance Computing

Parallelizing Quicksort Recursive decomposition provides a natural way of parallelizing quicksort Execute on a single process, initially Then assign one of the subproblems to other processes Limitations of this parallel formulation: Uses n processes to sort n items Performs the partitioning step serially Can we use n processes to partition a list of length n around a pivot in O(1) time? We present three parallel formulations that perform the partitioning step in parallel: PRAM formulation Shared-address space (SAS) formulation Message-Passing (M-P) formulation Sahalu Junaidu ICS 573: High-Performance Computing

ICS 573: High-Performance Computing PRAM Formulation We assume a CRCW (concurrent read, concurrent write) PRAM with concurrent writes resulting in an arbitrary write succeeding. The formulation works by creating pools of processes. Every process is assigned to the same pool initially and has one element. Each processor attempts to write its element to a common location (for the pool). Each process tries to read back the location. If the value read back is greater than the processor's value, it assigns itself to the `left' pool, else, it assigns itself to the `right' pool. Each pool performs this operation recursively. Sahalu Junaidu ICS 573: High-Performance Computing

PRAM Formulation: Illustration Sahalu Junaidu ICS 573: High-Performance Computing

PRAM Formulation: An Example Sahalu Junaidu ICS 573: High-Performance Computing

PRAM Formulation: Algorithm’s Complexity This formulation interprets quicksort to have two steps: Constructing a binary tree of pivot elements A process exits when its element becomes the pivot and the algorithm continues until n pivots are selected Obtaining the sorted sequence by performing an inorder traversal of the tree During each iteration, a level of the tree is constructed in time Θ(1). Thus, the average complexity of the tree building algorithm is Θ(log n)as this is the average height of the tree. Sahalu Junaidu ICS 573: High-Performance Computing

SAS Formulation: Partitioning & Merging Consider a list of size n equally divided across p processors. A pivot is selected by one of the processors and made known to all processors. Each processor partitions its list into two, say Li and Ui, based on the selected pivot. All of the Li lists are merged and all of the Ui lists are merged separately. The set of processors is partitioned into two (in proportion of the size of lists L and U). The process is recursively applied to each of the lists. When does the recursion end? Sahalu Junaidu ICS 573: High-Performance Computing

SAS Formulation: An Example Sahalu Junaidu ICS 573: High-Performance Computing

SAS Formulation: Merging Local Lists The only thing we have not described is the global reorganization (merging) of local lists to form L and U. The problem is one of determining the right location for each element in the merged list. Each processor computes the number of elements locally less than and greater than pivot. It computes two sum-scans to determine the starting location for its elements in the merged L and U lists. Once it knows the starting locations, it can write its elements safely. Sahalu Junaidu ICS 573: High-Performance Computing

SAS Formulation: Example Merging Local Lists Sahalu Junaidu ICS 573: High-Performance Computing

SAS Formulation: Algorithm’s Complexity The parallel time depends on the split and merge time, and the quality of the pivot. The latter is an issue independent of parallelism, so we focus on the first aspect, assuming ideal pivot selection. The algorithm executes in four steps: (i) determine and broadcast the pivot - takes Θ(log p) time. (ii) locally rearrange the array assigned to each process - Θ(n/p) time. (iii) determine the locations in the globally rearranged array that the local elements will go to (Θ(log p) time) and (iv) perform the global rearrangement - Θ(n/p) time. The overall complexity of splitting an n-element array is Θ(n/p) + Θ(log p). Sahalu Junaidu ICS 573: High-Performance Computing

SAS Formulation: Algorithm’s Complexity The process recurses until there are p lists, at which point, the lists are sorted locally. Therefore, the total parallel time is: Sahalu Junaidu ICS 573: High-Performance Computing

Message Passing Formulation: Partitioning & Pairing A simple message passing formulation is based on the recursive halving of the machine. Assumptions each process in the lower half of a p process ensemble is paired with a corresponding process in the upper half. A designated processor selects and broadcasts the pivot. Each processor splits its local list into two lists, one less (Li), and other greater (Ui) than the pivot. A processor in the low half of the machine sends its list Ui to the paired processor in the other half. The paired processor sends its list Li. After this step, all elements less than the pivot are in the low half of the machine and all elements greater than the pivot are in the high half. Sahalu Junaidu ICS 573: High-Performance Computing

Message-Passing Formulation: Algorithm’s Complexity The above process is recursed until each process has its own local list, which is sorted locally. The time for a single reorganization is Θ(log p) for broadcasting the pivot element, Θ(n/p) for splitting the locally assigned portion of the array, Θ(n/p) for exchange and local reorganization. We note that this time is identical to that of the corresponding shared address space formulation. Sahalu Junaidu ICS 573: High-Performance Computing

ICS 573: High-Performance Computing Bucket and Sample Sort In Bucket sort, the range [a,b] of input numbers is divided into m equal sized intervals, called buckets. Each element is placed in its appropriate bucket. If the numbers are uniformly divided in the range, the buckets can be expected to have roughly identical number of elements. Elements in the buckets are locally sorted. The run time of this algorithm is Θ(nlog(n/m)). Sahalu Junaidu ICS 573: High-Performance Computing

ICS 573: High-Performance Computing Parallel Bucket Sort Parallelizing bucket sort is relatively simple. We can select m = p. In this case, each process has a range of values it is responsible for. Each process runs through its local list and assigns each of its elements to the appropriate process. The elements are sent to the destination processes using a single all-to-all personalized communication. Each process sorts all the elements it receives. Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Bucket and Sample Sort The critical aspect of the above algorithm is one of assigning ranges to processes. This is done by suitable splitter selection. The splitter selection method divides the n elements into m blocks of size n/m each, and sorts each block by using quicksort. From each sorted block it chooses m – 1 evenly spaced elements. The m(m – 1) elements selected from all the blocks represent the sample used to determine the buckets. This scheme guarantees that the number of elements ending up in each bucket is less than 2n/m. Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Bucket and Sample Sort Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Bucket and Sample Sort The splitter selection scheme can itself be parallelized. Each process generates the p – 1 local splitters in parallel. All processes share their splitters using a single all-to-all broadcast operation. Each process sorts the p(p – 1) elements it receives and selects p – 1 uniformly spaces splitters from them. Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Bucket and Sample Sort: Analysis The internal sort of n/p elements requires time Θ((n/p)log(n/p)), and the selection of p – 1 sample elements requires time Θ(p). The time for an all-to-all broadcast is Θ(p2), the time to internally sort the p(p – 1) sample elements is Θ(p2log p), and selecting p – 1 evenly spaced splitters takes time Θ(p). Each process can insert these p – 1splitters in its local sorted block of size n/p by performing p – 1 binary searches in time Θ(plog(n/p)). The time for reorganization of the elements is Θ(n/p). Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Bucket and Sample Sort: Analysis The total time is given by: Sahalu Junaidu ICS 573: High-Performance Computing