Sorting Algorithms: Topic Overview

Sorting Algorithms: Topic Overview
Issues in Sorting on Parallel Computers Sorting Networks Bitonic Sort Bubble Sort and its Variants Quicksort Bucket and Sample Sort Sahalu Junaidu ICS 573: High-Performance Computing

ICS 573: High-Performance Computing
Sorting: Overview One of the most commonly used and well-studied kernels. many algorithms require sorted data for easier manipulation Sorting algorithms internal external comparison-based: based on compare-exchange noncomparison-based: based on elements’ properties like their binary representation or their distribution Lower bound complexity classes to sort n numbers: Comparison-based: Θ(nlog n). Noncomparison-based: Θ(n). We focus here on comparison-based sorting algorithms. Sahalu Junaidu ICS 573: High-Performance Computing

Issues in Sorting on Parallel Computers
Where are the input and output lists stored? We assume that the input and output lists are distributed. Input specification Each processor has n/p elements An ordering of the processors Output specification Each processor will get n/p consecutive elements of the final sorted array. The “chunk” is determined by the processor ordering. Variations Unequal number of elements on output. In general, this is not a good idea and it may require a shift to obtain the equal size distribution. Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Compare-Exchange Operation
Comparison becomes more complicated when elements reside on different processes One element per process: Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Compare-Split Operation
The compare-exchange communication cost is ts + tw. Assuming bidirectional channels Delivers poor performance. Why? More than one element per process: compare-split operation. Assume each of two processes has n/p elements. After the compare-split operation, the smaller n/p elements are at process Pi and the larger n/p elements at Pi, where i < j. The compare-split communication cost is ts+ twn/p. Assuming that the two partial lists were initially sorted. Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Compare-Split Operation
Sahalu Junaidu ICS 573: High-Performance Computing

Sorting Networks Key Idea: Perform many comparisons in parallel. Key Elements: Comparators: Consist of two-input, two-output wires Take two elements on the input wires and outputs them in sorted order in the output wires. Network architecture: The arrangement of the comparators into interconnected comparator columns similar to multi-stage networks Many sorting networks have been developed. Bitonic sorting network with Θ(log2(n)) columns of comparators. Sahalu Junaidu ICS 573: High-Performance Computing

Sorting Networks: Comparators
A comparator is a device with two inputs x and y and two outputs x' and y'. For an increasing comparator, x' = min{x,y} and y' = max{x,y}; For an decreasing comparator, x' = max{x,y} and y' = min{x,y}; We denote an increasing comparator by  and a decreasing comparator by Ө. Consists of a series of columns each with comparators connected in parallel. The depth of a network is the number of columns it contains. Sahalu Junaidu ICS 573: High-Performance Computing

Sorting Networks: Comparators

Sorting Networks: Architecture

Bitonic Sort Bitonic sorting depends on rearranging a bitonic sequence into a sorted sequence. Uses a bitonic sorting network to sort n elements in Θ(log2n) time. A bitonic sequence <a0, a1, ..., an-1> is such that either there exists an index i, 0 ≤ i ≤ n - 1, such that <a0, ..., ai > is monotonically increasing and <ai +1, ..., an-1> is monotonically decreasing, or there exists a cyclic shift of indices so that (1) is satisfied. Example bitonic sequences: 1,2,4,7,6,0 8,9,2,1,0,4 because it is a cyclic shift of 0,4,8,9,2,1. Sahalu Junaidu ICS 573: High-Performance Computing

Sorting a Bitonic Sequence
Let s=a0,a1,…,an-1 be a bitonic sequence such that a0 ≤ a1 ≤ ··· ≤ an/2-1 and an/2 ≥ an/2+1 ≥ ··· ≥ an-1. Consider the following subsequences of s: s1 = min{a0,an/2},min{a1,an/2+1},…,min{an/2-1,an-1} s2 = max{a0,an/2},max{a1,an/2+1},…,max{an/2-1,an-1} Dividing a bitonic sequence into two subsequences as above is called a bitonic split Note that s1 and s2 are both bitonic and each element of s1 is less than every element in s2. Apply bitonic split recursively on s1 and s2 to get the sorted sequence. Sorting a bitonic sequence this way is called bitonic merge How many bitonic splits are required to sort a bitonic sequence? Sahalu Junaidu ICS 573: High-Performance Computing

Example: Bitonic Sort Sahalu Junaidu ICS 573: High-Performance Computing

Bitonic Merging Network (BMN)
BMN: a network of comparators used to implement the bitonic merge algorithm. BMN contains log n columns each containing n/2 comparators Each column performs one step of the bitonic merge. BMN takes as input the bitonic sequence and outputs the sequence in sorted order. We denote a bitonic merging network with n inputs by BM[n]. Replacing the  comparators by Ө comparators results in a decreasing output sequence; such a network is denoted by ӨBM[n]. Sahalu Junaidu ICS 573: High-Performance Computing

Example: Using a Bitonic Merging Network

Sorting Unordered Elements
Sorting can be achieved by repeatedly merging bitonic sequences of increasing lengths Resulting in a single bitonic sequence from the given sequence. A sequence of length 2 is a bitonic sequence. Why? A bitonic sequence of length 4 can be built by sorting the first two elements using BM[2] and next two, using ӨBM[2]. This process can be repeated to generate larger bitonic sequences. The algorithm embodied in this process is called Bitonic Sort and the network is called a Bitonic Sorting Network (BSN). Sahalu Junaidu ICS 573: High-Performance Computing

Sorting Unordered Elements Using a BSN

Details of Sorting Using a BSN

Complexity of Bitonic Sort
How many stages are there in a BSN for sorting n elements? What is the depth (# of steps/columns) of the network? Stage i consists of i columns of n/2 comparators How many comparators are there? n/2 log2 n Thus, a serial implementation of the network would have complexity Θ(n log2 n). Sahalu Junaidu ICS 573: High-Performance Computing

Bitonic Sort on Parallel Computers
A key aspect of bitonic sort: communication intensive A proper mapping must take into account the topology of the underlying interconnection network We discuss mapping to hypercube and mesh topologies Mapping requirements for good performance Map wires that perform compare-exchange onto neighboring processes Map wires that perform compare-exchange more frequently onto neighboring processes How are wires paired to do compare-exchange in each stage? Which wires communicate most frequently? Wires whose labels differ in the ith least-significant bit perform compare-exchange log n – i + 1 times Sahalu Junaidu ICS 573: High-Performance Computing

Mapping Bitonic Sort to Hypercubes
Case 1: one item per processor. What is the comparator? How do the wires get mapped? Recall that processes whose labels differ in only one bit are neighbors in a hypercube This provides a direct mapping of wires to processors. All communication is nearest neighbor! Pairing processes in a d-dimensional hypercube (p=2d): Consider the steps during the last stage of the algorithm: Step 1: processes differing in dth bit exchange elements along the dth dimension Step 2: compare-exchange takes place along (d-1)th dimension Step i: compare-exchange takes place along (d-(i-1))th dimension Sahalu Junaidu ICS 573: High-Performance Computing

Mapping Bitonic Sort to Hypercubes

Communication Pattern of Bitonic Sort on Hypercubes

Bitonic Sort Algorithm on Hypercubes
Parallel runtime: Tp = Θ(log2 n) Sahalu Junaidu ICS 573: High-Performance Computing

Mapping Bitonic Sort to Meshes
The connectivity of a mesh is lower than that of a hypercube, so we must expect some overhead in this mapping. So we map wires such that the most frequent compare-exchange operations occur between neighboring processes. Some mapping options: row-major mapping, row-major snakelike mapping, and row-major shuffled mapping. Each process is labeled by the wire that is mapped onto it. Advantage of row-major shuffled mapping processes that perform compare-exchange operations reside on square subsections of the mesh. wires that differ in the ith least-significant bit are mapped onto mesh processes that are communication links away. Sahalu Junaidu ICS 573: High-Performance Computing

Bitonic Sort on Meshes: Example Mappings

Mapping Bitonic Sort to Meshes

Parallel Time of Bitonic Sort on Meshes
Row-major shuffled mapping wires that differ at the ith least-significant bit are mapped onto mesh processes that are 2(i-1)/2 communication links away. The total amount of communication performed by each process is: The total computation performed by each process is Θ(log2n). The parallel runtime is: Sahalu Junaidu ICS 573: High-Performance Computing

Block of Elements Per Processor
Each process is assigned a block of n/p elements. The first step is a local sort of the local block. Each subsequent compare-exchange operation is replaced by a compare-split operation. We can effectively view the bitonic network as having (1 + log p)(log p)/2 steps. Sahalu Junaidu ICS 573: High-Performance Computing

Block of Elements Per Processor: Hypercube
Initially the processes sort their n/p elements (using merge sort) in time Θ((n/p)log(n/p)) and then perform Θ(log2p) compare-split steps. The parallel run time of this formulation is Sahalu Junaidu ICS 573: High-Performance Computing

Block of Elements Per Processor: Mesh
The parallel runtime in this case is given by: Sahalu Junaidu ICS 573: High-Performance Computing

Bubble Sort and its Variants
We now focus on traditional sorting algorithms. We’ll investigate whether n processes can be employed to sort a sequence in Θ(log n)time. Recall that the sequential bubble sort algorithm compares and exchanges adjacent elements. The complexity of bubble sort is Θ(n2). Bubble sort is difficult to parallelize since the algorithm has no concurrency. Sahalu Junaidu ICS 573: High-Performance Computing

Sequential Bubble Sort Algorithm

Odd-Even Transposition Sort: A Bubble Sort Variant
Odd-Even sort alternates between two phases, called the odd and even phases Odd phase: compare-exchange elements with odd indices with their right neighbors Similarly for even phase. Sequence is sorted after n phases (n is even), each of which requires n/2 compare-exchange operations. Thus, its sequential complexity is Θ(n2). Sahalu Junaidu ICS 573: High-Performance Computing

Sequential Odd-Even Sort Algorithm

Example: Odd-Even Sort Algorithm

Parallel Odd-Even Transposition
Consider the one item per process case. There are n iterations, in each iteration, each process does one compare-exchange. The parallel run time of this formulation is Θ(n). This is cost optimal with respect to the base serial algorithm but not the optimal one. Sahalu Junaidu ICS 573: High-Performance Computing

Consider a block of n/p elements per process. The first step is a local sort. Then the processes execute p phases In each subsequent step, the compare-exchange operation is replaced by the compare-split operation. The parallel run time of the formulation is Sahalu Junaidu ICS 573: High-Performance Computing

Shellsort Shellsort can provide a substantial improvement over odd- even sort. Moves elements long distances towards their final positions. Consider sorting n elements using p=2d processes. Assumptions Processes are ordered in a logical one-D array, and this ordering defines the global ordering of the sorted sequence Each process is assigned n/p elements Two phases of the algorithm: During the first phase, processes that are far away from each other in the array compare-split their elements. During the second phase, the algorithm switches to an odd-even transposition sort. The odd-even phases are performed only as long as the blocks on the processes are changing Note that elements are moved closer to their final positions in the first phase Thus the even-odd phases performed may be much smaller than p. Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Shellsort Initially, each process sorts its block of n/p elements internally. Each process is now paired with its corresponding process in the reverse order of the array. That is, process Pi, where i < p/2, is paired with process Pp-i-1. A compare-split operation is performed. The processes are split into two groups of size p/2 each and the process repeated in each group. Process continues for d steps Sahalu Junaidu ICS 573: High-Performance Computing

A Phase in Parallel Shellsort

Complexity of Parallel Shellsort
Each process performs d = log p compare-split operations. In the second phase, l odd and even phases are performed, each requiring time Θ(n/p). The parallel run time of the algorithm is: Sahalu Junaidu ICS 573: High-Performance Computing

Quicksort Quicksort is one of the most common sequential algorithms due to its simplicity, low overhead, and optimal average complexity. It is a divide-and-conquer algorithm Pivot selection Divide step: partitioning into to subsequences based on the pivot Conquer step: sort the two subsequence recursively using quicksort Sahalu Junaidu ICS 573: High-Performance Computing

Sequential Quicksort Sahalu Junaidu ICS 573: High-Performance Computing

Tracing Quicksort Sahalu Junaidu ICS 573: High-Performance Computing

Complexity of Quicksort
The performance of quicksort depends critically on the quality of the pivot. There are many methods for selecting the pivot A poorly selected pivot leads to worst performance n2 complexity A well-selected pivot leads to optimum performance n log n complexity. Sahalu Junaidu ICS 573: High-Performance Computing

Parallelizing Quicksort
Recursive decomposition provides a natural way of parallelizing quicksort Execute on a single process, initially Then assign one of the subproblems to other processes Limitations of this parallel formulation: Uses n processes to sort n items Performs the partitioning step serially Can we use n processes to partition a list of length n around a pivot in O(1) time? We present three parallel formulations that perform the partitioning step in parallel: PRAM formulation Shared-address space (SAS) formulation Message-Passing (M-P) formulation Sahalu Junaidu ICS 573: High-Performance Computing

PRAM Formulation We assume a CRCW (concurrent read, concurrent write) PRAM with concurrent writes resulting in an arbitrary write succeeding. The formulation works by creating pools of processes. Every process is assigned to the same pool initially and has one element. Each processor attempts to write its element to a common location (for the pool). Each process tries to read back the location. If the value read back is greater than the processor's value, it assigns itself to the `left' pool, else, it assigns itself to the `right' pool. Each pool performs this operation recursively. Sahalu Junaidu ICS 573: High-Performance Computing

PRAM Formulation: Illustration

PRAM Formulation: An Example

PRAM Formulation: Algorithm’s Complexity
This formulation interprets quicksort to have two steps: Constructing a binary tree of pivot elements A process exits when its element becomes the pivot and the algorithm continues until n pivots are selected Obtaining the sorted sequence by performing an inorder traversal of the tree During each iteration, a level of the tree is constructed in time Θ(1). Thus, the average complexity of the tree building algorithm is Θ(log n)as this is the average height of the tree. Sahalu Junaidu ICS 573: High-Performance Computing

SAS Formulation: Partitioning & Merging
Consider a list of size n equally divided across p processors. A pivot is selected by one of the processors and made known to all processors. Each processor partitions its list into two, say Li and Ui, based on the selected pivot. All of the Li lists are merged and all of the Ui lists are merged separately. The set of processors is partitioned into two (in proportion of the size of lists L and U). The process is recursively applied to each of the lists. When does the recursion end? Sahalu Junaidu ICS 573: High-Performance Computing

SAS Formulation: An Example

SAS Formulation: Merging Local Lists
The only thing we have not described is the global reorganization (merging) of local lists to form L and U. The problem is one of determining the right location for each element in the merged list. Each processor computes the number of elements locally less than and greater than pivot. It computes two sum-scans to determine the starting location for its elements in the merged L and U lists. Once it knows the starting locations, it can write its elements safely. Sahalu Junaidu ICS 573: High-Performance Computing

SAS Formulation: Example Merging Local Lists

SAS Formulation: Algorithm’s Complexity
The parallel time depends on the split and merge time, and the quality of the pivot. The latter is an issue independent of parallelism, so we focus on the first aspect, assuming ideal pivot selection. The algorithm executes in four steps: (i) determine and broadcast the pivot - takes Θ(log p) time. (ii) locally rearrange the array assigned to each process - Θ(n/p) time. (iii) determine the locations in the globally rearranged array that the local elements will go to (Θ(log p) time) and (iv) perform the global rearrangement - Θ(n/p) time. The overall complexity of splitting an n-element array is Θ(n/p) + Θ(log p). Sahalu Junaidu ICS 573: High-Performance Computing

SAS Formulation: Algorithm’s Complexity
The process recurses until there are p lists, at which point, the lists are sorted locally. Therefore, the total parallel time is: Sahalu Junaidu ICS 573: High-Performance Computing

Message Passing Formulation: Partitioning & Pairing
A simple message passing formulation is based on the recursive halving of the machine. Assumptions each process in the lower half of a p process ensemble is paired with a corresponding process in the upper half. A designated processor selects and broadcasts the pivot. Each processor splits its local list into two lists, one less (Li), and other greater (Ui) than the pivot. A processor in the low half of the machine sends its list Ui to the paired processor in the other half. The paired processor sends its list Li. After this step, all elements less than the pivot are in the low half of the machine and all elements greater than the pivot are in the high half. Sahalu Junaidu ICS 573: High-Performance Computing

Message-Passing Formulation: Algorithm’s Complexity
The above process is recursed until each process has its own local list, which is sorted locally. The time for a single reorganization is Θ(log p) for broadcasting the pivot element, Θ(n/p) for splitting the locally assigned portion of the array, Θ(n/p) for exchange and local reorganization. We note that this time is identical to that of the corresponding shared address space formulation. Sahalu Junaidu ICS 573: High-Performance Computing

Bucket and Sample Sort In Bucket sort, the range [a,b] of input numbers is divided into m equal sized intervals, called buckets. Each element is placed in its appropriate bucket. If the numbers are uniformly divided in the range, the buckets can be expected to have roughly identical number of elements. Elements in the buckets are locally sorted. The run time of this algorithm is Θ(nlog(n/m)). Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Bucket Sort Parallelizing bucket sort is relatively simple. We can select m = p. In this case, each process has a range of values it is responsible for. Each process runs through its local list and assigns each of its elements to the appropriate process. The elements are sent to the destination processes using a single all-to-all personalized communication. Each process sorts all the elements it receives. Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Bucket and Sample Sort
The critical aspect of the above algorithm is one of assigning ranges to processes. This is done by suitable splitter selection. The splitter selection method divides the n elements into m blocks of size n/m each, and sorts each block by using quicksort. From each sorted block it chooses m – 1 evenly spaced elements. The m(m – 1) elements selected from all the blocks represent the sample used to determine the buckets. This scheme guarantees that the number of elements ending up in each bucket is less than 2n/m. Sahalu Junaidu ICS 573: High-Performance Computing

The splitter selection scheme can itself be parallelized. Each process generates the p – 1 local splitters in parallel. All processes share their splitters using a single all-to-all broadcast operation. Each process sorts the p(p – 1) elements it receives and selects p – 1 uniformly spaces splitters from them. Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Bucket and Sample Sort: Analysis
The internal sort of n/p elements requires time Θ((n/p)log(n/p)), and the selection of p – 1 sample elements requires time Θ(p). The time for an all-to-all broadcast is Θ(p2), the time to internally sort the p(p – 1) sample elements is Θ(p2log p), and selecting p – 1 evenly spaced splitters takes time Θ(p). Each process can insert these p – 1splitters in its local sorted block of size n/p by performing p – 1 binary searches in time Θ(plog(n/p)). The time for reorganization of the elements is Θ(n/p). Sahalu Junaidu ICS 573: High-Performance Computing

Parallel Bucket and Sample Sort: Analysis
The total time is given by: Sahalu Junaidu ICS 573: High-Performance Computing

Sorting Algorithms: Topic Overview

Similar presentations

Presentation on theme: "Sorting Algorithms: Topic Overview"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sorting Algorithms: Topic Overview

Similar presentations

Presentation on theme: "Sorting Algorithms: Topic Overview"— Presentation transcript:

Similar presentations

About project

Feedback