Download presentation
Presentation is loading. Please wait.
Published byMyrtle Shelton Modified over 8 years ago
1
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato olgat@cs.tamu.eduolgat@cs.tamu.edu gabrielt@cs.tamu.edu amato@cs.tamu.edugabrielt@cs.tamu.eduamato@cs.tamu.edu Parasol Lab, Department of Computer Science, Texas A&M University, http://parasol.tamu.edu/ ContainerpContainerRuntime system : aRMI IteratorpRange TOOLBOXES: Performance Optimization AlgorithmspAlgorithms System Profiling STAPL STL The Standard Template Adaptive Parallel Library (STAPL) is a parallel library designed as a superset of the ISO Standard C++ Standard Template Library (STL). It executes on uni- or multi-processor systems that utilize shared or distributed memory. The goal of STAPL is to allow the user to work at a high level of abstraction by insulating them from the complexity of parallel programming, such as problem decomposition, problem mapping, scheduling, and execution, while still providing scalable performance. l Ease of use –STAPL emulates Shared Memory Programming. Users can program assuming a single address space in both shared and distributed systems. l Efficiency –STAPL provides building blocks equivalent to STL containers, iterators, and algorithms that are automatically tuned for parallel and distributed systems. l Portability –STAPL has its own runtime that hides machine specific details and provides a uniform and efficient communication interface. STAPL Design GoalsSTAPL Main Components STAPL: Standard Template Adaptive Parallel Library l pContainer –Distributed data structures. l pRange –Presents an abstract view of a scoped data space, which allows random access to a partition or subrange of the data in a pContainer –Stores data dependence information. l pAlgorithms –Parallel Algorithms which provide basic functionality, bound with the pContainer by pRange. l Adaptive Runtime System –Adaptive Remote Method Invocation (aRMI) communication library hides machine specifics and provides a uniform communication interface. –Adaptive performance optimization toolbox, including scheduler, load-balancer, and system profiling tools. Parallel Sorting Algorithms l [1] "STAPL: An Adaptive, Generic Parallel C++ Library," P.An, A.Jula, S.Rus, S.Saunders, T.Smith, G.Tanase, N.Thomas, N.Amato and L.Rauchwerger, 2001. l [2] “A Comparison of Parallel Sorting Algorithms on Different Architectures,” N.Amato, R.Iyer, S.Sundaresan, Y.Wu, 1996. To be able to adaptively select the best algorithm based on the data provided and the system information available Radix SortSample Sort l Sequential Algorithm 1. Select p-1 splitters 2. Sort the splitters; the splitters are the upper and lower bounds that define p buckets 3. Compare each element to the splitters and place it in an appropriate bucket 4. Sort the content of each bucket 5. Copy the values from buckets into the original container l Parallelization –If each processor is responsible for one bucket, the steps can be done in parallel –The running time of this algorithm is dependent on the maximum number of elements contained in any bucket (distribution of elements between buckets) –Thus we want all buckets to contain an equal number of elements –Technique used to achieve the above: oversampling l Performance of parallel sorts depends on: –Machine Architecture –Number of Processors –Type of Elements to Sort –How presorted the elements are l Works for all types of elements that can be compared l Sequential Complexity: O(n log n) l Parallel Complexity: O(n/p log n/p) l Works only for integers l Sequential Complexity: O(n) l Parallel Complexity: O(n/p) Sequential Algorithm –Radix sort is not a comparison sort, therefore it is not subject to the O(n log n) sorting lower bound –Each element is represented by b bits (i.e. 32 bit integers) –The algorithm performs a number of iterations; each iteration considers only r bits of each elementat a time, with the ith pass sorting according to the ith group of the least significant bits –The sorting algorithm used to sort the r bits must be stable, meaning that if two elements have the same value, they appear in the same order in the output sequence as they did in the input sequence. Counting sort is usually used here, as demonstrated on the parallel example to the right Bitonic Sort References l Parallel Algorithm 1. Locally sort the elements on each thread 2. Form a bitonic sequence (a sequence which is first increasing and then decreasing, or can be circularly shifted to become so) 3. Sort in an increasing order l Note: Each step of the Bitonic Sort consists of 2 threads exchanging data, merging the 2 sequences, and keeping its corresponding half l Heuristic applied: the threads exchange minimum and maximum first, then trade only the elements necessary for the merge l Works for all types of elements that can be compared l Sequential Complexity: O(n log n) l Parallel Complexity: O(n/p log n/p + n/p log p) (sort+merge) Performance Goal Comparison l Random Data –Radix Sort is faster up to 8 processors –Sample Sort outperforms Radix Sort as the number of processors increases Sample SortRadix Sort l Sample Sort –Scales better than Radix Sort –Performs well on various data types l Radix Sort –The fastest sort for integers –Scalability is poor for random data l Nearly Sorted Data –Radix Sort is faster –The difference in performance is smaller as the number of processors increases
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.