1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.

Slides:



Advertisements
Similar presentations
The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
Sorting Part 4 CS221 – 3/25/09. Sort Matrix NameWorst Time Complexity Average Time Complexity Best Time Complexity Worst Space (Auxiliary) Selection SortO(n^2)
§7 Quicksort -- the fastest known sorting algorithm in practice 1. The Algorithm void Quicksort ( ElementType A[ ], int N ) { if ( N < 2 ) return; pivot.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
DIVIDE AND CONQUER APPROACH. General Method Works on the approach of dividing a given problem into smaller sub problems (ideally of same size).  Divide.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Quicksort CS 3358 Data Structures. Sorting II/ Slide 2 Introduction Fastest known sorting algorithm in practice * Average case: O(N log N) * Worst case:
Quicksort COMP171 Fall Sorting II/ Slide 2 Introduction * Fastest known sorting algorithm in practice * Average case: O(N log N) * Worst case: O(N.
Reference: Message Passing Fundamentals.
Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.
Chapter Hardwired vs Microprogrammed Control Multithreading
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Computer Algorithms Lecture 10 Quicksort Ch. 7 Some of these slides are courtesy of D. Plaisted et al, UNC and M. Nicolescu, UNR.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
PRESENTED BY MOHAMMED ALHARBI MOHAMMED ALEISABAKRI AWAJI Causing Incoherencies Parallel sorting algorithms and study cache behaviors L1 and L2 on Multi-
Multi-core architectures. Single-core computer Single-core CPU chip.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Multi-Core Architectures
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
HKOI 2006 Intermediate Training Searching and Sorting 1/4/2006.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
1 Radix Sort. 2 Classification of Sorting algorithms Sorting algorithms are often classified using different metrics:  Computational complexity: classification.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua.
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Computer Sciences Department1. Sorting algorithm 4 Computer Sciences Department3.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
1. Objective Sorting is used in human activities and devices like personal computers, smart phones, and it continues to play a crucial role in the development.
Simultaneous Multithreading
Parallel Programming By J. H. Wang May 2, 2017.
Multi-core processors
Hyperthreading Technology
Levels of Parallelism within a Single Processor
Unit-2 Divide and Conquer
Quick sort and Radix sort
Parallel Sorting Algorithms
Hybrid Programming with OpenMP and MPI
Levels of Parallelism within a Single Processor
Presentation transcript:

1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture U of C (ACAG) Department of Electrical and Computer Engineering, *Department of Computer Science University of Calgary

2 Outline  Multithreaded Architecture  Motivation  Parallel Radix Sort and Quick Sort  Our Approach: Multithreaded Radix and Quick sort  Experimental Methodology  Timing and Memory Analysis  Conclusions SDMAS'08University of Calgary

3 Multithreaded Architectures  Simultaneous Multithreaded architectures (SMT)  caches, execution units, buses are shared between multiple threads.  Chip Multiprocessors (CMP)  Each processor has its own execution units and L1 cache, however, the L2 cache and the bus interface are shared among processors.  Combinations of SMT, CMP and Symmetric Multiprocessors (SMP). E.g. multiples of the following structure: SDMAS'08University of Calgary

4 Motivation  New forms of multithreading have opened opportunities for the improvement of data management operations to better utilize the underlying hardware resources.  The sort operation is a core part of many critical applications.  Sorts suffer from high data-dependencies that vastly limits its performance. SDMAS'08University of Calgary

5 Radix Sort  Radix sort is a distribution sort that processes one digit of the keys in each sorting iteration. Least Significant Digit (LSD) radix sort, where digit(i) refers to a group of bits from a key. digit(i) is constant throughout each iteration. 1 for (i= 0; i < number_of_digits; i ++) 2 sort source-array based on digit(i);  Cache-conscious radix sort [1] uses Most Significant Digit (MSD) to construct subarrays from the source-array that fit in the largest data cache  Parallel partitioned radix sort [2] distributes data among processors once. However, it doesn’t guarantee perfect keys balancing across processors. SDMAS'08University of Calgary

6 Our Parallel Radix Sort Algorithm  Hy brid of parallel partitioned radix sort and cache-conscious radix sort: start: for each thread compute local histogram for a bucket of keys using MSD generate global histogram for each thread permute keys based on local and global histograms barrier for each thread i = next available bucket if bucket(i) is over-sized then store it in queue and pick another bucket else locally sort bucket(i) using LSD digits never visited before visit queue, goto start for each over-sized bucket SDMAS'08University of Calgary

7 Experimental Methodology  We implemented all algorithms in C, and we use the Intel® C++ Compiler for Linux version 9.1.  We use OpenMP C/C++ library version 2.5 to initiate multiple threads in our multi-threaded codes.  Hardware events are collected using Intel® VTune™ Performance Analyzer for Linux 9.0.  Our runs sort datasets ranges from 1×10^7 to 6×10^7 keys, which fits smoothly in our main memory.  We run three typical datasets:  Random: keys are generated by calling the random () C function, which return numbers ranging from 0-2^31.  Gaussian: each key is the average of four consecutive calls to the random () C function.  Zero: all keys are set to a constant. This constant is randomly picked using the random () C function SDMAS'08University of Calgary

8 Experimental Methodology (cont’d)  We run our algorithms on a machine combining SMT, CMP and SMP, with the following specifications: SDMAS'08University of Calgary

9 Miss Rates for LSD Radix Sort with Different Datasets SDMAS'08University of Calgary

10 Radix Sort Timing for the Random Datasets SDMAS'08University of Calgary  Speedups range from 54% for two threads to up to 300% for 16 threads compared to LSD radix sort.

11 Radix Sort Timing for the Gaussian Datasets SDMAS'08University of Calgary  Speedups range from 7% for two threads to up to 237% for 16 threads compared to LSD radix sort.

12 Quick Sort  Quicksort is a comparison-based, divide-and- conquer sort algorithm.  Memory-tuned quick sort [3] uses insertion sort to sort the source subarrays to increase data locality.  Simple fast parallel quick sort [4] in which a pivot is picked by the processor with the smallest ID. Then each processor processes an L1-data-cache- size block of keys from the left side of the pivot and another block from the right side. When the remaining subarrays are small enough to be sorted by one processor, memory-tuned quick sort is used. SDMAS'08University of Calgary

13 Our Parallel Quick Sort  We choose to implement the best parallel quick sort we find, "simple fast parallel quick sort" with some optimization that includes the following:  Dynamically adjusting the block-sizes: since each memory location in the blocks is referenced once on average, our block-size is dynamically adjusted for each subarrary such that it provides good data balancing across threads, and is not necessary equal to the L1 cache size. SDMAS'08University of Calgary

14 Quick Sort Timing for the Random Datasets SDMAS'08University of Calgary  Our speedups range from 34%-417% for 1.E+07 dataset size, and from 34% to 260% for 6.E+07.

15 Quick Sort Timing for the Gaussian Datasets SDMAS'08University of Calgary  Speedups range from 18% to 259%.

16 Conclusions  We achieve speedups up to 4.69x for radix sort and up to 4.17x for quick sort on a machine with 4 multithreaded processors compared to single threaded versions, respectively.  We find that since radix sort is CPU-intensive, it exhibits better results on Chip multiprocessors where multiple CPUs are available.  While quick sort is accomplishing speedups on all types of multithreading processers due to its ability to overlap memory miss latencies with other useful processing SDMAS'08University of Calgary

17 References [1] Jiménez-González, D., Navarro, J.J. and Larriba-Pey J. CC-Radix: a Cache Conscious Sorting Based on Radix Sort. In Proceedings of the 11th Euromicro Conference on Parallel Distributed and Network-Based Processing (PDP). Pages , [2] Lee, S., Jeon, M., Kim, D. and Sohn, A. Partition Parallel Radix Sort. Journal of Parallel and Distributed Computing. Pages: , [3] LaMarca, A. and Ladner, R. The Influence of Caches on the Performance of Sorting. In Proceeding of the ACM/SIAM Symposium on Discrete Algorithms. Pages: 370– 379, [4] Tsigas, P. and Zhang, Yi. A Simple, Fast Parallel Implementation of Quicksort and its Performance Evaluation on Sun Enterprise In Proceedings of the 11th EUROMICRO Conference on Parallel Distributed and Network-Based Processing (PDP). Pages: 372 – 381, SDMAS'08University of Calgary

18 The End SDMAS'08University of Calgary