Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Advertisements

Gerth Stølting Brodal University of Aarhus Monday June 9, 2008, IT University of Copenhagen, Denmark International PhD School in Algorithms for Advanced.
Streaming SIMD Extension (SSE)
INSTRUCTION SET ARCHITECTURES
The University of Adelaide, School of Computer Science
Heaps1 Part-D2 Heaps Heaps2 Recall Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is a pair (key, value)
AVL Trees1 Part-F2 AVL Trees v z. AVL Trees2 AVL Tree Definition (§ 9.2) AVL trees are balanced. An AVL Tree is a binary search tree such that.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer.
Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
CSE115/ENGR160 Discrete Mathematics 02/24/11 Ming-Hsuan Yang UC Merced 1.
Efficient Associative SIMD Processing for Non-Tabular Data Jalpesh K. Chitalia and Robert A. Walker Computer Science Department Kent State University.
Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.
Sorting and Searching Timothy J. PurcellStanford / NVIDIA Updated Gary J. Katz based on GPUTeraSort (MSR TR )U. of Pennsylvania.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
CS 104 Introduction to Computer Science and Graphics Problems Data Structure & Algorithms (3) Recurrence Relation 11/11 ~ 11/14/2008 Yang Song.
CSCI-455/552 Introduction to High Performance Computing Lecture 22.
Mapping Computational Concepts to GPU’s Jesper Mosegaard Based primarily on SIGGRAPH 2004 GPGPU COURSE and Visualization 2004 Course.
The Fundamentals: Algorithms, the Integers & Matrices.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
ALGORITHM ANALYSIS AND DESIGN INTRODUCTION TO ALGORITHMS CS 413 Divide and Conquer Algortihms: Binary search, merge sort.
AMD Opteron - AMD64 Architecture Sean Downes. Description Released April 22, 2003 The AMD Opteron is a 64 bit microprocessor designed for use in server.
NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.
Programmer's view on Computer Architecture by Istvan Haller.
Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.
PRESENTED BY: RAJKRISHNADEEPAK.VUYYURU SWAMYCHANDAN.DONDAPATI VINESHKUMARREDDY.LANKA RAJSEKHARTIRUMALA KANDURI ALAN.
 2005 Pearson Education, Inc. All rights reserved Searching and Sorting.
 Pearson Education, Inc. All rights reserved Searching and Sorting.
 2006 Pearson Education, Inc. All rights reserved Searching and Sorting.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.
Min Chen School of Computer Science and Engineering Seoul National University Data Structure: Chapter 2.
Memory Management during Run Generation in External Sorting – Larson & Graefe.
Tamanna Chhabra, M. Oguzhan Kulekci, and Jorma Tarhio Aalto University.
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.
Sorting and Searching. Selection Sort  “Search-and-Swap” algorithm 1) Find the smallest element in the array and exchange it with a[0], the first element.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Timo O. Korhonen, HUT Communication Laboratory 1 Convolutional encoding u Convolutional codes are applied in applications that require good performance.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
1 Parallel Sorting Algorithm. 2 Bitonic Sequence A bitonic sequence is defined as a list with no more than one LOCAL MAXIMUM and no more than one LOCAL.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Comparison Networks Sorting Sorting binary values Sorting arbitrary numbers Implementing symmetric functions.
Internal and External Sorting External Searching
Text Chapters 2 Analyzing Algorithms.  goal: predicting resources that an algorithm requires memory, communication bandwidth, hardware, memory, communication.
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
Introduction The STL is a complex piece of software engineering that uses some of C++'s most sophisticated features STL provides an incredible amount.
Chapter 4, Part II Sorting Algorithms. 2 Heap Details A heap is a tree structure where for each subtree the value stored at the root is larger than all.
Chapter 9: Sorting1 Sorting & Searching Ch. # 9. Chapter 9: Sorting2 Chapter Outline  What is sorting and complexity of sorting  Different types of.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
CSCI-455/552 Introduction to High Performance Computing Lecture 21.
Sorting: Parallel Compare Exchange Operation A parallel compare-exchange operation. Processes P i and P j send their elements to each other. Process P.
1 Chapter 2 Program Performance. 2 Concepts Memory and time complexity of a program Measuring the time complexity using the operation count and step count.
© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
16 Searching and Sorting.
Top 50 Data Structures Interview Questions
Priority Queues © 2010 Goodrich, Tamassia Priority Queues 1
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Heaps © 2010 Goodrich, Tamassia Heaps Heaps
Vector Processing => Multimedia
Algorithms Chapter 3 With Question/Answer Animations
Tree Representation Heap.
Comparison Networks Sorting Sorting binary values
Ch. 8 Priority Queues And Heaps
Data structures and algorithms
Samuel Larsen and Saman Amarasinghe, MIT CSAIL
Presentation transcript:

Using SIMD Registers and instructions to Enable Instruction- Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang

Source Source ACM Symposium on Parallel Algorithms and Architectures Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures Authors Timothy Furtak José Nelson Amaral Robert Niewiadomski

Outline Introduction Sorting network Sorting algorithms Experimental evaluation Contributions

Introduction Use SIMD resources to improve the performance of sorting algorithms for short sequence. Initial inspiration: need for Fast sorting of short sequences implementation of Graphics rendering in interactive video game SIMD machineries

Introduction SIMD machineries X86-64’s SSE2 (Streaming SIMD Extensions 2) G5’s AltiVec AltiVec,SSE2: SIMD instruction sets, both feature 128-bit vector registers

Sorting network a comparator network produces a sorted output for any possible input sequence. COMP(a, b) — the inputs are two storage units: memory locations, registers, or vector-register elements — a and b, each containing a numerical input.

Sorting network Size: the total number of comparators in the network. Depth: the length of the critical path in its dependence graph.

Sorting network

A comparator moves the larger value to the left, and the smaller value to the right. For instance, Figure1 size=5,width=3; Inputs: a = 7, b = 2, c = 5, d = 9 Output: a = 9, b = 7, c = 5, d = 2.

Supporting hardware for Sorting Network The comparator required by a sorting network is easily constructed using these two operations, a copy instruction, and a temporary variable. Min and max instructions min(a, b) = a : a ≤ b b : otherwise max(a, b) = a : a ≥ b b : otherwise

Supporting hardware for Sorting Network x86-64 architectures supports the SSE2 min and max operations that return the minimum (maximum) packed single-precision floating- point values.

Supporting hardware for Sorting Network Width: the number of vectors being sorted. x86-64 has 16 XMM vector registers, and each register can hold 4 floating-point values. Sorting the values in n XMM registers using a sorting network produces 4 sorted streams of data of length n. 1 ≤ n < 16, one register must be reserved as temporary storage for the swap of values.

Three sorting methods Two pass sorting with insertion sorting Two pass sorting with merge sorting One pass sorting (Register sorting)

Tow pass sorting In the first phase the SIMD registers and instructions are used to generate a partially-sorted output.  In the second phase a standard sorting algorithm — insertion sort and mergesort are investigated in this paper — finishes the sorting.

First phase: SIMD sort Vector registers A1B1C1D1 A2B2C2D2 AnBnCnDn …… After SIMD sort:

Second phase Insertion sort Merge sort A1<A5<A9 A2<A6<A10 A3<A7<A11 A4<A8<A12 A1<A2<A3 A4<A5<A6 A7<A8<A9 A10<A11<A12 A1A2A3A4 A5A6A7A8 A9A10A11A12 A1A4A7A10 A2A5A8A11 A3A6A9A12 A1A2A3A4A5A6A7A8A9A10A11A12

One pass sorting (Register sorting) Algorithm input Initial state Align a set of comparators Write values back to memory

4-elements example P1={comp(a,c) comp(b,d)} P2={comp(a,b) comp(c,d)} P3={comp(b,c)}

One concrete example

SSE2 instructions used

The method is also applied to sort Key- pointer pairs and D-heaps.

Evaluation

Contributions Effectively use SIMD resources to improve performance of sorting short sequence through the reduction of memory references and increases in ILP.

Contributions 1.three algorithms that use the SIMD machinery for efficient in-register sorting of short sequences 2.a method to use iterative-deepening search to find fast instruction sequences to move data within the SIMD registers 3.an extensive experimental study that indicates the elimination of loads, stores, branches correlates well with improvement performance.