The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

Slides:



Advertisements
Similar presentations
Garfield AP Computer Science
Advertisements

Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.
CSE 3101: Introduction to the Design and Analysis of Algorithms
Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.
CSE332: Data Abstractions Lecture 14: Beyond Comparison Sorting Dan Grossman Spring 2010.
CSCE 3110 Data Structures & Algorithm Analysis
Quicksort, Mergesort, and Heapsort. Quicksort Fastest known sorting algorithm in practice  Caveats: not stable  Vulnerable to certain attacks Average.
© 2004 Goodrich, Tamassia Quick-Sort     29  9.
Fundamentals of Algorithms MCS - 2 Lecture # 16. Quick Sort.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Sorting Heapsort Quick review of basic sorting methods Lower bounds for comparison-based methods Non-comparison based sorting.
1 Tuesday, November 14, 2006 “UNIX was never designed to keep people from doing stupid things, because that policy would also keep them from doing clever.
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
2 -1 Analysis of algorithms Best case: easiest Worst case Average case: hardest.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
Sorting Rearrange n elements into ascending order. 7, 3, 6, 2, 1  1, 2, 3, 6, 7.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
Computer Algorithms Lecture 11 Sorting in Linear Time Ch. 8
CSE 373 Data Structures Lecture 15
HOW TO SOLVE IT? Algorithms. An Algorithm An algorithm is any well-defined (computational) procedure that takes some value, or set of values, as input.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato
1 Summary of lectures 1.Introduction to Algorithm Analysis and Design (Chapter 1-3). Lecture SlidesLecture Slides 2.Recurrence and Master Theorem (Chapter.
CSC 41/513: Intro to Algorithms Linear-Time Sorting Algorithms.
Optimizing Sorting With Genetic Algorithms Xiaoming Li, María Jesús Garzarán, and David Padua University of Illinois at Urbana-Champaign.
Quicksort, Mergesort, and Heapsort. Quicksort Fastest known sorting algorithm in practice  Caveats: not stable  Vulnerable to certain attacks Average.
CSE332: Data Abstractions Lecture 14: Beyond Comparison Sorting Dan Grossman Spring 2012.
1 Joe Meehean.  Problem arrange comparable items in list into sorted order  Most sorting algorithms involve comparing item values  We assume items.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware
Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.
1 Sorting Algorithms Sections 7.1 to Comparison-Based Sorting Input – 2,3,1,15,11,23,1 Output – 1,1,2,3,11,15,23 Class ‘Animals’ – Sort Objects.
Sorting Algorithms Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
Searching and Sorting Recursion, Merge-sort, Divide & Conquer, Bucket sort, Radix sort Lecture 5.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Sorting.
Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemor y Refs. FlopsFlops/
A Comparison of Parallel Sorting Algorithms on Different Architectures Nancy M. Amato, Ravishankar Iyer, Sharad Sundaresan and Yan Wu Texas A&M University.
Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua.
Quick sort, lower bound on sorting, bucket sort, radix sort, comparison of algorithms, code, … Sorting: part 2.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
A System Performance Model Distributed Process Scheduling.
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
CSE332: Data Abstractions Lecture 12: Introduction to Sorting Dan Grossman Spring 2010.
Sorting Lower Bounds n Beating Them. Recap Divide and Conquer –Know how to break a problem into smaller problems, such that –Given a solution to the smaller.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Sorting – Lecture 3 More about Merge Sort, Quick Sort.
19 March More on Sorting CSE 2011 Winter 2011.
CS6045: Advanced Algorithms Sorting Algorithms. Sorting So Far Insertion sort: –Easy to code –Fast on small inputs (less than ~50 elements) –Fast on nearly-sorted.
June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.
A Parallel Communication Infrastructure for STAPL
Advanced Sorting 7 2  9 4   2   4   7
Sorting.
Sorting by Tammy Bailey
Introduction to Algorithms
Course Description Algorithms are: Recipes for solving problems.
Objective of This Course
COMP60621 Fundamentals of Parallel and Distributed Systems
Course Description Algorithms are: Recipes for solving problems.
COMP60611 Fundamentals of Parallel and Distributed Systems
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms Marek Olszewski and Michael Voss ECE Department University of Toronto

PDPTA 2004 Motivation  Sorting is a fundamental algorithm  Many algorithmic choices for sorting  Performance heavily influenced by Data being sorted (type, entropy) Target machine being used  How can we build the best sort for a given machine? An empirical install-time system

PDPTA 2004 Outline of Talk  Motivation  An Overview of Sorting Algorithms  Our install-time empirical system An adaptive hybrid sequential sort An adaptive hybrid parallel sort  An Evaluation  Related Work  Conclusions

PDPTA 2004 An overview of sorting algorithms  Art of Computer Programming V3 (Knuth) 25 algorithms comprehensively studied  Comparison sorts Lower bound shown to be  (n log n) Examples include: insertion sort, quick sort and merge sort  Non-comparison sorts Can be linear time, i.e. O(n) But require knowing the range of the data Examples include: radix sort and bucket sort

PDPTA 2004 An overview of sorting algorithms  Hybrid sorts Divide and conquer sorts are recursive May be beneficial to switch algorithms Most C++ STL sorts are hybrid sorts  Gnu std::sort is a hybrid sort with pre-defined points to switch between heap sort, quick sort, merge sort and insertion sort

PDPTA 2004 An overview of parallel sorts  Ideally, O( (n log n) / p) If p = n, then O( log n) Several parallel sorts demonstrate this bound, e.g. Column sort Parallelized sequential sorts often better for low numbers of processors (our focus).  Parallelized divide and conquer algorithms Effective for small numbers of processors Use a work-queue model Tasks are place in a shared work-queue Idle processors remove tasks from the queue Good load balance

PDPTA 2004 Our install-time system Start Sample input data provided to installer Specialized decision Function place in library Time Sorts Random algorithms at each recursive step Calculate best sorting algorithm for each data aet size Convert tree to C++ C4.5 creates decision tree End Parallel? Time Sorts Different input sizes and work-share points Work-share cutoff point tree and C++ functions generated

PDPTA 2004 Algorithms available to our hybrid sort: AlgorithmDescription Insertion Sort O(n 2 ) but with small lower order terms. Efficient for small lists. Merge Sort O(n log n). Subtasks evenly divided by has higher lower-order terms than quick sort. Quick Sort O(n log n) on average, but is O(n 2 ) worst-case. Has smaller lower-order terms than merge sort. In-place Merge Sort O(n log n). Higher constant coefficient than merge sort, but uses less memory. Heap Sort O(n log n). Non-recursive algorithm. Can do well on medium sized lists. Higher lower-order terms than quick sort.

PDPTA 2004 Hybrid Adaptive Sequential Sort  Use random data to train system Up to 10 million elements Insertion sort not used for large inputs Not all inputs sorted to completion  Dynamic programming used to find best choice Assume best sort at each subsequent step Per step timings were measured  C4.5 decision tree used to analyze this data  C4.5 tree converted to C++ template code

PDPTA 2004 Hybrid Adaptive Parallel Sort  Start with sequential hybrid sort  Determine work-sharing cutoff point When should a thread execute its own tasks When should a thread place tasks in work queue  Determines the point at which synchronization costs are no longer amortized by small work

PDPTA 2004 Methodology: Platforms  Sequential platforms Linux Intel Penitum GHz Xeon Linux AMD Athlon XP SunOS 5.8 on a 600 MHz Sparc Workstation  Parallel platform 4 processor 1.6 GHz Intel Xeon SMP Modified smp kernel (allowed binding)

PDPTA 2004 Methodology: Comparisons  Adaptive Hybrid Sequential Sort  Adaptive Hybrid Parallel Sort  Gnu G std::sort and std::stable_sort Also hybrid sorts Complex – not easily parallelized  8 equally sized merge sorts that called std::sort and std::stable_sort in parallel

PDPTA 2004 Serial Non-Optimized (w/o –O) Results

PDPTA 2004 Serial Optimized (w –O) Results

PDPTA 2004 Parallel Work-share Cutoff Point

PDPTA 2004 Parallel Non-Optimized (w/o –O) Results

PDPTA 2004 Parallel Optimized (with –O) Results

PDPTA 2004 Parallel Sort Speedups

PDPTA 2004 Related Work  Install-time empirical optimization systems ATLAS: Level 3 BLAS FFTW: FFT  STAPL: Adaptive Parallel C++ Library Uses decision trees like our approach Uses only single-level sorts, not hybrids Not available for comparison  A Dynamically Tuned Sorting Library (CGO’04) Install-time tuning of sequential sorts Only single-level sorts, not hybrid

PDPTA 2004 Conclusion  Presented an install-time system for empirically constructing a “best” sorting algorithm for a target machine  Competitive with STL sort on 1 processor  Better than a parallelized STL sort on multiple processors