By: A. LaMarca & R. Lander Presenter : Shai Brandes The Influence of Caches on the Performance of Sorting.

Slides:

Advertisements

Similar presentations

Introduction to Algorithms Quicksort

Advertisements

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.

CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.

Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 6.

ADA: 5. Quicksort1 Objective o describe the quicksort algorithm, it's partition function, and analyse its running time under different data conditions.

Sorting Algorithms Bryce Boe 2012/08/13 CS32, Summer 2012 B.

Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort

1 Divide & Conquer Algorithms. 2 Recursion Review A function that calls itself either directly or indirectly through another function Recursive solutions.

Stephen P. Carl - CS 2421 Recursive Sorting Algorithms Reading: Chapter 5.

Theory of Algorithms: Divide and Conquer

ISOM MIS 215 Module 7 – Sorting. ISOM Where are we? 2 Intro to Java, Course Java lang. basics Arrays Introduction NewbieProgrammersDevelopersProfessionalsDesigners.

Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 5.

Analysis of Algorithms CS 477/677 Sorting – Part B Instructor: George Bebis (Chapter 7)

Data Structures Data Structures Topic #13. Today’s Agenda Sorting Algorithms: Recursive –mergesort –quicksort As we learn about each sorting algorithm,

CMPS1371 Introduction to Computing for Engineers SORTING.

1 Sorting Problem: Given a sequence of elements, find a permutation such that the resulting sequence is sorted in some order. We have already seen: –Insertion.

CS 171: Introduction to Computer Science II Quicksort.

CS 253: Algorithms Chapter 7 Mergesort Quicksort Credit: Dr. George Bebis.

Data Structures Advanced Sorts Part 2: Quicksort Phil Tayco Slide version 1.0 Mar. 22, 2015.

Sorting21 Recursive sorting algorithms Oh no, not again!

Lecture 25 Selection sort, reviewed Insertion sort, reviewed Merge sort Running time of merge sort, 2 ways to look at it Quicksort Course evaluations.

© 2006 Pearson Addison-Wesley. All rights reserved10-1 Chapter 10 Algorithm Efficiency and Sorting CS102 Sections 51 and 52 Marc Smith and Jim Ten Eyck.

Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.

FALL 2006CENG 351 Data Management and File Structures1 External Sorting.

Sorting Chapter 10.

Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.

Selection Sort, Insertion Sort, Bubble, & Shellsort

1 QuickSort Worst time:  (n 2 ) Expected time:  (nlgn) – Constants in the expected time are small Sorts in place.

© 2006 Pearson Addison-Wesley. All rights reserved10 A-1 Chapter 10 Algorithm Efficiency and Sorting.

Computer Algorithms Lecture 10 Quicksort Ch. 7 Some of these slides are courtesy of D. Plaisted et al, UNC and M. Nicolescu, UNR.

Sorting (Part II: Divide and Conquer) CSE 373 Data Structures Lecture 14.

1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.

Copyright © 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin “ Introduction to the Design & Analysis of Algorithms, ” 2 nd ed., Ch. 1 Chapter.

Chapter 12 Recursion, Complexity, and Searching and Sorting

C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 19: Searching and Sorting Algorithms.

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 9: Algorithm Efficiency and Sorting Data Abstraction &

Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.

10/14/ Algorithms1 Algorithms - Ch2 - Sorting.

Chapter 10 B Algorithm Efficiency and Sorting. © 2004 Pearson Addison-Wesley. All rights reserved 9 A-2 Sorting Algorithms and Their Efficiency Sorting.

The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.

1 CSE 326: Data Structures Sorting It All Out Henry Kautz Winter Quarter 2002.

Introduction to Algorithms Jiafen Liu Sept

Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

CSC 211 Data Structures Lecture 13

© 2006 Pearson Addison-Wesley. All rights reserved10 A-1 Chapter 10 Algorithm Efficiency and Sorting.

© 2006 Pearson Addison-Wesley. All rights reserved10 B-1 Chapter 10 (continued) Algorithm Efficiency and Sorting.

CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.

Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.

Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.

QUICKSORT 2015-T2 Lecture 16 School of Engineering and Computer Science, Victoria University of Wellington COMP 103 Marcus Frean.

Chapter 9 Sorting. The efficiency of data handling can often be increased if the data are sorted according to some criteria of order. The first step is.

Internal and External Sorting External Searching

Sorting divide and conquer. Divide-and-conquer  a recursive design technique  solve small problem directly  divide large problem into two subproblems,

CSE 326: Data Structures Lecture 23 Spring Quarter 2001 Sorting, Part 1 David Kaplan

Chapter 4, Part II Sorting Algorithms. 2 Heap Details A heap is a tree structure where for each subtree the value stored at the root is larger than all.

Intro. to Data Structures Chapter 7 Sorting Veera Muangsin, Dept. of Computer Engineering, Chulalongkorn University 1 Chapter 7 Sorting Sort is.

Data Structures and Algorithms Instructor: Tesfaye Guta [M.Sc.] Haramaya University.

CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.

Sorting and Runtime Complexity CS255. Sorting Different ways to sort: –Bubble –Exchange –Insertion –Merge –Quick –more…

CS6045: Advanced Algorithms Sorting Algorithms. Sorting Input: sequence of numbers Output: a sorted sequence.

CSE 351 Section 9 3/1/12.

Lecture No 6 Advance Analysis of Institute of Southern Punjab Multan

Algorithm Efficiency and Sorting

Algorithm Efficiency and Sorting

Divide and Conquer Merge sort and quick sort Binary search

Presentation transcript:

By: A. LaMarca & R. Lander Presenter : Shai Brandes The Influence of Caches on the Performance of Sorting

Introduction Sorting is one of the most important operations performed by computers. In the days of magnetic tape storage before modern data-bases, it was almost certainly the most common operation performed by computers as most "database" updating was done by sorting transactions and merging them with a master file.

Introduction cont. Since the introduction of caches, main memory continued to grow slower relative to processor cycle times. The time to service a cache miss grew to 100 cycles and more. Cache miss penalties have grown to the point where good overall performance cannot be achieved without good cache performance.

Introduction cont. In the article, the authors investigate, both experimentally and analytically, the potential performance gains that cache- conscious design offers in improving the performance of several sorting algorithms.

Introduction cont. For each algorithm, an implementation variant with potential for good overall performance, was chosen. Than, the algorithm was optimized, using traditional techniques to minimize the number of instruction executed. This algorithm forms the baseline for comparison. Memory optimizations were applied to the comparison sort baseline algorithm, in order to improve cache performance.

Performance measures The authors concentrate on three performance measures: Instruction count Cache misses Overall performance The analyses presented here are only approximation, since cache misses cannot be analyzed precisely due to factors such as variation in process scheduling and the operating system’s virtual to physical page mapping policy.

Main lesson The main lesson from the article is that because of the cache miss penalties are growing larger with each new generation of processors, improving an algorithm’s overall performance requires increasing the number of instruction executed, while at the same time, reducing the number of cache misses.

Design parameters of caches Capacity – total number of blocks the cache can hold. Block size – the number of bytes that are loaded from and written to memory at a time. Associativity – in an N-way set associative cache, a particular block can be loaded in N different cache locations. Replacement policy – which block do we remove from the cache as a new block is loaded

Which cache are we investigating? In modern machines, more than one cache is placed between the main memory and the processor. processor memory Direct map N-way associative Full associative

Which cache are we investigating? The largest miss penalty is typically incurred to the cache closest to the main memory, which is usually direct-mapped. Thus, we will focus on improving the performance of direct-mapped caches.

Improve the cache hit ratio Temporal locality – there is a good chance that an accessed data will be accessed again in the near future. Spatial locality - there is a good chance that subsequently accessed data items are located near each other in memory.

Cache misses Compulsory miss – occur when a block is first accessed and loaded to the cache. Capacity miss – caused by the fact that the cache is not large enough to hold all the accessed blocks at one time. Conflict miss – occur when two or more blocks, which are mapped to the same location in the cache, are accessed.

Measurements n – the number of keys to be sorted C – the number of blocks in the cache B – the number of keys that fit in a cache block B keys Cache block

Mergesort Two sorted lists can be merged into a single list by repeatedly adding the smaller key to a single sorted list:

Mergesort By treating a set of unordered keys as a set of sorted lists of length 1, the keys can be repeatedly merged together until a single sorted set of keys remains. The iterative mergesort was chosen as the base algorithm.

Mergesort base algorithm Mergesort makes [log 2 n] passes over the array, where the i-th pass merges sorted subarrays of length 2 i-1 into sorted subarrays of size 2 i i=1 i=2

Improvements to the base algorithm 1.Alternating the merging process from one array to another to avoid unnecessary copying. 2.Loop unrolling 3.Sorting subarrays of size 4 with a fast in-line sorting method. Thus, the number of passes is [log 2 (n/4)]. If this number is odd, then an additional copy pass is needed to return the sorted array to the input array.

The problem with the algorithm The base mergesort has the potential for terrible cache performance: if a pass is large enough to wrap around the in the cache, keys will be ejected before they are used again. n ≤ BC/2 →the entire sort will be performed in the cache – only Compulsory misses. BC/2 < n ≤ BC →temporal reuse drops off sharply BC < n →no temporal reuse In each pass: 1.The block is accessed in the input array (r/w) 2.The block is accessed in the auxiliary array (w/r). → 2 cache misses per block→ 2/B cache misses per key

Input array n keys Auxiliary array i=1 i=2 Cache after pass i=1 Cache block Read 1 miss Write 1 miss Read 1 miss Write 1 miss No cache misses! 4 cache misse s Read key=1 n ≤ BC/2

Mergesort analysis For n≤BC/2 → 2/B misses per key The entire sort will be performed in the cache – only Compulsory misses

Base Mergesort analysis cont. For BC/2<n (misses per key) : 2/B [log 2 (n/4)] + 1/B + 2/B ([log 2 (n/4)] mod 2) In each pass, each key is moved from a source array to a destination array. Every B-th key visited in the source array results in one cache miss. Every B-th key written to the destination array results in one cache miss. Number of merge passes Initial pass of sorting groups of 4 keys. 1 compulsory miss per block. thus, 1/B misses per key If number of iteration is odd, we need to copy the sorted array to the input array

1 st Memory optimization Tiled mergesort Improve temporal locality : Phase 1- subarrays of legnth BC/2 are sorted using mergesort. Phase 2- Return the arrays to the base mergesort to complete the sorting of the entire array. Avoid the final copy if [log 2 (n/4)]is odd: subarrays of size 2 are sorted in-line if log 2 (n) is odd.

tiled-mergesort example Phase 1 - mergesort every BC / 2 keys Phase 2- regular Mergesort

Tiled Mergesort analysis For n≤BC/2 → 2/B misses per key The entire sort will be performed in the cache – only Compulsory misses

Tiled Mergesort analysis cont. For BC/2<n (misses per key) : 2/B [log 2 (2n/BC)] + 2/B + 0 number of iteration is forced to be even. no need to copy the sorted array to the input array Initial pass of mergesorting groups of BC/2 keys. Each merge is done in the cache with 2 compulsory misses per block. Number of merge passes each pass is large enough to wrap around the in the cache, keys will be ejected before they are used again. 2 compulsory miss per block. thus, 2/B misses per key

Tiled mergesort cont. The problem: In phase 2 – no reuse is achieved across passes since the set size is larger than the cache. The solution: multi-mergesort

2 nd Memory optimization multi-mergesort We replace the final [log 2 (n/(BC/2))] merge passes of tiled mergesort with a single pass that merges all the subarrays at once. The last pass uses a memory optimized heap which holds the heads of the subarrays. The number of misses per key due to the use of the heap is negligible for practical values of n, B and C.

multi-mergesort example Phase 1 - mergesort every BC / 2 keys Phase 2- multi Mergesort all [n/(BC/2)] subarrays at once

Multi Mergesort analysis For n≤BC/2 → 2/B misses per key The entire sort will be performed in the cache – only Compulsory misses

Multi Mergesort analysis cont. For BC/2<n (misses per key) : 2/B + 2/B Initial pass of mergesorting groups of BC/2 keys. Each merge is done in the cache with 2 compulsory misses per block. number of iteration is forced to be odd → That way, in the next phase we will multi- merge keys from the auxiliary array to the input array a single pass that merges all the [n/(BC/2)] subarrays at once. 2 compulsory miss per block. thus, 2/B misses per key

Performance Set size in keys Instructions per key Base Tiled multi Cache size Multi-merge all subarrays in a single pass

Performance cont. Cache misses per key Base Tiled multi Set size in keys Cache size Increase in cache misses: set size is larger than cache constan t number of cache misses per key! 66% fewer misses than the base

Performance cont. Time (cycles per key) Set size in keys Cache size 200 Base Tiled multi Worst performance due to the large number of cache misses Executes up to 55% faster than Base Due to increase in instruction count 0

Quicksort - Divide and conquer algorithm. 2 Choose a pivot Partition the set around the pivot Quicksort left regionQuicksort right region At the end of the pass the pivot is in its final position.

Quicksort base algorithm Implementation of optimized Quicksort which was developed by Sedgewick: Rather than sorting small subsets in the natural course of quicksort recursion, they are left unsorted until the very end, at which time they are sorted using insertion sort in a single final pass over the entire array

Insertion sort Sort by repeatedly taking the next item and inserting it into the final data structure in its proper order with respect to items already inserted. 1342

Quicksort base algorithm cont. Quicksort make sequential passes → all keys in a block are always used → exellent spatial locality Divide and conquer → if subarray is small enough to fit in the cache – quicksort will incur at most 1 cache miss per block before the subset is fully sorted → exellent temporal locality

1 st Memory optimization memory tuned quicksort Remove Sedgewick’s insertion sort in the final pass. Instead, sort small subarrays when they are first encountered using insertion sort. Motivation: When a small subarray is encountered it has just been part of a recent partition → all of its keys should be in the cache

2 nd Memory optimization multi quicksort n ≤ BC → 1 cache miss per block Problem: Larger sets of keys incur a substantial number of misses. Solution: a single multi-partition pass is performed: divides the full set into a number of subsets which are likely to be cache sized or smaller

Feller If k points are placed randomly in a range of length 1: P( subrange i ≥ X ) = (1 - X) k

multi quicksort cont. multi-partition the array into 3n / (BC) pieces. → (3n / (BC)) – 1 pivots. → P( subset i ≥ BC) = (1– BC/n) (3n / (BC)) – 1 lim n → ∞ [(1– BC/n) (3n / (BC)) – 1 ]= e -3 →the percentage of subsets that are larger than the cache is less than 5%. feller

Memory tuned quicksort analysis We analyze the algorithm in two parts: 1. Assumption: partitioning an array of size m costs : m > BC → m/B misses m ≤ BC → 0 misses 2. Correct the assumption: estimate the undercounted and over-counted cache misses

Memory tuned quicksort analysis cont. M(n) = the expected number of misses: 0 n ≤ BC M(n)= n/b + 1/n ∑ [M(i)+M(n-i-1)] else 0≤i<n-1 Assumption: partitioning an array of size n > BC costs n / B misses n places to locate the pivot: P(pivot is in the i-th place) =1/n numbe r of misses in the left region

Memory tuned quicksort analysis cont. The recurrence solves to: 0 n ≤ BC M(n)= 2(n+1)/B ln[(n+1)/(BC+2)]+O(1/n) else

Memory tuned quicksort analysis cont. First correction Undercounting the misses when the subproblem first reaches size ≤ BC. We counted it as 0, but this subproblem may have no part in the cache! We add n/B more misses, since there are approximately n keys in ALL the subproblems that first reaches size ≤ BC.

Memory tuned quicksort analysis cont. Second correction In the very first partitioning there are n/B misses, but not for the subsequent partitioning ! In the end of partitioning, some of the array in the LEFT subproblem is still in the cache. → there are hits that we counted as misses Note: by the time the algorithm reaches the right subproblem, its data has been removed from the cache

Memory tuned quicksort analysis Second correction cont. The expected number of subproblems of size > BC: 0 n ≤ BC N(n)= 1 + 1/n ∑ [N(i)+N(n-i-1)] else 0≤i<n-1 n>BC thus, this array itself is a subproblem larger than the cache… n places to locate the pivot: P(pivot is in the i-th place) =1/n number of subproblems of size > BC in the left / right region

Memory tuned quicksort analysis Second correction cont. The recurrence solves to: 0 n ≤ BC N(n)= (n+1)/(BC+2) – 1 else

Memory tuned quicksort analysis Second correction cont. In each subproblems of size n > BC: pivot LR array Left sub-problem cache On average, BC/2 keys are in the cache (1/2 cache) R progresses left → it can access to these blue cache blocks L progresses right → eventually will access blocks that map to the blue blocks in the cache and replace them

Memory tuned quicksort analysis Second correction cont. Assumption: R points to a key in the block located at the right end of the cache Reminder: this is a direct map cache, the i-th block will be in the i (mod C) R

Memory tuned quicksort analysis Second correction cont. 2 possible scenarios. the first: i blocks R points to a key in a block which is mapped to this cache block. R progresses to the blue blocks on the left L points to a key in a block which is mapped to this cache block. L progresses and replaces the blocks on the right On average, there will be [c/2 + i ] / 2 hits XXXX

Memory tuned quicksort analysis Second correction cont. The second scenario: i blocks R points to a key in a block which is mapped to this cache block. R progresses to the blue blocks on the left L points to a key in a block which is mapped to this cache block. L progresses and replaces the blocks on the right On average, there will be i + [c/2 - i ] / 2 = [c/2 + i ] / 2 hits XX

Memory tuned quicksort analysis Second correction cont. Number of hits: 1/(c/2) ∑ [c/2 + i ] / 2 ~ 3C/8 0≤i<c/2 = L can start on any block with equal probability Average number of hits

Memory tuned quicksort analysis Second correction cont. Number of hits not acounted for the computation of M(n) : 3C/8 N(n) The expected number of sub-problems of size > BC number of hits after a partition the expected number of misses

Memory tuned quicksort analysis Second correction cont. The expected number of misses per key for n>BC: [M(n) + (n/B) - 3C/8 N(n) ]/n = 2/B ln(n/BC) + 5/8B + 3C/8n misses per key

Base quicksort analysis Number of cache misses per key: 2/B ln(n/BC) + 5/8B + 3C/8n +1/B Base QS makes an extra pass at the end to perform the insertion sort. Same as Memory tuned quicksort

Multi quicksort analysis cont. Number of cache misses per key for n≤BC: 1/B Compulsory misses

Multi quicksort analysis cont. Number of cache misses per key for n>BC : 2/B + 2/B We partition the input to k=3n/BC pieces. Assumption: Each partition is smaller than the cache We hold k linked lists, one for each partition. 100 keys can fit in one linked list node (minimize storage waste). Each partitioned key is moved to the a linked list: 1 miss per block in the input array 1 miss per block in the linked list Each partition is returned to the input array and sorted in place

Performance Set size in keys Instructions per key Base Memory tuned multi Constant number of additional instructions Multi partition Cache size

Performance cont. Cache misses per key Base Memory tuned multi Set size in keys Cache size 1 0 Multi partition usually produce subsets smaller than cache. 1 miss per key!

Performance cont. Time (cycles per key) Set size in keys Cache size 200 Base Tiled multi 0 Due to increase in instruction cost. If larger sets were sorted it would have outperformed the other 2 variants