Low Depth Cache-Oblivious Algorithms

Slides:



Advertisements
Similar presentations
Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
Advertisements

Lecture 3: Parallel Algorithm Design
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
I/O-Algorithms Lars Arge Spring 2009 February 2, 2009.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
I/O-Algorithms Lars Arge Spring 2009 January 27, 2009.
I/O-Algorithms Lars Arge Spring 2007 January 30, 2007.
I/O-Algorithms Lars Arge Aarhus University February 7, 2005.
I/O-Algorithms Lars Arge Aarhus University February 6, 2007.
I/O-Algorithms Lars Arge Spring 2006 February 2, 2006.
I/O-Algorithms Lars Arge Aarhus University February 14, 2008.
I/O-Algorithms Lars Arge Spring 2008 January 31, 2008.
Sorting Algorithms and Analysis Robert Duncan. Refresher on Big-O  O(2^N)Exponential  O(N^2)Quadratic  O(N log N)Linear/Log  O(N)Linear  O(log N)Log.
I/O-Algorithms Lars Arge Fall 2014 August 28, 2014.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
Data Structures and Algorithms in Parallel Computing Lecture 2.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
15-853: Algorithms in the Real World Locality II: Cache-oblivious algorithms – Matrix multiplication – Distribution sort – Static searching.
Page :Algorithms in the Real World Parallelism: Lecture 1 Nested parallelism Cost model Parallel techniques and algorithms
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
COMP 5704 Project Presentation Parallel Buffer Trees and Searching Cory Fraser School of Computer Science Carleton University, Ottawa, Canada
Algorithm Design Techniques, Greedy Method – Knapsack Problem, Job Sequencing, Divide and Conquer Method – Quick Sort, Finding Maximum and Minimum, Dynamic.
Advanced Sorting 7 2  9 4   2   4   7
Sort Algorithm.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
PRAM and Parallel Computing
Introduction to Algorithms
Lecture 3: Parallel Algorithm Design
Subject Name: Design and Analysis of Algorithm Subject Code: 10CS43
Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
Quick Sort and Merge Sort
Randomized Algorithms
Chapter 4 Divide-and-Conquer
Quick-Sort 9/12/2018 3:26 PM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia,
COSC160: Data Structures Linked Lists
Divide-And-Conquer-And-Combine
Quick-Sort 9/13/2018 1:15 AM Quick-Sort     2
Chapter 7 Sorting Spring 14
Sorting atomic items Chapter 5 Distribution based sorting paradigms.
Radish-Sort 11/11/ :01 AM Quick-Sort     2 9  9
Objectives Introduce different known sorting algorithms
Chapter 4: Divide and Conquer
Advanced Topics in Data Management
Unit-2 Divide and Conquer
Randomized Algorithms
Data Structures Review Session
Divide-and-Conquer 7 2  9 4   2   4   7
CSE838 Lecture notes copy right: Moon Jung Chung
Numerical Algorithms Quiz questions
Divide-And-Conquer-And-Combine
Quick-Sort 2/25/2019 2:22 AM Quick-Sort     2
CS 3343: Analysis of Algorithms
Copyright © Aiman Hanna All rights reserved
Quick-Sort 4/8/ :20 AM Quick-Sort     2 9  9
Divide-and-Conquer 7 2  9 4   2   4   7
CSC 380: Design and Analysis of Algorithms
CSC 380: Design and Analysis of Algorithms
Quick-Sort 4/25/2019 8:10 AM Quick-Sort     2
Divide & Conquer Sorting
The Selection Problem.
CSE 332: Parallel Algorithms
CS203 Lecture 15.
Divide-and-conquer approach
Divide and Conquer Merge sort and quick sort Binary search
Presentation transcript:

Low Depth Cache-Oblivious Algorithms Guy E. Blelloch, Phillip B. Gibbons, Harsha Vardhan Simhadri Slides by Endrias Kahssay

Why Parallel Cache Oblivious Algorithms? Modern machines have multiple layers of cache – L1, L2, L3. Roughly 4 cycles, 10 cycles, and 40 cycles respectively. L2 and L1 are private, L3 is shared across CPUs.

Depiction of Architecture

Why Parallel Cache Oblivious Algorithms? Hard to manually tune for – because of scheduler and shared caches. Parallel programs are often memory bound. Natural extension to processor oblivious code.

Paper Contributions First parallel sorting, and sparse matrix vector multiply with optimal cache complexity and low depth. Applications of parallel sorting for solving other problems such as list ranking. Generalization of results to multilayer private and shared caches.

Parallel Cache Complexity Qp(n; M,B) < Q(n; M,B) + O(pMD/B) with high probability for private caches, work-stealing scheduler. Qp(n; M +pBD, B) ≤ Q(n; M,B) for shared caches. Results carry over to fork-join parallelism (e.g Cilk).

Recipe for Parallel Cache Oblivious Algorithms Low depth, and low serial cache complexity. Optimizes for both serial and parallel performance.

Parallel Sorting Optimal cache bound is Prior serial optimal cache oblivious algorithms would have Ω(√n) depth under standard techniques. Paper proposes a new sorting algorithm with optimal cache bound.

Primitives Prefix Sum with logarithmic depth and O(n/B) cache complexity. Merge sort with parallel merge based on picking O(N1/3) pivots using dual binary search, and achieves O((n/B) log (n/M)) cache complexity. Transpose in logarithmic depth and O(nm/B) cache complexity.

Deterministic Sorting Based on sample sort; selects a set of pivots that partition the keys. Routes all the keys to the appropriate buckets; main insight of the paper is how to do this in O(log2n) depth. Sort each bucket.

Deterministic Sorting Split the set of elements into √n subarrays of size √n, and recursively sort them. Choose every (log n)-th element from each of the sub arrays as pivot candidates. Sort the O(n/log n) pivot candidates using merge sort. Choose √n pivots.

Deterministic Sorting Use prefix sums and matrix transpositions to determine where buckets should go. Each bucket will have at most O(√n log n) elements. Use B-TRANSPOSE which is a cache oblivious recursive divide conquer algorithm to move elements to the right buckets.

Transpose Complexity B-TRANPOSE transfers from a matrix of √n × √n keys into the √n in O(log n) depth, and O(n/B) sequential cache complexity.

Analysis For each node in the recursion tree, define s(v) to be n2; the size of the submatrix T. Define w(v) to be the total weight of the keys that the submatrix T is responsible transferring for.

Analysis Light-1 Node: if s(v) < M/100, w(v) < M/10, and S(parent node) >= M/100 Light-2 Nodes: if s(v) < M/100, w(v) < M/10, and S(parent node) >= M/10 Heavy leaf: if w(v) >= M/10.

Analysis Each node in the recursion tree is accounted in the sub tree of this three nodes. There are 4n/(M/100) light 1 nodes; each incur M/B cache miss. Let W be the weight of all heavy leaves. Then all heavy leaves incur < 11W/B misses. There are 4(n-W)/(M/10) light 2 leaves, and in total incur O(40 (n-W)/B) cache misses.

Analysis Heavy + Light 2 + Light 1 <= (40 (n-W)/B) + 11W/B + N/B <= O(N/B) Total cache miss is thus O(N/B).

Deterministic Sorting Analysis Recurrence solves to: Q(n; M,B) = O(n/B* logMn ) sequential cache complexity, O(n log n) work, and O(log2 n) depth.

Randomized Sorting Pick O(√n) pivots randomly. Repeat picking of pivots until the largest block is < 3 √n log n. Paper shows with high probability, the algorithm has the same sequential cache complexity, O(n log n) work, and O(log 1.5n) depth.

Generalizing to Multiple Caches We look at the PMDH model which stands for Parallel Multi-level Distributed Hierarchy. In PMDH, each of the processors has a multi-level private hierarchy. All computations occur in level-one caches.

PMDH Caches are inclusive. Cache coherency is done at each level through the BACKER protocol which maintains dag consistency. Concurrent writes to the same word are handled arbitrarily.

Layout of PMDH

PMDH Bound Let Bi be the block size of the ith cache, and let Mi the size of the cache. Then: Qp(n; Mi, Bi) < Q(n;Mi, Bi) + O(pMiD/Bi).

Strengths The paper does a good job developing a simple model that accounts for cache locality. Solid theoretical analysis. Sorting is a core problem in many serial and parallel computations so it’s a very useful primitive. Extension to multi layer caches.

Weakness Co-Sort is only O(log2n/logMn) more cache efficient than merge sort but is significantly more complicated and uses more external memory. No real world results presented; this would have been very useful to see how their model of caches matched reality.

Future Work Generalizing result to more realistic cache models. Analyzing parallel cache oblivious algorithms with high depth.

Questions? Thanks!