Download presentation
Presentation is loading. Please wait.
Published byLisandro Cosby Modified over 10 years ago
1
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut für Informatik, Saarbrücken, Germany Thanks to: A. Crauser, D. Hutchinson, L. Kettner, S. Mitra, A. Morton, N. Rajput, and J. Vitter
2
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 2 Motivation & Related Work Sorting is important for EM problems Disks are cheaper than computers Prefetch buffers for load disk balancing and overlapping have been studied but no results which guarantee – overlapping during merging of runs – asymptotically optimal I/O volume Here: algorithm & implementation – I/O cost lower bound – guarantees almost perfect overlapping between I/O and computation
3
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 3 Motivation & Related Work [ChaudhryCormen0102] impl. of distributed memory, parallel disk Theory: Implementations: Differences: – Optimal prefetching algorithm – Overlapping of I/O and computation – Completely asynchronous implementation – Part of library, compatible with STL. Use as simple as: stxxl::vector v; long long int i=vec_size; // long long int is a 64-bit data type while(i--) v.push_back(f(i)); // fill vector with some values stxxl::sort(v.begin(), v.end()); // sort // check order using STL predicate assert(std::is_sorted(v.begin(), v.end())); [BarvGrovVit97][SandEgnKorst00][HutchVitt01][HutchSandVitt02] Here [DementievSanders03] [BarveVitter99]
4
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 4 External Multi-way Merge Sort run 0 input : O(M) chunk 0chunk 1chunk 2chunk 3chunk 4 run 1run 2run 3 chunk 5chunk 6 … B buffers BBB k=O(M/B) k-merger run 4run 5run 6run 7 BBBB k-merger chunk 7chunk 8 k run 0123 B run 4567 B buffers … These algorithms do not guarantee overlapping D write buffers 1 2 3 4 k ….. merge elements D blocks merge buffers merging D prefetch buffers D-head disk D-head model N – input size M – main memory size B – block size D – number of disks
5
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 5 Multi-way Merge Sort with Overlapped I/Os Theorem 1: Theorem 1: If I/O and computation can be overlapped, N elements can be sorted in time where is the time needed for run formation, is the time needed for merging k= (M/B) : # sequences, k = O(N/M): total # runs.
6
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 6 Overlapping Run Formation Easy: Corollary 2: Corollary 2: Input of size N => sorted runs of size ~ M/2 in time 2M in the paper time read sort write read sort write 12 1 1 2 3 3 3 4 2 4 4 k-1k-1 k-1 k k k... 12 1 1 2 3 3 2 4 4 34 k-1 k k k I/O bound case compute bound case run numbers
7
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 7 Simple External Merging Smallest element of a block is a trigger k sorted runs …. merger sort pairs prediction sequence : …. k-merger ….
8
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 8 Overlapping I/O and Merging Prediction of delay between issuing two reads is not easy: 1 B-1 23 B-1 45 B-1 6… 1 B-1 23 B-1 45 B-1 6… 1 B-1 23 B-1 45 B-1 6… k … merger
9
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 9 Overlapping I/O and Merging Prediction of delay between issuing two reads is not easy: Solution: 23 B-1 45 B-1 6… 23 B-1 45 B-1 6… 23 B-1 45 B-1 6… k … merger D write buffers 1 2 3 4 k ….. merge elements D blocks merge buffers merging D prefetch buffers D-head disk 2D write buffers 1 2 3 4 k ….. merge elements D blocks merge buffers merging D prefetch buffers D-head disk k+3D overlap buffers
10
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 10 Overlapping I/O and Merging [contd.] Theorem 4: Theorem 4: Merging k sequences with a total of N elements takes time – : time to produce one element of output – L: time to input or output D arbitrary blocks Our I/O thread strategy: – DB elements in write buffer => output step – input step To prove Theorem 4 we distinguish two cases – I/O bound case: 2L DB – Compute bound case: 2L< DB
11
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 11 Compute Bound Case Lemma 6: Lemma 6: In compute bound case after k/D+1 steps, the merging thread never blocks until all elements are merged. Notation: – # elem. in write buffer – r # of elem. in the overlap and merge buffers Our I/O thread strategy: – DB elements in write buffer => output step – input step r kB kB+DB kB+2DB kB+3DB kB+y L/ kB+2 L/ 0DB 2DB- L/ 2DB merge thread blocked I/O blocking fetch output
12
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 12 I/O Bound Case Lemma 7: Lemma 7: In I/O bound case the I/O thread never blocks until all input blocks are fetched. Our I/O thread strategy (the same): – DB elements in write buffer => output step – input step Proof: Proof: similar to compute bound case r blocking fetch output DB- L/ DB2DB- L/ 2DB kB+3DB kB+2DB+ L/ kB+2DB kB+DB+ L/ kB+DB 0 kB+ L/
13
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 13 Multi-head Model Multi-disk Model Randomized Cyclic Allocation [VitHut01] makes accesses in balanced Optimal prefetching from [HutchSanVitt01,SanEgnKor00] Corollary 8: Corollary 8: For any > 0, prefetch buffer of size m= (D/ ) merging k sequences with a total N elements can be implemented to run in time … … … … -ping 2D write buffers … O(D) prefetch buffers k+O(D) overlap buffers 1 2 3 4 k ….. merge elements read buffers D blocks merge buffers merging overlap-disk scheduling
14
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 14 Implementation Issues Implementation (part of library) – Forming runs: Key sorting, efficient two passes MSD radix sort – Multi-way merging: [Sanders00] – prefetch buffer + overlap buffer = read buffer – Asynchronous I/O (POSIX threads) – No superfluous copying User blocks are transferred by DMA ( O_DIRECT flag) Buffers are passed by pointer between pools Applications Sorting, Priority queues, Search trees OS Caching, Mem. management Disk scheduling POSIX threads, File system STL + more structure
15
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 15 Hardware 3000 Euro, 375 MByte/s, July 2002
16
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 16 Single Disk Performance LEDA-SM, TPIE 2 GB volume, 32-bit keys, runs of size 256 MB, g++ 3.2 –O6 I/O bandwidth 45.4 Mbyte/s = 95% of peak bandwidth of the disk
17
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 17 Multiple Disk Performance 2 GB volume, 32-bit keys, runs of size 256 MB, g++ 3.2 –O6 TPIE, LEDA-SM – Linux Soft-RAID 8X128KByte stripes Bandwidth of 315 MByte/s
18
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 18 Element Size 16GB volume, 32-bit keys, runs of size 256 MB, g++ 3.2 –O6 64 merging is I/O bound 128 run formation is I/O bound Small elements require special treatment
19
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 19 Optimal Prefetching 2D write buffers … k+O(D) overlap buffers 1 2 3 4 k ….. merge read buffers D blocks merge buffers merging 2D write buffers … O(D) prefetch buffers k+O(D) overlap buffers 1 2 3 4 k ….. merge read buffers D blocks merge buffers merging
20
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 20 Block Size Block sizes of several MB => good performance Leave room for read and write buffers Naive striping implies factor 8 block size reduction Dramatic run time increase
21
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 21 Large Inputs One-pass, curves go up because of – Slower zones – Seek times
22
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 22 Summary Parallel disk sorting with high performance on state of art hardware with theoretical performance guarantees Bandwidth is no longer a limiting factor for external memory algorithms Price performance ratio can improve by adding disks
23
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 23 ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.