R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut.

Slides:

Advertisements

Similar presentations

Variations of the Turing Machine

Advertisements

Introduction to Algorithms

Chapter 6 I/O Systems.

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Analysis of Computer Algorithms

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

Introduction to Algorithms 6.046J/18.401J

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Year 6 mental test 5 second questions

Chapter 5 Input/Output 5.1 Principles of I/O hardware

Chapter 6 File Systems 6.1 Files 6.2 Directories

Robust Window-based Multi-node Technology- Independent Logic Minimization Jeff L.Cobb Kanupriya Gulati Sunil P. Khatri Texas Instruments, Inc. Dept. of.

1 Maintaining Packet Order in Two-Stage Switches Isaac Keslassy, Nick McKeown Stanford University.

SE-292 High Performance Computing

Chapter 5 : Memory Management

The Bare Basics Storing Data on Disks and Files

Storing Data: Disks and Files

Copyright © 2009 EMC Corporation. Do not Copy - All Rights Reserved.

1 RAID Overview n Computing speeds double every 3 years n Disk speeds cant keep up n Data needs higher MTBF than any component in system n IO.

Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.

Mehdi Naghavi Spring 1386 Operating Systems Mehdi Naghavi Spring 1386.

OPERATING SYSTEMS 11 - DISK PIETER HARTEL 1. Hardware Intended for humans, local devices and communication (challenge?) Hardware support for interrupts.

Indexing Large Data COMP # 22

Chapter 4 Memory Management Basic memory management Swapping

Data Structures Using C++

FIFO Queues CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.

1 Joe Meehean. Ordered collection of items Not necessarily sorted 0-index (first item is item 0) Abstraction 2 Item 0 Item 1 Item 2 … Item N.

ABC Technology Project

Project 5: Virtual Memory

Chapter 9 -- Simplification of Sequential Circuits.

1 Overview Assignment 4: hints Memory management Assignment 3: solution.

CS 241 Spring 2007 System Programming 1 Memory Replacement Policies Lecture 32 Klara Nahrstedt.

Chapter 10: Virtual Memory

Virtual Memory II Chapter 8.

1 Chapter 4 Greedy Algorithms Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.

25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst

25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst

5 August, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.

Chapter 6 File Systems 6.1 Files 6.2 Directories

CS 241 Spring 2007 System Programming 1 Memory Implementation Issues Lecture 33 Klara Nahrstedt.

3.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Process An operating system executes a variety of programs: Batch system.

Processes Management.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

© 2004, D. J. Foreman 1 Scheduling & Dispatching.

Addition 1’s to 20.

25 seconds left…...

SE-292 High Performance Computing

We will resume in: 25 Minutes.

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

CS 4432lecture #31 CS4432: Database Systems II Lecture #3 Professor Elke A. Rundensteiner.

External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.

Duality between Reading and Writing with Applications to Sorting Jeff Vitter Department of Computer Science Center for Geometric & Biological Computing.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec ): Ramakrishnan & Gehrke and Chapter 11 (Sec ): G-M et al. (R2) OR.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.

External Sorting Sort n records/elements that reside on a disk.

Presentation transcript:

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut für Informatik, Saarbrücken, Germany Thanks to: A. Crauser, D. Hutchinson, L. Kettner, S. Mitra, A. Morton, N. Rajput, and J. Vitter

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 2 Motivation & Related Work Sorting is important for EM problems Disks are cheaper than computers Prefetch buffers for load disk balancing and overlapping have been studied but no results which guarantee – overlapping during merging of runs – asymptotically optimal I/O volume Here: algorithm & implementation – I/O cost lower bound – guarantees almost perfect overlapping between I/O and computation

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 3 Motivation & Related Work [ChaudhryCormen0102] impl. of distributed memory, parallel disk Theory: Implementations: Differences: – Optimal prefetching algorithm – Overlapping of I/O and computation – Completely asynchronous implementation – Part of library, compatible with STL. Use as simple as: stxxl::vector v; long long int i=vec_size; // long long int is a 64-bit data type while(i--) v.push_back(f(i)); // fill vector with some values stxxl::sort(v.begin(), v.end()); // sort // check order using STL predicate assert(std::is_sorted(v.begin(), v.end())); [BarvGrovVit97][SandEgnKorst00][HutchVitt01][HutchSandVitt02] Here [DementievSanders03] [BarveVitter99]

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 4 External Multi-way Merge Sort run 0 input : O(M) chunk 0chunk 1chunk 2chunk 3chunk 4 run 1run 2run 3 chunk 5chunk 6 … B buffers BBB k=O(M/B) k-merger run 4run 5run 6run 7 BBBB k-merger chunk 7chunk 8 k run 0123 B run 4567 B buffers … These algorithms do not guarantee overlapping D write buffers k ….. merge elements D blocks merge buffers merging D prefetch buffers D-head disk D-head model N – input size M – main memory size B – block size D – number of disks

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 5 Multi-way Merge Sort with Overlapped I/Os Theorem 1: Theorem 1: If I/O and computation can be overlapped, N elements can be sorted in time where is the time needed for run formation, is the time needed for merging k= (M/B) : # sequences, k = O(N/M): total # runs.

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 6 Overlapping Run Formation Easy: Corollary 2: Corollary 2: Input of size N => sorted runs of size ~ M/2 in time 2M in the paper time read sort write read sort write k-1k-1 k-1 k k k k-1 k k k I/O bound case compute bound case run numbers

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 7 Simple External Merging Smallest element of a block is a trigger k sorted runs …. merger sort pairs prediction sequence : …. k-merger ….

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 8 Overlapping I/O and Merging Prediction of delay between issuing two reads is not easy: 1 B-1 23 B-1 45 B-1 6… 1 B-1 23 B-1 45 B-1 6… 1 B-1 23 B-1 45 B-1 6… k … merger

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 9 Overlapping I/O and Merging Prediction of delay between issuing two reads is not easy: Solution: 23 B-1 45 B-1 6… 23 B-1 45 B-1 6… 23 B-1 45 B-1 6… k … merger D write buffers k ….. merge elements D blocks merge buffers merging D prefetch buffers D-head disk 2D write buffers k ….. merge elements D blocks merge buffers merging D prefetch buffers D-head disk k+3D overlap buffers

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 10 Overlapping I/O and Merging [contd.] Theorem 4: Theorem 4: Merging k sequences with a total of N elements takes time – : time to produce one element of output – L: time to input or output D arbitrary blocks Our I/O thread strategy: – DB elements in write buffer => output step – input step To prove Theorem 4 we distinguish two cases – I/O bound case: 2L DB – Compute bound case: 2L< DB

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 11 Compute Bound Case Lemma 6: Lemma 6: In compute bound case after k/D+1 steps, the merging thread never blocks until all elements are merged. Notation: – # elem. in write buffer – r # of elem. in the overlap and merge buffers Our I/O thread strategy: – DB elements in write buffer => output step – input step r kB kB+DB kB+2DB kB+3DB kB+y L/ kB+2 L/ 0DB 2DB- L/ 2DB merge thread blocked I/O blocking fetch output

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 12 I/O Bound Case Lemma 7: Lemma 7: In I/O bound case the I/O thread never blocks until all input blocks are fetched. Our I/O thread strategy (the same): – DB elements in write buffer => output step – input step Proof: Proof: similar to compute bound case r blocking fetch output DB- L/ DB2DB- L/ 2DB kB+3DB kB+2DB+ L/ kB+2DB kB+DB+ L/ kB+DB 0 kB+ L/

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 13 Multi-head Model Multi-disk Model Randomized Cyclic Allocation [VitHut01] makes accesses in balanced Optimal prefetching from [HutchSanVitt01,SanEgnKor00] Corollary 8: Corollary 8: For any > 0, prefetch buffer of size m= (D/ ) merging k sequences with a total N elements can be implemented to run in time … … … … -ping 2D write buffers … O(D) prefetch buffers k+O(D) overlap buffers k ….. merge elements read buffers D blocks merge buffers merging overlap-disk scheduling

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 14 Implementation Issues Implementation (part of library) – Forming runs: Key sorting, efficient two passes MSD radix sort – Multi-way merging: [Sanders00] – prefetch buffer + overlap buffer = read buffer – Asynchronous I/O (POSIX threads) – No superfluous copying User blocks are transferred by DMA ( O_DIRECT flag) Buffers are passed by pointer between pools Applications Sorting, Priority queues, Search trees OS Caching, Mem. management Disk scheduling POSIX threads, File system STL + more structure

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 15 Hardware 3000 Euro, 375 MByte/s, July 2002

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 16 Single Disk Performance LEDA-SM, TPIE 2 GB volume, 32-bit keys, runs of size 256 MB, g –O6 I/O bandwidth 45.4 Mbyte/s = 95% of peak bandwidth of the disk

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 17 Multiple Disk Performance 2 GB volume, 32-bit keys, runs of size 256 MB, g –O6 TPIE, LEDA-SM – Linux Soft-RAID 8X128KByte stripes Bandwidth of 315 MByte/s

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 18 Element Size 16GB volume, 32-bit keys, runs of size 256 MB, g –O6 64 merging is I/O bound 128 run formation is I/O bound Small elements require special treatment

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 19 Optimal Prefetching 2D write buffers … k+O(D) overlap buffers k ….. merge read buffers D blocks merge buffers merging 2D write buffers … O(D) prefetch buffers k+O(D) overlap buffers k ….. merge read buffers D blocks merge buffers merging

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 20 Block Size Block sizes of several MB => good performance Leave room for read and write buffers Naive striping implies factor 8 block size reduction Dramatic run time increase

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 21 Large Inputs One-pass, curves go up because of – Slower zones – Seek times

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 22 Summary Parallel disk sorting with high performance on state of art hardware with theoretical performance guarantees Bandwidth is no longer a limiting factor for external memory algorithms Price performance ratio can improve by adding disks

R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 23 ?