External Sorting Sort n records/elements that reside on a disk. Space needed by the n records is very large.  n is very large, and each record may be.

Slides:



Advertisements
Similar presentations
CS 400/600 – Data Structures External Sorting.
Advertisements

Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
CS252: Systems Programming Ninghui Li Program Interview Questions.
Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
© Copyright 2012 by Pearson Education, Inc. All Rights Reserved. 1 Chapter 17 Sorting.
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Chapter 24 Sorting.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
Chapter 7: Sorting Algorithms
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
External Sorting R & G Chapter 13 One of the advantages of being
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
Preliminaries Multiway trees have nodes with greater than two children. Multiway trees of order k have nodes with most k children Trees –For all.
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved L18 (Chapter 23) Algorithm.
Chapter 7 (Part 2) Sorting Algorithms Merge Sort.
Improve Run Generation Overlap input,output, and internal CPU work. Reduce the number of runs (equivalently, increase average run length). DISK MEMORY.
Fast matrix multiplication; Cache usage
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Chapter 8 File Processing and External Sorting. Primary vs. Secondary Storage Primary storage: Main memory (RAM) Secondary Storage: Peripheral devices.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
Lecture 11: DMBS Internals
Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.
ICS 321 Fall 2011 Overview of Storage & Indexing (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 11/9/20111Lipyeow.
Sorting.
Indexing.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Java Methods Big-O Analysis of Algorithms Object-Oriented Programming
Computer Organization. This module surveys the physical resources of a computer system.  Basic components  CPU  Memory  Bus  I/O devices  CPU structure.
Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.
SNU IDB Lab. Ch4. Performance Measurement © copyright 2006 SNU IDB Lab.
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
External Sorting Adapt fastest internal-sort methods.
Liang, Introduction to Java Programming, Ninth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 25 Sorting.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Chapter 4, Part II Sorting Algorithms. 2 Heap Details A heap is a tree structure where for each subtree the value stored at the root is larger than all.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Chapter 9: Sorting1 Sorting & Searching Ch. # 9. Chapter 9: Sorting2 Chapter Outline  What is sorting and complexity of sorting  Different types of.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.
CMPT 238 Data Structures More on Sorting: Merge Sort and Quicksort.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
Liang, Introduction to Java Programming, Tenth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 23 Sorting.
Cache Memories CSE 238/2038/2138: Systems Programming
Lecture 16: Data Storage Wednesday, November 6, 2006.
External Sorting Sort n records/elements that reside on a disk.
Lecture 11: DMBS Internals
External Sorting Adapt fastest internal-sort methods.
Performance Measurement
Improve Run Merging Reduce number of merge passes.
Improve Run Generation
Introduction to Computer Systems
These notes were largely prepared by the text’s author
External Sorting Adapt fastest internal-sort methods.
Advanced Sorting Methods: Shellsort
External Sorting Dina Said
Presentation transcript:

External Sorting Sort n records/elements that reside on a disk. Space needed by the n records is very large.  n is very large, and each record may be large or small.  n is small, but each record is very large. So, not feasible to input the n records, sort, and output in sorted order.

Small n But Large File Input the record keys. Sort the n keys to determine the sorted order for the n records. Permute the records into the desired order (possibly several fields at a time). We focus on the case: large n, large file.

New Data Structures/Concepts Tournament trees. Huffman trees. Double-ended priority queues. Buffering. Ideas also may be used to speed algorithms for small instances by using cache more efficiently.

External Sort Computer Model MAIN ALU DISK

Disk Characteristics Seek time  Approx. 100,000 arithmetics Latency time  Approx. 25,000 arithmetics Transfer time Data access by block tracks read/write head

Traditional Internal Memory Model MAIN ALU

Matrix Multiplication for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) c[i][j] += a[i][k] * b[k][j]; ijk, ikj, jik, jki, kij, kji orders of loops yield same result. All perform same number of operations. But run time differs! ijk takes > 7x ikj on modern PC when n = 4K.

More Accurate Memory Model R L1 L2 MAIN ALU KB256KB1GB 1C2C10C100C

2D Array Representation In Java, C, and C++ int x[3][4]; abcd efgh ijkl x[]

ijk Order... for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) c[i][j] += a[i][k] * b[k][j];

ijk Analysis... Block size = width of cache line = w. Assume one-level cache. C => n 2 /w cache misses. A => n 3 /w cache misses, when n is large. B => n 3 cache misses, when n is large. Total cache misses = n 3 /w(1/n w).

ikj Order... for (int i = 0; i < n; i++) for (int k = 0; k < n; k++) for (int j = 0; j < n; j++) c[i][j] += a[i][k] * b[k][j];

ikj Analysis... C => n 3 /w cache misses, when n is large. A => n 2 /w cache misses. B => n 3 /w cache misses, when n is large. Total cache misses = n 3 /w(2 + 1/n).

ijk Vs. ikj Comparison ijk cache misses = n 3 /w(1/n w). ikj cache misses = n 3 /w(2 + 1/n). ijk/ikj ~ (1 + w)/2, when n is large. w = 4 (32-byte cache line, double precision data)  ratio ~ 2.5. w = 8 (64-byte cache line, double precision data)  ratio ~ 4.5. w = 16 (64-byte cache line, integer data)  ratio ~ 8.5.

Prefetch Prefetch can hide memory latency Successful prefetch requires ability to predict a memory access much in advance Prefetch cannot reduce energy as prefetch does not reduce number of memory accesses

Faster Internal Sorting May apply external sorting ideas to internal sorting. Internal tiled merge sort gives 2x (or more) speedup over traditional merge sort.

External Sort Methods Base the external sort method on a fast internal sort method. Average run time  Quick sort Worst-case run time  Merge sort

Internal Quick Sort To sort a large instance, select a pivot element from out of the n elements. Partition the n elements into 3 groups left, middle and right. The middle group contains only the pivot element. All elements in the left group are <= pivot. All elements in the right group are >= pivot. Sort left and right groups recursively. Answer is sorted left group, followed by middle group followed by sorted right group.

Internal Quick Sort Use 6 as the pivot Sort left and right groups recursively.

Quick Sort – External Adaptation 3 input/output buffers  input, small, large rest is used for middle group DISK inputsmalllarge Middle group

Quick Sort – External Adaptation fill middle group from disk if next record <= middle min send to small if next record >= middle max send to large else remove middle min or middle max from middle and add new record to middle group DISKinputsmalllarge Middle group

Quick Sort – External Adaptation Fill input buffer when it gets empty. Write small/large buffer when full. Write middle group in sorted order when done. Double-ended priority queue. Use additional buffers to reduce I/O wait time. DISKinputsmalllarge Middle group