Download presentation
Presentation is loading. Please wait.
Published byAnna Hodge Modified over 9 years ago
1
CS186 Week 0 Out of Core Algorithms
2
Today External Merge Sort External Hashing
3
Sorting Goal: minimize number of I/Os (especially “random” I/Os) Classic interview question: how to sort if data don’t fit in memory?
4
But first, what is a sorted run? (name = Bob, sid = 1) (name = Jill, sid = 2) (name = Sam, sid = 3) (name = Sue, sid = 6) (name = Kev, sid = 8) (name = Jack, sid = 9) (name = Joe, sid = 10) (name = Sid, sid = 12) (name = Sal, sid = 15) (name = Bit, sid = 1) (name = Bat, sid = 2) (name = Tam, sid = 3) (name = Foo, sid = 6) (name = Bar, sid = 8) (name = Bam, sid = 9) (name = Ke, sid = 10) (name = Kay, sid = 12) (name = Al, sid = 15) A sorted subset of a table. Another common interview question: How to sort a bunch of sorted sublists into one list?
5
Sorting: 2-Way RAM I/O Buffer sort OUTPUT INPUT Pass 0 (conquer): –read a page, sort it, write it. –only one buffer page is used –a repeated “batch job”
6
Sorting: 2-Way Pass 0 (conquer): – read a page, sort it, write it. – only one buffer page is used – a repeated “batch job” Pass 1, 2, 3, …, etc. (merge): – requires 3 buffer pages note: this has nothing to do with double buffering! – merge pairs of runs into runs twice as long – a streaming algorithm, as in the previous slide! INPUT 1 INPUT 2 OUTPUT RAM
7
Two-Way External Merge Sort Conquer and Merge: sort subfiles and merge Each pass we read + write each page in file. N pages in the file. So, the number of passes is: So total cost is: Why 2N * num passes ? Input file 1-page runs 2-page runs 4-page runs 8-page runs PASS 0 PASS 1 PASS 2 PASS 3 9 3,4 6,2 9,48,75,63,1 2 3,45,62,64,97,8 1,32 2,3 4,6 4,7 8,9 1,3 5,62 2,3 4,4 6,7 8,9 1,2 3,5 6 1,2 2,3 3,4 4,5 6,6 7,8
8
Merging Runs General External Merge Sort More than 3 buffer pages. How can we utilize them? To sort a file with N pages using B buffer pages: – Pass 0: use B buffer pages. Produce sorted runs of B pages each. – Pass 1, 2, …, etc.: merge B-1 runs. INPUT 1 INPUT B-1 OUTPUT Disk INPUT 2... RAM
9
Cost of External Merge Sort Number of passes: Cost = 2N * (# of passes) – Why? How big of a table can we sort in two passes? – Each “sorted run” after Phase 0 is of size B – Can merge up to B-1 sorted runs in Phase 1 Answer: B(B-1). – Sort N pages of data in about sqrt(N) space
10
HASHING
11
Cats by fur color (Hashing) Black cats… Grey cats… Orange cats… White cats… Zorro cats…
12
Hashing: How To Goal: Group kitties by fur color so we can d’aww them. Setup: 12 kitties, 2 can fit per page. We have 8 kitties worth of memory. N = B = 6 4
13
Hashing: How To N = 6, B = 4 Step 1: Partition – Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.)
14
Hashing: How To N = 6, B = 4 Step 1: Partition – Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) How to assign cats to partitions? Hashing! What does that mean? Map each fur color to a bucket. {B, G, O, W, Z} -> {1, 2, 3} What hash function? Let’s say, we’ll map each color to whichever THIRD of the alphabet the first letter lies in. {B, G} -> 1; {O} -> 2, {W, Z} -> 3.
15
Hashing: How To N = 6, B = 4 Step 1: Partition – Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) How to assign cats to partitions? Hashing! What does that mean? Map each fur color to a bucket. {B, G, O, W, Z} -> {1, 2, 3} What hash function? Let’s say, we’ll map each color to whichever THIRD of the alphabet the first letter lies in. {B, G} -> 1; {O} -> 2, {W, Z} -> 3.
16
Hashing: How To N = 6, B = 4 Step 1: Partition – Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function: {B, G} -> 1; {O} -> 2, {W, Z} -> 3.
17
Hashing: How To N = 6, B = 4 Step 1: Partition – Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function: {B, G} -> 1; {O} -> 2, {W, Z} -> 3.
18
Hashing: How To N = 6, B = 4 Step 1: Partition – Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function: {B, G} -> 1; {O} -> 2, {W, Z} -> 3.
19
Hashing: How To N = 6, B = 4 Step 1: Partition – Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function: {B, G} -> 1; {O} -> 2, {W, Z} -> 3.
20
Hashing: How To N = 6, B = 4 Step 1: Partition – Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function: {B, G} -> 1; {O} -> 2, {W, Z} -> 3.
21
Hashing: How To N = 6, B = 4 Step 1: Partition – Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function: {B, G} -> 1; {O} -> 2, {W, Z} -> 3.
22
Hashing: How To N = 6, B = 4 Step 1: Partition – Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function: {B, G} -> 1; {O} -> 2, {W, Z} -> 3.
23
Hashing: How To N = 6, B = 4 Step 1: Partition Step 2: Re-Hash – Create in-memory table for each partition
24
Hashing: How To N = 6, B = 4 Step 1: Partition Step 2: Re-Hash – Create in-memory table for each partition
25
Hashing: How To N = 6, B = 4 Step 1: Partition Step 2: Re-Hash – Create in-memory hash table for each partition Grey -> Black ->
26
Two Phases Partition: (Divide) Rehash: (Conquer) Partitions Hash table for partition R i (k <= B pages) B main memory buffers Disk Result hash fn hrhr B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h p B-1 Partitions 1 2 B-1...
27
Memory Requirement How big of a table can we hash in two passes? – B-1 “partitions” result from Pass 1 – Each should be no more than B pages in size – Answer: B(B-1). We can hash a table of size N pages in about space – Note: assumes hash function distributes records evenly! Have a bigger table? Recursive partitioning!
28
Cost of External Hashing cost = 4*N IO’s Cost of External Sorting Divide Conquer Merge
29
Summary Sort/Hash Duality – Hashing is Divide & Conquer – Sorting is Conquer & Merge Sorting is overkill for rendezvous – But sometimes a win anyhow Sorting sensitive to internal sort alg – Quicksort vs. HeapSort – In practice, QuickSort tends to win Don’t forget double buffering
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.