Sorting by the Numbers Sorting Part Four
Question Suppose you are given the task of writing an application to sort a big data file. What do you need to know to pick a good solution? File Size = 1 GB Record Size = 250 Bytes Available Memory = ¼ GB
How many Runs? How big is each Run? Total Records to Process 1 billion bytes in the file 250 bytes for each record = 4 million records in the file Run Size 1GB file ¼ GB memory = 4 Runs of 1 million records each
Time to Create the Runs Sorting One Run Using either Quicksort or Ordered Binary Tree N log 2 N 1million * 20 approximately 20 million comparisons of internal memory locations Sorting Four Runs 80 million internal memory comparisons
Refresher on Merging Files So, to merge 2 files of N random records each, requires 2N compares And, to merge 2 files where the runs were built from a sorted file requires N compares File One File Two File One File Two
Merging the Four Files R1R2 T2 R3 T1 R4R1R2 T2 R3T1 R4 2 million compares 4 million compares 3 million compares 2 million compares 4 million compares
Total Processing Time Time to Create the 4 Runs 80 million comparisons Time to Merge the 4 Runs 8 million comparisons Assuming a File Read takes just 100 times longer than a Memory Read Total Time = 880 million time units note, we have omitted the time to read the runs into memory and to write the runs to temp files
Second Example 2 Runs of 2 Million Records each 2 Runs of 2 Million Records each Internal Sorting N log2 N = 2million * 24 = 48 million compares 96 million to create both runs File Merging 4 million compares Total Time 496 million time units 496 million time units
Next in this course So how much time does it take to access the disk?