Putting things in order Sorting Putting things in order This is the big picture version, it includes sort_i and sort_e and would be suitable for data structures. Copyright © 2003-2011 Curt Hill
Sorting is Important One of the largest consumers of resources Estimated that in the 1960s 1/3 of all CPU power was used on sorts One sort was recorded to take months Sorts are expensive Still consume cycles or accesses Too many things do not work without them Entire grad level classes on just sorting Copyright © 2003-2011 Curt Hill
Definitions Internal sorting External sorting Merge All the data is in memory – most often in an array No I/Os are required External sorting The data will not fit in memory Usually the I/O costs exceed the CPU costs Merge Start with two (or more) sorted files Produce one sorted result Copyright © 2003-2011 Curt Hill
Keys Often more than one Primary key Secondary key Most important The DB notion is different than the sort one Secondary key Discriminate between two records where primary key is the same There may be multiple secondaries Ordered by importance Copyright © 2003-2011 Curt Hill
The Sort Merge Generalized program that will sort any file Standard program on mainframes, but less common on other platforms User specifies the sort and file characteristics The program then creates the sorted file Copyright © 2003-2011 Curt Hill
Sort Characteristics Primary and secondary keys Exits Position Length Type Ascending or descending Exits Points in the algorithm where a user written program could be invoked Usually to filter/reformat the data Copyright © 2003-2011 Curt Hill
The Sort Merge Did as much as it could using internal sorting Usually had to resort to file merging for files that were too large to fit in memory Copyright © 2003-2011 Curt Hill
Internal Sorting Data fits in memory Typically an array or table Examined first Sort_I.ppt Copyright © 2003-2011 Curt Hill
External Sorting Usual situation with a DBMS Occurs elsewhere as well Too many pages for a file to fit in memory The internal sorts mostly require random access to individual records While we have random access to pages Want to minimize the number of pages sort_e.ppt Copyright © 2003-2011 Curt Hill
Parallelism With multiple CPUs some gain in sorting may occur Each CPU may do part of a sort in parallel Only the final merge needs a single CPU to properly handle Copyright © 2003-2011 Curt Hill
Using a BTree Suppose that the file in question has a BTree index on the field in question Should that be used instead of sorting? The answer depends on whether this is a clustered or unclustered index Copyright © 2003-2011 Curt Hill
Well? Clustered Unclustered Traverse root to initial value Ride the leaves as long as needed This eliminates the need for a sort, the leaves are already sorted Unclustered We will have to use index for each record Generally a sequential scan that is sorted will be faster for normal index sizes Copyright © 2003-2011 Curt Hill