Hash vs Join A case study evaluating the use of the data step hash object to replace a SQL join Geoff Ness Sep 2014
The Hash Object Effectively a lookup table which resides in memory – key/value pairs Similar to associative arrays, dictionaries in other programming languages Fast lookup (O(1)), no sorting required Can offer a faster alternative to traditional data step merge or SQL join, at a price: –The syntax is unfamiliar to a lot of SAS programmers –There’s more code to write –Requires more memory than a join (sometimes much more)
Using Hash to replace a SQL Join Fact table Dimension 1 Dimension 2 Dimension 3 Dimension 4
SQL Join
Alternative using the Hash Object Replacing the join typically requires 3 steps to be coded: 1 - Create variables by ‘faking’ a set statement:
2 - Then declare hash objects for each dimension:
3 - Finally, join rows from the fact to rows in the dimensions by calling the hash.find() method: The.find() method returns 0 when a matching row is found in the column from.definekey(), and the values from.definedata() are populated
Performance Comparison When joining 2 dimensions, small fact (100K rows):
Joining 2 dimensions, large fact (~10M rows):
Joining 9 dimensions, small fact (100K rows):
Joining 9 dimensions, large fact (~10M rows):
Stuff we haven’t considered Outer joins (yes these are possible) When proc sql will use the hash object ‘under the covers’ Performance against RDBMS tables (as opposed to SAS datasets) Hash iterators Other things that can be done with the hash object (sorting, summarisation, de-duplication)
Summary Implementing a join using the hash object can provide a considerable saving in terms of time, usually at the expense of memory The code is a little more involved but breaks down to a reasonably simple process to implement Things to consider: –The number and size of tables involved –The memory required to load all the hash objects into memory
References The SAS® Hash Object in Action pdf Introduction to SAS® Hash Objects SAS%C2%AE-Hash-Objects-Chris-Schacherer.pdf A Hash Alternative to the PROC SQL Left Join Using the Hash Object – SAS® Language Reference: Concepts ult/viewer.htm#a htm
Questions?