Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hash vs Join A case study evaluating the use of the data step hash object to replace a SQL join Geoff Ness Sep 2014.

Similar presentations


Presentation on theme: "Hash vs Join A case study evaluating the use of the data step hash object to replace a SQL join Geoff Ness Sep 2014."— Presentation transcript:

1 Hash vs Join A case study evaluating the use of the data step hash object to replace a SQL join Geoff Ness Sep 2014

2 The Hash Object Effectively a lookup table which resides in memory – key/value pairs Similar to associative arrays, dictionaries in other programming languages Fast lookup (O(1)), no sorting required Can offer a faster alternative to traditional data step merge or SQL join, at a price: –The syntax is unfamiliar to a lot of SAS programmers –There’s more code to write –Requires more memory than a join (sometimes much more)

3 Using Hash to replace a SQL Join Fact table Dimension 1 Dimension 2 Dimension 3 Dimension 4

4 SQL Join

5 Alternative using the Hash Object Replacing the join typically requires 3 steps to be coded: 1 - Create variables by ‘faking’ a set statement:

6 2 - Then declare hash objects for each dimension:

7 3 - Finally, join rows from the fact to rows in the dimensions by calling the hash.find() method: The.find() method returns 0 when a matching row is found in the column from.definekey(), and the values from.definedata() are populated

8 Performance Comparison When joining 2 dimensions, small fact (100K rows):

9 Joining 2 dimensions, large fact (~10M rows):

10 Joining 9 dimensions, small fact (100K rows):

11 Joining 9 dimensions, large fact (~10M rows):

12 Stuff we haven’t considered Outer joins (yes these are possible) When proc sql will use the hash object ‘under the covers’ Performance against RDBMS tables (as opposed to SAS datasets) Hash iterators Other things that can be done with the hash object (sorting, summarisation, de-duplication)

13 Summary Implementing a join using the hash object can provide a considerable saving in terms of time, usually at the expense of memory The code is a little more involved but breaks down to a reasonably simple process to implement Things to consider: –The number and size of tables involved –The memory required to load all the hash objects into memory

14 References The SAS® Hash Object in Action http://support.sas.com/resources/papers/proceedings09/153- 2009.pdf Introduction to SAS® Hash Objects http://www.scsug.org/wp-content/uploads/2013/11/Introduction-to- SAS%C2%AE-Hash-Objects-Chris-Schacherer.pdf A Hash Alternative to the PROC SQL Left Join http://www.nesug.org/proceedings/nesug06/dm/da07.pdf Using the Hash Object – SAS® Language Reference: Concepts http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/defa ult/viewer.htm#a002585310.htm

15 Questions?


Download ppt "Hash vs Join A case study evaluating the use of the data step hash object to replace a SQL join Geoff Ness Sep 2014."

Similar presentations


Ads by Google