Hash vs Join A case study evaluating the use of the data step hash object to replace a SQL join Geoff Ness Sep 2014.

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

© 2007 by Prentice Hall (Hoffer, Prescott & McFadden) 1 Joins and Sub-queries in SQL.
Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.
How SAS implements structured programming constructs
EXECUTION PLANS By Nimesh Shah, Amit Bhawnani. Outline  What is execution plan  How are execution plans created  How to get an execution plan  Graphical.
Arrays. INTRODUCTION TO ARRAYS Just as with loops and conditions, arrays are a common programming construct and an important concept Arrays can be found.
Case, Arrays, and Structures. Summary Slide  Case Structure –Select Case - Numeric Value Example 1 –Select Case - String Value Example  Arrays –Declaring.
Programming with Collections Collections in Java Using Arrays Week 9.
Arrays-Part 1. Objectives Declare and initialize a one-dimensional array Store data in a one-dimensional array Display the contents of a one-dimensional.
Designing for Performance Announcement: The 3-rd class test is coming up soon. Open book. It will cover the chapter on Design Theory of Relational Databases.
1 Internal Table / DB Alternatives Analysis of Various Table Lookup Approaches.
Chapter 8 Arrays and Strings
Inner join, self join and Outer join Sen Zhang. Joining data together is one of the most significant strengths of a relational database. A join is a query.
Joins and Cardinality Demystified
Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Introduction to Databases Chapter 7: Data Access and Manipulation.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
1 Chapter 4: Introduction to Lookup Techniques 4.1 Introduction to Lookup Techniques 4.2 In-Memory Lookup Techniques 4.3 Disk Storage Techniques.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
Chapter 15: Combining Data Horizontally 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 8 Arrays and Strings
SQL Joins Oracle and ANSI Standard SQL Lecture 6.
CPS120 Introduction to Computer Science Iteration (Looping)
Data Partitioning in VLDB Tal Olier
111 © 2002, Cisco Systems, Inc. All rights reserved.
ARRAYS Computer Engineering Department Java Course Asst. Prof. Dr. Ahmet Sayar Kocaeli University - Fall
Status “Lifetime of a Query” –Query Rewrite –Query Optimization –Query Execution Optimization –Use cost-estimation to iterate over all possible plans,
Introduction to Using the Data Step Hash Object with Large Data Sets Richard Allen Peak Stat.
Introduction To PROLOG World view of imperative languages. World view of relational languages. A PROLOG program. Running a PROLOG program. A PROLOG.
Database Development Tr ươ ng Quý Quỳnh. References UDEMY: SQL Database MasterClass: Go From Pupil To Master! Database Systems - A Practical Approach.
MA/CSSE 473 Day 27 Hash table review Intro to string searching.
SQL Chapter Two. Overview Basic Structure Verifying Statements Specifying Columns Specifying Rows.
1 Efficient SAS Coding with Proc SQL When Proc SQL is Easier than Traditional SAS Approaches Mike Atkinson, May 4, 2005.
8 1 Chapter 8 Advanced SQL Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Programming in R SQL in R. Running SQL in R In this session I will show you how to: Run basic SQL commands within R.
Structured Query Language Introduction. Basic Select SELECT lname, fname, phone FROM employees; Employees Table LNAMEFNAMEPHONE JonesMark SmithSara
SQL Select Statement IST359.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Introduction to Explicit Cursors. 2 home back first prev next last What Will I Learn? Distinguish between an implicit and an explicit cursor Describe.
CPS120 Introduction to Computer Science Iteration (Looping)
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
SWE 4743 Abstraction Richard Gesick. CSE Abstraction the mechanism and practice of abstraction reduces and factors out details so that one can.
Chapter 8: Part 3 Collections and Two-dimensional arrays.
Arrays Declaring arrays Passing arrays to functions Searching arrays with linear search Sorting arrays with insertion sort Multidimensional arrays Programming.
Arrays and variables 1.Representing tables as arrays in MATLAB 2.Concept of array dimension 3.Correspondence of array dimension to rows and columns 4.Picking.
Merge Sort Comparison Left Half Data Movement Right Half Sorted.
7 1 Database Systems: Design, Implementation, & Management, 7 th Edition, Rob & Coronel 7.6 Advanced Select Queries SQL provides useful functions that.
CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 10/11/2006 Lecture 7 – Introduction to C.
SQL LANGUAGE TUTORIAL Prof: Dr. Shu-Ching Chen TA: Hsin-Yu Ha.
Introduction to JavaScript academy.zariba.com 1. Lecture Content 1.What is JavaScript? 2.JavaScript Pros and Cons 3.The weird JavaScript stuff 4.Including.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapter 26 By Tasha Chapman, Oregon Health Authority.
CS4432: Database Systems II Query Processing- Part 1 1.
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
IFS180 Intro. to Data Management Chapter 10 - Unions.
Tuning Transact-SQL Queries
MySQL Subquery Source: Dev.MySql.com
Putting tables together
Indices.
Introduction to PL/SQL Programing
OASUS Spring or Fall YYYY
SQL 101.
Chapter 4 Summary Query.
Chapter 8: More on the Repetition Structure
Combining Data Sets in the DATA step.
Introduction to Execution Plans
EXECUTION PLANS Quick Dive.
SQL set operators and modifiers.
A – Pre Join Indexes.
Presentation transcript:

Hash vs Join A case study evaluating the use of the data step hash object to replace a SQL join Geoff Ness Sep 2014

The Hash Object Effectively a lookup table which resides in memory – key/value pairs Similar to associative arrays, dictionaries in other programming languages Fast lookup (O(1)), no sorting required Can offer a faster alternative to traditional data step merge or SQL join, at a price: –The syntax is unfamiliar to a lot of SAS programmers –There’s more code to write –Requires more memory than a join (sometimes much more)

Using Hash to replace a SQL Join Fact table Dimension 1 Dimension 2 Dimension 3 Dimension 4

SQL Join

Alternative using the Hash Object Replacing the join typically requires 3 steps to be coded: 1 - Create variables by ‘faking’ a set statement:

2 - Then declare hash objects for each dimension:

3 - Finally, join rows from the fact to rows in the dimensions by calling the hash.find() method: The.find() method returns 0 when a matching row is found in the column from.definekey(), and the values from.definedata() are populated

Performance Comparison When joining 2 dimensions, small fact (100K rows):

Joining 2 dimensions, large fact (~10M rows):

Joining 9 dimensions, small fact (100K rows):

Joining 9 dimensions, large fact (~10M rows):

Stuff we haven’t considered Outer joins (yes these are possible) When proc sql will use the hash object ‘under the covers’ Performance against RDBMS tables (as opposed to SAS datasets) Hash iterators Other things that can be done with the hash object (sorting, summarisation, de-duplication)

Summary Implementing a join using the hash object can provide a considerable saving in terms of time, usually at the expense of memory The code is a little more involved but breaks down to a reasonably simple process to implement Things to consider: –The number and size of tables involved –The memory required to load all the hash objects into memory

References The SAS® Hash Object in Action pdf Introduction to SAS® Hash Objects SAS%C2%AE-Hash-Objects-Chris-Schacherer.pdf A Hash Alternative to the PROC SQL Left Join Using the Hash Object – SAS® Language Reference: Concepts ult/viewer.htm#a htm

Questions?