Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.

Slides:



Advertisements
Similar presentations
Topic Reviews For Unit ET156 – Introduction to C Programming Topic Reviews For Unit
Advertisements

Effecting Efficiency Effortlessly Daniel Carden, Quanticate.
Introduction to C Programming
Chapter 7: Arrays In this chapter, you will learn about
Addison Wesley is an imprint of © 2010 Pearson Addison-Wesley. All rights reserved. Chapter 10 Arrays and Tile Mapping Starting Out with Games & Graphics.
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
How SAS implements structured programming constructs
© Copyright 1992–2005 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. Tutorial 13 – Salary Survey Application: Introducing.
S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008.
Copyright © 2014, 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Starting Out with C++ Early Objects Eighth Edition by Tony Gaddis,
Professional Seminar Northwestern Polytechnic University By Dr. Michael M Cheng.
Chapter 9: Searching, Sorting, and Algorithm Analysis
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. 1 Chapter 6 - Arrays Outline 6.1Introduction 6.2Arrays.
Chapter 8 Search and Sort Asserting Java ©Rick Mercer.
 2003 Prentice Hall, Inc. All rights reserved Sorting Arrays Sorting data –Important computing application –Virtually every organization must sort.
Hash vs Join A case study evaluating the use of the data step hash object to replace a SQL join Geoff Ness Sep 2014.
Programming Logic and Design Fourth Edition, Comprehensive
 2007 Pearson Education, Inc. All rights reserved C Arrays.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
SAS SQL SAS Seminar Series
Processing Arrays Lesson 8 McManusCOP Overview One-Dimensional Arrays –Entering Data into an Array –Printing an Array –Accumulating the elements.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS ESSENTIALS -- Elliott & Woodward1.
A Simple Two-Pass Assembler
Chapter 16: Searching, Sorting, and the vector Type.
Array Processing Simple Program Design Third Edition A Step-by-Step Approach 7.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. 1 Chapter 6 - Arrays Outline 6.1Introduction 6.2Arrays.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Extended Prelude to Programming Concepts & Design, 3/e by Stewart Venit and.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
Copyright © 2008, SAS Institute Inc. All rights reserved. Hash Objects – Why Use Them? Carolyn Cunnison SAS Technical Training Specialist.
Searching and Sorting Chapter Sorting Arrays.
Array Processing.
Introduction to Using the Data Step Hash Object with Large Data Sets Richard Allen Peak Stat.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Extended Prelude to Programming Concepts & Design, 3/e by Stewart Venit and.
Chapter 8 Search and Sort ©Rick Mercer. Outline Understand how binary search finds elements more quickly than sequential search Sort array elements Implement.
Programming Logic and Design Fourth Edition, Comprehensive Chapter 8 Arrays.
Objectives At the end of the class, students are expected to be able to do the following: Understand the searching technique concept and the purpose of.
The Power of the BY Statement SVSUG Paul Choate, California Developmental Services (& Toby Dunn, U.S. Army Medical Department Center & School)
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
 2007 Pearson Education, Inc. All rights reserved C Arrays.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Processing Arrays Lesson 9 McManusCOP Overview One-Dimensional Arrays –Entering Data into an Array –Printing an Array –Accumulating the elements.
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
Data Structures and Algorithms Searching Algorithms M. B. Fayek CUFE 2006.
Visual C++ Programming: Concepts and Projects Chapter 8A: Binary Search (Concepts)
Course Code #IDCGRF001-A 5.1: Searching and sorting concepts Programming Techniques.
Searching Topics Sequential Search Binary Search.
Higher Computing Science 2016 Prelim Revision. Topics to revise Computational Constructs parameter passing (value and reference, formal and actual) sub-programs/routines,
TASS Meeting Using Multiple DOW Loops September 25th, 2009 Using Multiple DOW Loops Dr. Arthur Tabachneck Director, Data Management Idea stolen from a.
CS 116 OBJECT ORIENTED PROGRAMMING II LECTURE 4 GEORGE KOUTSOGIANNAKIS Copyright: 2016 Illinois Institute of Technology/George Koutsogiannakis 1.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Starting Out with C++ Early Objects Seventh Edition by Tony Gaddis, Judy.
Array Applications. Objectives Design an algorithm to load values into a table. Design an algorithm that searches a table using a sequential search. Design.
© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.
Chapter 10 The Basics of Query Processing. Copyright © 2005 Pearson Addison-Wesley. All rights reserved External Sorting Sorting is used in implementing.
VISUAL C++ PROGRAMMING: CONCEPTS AND PROJECTS Chapter 8A Binary Search (Concepts)
Chapter 16: Searching, Sorting, and the vector Type.
Execution Plans Detail From Zero to Hero İsmail Adar.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 8, 13, & 24 By Tasha Chapman, Oregon Health Authority.
1 Ready To Become Really Productive Using PROC SQL? Sunil Gupta Gupta Programming.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
Top 50 Data Structures Interview Questions
Introduction to Execution Plans
Topics Introduction to File Input and Output
Introduction to Execution Plans
Introduction to Execution Plans
Topics Introduction to File Input and Output
Introduction to Execution Plans
Presentation transcript:

Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting

Copyright © 2006, SAS Institute Inc. All rights reserved. Introduction Merging Something we all do The need for speed increases as Files get larger Process is repeated (daily, weekly, etc.) I/O is the speed killer Basic match-merge (requires sorting) Indexes are usually faster But hashing is fastest! No need to sort any file Single pass of each file, then memory I/O

Copyright © 2006, SAS Institute Inc. All rights reserved. Introduction If hashing is fastest, why are you not using it? dot-notation syntax is weird object is ambiguous and hard to understand Think: hash object = memory table Familiar terms Side-by-side code comparisons

Copyright © 2006, SAS Institute Inc. All rights reserved. Topics to be Covered Hashing – Defined Quick Review Sequential/Direct Access Implicit/Explicit Looping Indexes Hash Object = Memory Table Compare Merge Methods Limitations and Overcoming Them

Copyright © 2006, SAS Institute Inc. All rights reserved. Hashing – Defined Hashing is the process of converting a long- range key (numeric or character) to a smaller- range integer number with a mathematical algorithm or function + Key-indexing - the concept of using the value of a tables key variable as the index into that table Think of arrays: client(66216)=POTENTIAL CLIENT; Introduced to the SAS world by Paul Dorfman at SUGI 25 Private Detectives In A Data Warehouse: Key-Indexing, Bitmapping, And Hashing

Copyright © 2006, SAS Institute Inc. All rights reserved. Hashing – Defined Incorporated into the DATA step with v9 Has two predefined component objects: hash object hash iterator These objects provide a quick and efficient method to store, search, and retrieve data based on lookup keys

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review Sequential/Direct Access Implicit/Explicit Looping Indexes

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review Sequential/Direct Access

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review – Sequential Access Sequential access – read one after another Top to bottom SAS is smart enough to know when the end-of-file (EOF) has been encountered and stops reading Lets look at an example… /* sequential access */ data work.sequential; set sashelp.class; put _all_; output; run; Name=Alfred Sex=M Age=14 Height=69 Weight=112.5 _ERROR_=0 _N_=1 Name=Alice Sex=F Age=13 Height=56.5 Weight=84 _ERROR_=0 _N_=2 Name=Barbara Sex=F Age=13 Height=65.3 Weight=98 _ERROR_=0 _N_=3 Name=Carol Sex=F Age=14 Height=62.8 Weight=102.5 _ERROR_=0 _N_=4 Name=Henry Sex=M Age=14 Height=63.5 Weight=102.5 _ERROR_=0 _N_=5 Name=James Sex=M Age=12 Height=57.3 Weight=83 _ERROR_=0 _N_=6

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review – Direct Access Direct access – read specific records You must specify which row(s) to read SAS has no way of knowing when you want to stop (so tell it) Lets look at an example… /* direct access */ data work.direct; do i=2, 3, 9; set sashelp.class point=i; put _all_; output; end; stop; run; i=2 Name=Alice Sex=F Age=13 Height=56.5 Weight=84 _ERROR_=0 _N_=1 i=3 Name=Barbara Sex=F Age=13 Height=65.3 Weight=98 _ERROR_=0 _N_=1 i=9 Name=Jeffrey Sex=M Age=13 Height=62.5 Weight=84 _ERROR_=0 _N_=1

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review Sequential/Direct Access Implicit/Explicit Looping Indexes

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review Sequential/Direct Access Implicit/Explicit Looping Indexes

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review Implicit/Explicit Looping

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review – Implicit/Explicit Access Implicit/Explicit Looping Implicit looping By default, DATA steps execute an implicit loop Explicit looping You specify what, when, and how long to loop

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review – Implicit/Explicit Access Implicit/Explicit Looping Lets look at an example Side by side with sequential access /* implicit looping */ /* explicit looping */ data work.sequential; do until (eof); set sashelp.class; set sashelp.class end=eof; put _all_; put _all_; output; output; end; run;

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review – Implicit/Explicit Access Explicit looping is usually utilized with direct access Used in the previous direct access example… /* direct access */ data work.direct; do i=2, 3, 9; set sashelp.class point=i; put _all_; output; end; stop; run; i=2 Name=Alice Sex=F Age=13 Height=56.5 Weight=84 _ERROR_=0 _N_=1 i=3 Name=Barbara Sex=F Age=13 Height=65.3 Weight=98 _ERROR_=0 _N_=1 i=9 Name=Jeffrey Sex=M Age=13 Height=62.5 Weight=84 _ERROR_=0 _N_=1

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review Sequential/Direct Access Implicit/Explicit Looping Indexes

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review Sequential/Direct Access Implicit/Explicit Looping Indexes

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review Indexes

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review – Indexes Indexes An optional file created for a SAS dataset Provides direct access to specific records based on key values Key values stored in ascending order Includes pointers to corresponding records

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review – Indexes /* simulate an index on variable: age */ data work.class_index; set sashelp.class; row_id=_n_; keep age row_id; run; proc sort data=work.class_index; by age row_id; run; data work.class_index; keep age rid; retain age rid; length rid $20; set work.class_index; by age; if first.age then rid = trim(put(row_id,best.-L)); else rid = trim(rid) || ',' || trim(put(row_id,best.-L)); if last.age then output; run; Lets simulate an index

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review – Indexes

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review – Indexes /* create work.class with index */ data work.class (index=(age)); set sashelp.class; run; /* direct access with rows */ /* direct access with values */ data work.direct; do i=2, 3, 9; set sashelp.class point=i; set work.class; where age=13; put _all_; put _all_; output; output; end; stop; run; i=2 Name=Alice Sex=F Age=13 Height=56.5 Weight=84 _ERROR_=0 _N_=1 i=3 Name=Barbara Sex=F Age=13 Height=65.3 Weight=98 _ERROR_=0 _N_=1 i=9 Name=Jeffrey Sex=M Age=13 Height=62.5 Weight=84 _ERROR_=0 _N_=1 Lets use direct access with a real index

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review – Indexes /* direct access */ data work.direct; do age=13,14; do until (eof); set class key=age end=eof; if _IORC_=0 then do; /* 0 indicates a match was found */ put _all_; output; end; /* if no match, reset the error flag and continue */ else _ERROR_=0; end; stop; run; age=13 eof=0 Name=Alice Sex=F Height=56.5 Weight=84 _ERROR_=0 _IORC_=0 _N_=1 age=13 eof=0 Name=Barbara Sex=F Height=65.3 Weight=98 _ERROR_=0 _IORC_=0 _N_=1... age=14 eof=0 Name=Henry Sex=M Height=63.5 Weight=102.5 _ERROR_=0 _IORC_=0 _N_=1 age=14 eof=0 Name=Judy Sex=F Height=64.3 Weight=90 _ERROR_=0 _IORC_=0 _N_=1 And now with explicit looping

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review – Indexes /* direct access */ data work.driver; age=13; output; age=14; output; run; data work.direct; set work.driver; /* <- sequential access & implicit loop */ do until (eof); /* <- explicit loop */ set work.class key=age end=eof; /* <- direct access */ if _IORC_=0 then do; put _all_; output; end; else _ERROR_=0; end; run; And now data driven

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review Sequential/Direct Access Implicit/Explicit Looping Indexes

Copyright © 2006, SAS Institute Inc. All rights reserved. Quick Review Sequential/Direct Access Implicit/Explicit Looping Indexes

Copyright © 2006, SAS Institute Inc. All rights reserved. Hashing Object = Memory Table Think of a traditional row/column table Create it Define it Fill it Access it

Copyright © 2006, SAS Institute Inc. All rights reserved. Hashing Object = Memory Table Create it Dynamic run-time memory table It does not exist until you create it It can also be dynamically deleted!

Copyright © 2006, SAS Institute Inc. All rights reserved. Hashing Object = Memory Table Create it data ;... /* Create it */ declare hash h_small ();... run; Created a dynamic run-time memory table h_small, NOT work.h_small, just h_small No structure (variables/index) No content (rows).

Copyright © 2006, SAS Institute Inc. All rights reserved. Hashing Object = Memory Table Define it data ;... /* Define it */ length keyvar smallvar1-smallvar2 8 newvar $12; rc = h_small.DefineKey ( keyvar ); rc = h_small.DefineData ( smallvar1,smallvar2, newvar); rc = h_small.DefineDone ();... run; Create an index variable called keyvar Create three other variables Notice length is declared before using them Stop defining

Copyright © 2006, SAS Institute Inc. All rights reserved. Hashing Object = Memory Table Fill it data ;... /* Fill it */ do until ( eof_small ); set work.small (keep=keyvar smallvar1-smallvar2) end = eof_small; newvar = any text; rc = h_small.add (); end;... run; Use explicit looping Retrieve values from another table Assign variables values any way you want Fill it

Copyright © 2006, SAS Institute Inc. All rights reserved. Hashing Object = Memory Table Access it data ;... /* Access it */ do until ( eof_big); set work.big end = eof_big; smallvar1=.; smallvar2=.; newtext= ; rc = h_small.find (); output; end;... run; Use explicit looping Load keyvar with a value Access the memory table by keyvar Variables assigned only if a match is found

Copyright © 2006, SAS Institute Inc. All rights reserved. Hashing Object = Memory Table... /* Create it */ declare hash h_small (); /* Define it */ length keyvar smallvar1-smallvar2 8 newvar $12; rc = h_small.DefineKey ( keyvar ); rc = h_small.DefineData ( smallvar1,smallvar2, newvar); rc = h_small.DefineDone (); /* Fill it */ do until ( eof_small ); set work.small (keep=keyvar smallvar1-smallvar2) end = eof_small; newvar = any text; rc = h_small.add (); end; /* Access it */ do until ( eof_big); set work.big end = eof_big; smallvar1=.; smallvar2=.; newtext= ; rc = h_small.find (); output; end;...

Copyright © 2006, SAS Institute Inc. All rights reserved. Comparing Merge Methods 12 ways to do anything with SAS Limit of two garden-variety merge techniques Match merging Merging with indexes Pentium 4, 2.4GHz, 1.25GB ram, 50GB disk XP Pro SP2 with SAS 9.1 (TS1M3) Each run was executed from a new session

Copyright © 2006, SAS Institute Inc. All rights reserved. Comparing Merge Methods Create sample tables (lifted from SAS-L) %let large_obs = ; data work.small ( keep = keyvar small: ) work.large ( keep = keyvar large: ); array keys(1:500000) $1 _temporary_; length keyvar 8; array smallvar [20]; retain smallvar 12; array largevar [682]; retain largevar 55; do _i_ = 1 to &large_obs ; keyvar = ceil (ranuni(1) * &large_obs); if keys(keyvar) = ' ' then do; output large; if ranuni(1) < 1/5 then output small; keys(keyvar) = 'X'; end; run; NOTE: The data set WORK.SMALL has observations and 21 variables. NOTE: The data set WORK.LARGE has observations and 683 variables.

Copyright © 2006, SAS Institute Inc. All rights reserved. Comparing Merge Methods Create sample tables (lifted from SAS-L)

Copyright © 2006, SAS Institute Inc. All rights reserved. Comparing Merge Methods Create sample tables (lifted from SAS-L)

Copyright © 2006, SAS Institute Inc. All rights reserved. Match Merging Requires sorting Used in 80% of all code (not statistically verified)

Copyright © 2006, SAS Institute Inc. All rights reserved. Match Merging /* basic match-merge with sort */ proc sort data=work.small; by keyvar; run; NOTE: There were observations read from the data set WORK.SMALL. NOTE: The data set WORK.SMALL has observations and 21 variables. NOTE: PROCEDURE SORT used (Total process time): real time 2.00 seconds cpu time 0.23 seconds proc sort data=work.large; by keyvar; run; NOTE: There were observations read from the data set WORK.LARGE. NOTE: The data set WORK.LARGE has observations and 683 variables. NOTE: PROCEDURE SORT used (Total process time): real time 11:59.46 cpu time seconds 12 minutes for sorting

Copyright © 2006, SAS Institute Inc. All rights reserved. Match Merging /* basic match-merge with sort */ data work.match_merge; merge work.large (in=a) work.small (in=b); by keyvar; if a; run; NOTE: There were observations read from the data set WORK.LARGE. NOTE: There were observations read from the data set WORK.SMALL. NOTE: The data set WORK.MATCH_MERGE has obs and 703 variables. NOTE: DATA statement used (Total process time): real time 8:39.31 cpu time seconds 8.5 minutes to merge

Copyright © 2006, SAS Institute Inc. All rights reserved. Match Merging 12 minute sort + 8:39 merge = 20.5 minutes I/O was the real speed killer (sorting both files)

Copyright © 2006, SAS Institute Inc. All rights reserved. Merge with an Index An index can eliminate the need for sorting Usually speeds things up

Copyright © 2006, SAS Institute Inc. All rights reserved. Merge with an Index options msglevel=i; /* creating indexes */ proc datasets lib=work nolist; modify small; index create keyvar; modify large; index create keyvar; quit; INFO: Multiple concurrent threads will be used to create the index. NOTE: Simple index keyvar has been defined. NOTE: MODIFY was successful for WORK.SMALL.DATA. INFO: Multiple concurrent threads will be used to create the index. NOTE: Simple index keyvar has been defined. NOTE: MODIFY was successful for WORK.LARGE.DATA. NOTE: PROCEDURE DATASETS used (Total process time): real time seconds cpu time 6.40 seconds 1 minute for indexing

Copyright © 2006, SAS Institute Inc. All rights reserved. Merge with an Index /* merge with indexes (no sorting) */ data work.match_merge_index; merge work.large (in=a) work.small (in=b); by keyvar; if a; run; INFO: Index keyvar selected for BY clause processing. NOTE: There were observations read from the data set WORK.LARGE. NOTE: There were observations read from the data set WORK.SMALL. NOTE: The data set WORK.MATCH_MERGE_INDEX has observations and 703 variables. NOTE: DATA statement used (Total process time): real time 1:21:18.98 cpu time 1: hour 21 minutes for merging

Copyright © 2006, SAS Institute Inc. All rights reserved. Merge with an Index 1 minute index + 81 minute merge = 82 minutes Indexes Usually speeds things up Generally not when accessing every record For every record being read, from each table Read from index to get RIDs for each value Then read each record by RID Essentially doubled the I/O What if work.large were already sorted? Only work.small would need the index

Copyright © 2006, SAS Institute Inc. All rights reserved. Merge with an Index (large pre-sorted) /* creating indexes */ proc datasets lib=work nolist; modify small; index create keyvar; quit; INFO: Multiple concurrent threads will be used to create the index. NOTE: Simple index keyvar has been defined. NOTE: MODIFY was successful for WORK.SMALL.DATA. NOTE: PROCEDURE DATASETS used (Total process time): real time 2.29 seconds cpu time 0.22 seconds 2 seconds for indexing

Copyright © 2006, SAS Institute Inc. All rights reserved. Merge with an Index (large pre-sorted) /* merge with index on small (large is already sorted) */ data work.match_merge_index; merge work.large (in=a) work.small (in=b); by keyvar; if a; run; INFO: Index keyvar selected for BY clause processing. NOTE: There were observations read from the data set WORK.LARGE. NOTE: There were observations read from the data set WORK.SMALL. NOTE: The data set WORK.MATCH_MERGE_INDEX has obs and 703 vars. NOTE: DATA statement used (Total process time): real time 7:46.57 cpu time seconds 7.75 minutes for merging

Copyright © 2006, SAS Institute Inc. All rights reserved. Merge with an Index (large pre-sorted) 2 second index merge = 7.8 minutes Eliminated I/O thrashing on work.large

Copyright © 2006, SAS Institute Inc. All rights reserved. Memory Table Merge /* merge with memory table (no sorting or indexing required!) */ data work.hash_merge (drop=rc i); /* Create it */ declare hash h_small (); /* Define it */ length keyvar smallvar1-smallvar20 8; array smallvar(20); rc = h_small.DefineKey (keyvar ); rc = h_small.DefineData (smallvar1,smallvar2,smallvar3, smallvar4,smallvar5,smallvar6, smallvar7,smallvar8,smallvar9, smallvar10,smallvar11,smallvar12, smallvar13,smallvar14,smallvar15, smallvar16,smallvar17,smallvar18, smallvar19,smallvar20 ); rc = h_small.DefineDone ();...

Copyright © 2006, SAS Institute Inc. All rights reserved. Memory Table Merge /* Fill it */ do until ( eof_small ); set work.small end = eof_small; rc = h_small.add (); end; /* Merge it */ do until ( eof_large ); set work.large end = eof_large; /* this loop initializes variables before merging from h_small */ do i=lbound(smallvar) to hbound(smallvar); smallvar(i) =.; end; rc = h_small.find (); output; end; run;

Copyright © 2006, SAS Institute Inc. All rights reserved. Memory Table Merge NOTE: There were observations read from the data set WORK.SMALL. NOTE: There were observations read from the data set WORK.LARGE. NOTE: The data set WORK.HASH_MERGE has obd and 703 variables. NOTE: DATA statement used (Total process time): real time 7:17.23 cpu time seconds 7.3 minutes for merging

Copyright © 2006, SAS Institute Inc. All rights reserved. Comparing Merge Methods Merge results: Match merge w/sorting = 20.5 minutes Index merge w/o sorting = 82 minutes Index merge w/pre-sorting = 7.8 minutes Memory table merge = 7.3 minutes

Copyright © 2006, SAS Institute Inc. All rights reserved. Stacking the Odds? Did I select tables that favor hashing? NO! And I will Prove it! Rerun the two fastest merges in reverse order Merge small onto large

Copyright © 2006, SAS Institute Inc. All rights reserved. Merge with an Index (large pre-sorted) /* merge with index on small (large is already sorted) */ data work.match_merge_index; merge work.small (in=a) work.large (in=b keep=keyvar largevar1-largevar20); by keyvar; if a; run; INFO: Index keyvar selected for BY clause processing. NOTE: There were observations read from the data set WORK.SMALL. NOTE: There were observations read from the data set WORK.LARGE. NOTE: The data set WORK.MATCH_MERGE_INDEX has obs and 41 vars. NOTE: DATA statement used (Total process time): real time 2:08.84 cpu time 7.11 seconds 2 second index + 2:08 merge = 2.2 minutes

Copyright © 2006, SAS Institute Inc. All rights reserved. Memory Table Merge /* merge with memory table (no sorting or indexing required!) */ data work.hash_merge (drop=rc i); /* Create it */ declare hash h_large (); /* Define it */ length keyvar largevar1-largevar20 8; array largevar(20); rc = h_large.DefineKey ( "keyvar" ); rc = h_large.DefineData ( "largevar1","largevar2","largevar3", "largevar4","largevar5","largevar6", "largevar7","largevar8","largevar9", "largevar10","largevar11","largevar12", "largevar13","largevar14","largevar15", "largevar16","largevar17","largevar18", "largevar19","largevar20" ); rc = h_large.DefineDone ();...

Copyright © 2006, SAS Institute Inc. All rights reserved. Memory Table Merge /* Fill it */ do until ( eof_large ); set work.large(keep=keyvar largevar1-largevar20) end = eof_large; rc = h_large.add (); end; /* Merge it */ do until ( eof_small ); set work.small end = eof_small; do i=lbound(largevar) to hbound(largevar); largevar(i) =.; end; rc = h_large.find (); output; end; run;

Copyright © 2006, SAS Institute Inc. All rights reserved. Memory Table Merge NOTE: There were observations read from the data set WORK.LARGE. NOTE: There were observations read from the data set WORK.SMALL. NOTE: The data set WORK.HASH_MERGE has observations and 41 variables. NOTE: DATA statement used (Total process time): real time 1:19.46 cpu time 6.43 seconds 1:19 merge = 1.3 minutes

Copyright © 2006, SAS Institute Inc. All rights reserved. Comparing Merge Methods Merge results: Match merge w/sorting = 20.5 minutes Index merge w/o sorting = 82 minutes Index merge w/pre-sorting = 7.8 minutes Memory table merge = 7.3 minutes And when reversing the order of the tables: Index merge w/pre-sorting = 2.2 minutes Memory table merge = 1.3 minutes

Copyright © 2006, SAS Institute Inc. All rights reserved. Limitations and Overcoming Them Limitation #1 Hash tables are not persisted across DATA steps Automatically deleted at the end of the step Once deleted, memory is immediately freed up Overcome it Multiple merges within a single DATA step, or Merge multiple tables at once on different keys

Copyright © 2006, SAS Institute Inc. All rights reserved. Limitations and Overcoming Them Limitation #2 Hash tables are limited by available memory Estimate as variables*length*records To include all 682 variables from work.large (682+1)*8* or about 1.7gig Overcome it Increase available memory with -memsize Reduce variable lengths and/or concatenate

Copyright © 2006, SAS Institute Inc. All rights reserved. Limitations and Overcoming Them Limitation #3 Key values must be distinct – no duplicates! Should not be the many in a many-one merge

Copyright © 2006, SAS Institute Inc. All rights reserved. Limitations and Overcoming Them Limitation #3 Key values must be distinct – no duplicates! Should not be the many in a many-one merge Overcome it Add variables until the key is unique Create a sequence variable to make it unique For example…

Copyright © 2006, SAS Institute Inc. All rights reserved. Limitations and Overcoming Them rc = h_large.DefineKey ( "keyvar",keyseq ); rc = h_large.DefineData ( "largevar1","largevar2, … ) rc = h_large.DefineDone (); /* Fill it */ maxkeyseq=0; do until ( eof_large ); set work.large(keep=keyvar) end = eof_large; by keyvar; if first.keyvar then keyseq=0; keyseq+1; rc = h_large.add (); if last.keyvar then maxkeyseq=max(maxkeyseq,keyseq); end; Create a sequence variable to make it unique

Copyright © 2006, SAS Institute Inc. All rights reserved. Limitations and Overcoming Them /* Merge it */ do until ( eof_small ); set work.small end = eof_small; do keyseq=1 to maxkeyseq; do i=lbound(largevar) to hbound(largevar); largevar(i) =.; end; rc = h_large.find (); output; end; drop maxkeyseq; end; Create a sequence variable to make it unique

Copyright © 2006, SAS Institute Inc. All rights reserved. Conclusion Merge results: Match merge w/sorting = 20.5 minutes Index merge w/o sorting = 82 minutes Index merge w/pre-sorting = 7.8 minutes Memory table merge = 7.3 minutes And when reversing the order of the tables: Index merge w/pre-sorting = 2.2 minutes Memory table merge = 1.3 minutes

Copyright © 2006, SAS Institute Inc. All rights reserved. Acknowledgements the Hash Man – Paul Dorfman Richard DeVenezia ( Recommended reading Hash tip sheet support.sas.com/rnd/base/topics/datastep/dot/hash-tip-sheet.pdf

Copyright © 2006, SAS Institute Inc. All rights reserved. 69 Copyright © 2006, SAS Institute Inc. All rights reserved. 69