Download presentation
Presentation is loading. Please wait.
Published byLeslie Small Modified over 9 years ago
1
Denormalizing Data with PROC SQL
2
Demoralizing Data with PROC SQL Is that a real word?? Spell Check…
3
Grocery Store
4
What You Have CUSTOMER_IDNAME 1Ansel 2Fiona 3James 4Kathy 5Ying 6Otto 7Costas 8Abdul 9Enrico 10Mitzu ITEM_IDITEM_NAME 1eggs 2milk 3bread 4chicken 5beef 6broccoli 7carrots 8apples 9peaches 10dog food CUSTOMER_IDITEM_ID 11 12 14 16 22 23 24 27 28 …… 104 5 8 Relational tables in normal form Purchase events in many-to-many relation Good for relational storage, bad for computing stats Data step: join tables using complex merge
5
What you want CUSTOMER_IDNAMEEGGSMILKBREAD…DOG_FOOD 1Ansel110…0 2Fiona111…1 3James010…0 4Kathy101…0 5Ying111…1 6Otto001…1 7Costas001…1 8Abdul010…0 9Enrico101…1 10Mitzu010…1 Matrix shows who bought what One row per customer One column per item Easy to compute stats
6
How do you get this? A few SQL examples to build up to a solution…
7
SQL Examples What items did customer #1 buy? select item_id from purchases where customer_id = 1; ITEM_ID ---------- 1 2 4 6
8
What items did customer #1 buy? Join with grocery table to get item name select P.item_id, G.item_name from from purchases P, groceries G where G.item_id = P.item_id and P.customer_id = 1; ITEM_ID ITEM_NAME ---------- --------- 1 eggs 2 milk 4 chicken 6 broccoli
9
How many customers bought eggs? Use SQL aggregate function count(). select count(*) from purchases P, groceries G where P.item_id = G.item_id and G.item_name = 'eggs' COUNT(*) -------- 5
10
Did customer #1 buy eggs? Restrict by customer, count() function returns 0 or 1, i.e., yes or no select count(*) from groceries G, purchases P where P.item_id = G.item_id and G.item_name = 'eggs' and P.customer_id = 1; COUNT(*) -------- 1
11
Did customer #10 buy eggs? select count(*) from groceries G, purchases P where P.item_id = G.item_id and G.item_name = 'eggs' and P.customer_id = 10; COUNT(*) -------- 0
12
Subqueries In SQL, select clause can include a query that returns a scalar value select name, (select count(*) from purchases) num_purchases from customers NAME NUM_PURCHASES ----------- ------------- Ansel 58 Fiona 58 James 58 Kathy 58 Ying 58 Otto 58 Costas 58 Abdul 58 Enrico 58 Mitzu 58
13
Correlated Subqueries Relate inner and outer queries via alias select name, (select count(*) from purchases where customer_id = C.customer_id) num_purchases from customers C; NAME NUM_PURCHASES ----------- ------------- Ansel 4 Fiona 9 James 6 Kathy 3 Ying 8 Otto 7 Costas 7 Abdul 2 Enrico 7 Mitzu 5
14
Putting the pieces together Joins to get data from multiple tables Count() to get 0/1, yes/no Correlated subqueries rotate rows to columns Aliases to name columns
15
Final query select customer_id, name, (select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'eggs' and P.customer_id = C.customer_id) eggs, (select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'milk' and P.customer_id = C.customer_id) milk, (select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'bread' and P.customer_id = C.customer_id) bread from customers C;
16
SAS Code proc sql; create table Work.Purchase_Matrix as select customer_id, name, (select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'eggs' and P.customer_id = C.customer_id) eggs, (select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'milk' and P.customer_id = C.customer_id) milk, (select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'bread' and P.customer_id = C.customer_id) bread from customers C; quit;
17
Final Dataset CUSTOMER_ID NAME EGGS MILK BREAD ----------- ----------- ---------- ---------- ---------- 1 Ansel 1 1 0 2 Fiona 1 1 1 3 James 0 1 0 4 Kathy 1 0 1 5 Ying 1 1 1 6 Otto 0 0 1 7 Costas 0 0 1 8 Abdul 0 1 0 9 Enrico 1 0 1 10 Mitzu 0 1 0
18
Resources SQL may not always be the most appropriate choice for a given problem. This technique starts to get untenable as the number of columns needed in the output increases. DATA Step vs. PROC SQL: What’s a neophyte to do? http://www2.sas.com/proceedings/sugi29/269-29.pdf Proc SQL versus The Data Step http://www.nesug.org/proceedings/nesug06/hw/hw06.pdf
19
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.