The Power of the BY Statement SVSUG 2009.06.25 Paul Choate, California Developmental Services (& Toby Dunn, U.S. Army Medical Department Center & School)

Slides:



Advertisements
Similar presentations
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
Advertisements

Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.
Statistical Methods Lynne Stokes Department of Statistical Science Lecture 7: Introduction to SAS Programming Language.
Professional Seminar Northwestern Polytechnic University By Dr. Michael M Cheng.
SAS Programming: Working With Variables. Data Step Manipulations New variables should be created during a Data step Existing variables should be manipulated.
Livelihoods analysis using SPSS. Why do we analyze livelihoods?  Food security analysis aims at informing geographical and socio-economic targeting 
1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Instructor: Craig Duckett CASE, ORDER BY, GROUP BY, HAVING, Subqueries
Introduction to Structured Query Language (SQL)
15b. Accessing Data: Frequencies in SAS ®. 1 Prerequisites Recommended modules to complete before viewing this module  1. Introduction to the NLTS2 Training.
Introduction to SQL Session 1 Retrieving Data From a Single Table.
Access Tutorial 3 Maintaining and Querying a Database
Chapter 18: Modifying SAS Data Sets and Tracking Changes 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Understanding SAS Data Step Processing Alan C. Elliott stattutorials.com.
Welcome to SAS…Session..!. What is SAS..! A Complete programming language with report formatting with statistical and mathematical capabilities.
SAS SQL SAS Seminar Series
SAS PROC REPORT PROC TABULATE
ASP.NET Programming with C# and SQL Server First Edition
SAS Workshop Lecture 1 Lecturer: Annie N. Simpson, MSc.
Chapter 9 Producing Descriptive Statistics PROC MEANS; Summarize descriptive statistics for continuous numeric variables. PROC FREQ; Summarize frequency.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS ESSENTIALS -- Elliott & Woodward1.
Introduction to SAS BIO 226 – Spring Outline Windows and common rules Getting the data –The PRINT and CONTENT Procedures Manipulating the data.
1 Experimental Statistics - week 4 Chapter 8: 1-factor ANOVA models Using SAS.
SAS SQL Part 2 Alan Elliott. Dealing with Missing Values Title "Dealing with Missing Values in SQL"; PROC SQL; select INC_KEY,GENDER, RACE, INJTYPE, case.
Niraj J. Pandya, Element Technologies Inc., NJ.  Summarize all possible combinations of class level variables even if few categories are altogether missing.
Key Data Management Tasks in Stata
1 Experimental Statistics - week 2 Review: 2-sample t-tests paired t-tests Thursday: Meet in 15 Clements!! Bring Cody and Smith book.
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
Use the UPDATE statement to: –update a master dataset with new transactions (e.g. a bank account updated regularly with deposits and withdrawals…). Not.
A Brief Introduction to PROC TRANSPOSE prepared by Voytek Grus for
Using Weighted Data Donald Miller Population Research Institute 812 Oswald Tower, December 2008.
1 Filling in the blanks with PROC FREQ Bill Klein Ryerson University.
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
Microsoft Access. Microsoft access is a database programs that allows you to store retrieve, analyze and print information. Companies use databases for.
Chapter 22: Using Best Practices 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Priya Ramaswami Janssen R&D US. Advantages of PROC REPORT -Very powerful -Perform lists, subsets, statistics, computations, formatting within one procedure.
Chapter 5 Reading and Manipulating SAS ® Data Sets and Creating Detailed Reports Xiaogang Su Department of Statistics University of Central Florida.
1 EPIB 698C Lecture 4 Raul Cruz-Cano Summer 2012.
1 Statistical Software Programming. STAT 6360 –Statistical Software Programming Modifying and Combining Datasets For most tasks we need to work with multiple.
Lesson 8 - Topics Creating SAS datasets from procedures Using ODS and data steps to make reports Using PROC RANK Programs in course notes LSB 4:11;5:3.
Practical Uses of the DOW Loop Richard Allen Peak Stat April 8, 2009.
SAUSAG 69 – 20 Feb 2014 Smarter Sorts Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) Beyond the Obvious.
Computing with SAS Software A SAS program consists of SAS statements. 1. The DATA step consists of SAS statements that define your data and create a SAS.
FORMAT statements can be used to change the look of your output –if FORMAT is in the DATA step, then the formats are permanent and stored with the dataset.
17b.Accessing Data: Manipulating Variables in SAS ®
BMTRY 789 Lecture 6: Proc Sort, Random Number Generators, and Do Loops Readings – Chapters 5 & 6 Lab Problem - Brain Teaser Homework Due – HW 2 Homework.
Use the SET statement to: –create an exact copy of a SAS dataset –modify an existing SAS dataset by creating new variables, subsetting (using a subsetting.
TASS Meeting Using Multiple DOW Loops September 25th, 2009 Using Multiple DOW Loops Dr. Arthur Tabachneck Director, Data Management Idea stolen from a.
Chapter 6: Modifying and Combining Data Sets  The SET statement is a powerful statement in the DATA step DATA newdatasetname; SET olddatasetname;.. run;
7 1 Database Systems: Design, Implementation, & Management, 7 th Edition, Rob & Coronel 7.6 Advanced Select Queries SQL provides useful functions that.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
SAS Programming Training Instructor:Greg Grandits TA: Textbooks:The Little SAS Book, 5th Edition Applied Statistics and the SAS Programming Language, 5.
INFORMATION TECHNOLOGY DATABASE MANAGEMENT. A database is a collection of information organized to provide efficient retrieval. The collected information.
Based on Learning SAS by Example: A Programmer’s Guide Chapters 1 & 2
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 8, 13, & 24 By Tasha Chapman, Oregon Health Authority.
1 Ready To Become Really Productive Using PROC SQL? Sunil Gupta Gupta Programming.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 14 & 19 By Tasha Chapman, Oregon Health Authority.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 16 & 17 By Tasha Chapman, Oregon Health Authority.
Session 1 Retrieving Data From a Single Table
Applied Business Forecasting and Regression Analysis
Chapter 6: Modifying and Combining Data Sets
Chapter 18: Modifying SAS Data Sets and Tracking Changes
Lesson 8 - Topics Creating SAS datasets from procedures
Quick Data Summaries in SAS
Chapter 14 Sorting and Merging.
Producing Descriptive Statistics
Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.
Presentation transcript:

The Power of the BY Statement SVSUG Paul Choate, California Developmental Services (& Toby Dunn, U.S. Army Medical Department Center & School)

BY Statement Syntax and Usage The BY statement is used in SAS to instruct the DATA step or procedures to process dataset observations in groups, rather than singly. It can be used whenever SAS data is ordered, or can be accessed in order through a SAS dataset index. In the DATA step this allows observations to be summarized or reorganized according to a group structure. In PROC steps it allows SAS to process and present data in groups. The basic syntax of the BY statement is the same throughout SAS, with the exception that the GROUPFORMAT option is only available in the DATA step. BY var1 varn> ;

BY Statement Syntax and Usage BY Sex Age Name; NAME SEX AGE HEIGHT WEIGHT Alice F Barbara F Carol F Judy F Jeffrey M Alfred M Ronald M Philip M

BY Statement Syntax and Usage Sort order is platform dependent and is based on the internal ordering of the platform character set, called the collating sequence. ASCII (PC) character set order:..., 1, 2, 3,... A, B, C,... a, b, c......, 1, 2, 3,... A, B, C,... a, b, c... EBCDIC (MVS) character set order:..., a, b, c,... A, B, C,... 1, 2, 3,......, a, b, c,... A, B, C,... 1, 2, 3,...

BY Statement Syntax and Usage BY Sex DESCENDING Age Name; NAME SEX AGE HEIGHT WEIGHT Janet F Carol F Alice F Barbara F Philip M Alfred M Henry M

BY Statement Syntax and Usage BY Age NOTSORTED; NAME AGE HEIGHT WEIGHT Carol Judy Janet Ronald Mary Alice Jeffrey

BY Statement Syntax and Usage PROC FORMAT; VALUE $Initials 'A'-<'B'='A' VALUE $Initials 'A'-<'B'='A' 'B'-<'C'='B' 'B'-<'C'='B'......RUN; DATA Class; DATA Class; SET Class; SET Class; FORMAT Name $Initials.; FORMAT Name $Initials.; BY Name GROUPFORMAT; BY Name GROUPFORMAT;RUN;

BY Statement Syntax and Usage GROUPFORMAT (cont.) NAME AGE HEIGHT WEIGHT Alice Alfred Judy Janet Jeffrey William

Introduction to Data Structure Variables may be divided into two classes: primary key variables, whose values may be combined uniquely to identify one observation or event, and primary key variables, whose values may be combined uniquely to identify one observation or event, and non-primary keys, whose values cannot be combined to uniquely identify an observation. non-primary keys, whose values cannot be combined to uniquely identify an observation. The primary and non-primary keys are all related to each other in some fashion known as functional dependencies. Primary keys must be unique or form unique combinations called composite keys.

Introduction to Data Structure The most fundamental rule is that no two rows shall have the same unique values for all primary key variables. VEHICLETYPE MODEL MAKE YEAR COLOR Truck 1500 Chevy 2008 Blue should be reduced to: VEHICLETYPE MODEL MAKE YEAR COLOR COUNT Truck 1500 Chevy 2008 Blue 2

Introduction to Data Structure Each variable in the dataset should have atomic values. VEHICLE MODEL YEAR COLOR NUMSOLD PACKAGE Truck Blue 2 Sports, Standard Truck Gold 3 Sports, Sports, Standard should be restructured as: VEHICLE MODEL YEAR COLOR NUMSOLD PACKAGE Truck Blue 1 Sports Truck Blue 1 Standard Truck Gold 2 Sports Truck Gold 1 Standard This is called First Normal Form with Redundancies.

BY Statement in the Data Step The BY statement provides two automatic temporary variables for each BY variable: FIRST.variable and LAST.variable. They indicate whether an observation is: the first in a BY group the first in a BY group the last in a BY group the last in a BY group neither the first nor the last in a BY group neither the first nor the last in a BY group both first and last, as is the case when there is only one observation in a BY group. both first and last, as is the case when there is only one observation in a BY group.

BY Statement in the Data Step SEXAGEFIRST.SEXLAST.SEXFIRST.AGELAST.AGE F F F F M M M M BY Sex Age;

BY Statement in the Data Step SEXAGEFIRST.SEXLAST.SEXFIRST.AGELAST.AGE F F M M M Sorted variables with unique values have all FIRST.variable and LAST.variables set to 1. Here Age is unique within Sex: BY Sex Age;

BY Statement in the Data Step Examples: Unduplication example Unduplication example Counting records example Counting records example

Combining Datasets In the DATA step the BY statement is used for combining data with: Interleaving with the SET statement Interleaving with the SET statement Match-merging with the MERGE statement Match-merging with the MERGE statement Updating with the UPDATE statement Updating with the UPDATE statementand Modifying (beyond scope of presentation) Modifying (beyond scope of presentation)

Interleaving Datasets When a BY statement is used with a SET statement that specifies two or more datasets, the DATA step reads the files simultaneously, alternating between the files based on the BY variable order. This maintains the sort order of the data from the datasets as they are processed. For example, suppose there are two datasets, one for males and one for females, and both are sorted on Age. They can be interleaved into a single dataset sorted on Age and Gender.

Interleaving Datasets Example: SET statement interleaving example SET statement interleaving example

Interleaving Datasets With interleaving, the sum of a variable for each by-group may be attached back to the original non-aggregated dataset. This requires at least two passes of the data, but the efficiency and complexity may vary considerably based on the approach. This requires at least two passes of the data, but the efficiency and complexity may vary considerably based on the approach.

Interleaving Datasets Example: Howard Shreier look-ahead processing Howard Shreier look-ahead processing

Wookie One-Liners You?! It was your idea for Jar Jar?! And Lando never suggested a flea bath again. "I just need one head to finish my C3PO" Allright, allright. I promise: No more Colt-45 commercials! What do you mean,"We're OUT of shampoo??!!!!"

Match-Merging Datasets When a BY statement is used with a MERGE statement, the SAS datasets are read simultaneously, merging observations based on matching BY variables. When merging multiple datasets, usually at least all but one of the datasets should be unique on the BY variables. The combined unique observations are merged with each matching observation in the non-unique dataset. The unique observations are duplicated across the non-unique observations.

Match-Merging Datasets Example: MERGE statement MERGE statement

Updating Datasets The UPDATE statement only allows two datasets, a master dataset and a transaction dataset. The master dataset is specified first and the transaction dataset second, followed by a BY statement. As with MERGE, the two datasets are read simultaneously, updating observations from the master dataset with observations from the transaction dataset based on the lowest level groupings of the BY variables. When a transaction variable has a missing value, by default UPDATE does not overwrite the value in the master dataset, whereas the MERGE statement does.

Updating Datasets Examples: Updating prices in an inventory Updating prices in an inventory Flattening a dataset Flattening a dataset

Do-Loop of Whitlock (DoW) The SET statement may be wrapped inside a DO UNTIL loop with the BY statement controlling the loop. DATA...; ; ; DO UNTIL ; DO UNTIL ; SET...; SET...; By...; By...; ; ; END; END; ; ;RUN;

Do-Loop of Whitlock (DoW) The DoW works with the natural execution of the DATA step by isolating what happens between two consecutive break events. Statements and functions are placed within the loop, and the implicit action of the DATA step resets calculated values to missing after each BY group. In our example the break events are BY groups, but in other cases could be anything that triggers the DO loop to stop.

Do-Loop of Whitlock (DoW) Examples: Standard DATA step Standard DATA step Whitlock/Dorfman DoW Whitlock/Dorfman DoW Sequential DoWs Sequential DoWs

The BY Statement in SAS Procedures Nearly all SAS PROCs that process datasets allow for the BY statement. The syntax is the same as in the DATA step, except for the GROUPFORMAT option which is only available to the DATA step. Procedures that produce printed output, such as PROC PRINT, format printed output into BY groups. Procedures that produce printed output, such as PROC PRINT, format printed output into BY groups. Procedures that summarize datasets, like PROC FREQ or PROC SUMMARY process the data in groups, sometimes as an alternative to other statements such as TABLES or CLASS. Procedures that summarize datasets, like PROC FREQ or PROC SUMMARY process the data in groups, sometimes as an alternative to other statements such as TABLES or CLASS.

The PRINT Procedure PROC PRINT writes dataset values in columnar table form with the variable names or labels at the top of each column. The BY statement, and the related PAGEBY and SUMBY statements can be used with PROC PRINT.

The PRINT Procedure Examples: BY statement BY statement BY statement with ID statement BY statement with ID statement PAGEBY statement PAGEBY statement SUMBY statement SUMBY statement

The FREQ Procedure The FREQ procedure calculates frequencies and statistics on discrete variables. These can be printed or output. Levels of a tabulation are requested with a TABLES statement, or for sorted variables with a BY statement. PROC FREQ does not show rows or columns for missing categories of a variable in a BY group, but in the TABLE statement the row or column is zero filled. The BY and TABLE statements produce different statistics for tabulation levels with missing categories.

The FREQ Procedure Examples: PROC FREQ with the TABLES statement PROC FREQ with the TABLES statement PROC FREQ with the BY statement PROC FREQ with the BY statement

The SUMMARY or MEANS Procedure In PROC SUMMARY and PROC MEANS the BY statement is an alternate to the CLASS statement. All permutations of levels of CLASS variables are summarized. For three class variables A, B, and C, statistics are calculated for the overall data and all levels of A, B, C, A*B, A*C, B*C, and A*B*C. Sorted variables may be alternatively specified in a BY statement, but only permutations including that variable will be summarized. For example, if A is specified in the BY statement rather than the CLASS statement, then only statistics for A, A*B, A*C, and A*B*C are produced.

The SUMMARY or MEANS Procedure Examples: PROC SUMMARY with the CLASS statement PROC SUMMARY with the CLASS statement PROC SUMMARY with the BY statement PROC SUMMARY with the BY statement

The SQL Procedure PROC SQL performs actions both similar to the DATA step and summarizing procedures such as SUMMARY, TABULATE, and UNIVARIATE. PROC SQL has unique syntax conforming to the SQL programming language. The BY statement in PROC SQL is replaced by the GROUP BY statement. In PROC SQL if data are not sorted then the procedure will sort the data internally as needed by the GROUP BY statement.

The SQL Procedure Example: Aggregating grouped data with PROC SQL GROUP BY statement Aggregating grouped data with PROC SQL GROUP BY statement

Thanks to SVSUG Chair Andrew Karp

Contact Information Your comments and questions are valued and encouraged. Paul Choate, California Developmental Services Phone: (916)