Download presentation
Presentation is loading. Please wait.
Published byJuliet Stafford Modified over 9 years ago
1
Chapter 9 Cleansing and Augmenting the Data Xiaogang Su Department of Statistics University of Central Florida
2
Section 9.1 Cleansing and Augmenting the Data Objectives Modify the data using the functions —UPCASE —COMPBL —TRANWRD.
3
Scenario The secretary at Statistics Department maintains some faculty information. The data set contains information about all faculty in Fall 2001....
4
UPCASE Function The UPCASE function converts all of the letters in the data value into uppercase. var=UPCASE(var); country=upcase(country); COUNTRY france BEFORE COUNTRY FRANCE AFTER Example:...
5
COMPBL Function The COMPBL function compresses multiple consecutive blanks in a data value into one blank. Since the length of a variable is set at compilation, the resulting data value is padded with blanks. Example: var=COMPBL(var); name=compbl(name); NAME DE PABLOS BEFORE NAME DE PABLOS AFTER...
6
TRANWRD Function The TRANWRD function replaces all occurrences of a pattern of characters in a data value with another pattern of characters. Example: var=TRANWRD(var, target, replacement); name=tranwrd(name,’Miss’, ’Ms’); NAME Miss. Joy Ho BEFORE Ms. Joy Ho AFTER NAME...
7
Cleansing Existing Data This demonstration illustrates changing Country to be always uppercase for USA and only the first letter be upper case for other counties removing multiple blanks from a last name Program: pg2-ch9-ex01.sas
8
Section 9.2 Creating New Character Variables Objectives Modify character data values. Specifically, use the SCAN function TRIM function concatenation operator SUBSTR function...
9
ccutchin@mail.ucf.edu lguo@mail.ucf.edu hoffman@pegasus.cc.ucf.edu morij@pegasus.cc.ucf.edu Scenario University has two email systems “pegasus” and “mail”. The e-mail address structures are in the form of Lastname@pegasus.cc.ucf.edu Or Lastname@mail.ucf.edu...
10
Refresher Quiz! 1. What are the two phases of the DATA step? 2. What are the three required attributes of all variables? 3. In what phase are these attributes assigned?...
11
Refresher Quiz! 1. What are the two phases of the DATA step? Compilation and Execution 2. What are the three required attributes of all variables? Name, Type, and Length 3. In what phase are these attributes assigned? Compilation...
12
SUBSTR Function The SUBSTR function extracts a portion of the character data value based on how many characters are designated for retrieval. var1=SUBSTR(var, start, ); NAME Dorothy E name1 = substr(name,1,3); NAME1 Dor...
13
Retrieve First Initial The first step is to retrieve the initial letter of an employee’s first name. first_initial = substr(emp_firstname,1,1); EMP_FIRSTNAME Dorothy E FIRST_INITIAL D...
14
Retrieve Middle Initial The second step is to retrieve the middle initial. Problems: Not all middle initials are in the same location, so you cannot use the SUBSTR function. Not all employees have a middle initial. lastname firstname email MILLS DOROTHY E DEMILLS@IA.COM BOWER EILEEN A. EABOWER@IA.COM READING TONY R. TRREADING@IA.COM JUDD CAROL A. CAJUDD@IA.COM ANDERSON CHRISTOPHER CANDERSON@IA.COM...
15
SCAN Function The SCAN function extracts a portion of the character data value based on what word-number to retrieve. Example: var1=SCAN(var, word-number, ); NAME Dorothy E name1=scan(name,2,' '); NAME1 E...
16
Retrieve Middle Initial middle_initial=substr(scan(emp_firstname,2,' '),1,1); Example: emp_firstname = Dorothy E Second word = 'E'...
17
Retrieve Middle Initial middle_initial=substr(scan(emp_firstname,2,' '),1,1); Example: emp_firstname = Dorothy E 1 character = E Second word = 'E'
18
Retrieve Middle Initial middle_initial=substr(scan(emp_firstname,2,' '),1,1); Example: emp_firstname = Eileen Alma Second word = 'Alma'...
19
Retrieve Middle Initial middle_initial=substr(scan(emp_firstname,2,' '),1,1); Example: emp_firstname = Eileen Alma 1 character = A Second word = 'Alma'...
20
Retrieve Middle Initial middle_initial=substr(scan(emp_firstname,2,' '),1,1); Example: emp_firstname = Hanna Second word = missing...
21
When you put the e-mail address together, separate assignments must be made for those with middle initials and for those without. Retrieve Middle Initial middle_initial=substr(scan(emp_firstname,2,' '),1,1); Example: emp_firstname = Hanna 1 character = missing...
22
Concatenation Operator The third step is to put the first initial, middle initial, and last name together. The concatenation operator joins character data values together. var2 = var !! var1;
23
Concatenation Operator Example: newname=name !! name1; Compilation NAME1 $ 6 NEWNAME $ 15 + NAME $ 9...
24
Concatenation Operator Example: newname=name !! name1; DorothyE 2 spaces NEWNAME $ 15 Dorothy E Execution NAME $ 9 NAME1 $ 6 +...
25
TRIM Function The TRIM function removes trailing blanks from a character data value during execution. var2=TRIM(var) !! var1;
26
TRIM Function Example: newname=trim(name) !! name1; NEWNAME $ 15 NAME1 $ 6 NAME $ 9 Compilation +...
27
TRIM Function Example: newname=trim(name) !! name1; 0 spaces 7 spaces NEWNAME $ 15 DorothyE DorothyE Execution NAME $ 9 NAME1 $ 6 +...
28
TRIM Function Example: newname=trim(name) !! ‘ ‘ !! name1; 1 space 7 spaces NEWNAME $ 16 Dorothy E DorothyE Execution NAME $ 9 NAME1 $ 6 ++ space...
29
Pegasus, Mail, or Other email system This demonstration illustrates putting all the pieces together. Program: pg2-ch9-ex02.sas
30
Section 9.3 Creating New Numeric Variables Objectives Modify numeric data values. Specifically, use the SUM FunctionROUND FunctionINT FunctionMEAN Function
31
Scenario International Airlines will provide a retroactive, prorated bonus to employees for their years of service up to and including a maximum of ten years of past company service. sum_ avg_ EMP_ID bonus bonus E00002 18956.98 2708 E00003 60211.06 12042 E00004 4218.99 4219 E00005 17142.96 1905 E00007 20361.21 2909 The variable names that contain the bonuses are BONUS1, BONUS2, and so on.
32
Implied Variable Lists implied variable list Because the variable names have the same prefix word followed by a numbered suffix, you can use an implied variable list to refer to a group of variables. The syntax may vary slightly based on the statement in which the implied variable list is used, but the basic syntax is varA-varN where A is a starting number, and N is an ending number. The numbers must be consecutive.
33
The SUM Function The SUM function adds variable values together ignores missing values. var=SUM(var1, var2, …, varN); var=sum(var1, var2, var3); Example: VAR 18 VAR1 12 VAR2. VAR3 6...
34
The SUM Function When you use an implied variable list, use the keyword OF in front of the first variable name to prevent subtraction from occurring. Example: var=SUM(OF var1-varN); var=sum(of var1-var3); VAR1 12 VAR2. VAR3 6 VAR 18
35
The MEAN Function The MEAN function returns the arithmetic mean (average) and ignores missing values. Example: var=MEAN(var1, var2,…,varN); var=mean(of var1-var3); VAR 9 VAR1 12 VAR2. VAR3 6...
36
The ROUND Function The ROUND function returns a value rounded to the nearest round-off unit. If round-off unit is not provided, the variable is rounded to the nearest integer. Any number or fractional value can be used as a round-off unit. var=ROUND(var1, );
37
The ROUND Function Examples: VAR1A 12 VAR2A 43 VAR3A 6.5 VAR4A -6.5 VAR1 12.12 VAR2 42.54 VAR3 6.472 VAR4 -6.47 var1a = round(var1); var2a = round(var2); var3a = round(var3,.1); var4a = round(var4,.1);
38
The INT Function The INT function returns the integer portion of an argument. var=INT(var1);
39
The INT Function var1a = int(var1); var2a = int(var2); var3a = int(var3); var4a = int(var4); Examples: VAR1 12.12 VAR2 42.54 VAR3 6.472 VAR4 -6.47 VAR1A 12 VAR2A 42 VAR3A 6 VAR4A -6
40
Section 9.4 Creating New Numerical Variables Based on Dates Objectives Modify numeric date values. Specifically, use SAS time and datetime constants the INTNX function the TODAY function the YEAR function...
41
Scenario New employees are eligible for benefits on the first of the month following their start dates. benefit_ emp_hire_ benefit_ Obs date date LASTNAME FIRSTNAME days 1 05/01/1999 01APR1999 ANDERSON SANDRA 334 2 06/01/1999 10MAY1999 BARBOUR DAVID 303 3 02/01/2000 01JAN2000 BELL MICHAEL 58 4 08/01/1999 18JUL1999 BENTZ ALICE 242 5 02/01/2000 03JAN2000 BOLCH AMY 58
42
Refresher Quiz! 1.What date is the starting point from which all SAS dates are based? 2.What option takes effect if only a two-digit year is specified? 3.How do you specify a SAS date constant for September 12, 2008?...
43
Refresher Quiz! 1.What date is the starting point from which all SAS dates are based? 2.What option takes effect if only a two-digit year is specified? January 1, 1960 YEARCUTOFF= 3.How do you specify a SAS date constant for September 12, 2008? '12sep2008'd...
44
SAS Times Just as SAS has a starting point for dates, it also has a starting point for times. Time is measured as the number of seconds since midnight. 'hh:mm 'T
45
SAS Datetimes SAS datetimes are a combination of dates and times, and are measured as the number of seconds since January 1, 1960. 'ddmmmyyyy:hh:mm 'DT
46
The INTNX Function The INTNX function advances a date, time, or datetime value by a given interval, and returns a date, time, or datetime value. Example: var1=intnx('year',var,1); VAR 0 SAS date for 01JAN1960 VAR1 366 SAS date for 01JAN1961 var=INTNX('interval',start-from,increment);...
47
The TODAY Function The TODAY function returns the current date as a SAS date. Example: Assume today’s date is September 12, 2008. var1=today(); VAR1 17787 SAS date for 12SEP2008 var=TODAY();...
48
Other DATE Functions There are other functions that return values from SAS dates. var=YEAR(SAS-date); var=MONTH(SAS-date); var=DAY(SAS-date);
49
Other DATE Functions Example: Assume today’s date is September 12, 2008. THISYEAR=YEAR(TODAY()); THISDAY 12 THISMONTH=MONTH(TODAY()); THISDAY=DAY(TODAY()); THISMONTH 9 THISYEAR 2008...
50
Calculating Eligibility Date This demonstration illustrates putting all the pieces together. Program: pg2-ch9-ex03.sas
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.