Download presentation
Presentation is loading. Please wait.
Published byBranden Hodge Modified over 9 years ago
1
EPIB 698D Lecture 2 Raul Cruz Spring 2013
2
2 SAS functions SAS has over 400 functions, with the following general form: Function-name (argument, argument, …) All functions must have parentheses even if they don’t require any arguments Example: X=Int(log(10)); Mean_score = mean(score1, score2, score3); The Mean function returns mean of non-missing arguments, which differs from simply adding and dividing by their number, which would return a missing values if any arguments are missing
3
3 Common Functions And Operators Functions ABS: absolute value EXP: exponential LOG: natural logarithm MAX and MIN: maximum and minimum SQRT: square root SUM: sum of variables Example: SUM (of x1-x10, x21) Arithmetic: +, -, *, /, ** (not ^)
4
4 More SAS functions Function NameExampleResult Max Y=Max(1, 3, 5);Y=5 Round Y=Round (1.236, 2);Y=1.24 Sum Y=sum(1, 3, 5);Y=9 Length a=‘my cat’; Y=Length (a); Y=6 Trima=‘my ’, b=‘cat’ Y=trim(a)||b Y=‘mycat’
5
5 Using IF-THEN statement IF-THEN statement is used for conditional processing. Example: you want to derive means test scores for female students but not male students. Here we derive means conditioning on gender =‘female’ Syntax: If condition then action; Eg: If gender =‘F’ then mean_score =mean(scr1, scr2);
6
6 Using IF-THEN statement Logical comparisonMnemonic termsymbol Equal toEQ= Not equal toNE^= or ~= Less thanLT< Less than or equal toLE<= Greater thanGT> greater than or equal toGE>= Equal to one in a listIN List of Logical comparison operators Note: Missing numeric values will be treated as the most negative values you can reference on your computer
7
7 Using IF-THEN statement Example: We have data contains the following information of subjects: Age Gender Midterm Quiz FinalExam 21 M 80 B- 82 20 F 90 A 93 35 M 87 B+ 85 48 F 80 C 76 59 F 95 A+ 97 15 M 88 C 93 Task: To group student based on their age ( =60)
8
data conditional; input Age Gender $ Midterm Quiz $2. FinalExam; datalines; 21 M 80 B- 82 20 F 90 A 93 35 M 87 B+ 85 48 F 80 C 76 59 F 95 A+ 97 15 M 88 C 93 ; data new1; set conditional; if Age < 20 then AgeGroup = 1; if 20 <= Age < 40 then AgeGroup = 2; if 40 <= Age < 60 then AgeGroup = 3; if Age >= 60 then AgeGroup = 4; Run; 8
9
9 Multiple conditions with AND and OR IF condition1 and condition2 then action; Eg: If age <40 and gender=‘F’ then group=1; If age <40 or gender=‘F’ then group=2;
10
10 IF-THEN statement, multiple conditions Example: We have data contains the following information of subjects: Age Gender Midterm Quiz FinalExam 21 M 80 B- 82 20 F 90 A 93 35 M 87 B+ 85 48 F 80 C 76 59 F 95 A+ 97 15 M 88 C 93 Task: To group student based on their age ( =40),and gender
11
11 data new1; set conditional; If age <40 and gender='F' then group=1; If age >=40 and gender='F' then group=2; IF age <40 and gender ='M' then group=3; IF age >=40 and gender ='M' then group=4; run;
12
Note: Missing numeric values will be treated as the most negative values you can reference on your computer Example: group age into age groups with missing values 21 M 80 B- 82 20 F 90 A 93. M 87 B+ 85 48 F 80 C 76 59 F 95 A+ 97. M 88 C 93 12
13
13 IF-THEN statement, with multiple actions Example: We have data contains the following information of subjects: Age Gender Midterm Quiz FinalExam 21 M 80 B- 82 20 F 90 A 93 35 M 87 B+ 85 48 F 80 C 76 59 F 95 A+ 97 15 M 88 C 93 Task: To group student based on their age, and assign test date based on the age group
14
14 Multiple actions with Do, end Syntax: IF condition then do; Action1 ; Action 2; End; If age <=20 then do ; group=1; exam_date =“Monday”; End;
15
15 IF-THEN/ELSE statement Syntax IF condition1 then action1; Else if condition2 then action2; Else if condition3 then action3; IF-THEN/Else statement has two advantages than IF-THEN statement (1) It is more efficient, use less computing time (2) Else logic ensures that your groups are mutually exclusive so that you do not put one obervation into more than one groups.
16
16 IF-THEN/ELSE statement data new1; set conditional; if Age < 20 then AgeGroup = 1; else if Age >= 20 and Age < 40 then AgeGroup = 2; else if Age >= 40 and Age < 60 then AgeGroup = 3; else if Age >= 60 then AgeGroup = 4; run;
17
17 Subsetting your data You can subset you data using a IF statement in a data step Example: Data new1; Set new; If gender =‘F’; Data new1; Set new; If gender ^=‘F’ then delete;
18
18 Stacking data sets using the SET statement With more than one data, the SET statement stacks the data sets one on top of the other Syntax: DATA new-data-set; SET data-set-1 data-set-2 … data-set-n; The Number of observations in the new data set will equal to the sum of the number of observations in the old data sets The order of observations is determined by the order of the list of old data sets If one of the data set has a variables not contained in the other data sets, then observations from the other data sets will have missing values for that variable
19
19 Stacking data sets using the SET statement Example: Here is data set contains information of visitors to a park. There are two entrances: south entrance and north entrance. The data file for the south entrance has an S for south, followed by the customers pass numbers, the size of their parties, and ages. The data file for the north entrance has an N for north, the same data as the south entrance, plus one more variable for parking lot. /* North.dat */ N 21 5 41 1 N 87 4 33 3 N 65 2 67 1 N 66 2 7 1 /* South.dat */ S 43 3 27 S 44 3 24 S 45 3 2
20
20 DATA southentrance; INPUT Entrance $ PassNumber PartySize Age; cards; S 43 3 27 S 44 3 24 S 45 3 2 ; run; DATA northentrance; INPUT Entrance $ PassNumber PartySize Age Lot; Cards; N 21 5 41 1 N 87 4 33 3 N 65 2 67 1 N 66 2 7 1 ; run; DATA both; SET southentrance northentrance; RUN;
21
21 Combining data sets with one-to-many match One-to-many match: matching one observation from one data set with more than one observation to another data set The statement of one-to-many match is the same as one- to-one match DATA new-data-set; Merge data-set-1 data-set-2; By variable-list; The data sets must be sorted first by the BY variables If the two data sets have variables with the same names, besides the BY variables, the variables from the second data set will overwrite any variables with the same name in the first data set
22
22 Example: Shoes data The shoe store is putting all its shoes on sale. They have two data file, one contains information about each type of shoe, and one with discount information. We want to find out new price of the shoes Shoe data: Max Flight running 142.99 Zip Fit Leather walking 83.99 Zoom Airborne running 112.99 Light Step walking 73.99 Max Step Woven walking 75.99 Zip Sneak c-train 92.99 Discount data c-train.25 running.30 walking.20
23
23 DATA regular; INFILE datalines dsd; length style $15; INPUT Style $ ExerciseType $ RegularPrice @@; datalines; Max Flight, running, 142.99, … ; PROC SORT DATA = regular; BY ExerciseType; DATA discount; INPUT ExerciseType $ Adjustment @@; cards; c-train.25 … ; DATA prices; MERGE regular discount; BY ExerciseType; NewPrice = ROUND(RegularPrice - (RegularPrice * Adjustment),.01); RUN;
24
24 Simplifying programs with Arrays SAS Arrays are a collection of elements (usually SAS variables) that allow you to write SAS statements referencing this group of variables. Arrays are defined using Array statement as: ARRAY name (n) variable list name: is a name you give to the array n: is the number of variables in the array eg: ARRAY store (4) macys sears target costco Store(1) is the variable for macys Store(2) is the variable for sears
25
25 Simplifying programs with Arrays A radio station is conducting a survey asking people to rate 10 songs. The rating is on a scale of 1 to 5, with 1=Do not like the song; 5-like the song; IF the listener does not want to rate a song, he puts a “9” to indicate missing values Here is the data with location, listeners age and rating for 10 songs Albany 54 4 3 5 9 9 2 1 4 4 9 Richmond 33 5 2 4 3 9 2 9 3 3 3 Oakland 27 1 3 2 9 9 9 3 4 2 3 Richmond 41 4 3 5 5 5 2 9 4 5 5 Berkeley 18 3 4 9 1 4 9 3 9 3 2 We want to change 9 to missing values (.)
26
26 Simplifying programs with Arrays DATA songs; INFILE ‘E:\radio.txt'; INPUT City $ 1-15 Age domk wj hwow simbh kt aomm libm tr filp ttr; ARRAY song (10) domk wj hwow simbh kt aomm libm tr filp ttr; DO i = 1 TO 10; IF song(i) = 9 THEN song(i) =.; END; run;
27
27 Using shortcuts for lists of variable names When writing SAS programs, we will often need to write a list of variables names. When you have a data will many variables, a shortcut for lists of variables names is helpful Numbered range list: variables which starts with same characters and end with consecutive number can be part of a numbered range list Eg : INPUT cat8 cat9 cat10 cat11 INPUT cat8 – cat11
28
28 Using shortcuts for lists of variable names Name range list: name range list depends on the internal order, or position of the variables in a SAS dataset. This is determined by the appearance of the variables in the DATA step. Eg : Data new; Input x1 x2 y2 y3; Run; Then the internal range list is: x1 x2 y2 y3 Shortcut for this variable list is x1-y3; Proc contents procedure with the POSITION option can be used to find out the internal order
29
29 Using shortcuts for lists of variable names DATA songs; INFILE ‘E:\radio.txt'; INPUT City $ 1-15 Age domk wj hwow simbh kt aomm libm tr filp ttr; ARRAY new (10) Song1 - Song10; ARRAY old (10) domk -- ttr; DO i = 1 TO 10; IF old(i) = 9 THEN new(i) =.; ELSE new(i) = old(i); END; AvgScore = MEAN(OF Song1 - Song10); run;
30
30 Sorting, Printing and Summarizing Your Data SAS Procedures (or PROC) perform specific analysis or function, produce results or reports Eg: Proc Print data =new; run; All procedures have required statements, and most have optional statements All procedures start with the key word “PROC”, followed by the name of the procedure, such as PRINT, or contents Options, if there are any, follow the procedure name Data=data_name options tells SAS which dataset to use as an input for this procedure. NOTE: if you skip it, SAS will use the most recently created dataset, which is not necessary the same as the mostly recently used data.
31
31 BY statement The BY statement is required for only one procedure, Proc sort PROC Sort data = new; By gender; Run; For all the other procedures, BY is an optional statement, and tells SAS to perform analysis for each level of the variable after the BY statement, instead of treating all subjects as one group Proc Print data =new; By gender; Run; All procedures, except Proc sort, assumes you data are already sorted by the variables in your BY statement
32
32 PROC Sort Syntax Proc Sort data =input_data_name out =out_data_name ; By variable-1 … variable-n; The variables in the by statement are called by variables. With one by variable, SAS sorts the data based on the values of that variable With more than one variable, SAS sorts observations by the first variable, then by the second variable within the categories of the first variable, and so on The DATA and OUT options specify the input and output data sets. Without the DATA option, SAS will use the most recently created data set. Without the OUT statement, SAS will replace the original data set with the newly sorted version
33
33 PROC Sort By default, SAS sorts data in ascending order, from the lowest to the highest value or from A to Z. To have the the ordered reversed, you can add the keyword DESCENDING before the variable you want to use the highest to the lowest order or Z to A order The NODUPKEY option tells SAS to eliminate any duplicate observations that have the same values for the BY variables
34
34 PROC Sort Example: The sealife.txt contains information on the average length in feet of selected whales and sharks. We want to sort the data by the family and length Name Family Length beluga whale 15 whale shark 40 basking shark 30 gray whale 50 mako shark 12 sperm whale 60 dwarf shark.5 whale shark 40 humpback. 50 blue whale 100 killer whale 30
35
35 PROC Sort Example: The sealife.txt contains information on the average length in feet of selected whales and sharks. We want to sort the data by the family and length Name Family Length beluga whale 15 whale shark 40 basking shark 30 gray whale 50 mako shark 12 sperm whale 60 dwarf shark.5 whale shark 40 humpback. 50 blue whale 100 killer whale 30
36
36 PROC Sort DATA marine; INFILE ‘E:\Sealife.txt'; INPUT Name $ Family $ Length; run; * Sort the data; PROC SORT DATA = marine OUT = seasort NODUPKEY; BY Family DESCENDING Length; run;
37
37 Summarizing you data with PROC MEANS The proc means procedure provide simple statistics on numeric variables. Syntax: Proc means options ; List of simple statistics can be produced by proc means: MAX: the maximum value MIN: the minimum value MEAN: the mean N : number of non-missing values STDDEV: the standard deviation NMISS: number of missing values RANGE: the range of the data SUM: the sum MEDIAN: the median DEFAULT
38
38 Proc means Options of Proc means: By variable-list : perform analysis for each level of the variables in the list. Data needs to be sorted first Class variable-list: perform analysis for each level of the variables in the list. Data do not need to be sorted Var variable list: specifies which variables to use in the analysis
39
39 Proc means A wholesale nursery is selling garden flowers, they want to summarize their sales figures by month. The data is as follows: IDDate Lily SnapDragon Marigold 756-01 05/04/2001 120 80 110 756-01 05/14/2001 130 90 120 834-01 05/12/2001 90 160 60 834-01 05/14/2001 80 60 70 901-02 05/18/2001 50 100 75 834-01 06/01/2001 80 60 100 756-01 06/11/2001 100 160 75 901-02 06/19/2001 60 60 60 756-01 06/25/2001 85 110 100
40
40 DATA sales; INFILE ‘E:\Flowers.txt'; INPUT CustomerID $ @9 SaleDate MMDDYY10. Lily SnapDragon Marigold; Month = MONTH(SaleDate); PROC SORT DATA = sales; BY Month; * Calculate means by Month for flower sales; PROC MEANS DATA = sales; BY Month; VAR Lily SnapDragon Marigold; TITLE 'Summary of Flower Sales by Month'; RUN;
41
Proc GCHART for bar charts Example: A bar chart showing the distribution of blood types from the Blood data set /* The blood.txt data contain information of 1000 subjects. The variables include: subject ID, gender, blood_type, age group, red blood cell count, white blood cell count, and cholesterol. DATA blood; INFILE ‘C:\blood.txt'; INPUT ID Sex $ BloodType $ AgeGroup $ RBC WBC Cholesterol; run; title "Distribution of Blood Types"; proc gchart data=blood; vbar BloodType; run;
42
Proc GCHART for bar charts VBAR: request a vertical bar chart for the variable Alternatives to VBAR are as follows: HBAR: horizontal bar chart VBAR3D: three-dimensional vertical bar chart HBAR3D: three-dimensional horizontal bar chart PIE: pie chart PIE3D: three-dimensional pie chart DONUT: donut chart
43
A Few Options proc gchart data=blood; vbar bloodtype/space=0 type=percent ; run; Controls spacing between bars Changes the statistic from frequency to percent
44
Type option Type =freq : displays frequencies of a categorical variable Type =pct (Percent): displays percent of a categorical variable Type =cfreq : displays cumulative frequencies of a categorical variable Type =cpct (cPercent): displays cumulative percent of a categorical variable
45
Basic Output This value of 7,000 corresponds to a class ranging from 6500 to 7500 (with a frequency of about 350) SAS computes midpoints of each bar automatically. You can change it by supplying your own midpoints: vbar RBC / midpoints=4000 to 11000 by 1000;
46
Creating charts with values representing categories SAS places continuous variables into groups before generating a frequency bar chart If you want to treat the values as discrete categories, you can use DISCRETE option Example: create bar chart showing the frequencies by day of the week for the visit to a hospital
47
libname d “C:\”; data day_of_week; set d.hosp; Day = weekday(AdmitDate); run; *Program Demonstrating the DISCRETE option of PROC GCHART; title "Visits by Month of the Year"; proc gchart data =day_of_week; vbar Day / discrete; run;
48
The Discrete Option proc gchart data= day_of_week; vbar day /discrete; run; Discrete establishes each distinct value of the midpoint variable as a midpoint on the graph. If the variable is formatted, the formatted values are used for the construction. If you use discrete with a numeric variable you should: 1. Be sure it has only a few distinct values. or 2. Use a format to make categories for it.
49
Summary Variables If I want my bar chart to summarize values of some analysis variable for each midpoint, use the sumvar= (and type= ) option. sumvar= variable name Type =mean: displays mean of a continuous variable Type =sum: displays totals of a continuous variable ( this is default value)
50
Creating bar charts representing sums The GCHART procedure can be used to create bar charts where the height of bars represents some statistic, means (or sums) for example, for each value of a classification variable Example: Bar chart showing the sum of the Totalsales for each region of the country title "Total Sales by Region"; proc gchart data=d.sales; vbar Region / sumvar=TotalSales type=sum ; format TotalSales dollar8.; run;
51
Creating bar charts representing means proc gchart data=blood; vbar Gender / sumvar=cholesterol type=mean; run; quit;
52
GPLOT The GPLOT procedure plots the values of two or more variables on a set of coordinate axes (X and Y). The procedure produces a variety of two- dimensional graphs including – simple scatter plots – overlay plots in which multiple sets of data points display on one set of axes
53
Procedure Syntax: PROC GPLOT PROC GPLOT; PLOT y*x ; run; Example: plot of systolic blood pressure (SBP) by diastolic blood pressure (DBP) title "Scatter Plot of SBP by DBP"; proc gplot data=d.clinic; plot SBP * DBP; run;
54
Multiple plots can be made in 3 ways: (1)proc gplot; plot y1*x y2*x /overlay; run; plots y1 versus x and y2 versus x using the same horizontal and vertical axes. (2) proc gplot; plot y1*x; plot2 y2*x; run; plots y1 versus x and y2 versus x using different vertical axes. The second vertical axes appears on the right hand side of the graph. (3) proc gplot ; plot y1*x=z; run; uses z as a classification variable and will produce a single graph plotting y1 against x for each value of the variable z.
55
*controlling the axis ranges; title "Scatter Plot of SBP by DBP"; proc gplot data=d.clinic; plot SBP * DBP / haxis=70 to 120 by 5 vaxis=100 to 220 by 10; run;
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.