Presentation and Data  Short Courses  Intro to SAS  Download Data to Desktop 1.

Slides:



Advertisements
Similar presentations
All Possible Regressions and Statistics for Comparing Models
Advertisements

I OWA S TATE U NIVERSITY Department of Animal Science Using Basic Graphical and Statistical Procedures (Chapter in the 8 Little SAS Book) Animal Science.
Statistical Methods Lynne Stokes Department of Statistical Science Lecture 7: Introduction to SAS Programming Language.
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Research Support Center Chongming Yang
Analysis of variance (ANOVA)-the General Linear Model (GLM)
Analysis of Variance Compares means to determine if the population distributions are not similar Uses means and confidence intervals much like a t-test.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
EPI 809/Spring Probability Distribution of Random Error.
Where to find this presentation and data  Short Courses  “Data Analysis in SAS”  Course Materials  Download Data to Desktop.
Multiple regression analysis
ANOVA notes NR 245 Austin Troy
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Chapter 19 Data Analysis Overview
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Quantitative Business Analysis for Decision Making Simple Linear Regression.
Educational Research by John W. Creswell. Copyright © 2002 by Pearson Education. All rights reserved. Slide 1 Chapter 8 Analyzing and Interpreting Quantitative.
This Week Continue with linear regression Begin multiple regression –Le 8.2 –C & S 9:A-E Handout: Class examples and assignment 3.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 18-1 Chapter 18 Data Analysis Overview Statistics for Managers using Microsoft Excel.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Chapter 14 Inferential Data Analysis
November 4, 2009 Introduction to SAS LISA Short Course Series Mark Seiss, Dept. of Statistics.
Leedy and Ormrod Ch. 11 Gray Ch. 14
SAS Lecture 5 – Some regression procedures Aidan McDermott, April 25, 2005.
Introduction to SAS Essentials Mastering SAS for Data Analytics
Review of Econ424 Fall –open book –understand the concepts –use them in real examples –Dec. 14, 8am-12pm, Plant Sciences 1129 –Vote Option 1(2)
Chapter 9 Producing Descriptive Statistics PROC MEANS; Summarize descriptive statistics for continuous numeric variables. PROC FREQ; Summarize frequency.
Class Meeting #11 Data Analysis. Types of Statistics Descriptive Statistics used to describe things, frequently groups of people.  Central Tendency 
Introduction to SAS BIO 226 – Spring Outline Windows and common rules Getting the data –The PRINT and CONTENT Procedures Manipulating the data.
1 Experimental Statistics - week 4 Chapter 8: 1-factor ANOVA models Using SAS.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.
HLTH 653 Lecture 2 Raul Cruz-Cano Spring Statistical analysis procedures Proc univariate Proc t test Proc corr Proc reg.
Presentation and Data  Short Courses  Intro to SAS  Download Data to Desktop 1.
Topic 14: Inference in Multiple Regression. Outline Review multiple linear regression Inference of regression coefficients –Application to book example.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.
Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Haas MFE SAS Workshop Lecture 3: Peng Liu Haas School.
Topic 13: Multiple Linear Regression Example. Outline Description of example Descriptive summaries Investigation of various models Conclusions.
AnnMaria De Mars, Ph.D. The Julia Group Santa Monica, CA Categorical data analysis: For when your data DO fit in little boxes.
ANOVA: Analysis of Variance.
Regression & Correlation. Review: Types of Variables & Steps in Analysis.
Lecture 3 Topic - Descriptive Procedures Programs 3-4 LSB 4:1-4.4; 4:9:4:11; 8:1-8:5; 5:1-5.2.
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
ANALYSIS PLAN: STATISTICAL PROCEDURES
Chap 18-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 18-1 Chapter 18 A Roadmap for Analyzing Data Basic Business Statistics.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
1 STA 617 – Chp10 Models for matched pairs Summary  Describing categorical random variable – chapter 1  Poisson for count data  Binomial for binary.
Log-linear Models HRP /03/04 Log-Linear Models for Multi-way Contingency Tables 1. GLM for Poisson-distributed data with log-link (see Agresti.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Dependent Variable Discrete  2 values – binomial  3 or more discrete values – multinomial  Skewed – e.g. Poisson Continuous  Non-normal.
Experimental Statistics - week 9
Chapter 8: Using Basic Statistical Procedures “33⅓% of the mice used in the experiment were cured by the test drug; 33⅓% of the test population were unaffected.
Chapter 1 Introduction to Statistics. Section 1.1 Fundamental Statistical Concepts.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
1 Experimental Statistics - week 11 Chapter 11: Linear Regression and Correlation.
Appendix I A Refresher on some Statistical Terms and Tests.
BINARY LOGISTIC REGRESSION
Notes on Logistic Regression
Generalized Linear Models
Applied Statistical Analysis
6-1 Introduction To Empirical Models
15.1 The Role of Statistics in the Research Process
Introductory Statistics
Presentation transcript:

Presentation and Data  Short Courses  Intro to SAS  Download Data to Desktop 1

Mark Seiss, Dept. of Statistics Introduction to SAS Part 2 February 22, 2011

Reference Material The Little SAS Book – Delwiche and Slaughter SAS Programming I: Essentials SAS Programming II: Manipulating Data with the DATA Step Presentation and Data

Presentation Outline Part 1 1. Introduction to the SAS Environment 2. Working With SAS Data Sets Part 2 1. Summary Procedures 2. Basic Statistical Analysis Procedures

Presentation Outline Questions/Comments Individual Goals/Interests

Summary Procedures 1.Print Procedure 2.Plot Procedure 3.Univariate Procedure 4.Means Procedure 5.Freq Procedure

Print Procedure PROC PRINT is used to print data to the output window By default, prints all observations and variables in the SAS data set General Form:PROC PRINT DATA=input_data_set ; RUN; Some Options input_data_set (obs=n) -Specifies the number of observations to be printed in the output NOOBS - Suppresses printing observation number LABEL - Prints the labels instead of variable names

Print Procedure Optional SAS statements BY variable1 variable2 variable3; Starts a new section of output for every new value of the BY variables ID variable1 variable2 variable3; Prints ID variables on the left hand side of the page and suppresses the printing of the observation numbers SUM variable1 variable2 variable3; Prints sum of listed variables at the bottom of the output VAR variable1 variable2 variable3; Prints only listed variables in the output

Print Procedure Assignment Use PROC PRINT to print out the state variable separately for each region Note: All procedures for the remainder of the course will be run on the data set work.state_data.

Print Procedure Solution proc sort data=state_data; by region; run; proc print data=state_data; var state; by region; run;

Plot Procedure Used to create basic scatter plots of the data Use PROC GPLOT or PROC SGPLOT for more sophisticated plots General Form: PROC PLOT DATA=input_data_set; PLOT vertical_variable * horizontal_variable/ ; RUN; By default, SAS uses letters to mark points on plots A for a single observation, B for two observations at the same point, etc. To specify a different character to represent a point PLOT vertical_variable * horizontal variable = ‘*’;

Plot Procedure To specify a third variable to use to mark points PLOT vertical_variable * horizontal_variable = third_variable; To plot more than one variable on the vertical axis PLOT vertical_variable1 * horizontal_variable=‘2’ vertical_variable2 * horizontal_variable=‘1’/OVERLAY ;

Plot Procedure Assignment Use the PLOT PROCEDURE to plot SAT Verbal scores versus SAT Math Scores Use the value of the region variable to mark points

Plot Procedure Solution proc plot data=state_data; plot math*verbal=region; run;

Univariate Procedure PROC UNIVARIATE is used to examine the distribution of data Produces summary statistics for a single variable Includes mean, median, mode, standard deviation, skewness, kurtosis, quantiles, etc. General Form: PROC UNIVARIATE DATA=input_data_set ; VAR variable1 variable2 variable3; RUN ; If the variable statement is not used, summary statistics will be produced for all numeric variables in the input data set.

Univariate Procedure Options include: PLOT – produces Stem-and-leaf plot, Box plot, and Normal probability plot; NORMAL – produces tests of Normality

Univariate Procedure Assignment Use PROC UNIVARIATE to produce a normal probability plot and test the normality of the SAT Total variable and Expenditure variable

Univariate Procedure Solution proc univariate data=state_data normal plot; var expend total; run;

Means Procedure Similar to the Univariate procedure General Form:PROC MEANS DATA=input_data_set options; ; RUN; With no options or optional SAS statements, the Means procedure will print out the number of non-missing values, mean, standard deviation, minimum, and maximum for all numeric variables in the input data set

Means Procedure Options Statistics Available Note: The default alpha level for confidence limits is 95%. Use ALPHA= option to specify different alpha level. CLMTwo-Sided Confidence LimitsRANGERange CSSCorrected Sum of SquaresSKEWNESSSkewness CVCoefficient of VariationSTDDEVStandard Deviation KURTOSISKurtosisSTDERRStandard Error of Mean LCLMLower Confidence LimitSUMSum MAXMaximum ValueSUMWGTSum of Weight Variables MEANMeanUCLMUpper Confidence Limit MINMinimum ValueUSSUncorrected Sum of Squares NNumber Non-missing ValuesVARVariance NMISSNumber Missing ValuesPROBTProbability for Student’s t MEDIAN (or P50)MedianTStudent’s t Q1 (P25)25% QuantileQ3 (P75)75% Quantile P11% QuantileP55% Quantile P1010% QuantileP9090% Quantile P9595% QuantileP9999% Quantile

Means Procedure Optional SAS Statements VAR Variable1 Variable2; Specifies which numeric variables statistics will be produced for BY Variable1 Variable2; Calculates statistics for each combination of the BY variables Output out=output_data_set; Creates data set with the default statistics

Means Procedure Assignment Use PROC MEANS to calculate the mean and variance of the expenditure variable for each region

Means Procedure Solution proc sort data=state_data; by region; run; proc means data=state_data mean var; var expend; by region; run;

FREQ Procedure PROC FREQ is used to generate frequency tables Most common usage is create table showing the distribution of categorical variables General Form:PROC FREQ DATA=input_data_set; TABLE variable1*variable2*variable3/ ; RUN; Options LIST – prints cross tabulations in list format rather than grid MISSING – specifies that missing values should be included in the tabulations OUT=output_data_set – creates a data set containing frequencies, list format NOPRINT – suppress printing in the output window Use BY statement to get percentages within each category of a variable

FREQ Procedure Assignment Use PROC FREQ to find the number of states within each region

FREQ Procedure Solution proc freq data=state_data; table region; run;

Summary Procedures Questions/Comments

Statistical Analysis Procedures 1.Correlation – PROC CORR 2.Regression – PROC REG 3.Analysis of Variance – PROC ANOVA 4.Chi-square Test of Association – PROC FREQ 5.General Linear Models – PROC GENMOD

CORR Procedure PROC CORR is used to calculate the correlations between variables Correlation coefficient measures the linear relationship between two variables Values Range from -1 to 1 Negative correlation - as one variable increases the other decreases Positive correlation – as one variable increases the other increases 0 – no linear relationship between the two variables 1 – perfect positive linear relationship -1 – perfect negative linear relationship General Form:PROC CORR DATA=input_data_set VAR Variable1 Variable2; With Variable3; RUN;

CORR Procedure If the VAR and WITH statements are not used, correlation is computed for all pairs of numeric variables Options include SPEARMAN – computes Spearman’s rank correlations KENDALL – computes Kendall’s Tau coefficients

CORR Procedure Question: What is the correlation between the SAT Total variable and Expenditure variable? Is it significant? Based on previous exercises, which correlation coefficient should we use? Assignment:Use PROC CORR to find the correlation between the SAT Total variable and Expenditure Variable

CORR Procedure Solution If the normality assumption is valid proc corr data=state_data /; var total expend; run; If the normality assumption is not valid proc corr data=state_data spearman; var total expend; run;

REG Procedure PROC REG is used to fit linear regression models by least squares estimation One of many SAS procedures that can perform regression analysis Only continuous independent variables (Use GENMOD for categorical variables) General Form: PROC REG DATA=input_data_set MODEL dependent=independent1 independent2/ ; ; RUN; PROC REG statement options include PCOMIT=m - performs principle component estimation with m principle components CORR – displays correlation matrix for independent variables in the model

REG Procedure MODEL statement options include SELECTION= Specifies a model selection procedure be conducted – FORWARD, BACKWARD, and STEPWISE ADJRSQ - Computes the Adjusted R-Square MSE – Computes the Mean Square Error COLLIN – performs collinearity analysis CLB – computes confidence limits for parameter estimates ALPHA= Sets significance value for confidence and prediction intervals and tests

REG Procedure Optional statements include PLOT Dependent*Independent1 – generates plot of data

REG Procedure Assignment Use PROC REG to generate a multiple linear regression model Dependent Variable – SAT Total (total) Use Stepwise Selection  Possible Independent Variables –Average pupil to teacher ratio (PT_ratio) –Current expenditure per pupil (expend) –Estimated annual salary of teachers (salary) –Percentage of eligible students taking the SAT (students)

REG Procedure Solution proc reg data=state_data; model total=pt_ratio expend salary students/selection=stepwise; run;

ANOVA Procedure PROC ANOVA performs analysis of variance Designed for balanced data (PROC GLM used for unbalance data) Can handle nested and crossed effects and repeated measures General Form: PROC ANOVA DATA=input_data_set ; CLASS independent1 independent2; MODEL dependent=independent1 independent2; ; Run; Class statement must come before model statement, used to define classification variables

ANOVA Procedure Useful PROC ANOVA statement option – OUTSTAT=output_data_set Generates output data set that contains sums of squares, degrees of freedom, statistics, and p-values for each effect in the model Useful optional statement – MEANS independent1/ Used to perform multiple comparisons analysis Set to: TUKEY – Tukey’s studentized range test BON – Bonferroni t test T – pairwise t tests Duncan – Duncan’s multiple-range test Scheffe – Scheffe’s multiple comparison procedure

ANOVA Procedure Question:Are there significant differences between the Match SAT scores of students from different regions? If there are significant differences, which regions are different? Assignment:Use PROC ANOVA to determine if there are significant differences in the Math SAT variable between regions Perform multiple comparisons between regions using Tukey’s Adjustment

ANOVA Procedure Solution proc anova data=state_data; class region; model math=region; means region/tukey; run;

FREQ Procedure PROC FREQ can also be used to perform analysis with categorical data General Form:PROC FREQ DATA=input_data_set; TABLE variable1 variable2/ ; RUN; TABLE statement options include: AGREE – Tests and measures of classification agreement including McNemar’s test, Bowker’s test, Cochran’s Q test, and Kappa statistics CHISQ - Chi-square test of homogeneity and measures of association MEASURE - Measures of association include Pearson and Spearman correlation, gamma, Kendall’s Tau, Stuart’s tau, Somer’s D, lambda, odds ratios, risk ratios, and confidence intervals

GENMOD Procedure PROC GENMOD is used to estimate linear models in which the response is not necessarily normal Logistic and Poisson regression are examples of generalized linear models General Form: PROC GENMOD DATA=input_data_set; CLASS independent1; MODEL dependent = independent1 independent2/ dist= link= ; run;

GENMOD Procedure DIST = - specifies the distribution of the response variable LINK= - specifies the link function from the linear predictor to the mean of the response Example – Logistic Regression DIST = binomial LINK = logit Example – Poisson Regression DIST = poisson LINK = log

GENMOD Procedure Question:How do we model the probability of having a high total SAT average based on other variables in the dataset? Is the dependent variable normal, or does it have a different distribution? What link function would you specify? Assignment:Use PROC GENMOD to perform Logistic Regression on the work.state_data data set Dependent variable – upper_ind Independent variables –Average pupil to teacher ratio (PT_ratio) –Current expenditure per pupil (expend) –Estimated annual salary of teachers (salary) –Percentage of eligible students taking the SAT (students) –Region (region)

GENMOD Procedure Solution proc genmod data=state_data descending; class region; model upper_ind=pt_ratio expend salary students/dist=bin link=logit; run;

Statistical Analysis Procedures Questions/Comments

Attendee Questions If time permits