18 August Statistical Analysis with R Questionnaires Variables organization Descriptive analysis Graphs Statistical tests 1
18 August R Statistical package 4 th generation programming language extensible through functions and extensions environment for statistical computing and graphics statistical and graphical techniques extensible through packages Competitors: SPSS, Matlab 2
Variables 18 August Scale or numeric variables time, age, weight, distance in Kilometers, length, number of children, GDP Nominal or categorical variables country of residence, sex, degree course Ordinal variables education level, rankings, Likert scale in statistical analysis are often considered as nominal or scale variables Questionnaire overview
Missing values 18 August NA: means "not available", are inserted manually by you whenever datum is missing NaN: means "not a number", whenever calculation cannot be done for this datum Are skipped in any statistical analysis Any math operation with them gives NaN 4
Portable R 18 August Portable R Download from my website already preconfigured or download from Uncompress it on your computer’s hard disk or on an USB pendrive or install R on your computer Download from Install it on your computer Try desperately to set the language to English 5
Installing packages 18 August To install R commander Packages Install Package(s)... CRAN Mirror Rcmdr wait for installation of Rcmdr and additional packages To load R commander Packages Load Package... Rcmdr to warning on missing packages answer Yes answer to download them from CRAN Learn to load an R package 6
Running R commander 18 August Whenever you want to run it Packages Load Package... Rcmdr File Change Working directory R commander has problems navigating through your directories’ tree Choose an easy-to-find directory, such as your Desktop or the place where you keep your R exercises. 7
Files to save 18 August R commander windows script, contains the written instructions R commander File Save Script as… output, contains the output R commander File Save Output as… pasting them into a text file Workspace contains the data structure File Save Workspace… R commander File Save R workspace As… File Load Workspace… 8
data.frame or dataset 18 August database table suited for statistical analysis case names are optional 9
Building a new dataset 18 August R commander Data New data set … Insert all variables first Only after insert data and build a codebook use numbers for nominal and ordinal variables Convert nominal and ordinal variables to factor R commander Data Manage variables in active data set Convert numeric variables to factor Convert ordinal variables to ordered Submit the 3 lines of code with ordered instead of factor ls.str() and str(dataset) 10
Importing dataset 18 August R commander Data import from a package Data in packages import from a text file Import Data from text file, clipboard or URL… import from Excel (hoping that it works ) Import Data from Excel, Access or dBase data set… export to a text file Active data set Export active data set… 11
Importing dataset from SPSS 18 August written here just in case you'll ever need it; better and easier converting to text file! R commander Data Import Data from SPSS data set… Pay attention to value labels and factors date importing is wrong! Fix it with library(chron) var <- as.chron(ISOdate(1582, 10, 14) + var) 12
Univariate descriptive analysis 18 August Statistics Summaries For scale variables Numerical summaries For ordinal and nominal variables Frequency distributions 13
Graphs for one nominal variable Column plot 18 August
Graphs for one nominal variable Pie chart Radar graph 18 August
Graphs for one nominal variable Bar plot Line plot 18 August
Graphs for one nominal variable Area plot 3D variants 18 August
Graphs for one nominal variable 18 August R commander Graphs Color palette… Bar graph… Pie chart… To change colors, add option col=c(number of colors from palette) to text command, select text command and submit it 18
Graphs for one scale variable Building an histogram grouping into bins 18 August $1,000$2,000$3,000$4,000$5,
Graphs for one scale variable Choosing the bins carefully 18 August $1,000$2,000$3,000$4,000$5,
Graphs for one scale variable Boxplot Median in black line Central 50% is in the rectangle Central 90% is between whiskers Extremes are symbols 18 August
One scale variable case by case Only for scale variable with few cases Use any appropriate nominal variable graph 18 August
Graphs for one scale variable 18 August R commander Graphs Histogram… Boxplot… Index plot… 23
Bivariate analysis: nominal vs nominal 18 August Statistics Contingency table Two-way table… Percentages Understand clearly when using row percentages and column percentages 24
Graphs for nominal vs nominal Side by side Stacked 18 August
Graphs for nominal vs nominal Appropriate 3D variants 18 August
Graphs for nominal vs nominal a rare example of a useful stacked area chart 18 August
Graphs for nominal vs nominal 18 August No available graph in R as far as I know How to export your graphics into Word right-click copy as bitmap 28
Bivariate analysis: scale vs nominal 18 August Statistics Summaries Numerical summaries Summarize by groups… Table of statistics… 29
Graphs for scale vs nominal Boxplot side by side Histogram one above the other 18 August
Graphs for two variables 18 August R commander Graphs Boxplot… Plot by groups… 31
Bivariate analysis: scale vs scale 18 August Statistics Summaries Correlation matrix Pearson linear correlation Spearman rank correlation 32
Scale versus scale Scatterplot 18 August
Scale versus scale Mathematical graph Regression line 18 August
Graphs for two variables 18 August R commander Graphs Scatterplot… Remove all the unnecessary options Line graph… (mathematical graph) X variable must have values in order 35
Multivariate analysis 18 August Statistics three nominal Contingency table Multi-way table three scale Summaries Correlation matrix 36
Graphs for three scale variables Surface plot 18 August
Graphs for three scale variables Bubble chart 18 August
Graphs for two scale and one nominal variables 18 August R commander Graphs Scatterplot… Plot by groups… 39
Restrict data set 18 August R commander Data Active Data Set Subset active data set… Used to restrict data set to some cases Use labels and not numbers for nominal variables! Remove cases with missing data… 40
Recode 18 August Used to create or modify factor/ordered variables R commander Data Manage variables in active data set Recode variables… "Bolzano"="here" c("Munich","Hannover",“Bonn") = "Germany“ Do not use "Munich","Hannover",“Bonn" = "Germany” as suggest by help else= "Others" For numerical variableswe may use also 8:27= "high" together with lo and hi Massive recoding 41
Binning 18 August Used to group scale variables into ordered (but it produces factor) R commander Data Manage variables in active data set Bin numeric variable… 42
Compute 18 August Used to create new variable through math operations R commander Data Manage variables in active data set Compute new variable… newvector <- with(dataset, formula) CO2$myname <- with(CO2, uptake*7-sqrt(conc) ) it is identical to CO2$myname <- CO2$uptake*7-sqrt(CO2$conc) 43
Computing (line command) 18 August Instruction produced by compute CO2$myname <- with(CO2, uptake*7-sqrt(conc) ) can be easily typed directly by you! Or you can type CO2$myname <- CO2$uptake*7-sqrt(CO2$conc) Variables’ names must be preceded by dataset’s name and $ <- means take things from the right and put on the left 44
Computing (line command) 18 August If you do not specify dataset$, variable will be created outside the dataset with only one case (unless otherwise specified) print(variable) to look at it Variable assignment variable <- value or formula, value or formula -> variable + - * / ** 45
Computing (line command) 18 August Variable with many cases outside dataset is called “vector” vector <- c(list of items) to create it manually vector[index] to access a specific vector’s element vector[from:to] to access a sequence of vector’s elements 46
18 August Statistical tests Example: we want to study the age of Internet users, checking whether the average age is 35 years or not The only information we have are the observations on a sample of 100 users, which are: 25; 26; 27; 28; 29; 30; 31; 30; 33; 34; 35; 36; 37; 38; 30; 30; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 51; 52; 20; 54; 55; 56; 57; 20; 20; 20; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 20; 21; 22; 23; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35.
18 August Statistical tests Test’s hypotheses: H 0 : average age on population is 35 H 1 : average age on population is not 35 We calculate the age average on the sample, 36.2, which is an estimation for the average population’s age. We compare this result with the 35 of the H 0 hypothesis and we find a difference of We ask ourselves whether this difference is: large, implying that the average population’s age is not 35 and thus H 0 must be rejected small and it can be caused by random fluctuation in the sample choice and therefore H 0 must be accepted.
18 August Statistical tests In order to answer, the test provides us with a significance: probability that H 0 is not false In this example significance is 16% If significance is large, we accept H 0 this implies that we do not know If significance is small, we reject H 0 this implies that we are almost sure that H 0 is false Significance is also called p-value
18 August Typical univariate analysis techniques Variables Numerical description Graphical description Parametric test Non- parametric test nominal Frequencies (one-dimensional contingency table) Column plot Pie chart --- Chi-square for a one-dimensional contingency table scale Descriptive statistics Histogram Boxplot Student’s t for one variable Sign test
18 August Tests for one scale variable Student’s t test for one var H0: avg on the population = m Statistics Means Single-sample t-test Sign test H0: median on the population = m Not available in R commander
18 August Tests for one nominal variable Chi-square test for a one-dimensional contingency table H0: classification follows a predetermined distribution Statistics Summaries Frequencies Distributions… Chi-square
18 August Typical bivariate analysis techniques Variables Numerical description Graphical description Parametric test Non-parametric test nominal vs nominal 2D contingency table Clustered or stacked or 3D column plot --- Chi square for a 2D contingency table binary nominal vs scale Descriptive statistics by groups Boxplots or histograms by groups Student’s t for two populations Mann-Whitney non binary nominal vs scale One-way analysis of variance (ANOVA) Kruskal-Wallis scale vs scale Person’s or Spearman’s correlation Scatterplot Pearson’s correlation Student’s t for paired data Spearman’s correlation Wilcoxon signed rank test
18 August Tests for two nominal variables Chi-square test for a two-dimensional contingency table H0: classification of two variables is independent Statistics Contingency table Two-way table… Statistics Chi-square test of independence Warning: you should have no expected frequency less than 5
18 August Test for binary nominal vs scale Student’s t test for two pop H0: average group 1 = average group 2 Statistics Means Independent samples t-test Warning: scale variable should be normally distributed on two groups
18 August Non-parametric test for binary nominal vs scale Mann-Whitney Wilcoxon rank-sum It tests the ranks H0: position group 1 = position group 2 Statistics Nonparametric tests Two- samples Wilcoxon test
18 August Test for non-binary nominal vs scale ANOVA (ANalysis Of VAriance) H0: average is the same for all groups Statistics Means One-way ANOVA Test rejects if just one population’s average is different than the others Warning: scale variable should be normally distributed for each group
18 August Non-parametric test for non- binary nominal vs scale Kruskal-Wallis It tests the ranks H0: position is the same for all groups Statistics Nonparametric tests Kruskal- Wallis test
18 August Tests for two scale variables Pearson’s and Spearman’s correlation tests H0: correlation = 0 Statistics Summaries Correlation test
18 August Tests for difference of two scale variables When using tests on variables differences Student’s t test for paired data H0: average (var 1 – var 2) = 0 Statistics Means Paired t test Warning: distribution of difference of scale variables must be normal
18 August Nonparametric test for two scale paired variables Wilcoxon signed-rank test It tests the ranks H0: var 1 – var 2 is positioned around 0 Statistics Nonparametric tests Paired-samples Wilcoxon test
18 August Is a variable normally distributed? Histogram with normal curve Find out average a and standard deviation s Build an histogram with appropriate binning close it, add prob=TRUE and rebuild it do not close it! curve(dnorm(x, mean=a, sd=s), col="blue", lwd=2, add=TRUE, yaxt="n") Q-Q plot (data must be on the line) Graphs Quantile-comparison Plot
18 August Is a variable normally distributed? Skewness negative: tail left, positive: tail right excess Kurtosis negative : flat, 0: normal, positive: too pointy Statistics Summaries Numerical summaries Options Shapiro-Wilk normality test H0: variable comes from a normal distribution Statistics Summaries Shapiro-Wilk test of normality