Download presentation
Presentation is loading. Please wait.
Published byHoward Little Modified over 9 years ago
1
18 August 20151 Statistical Analysis with R Questionnaires Variables organization Descriptive analysis Graphs Statistical tests 1
2
18 August 20152 R Statistical package 4 th generation programming language extensible through functions and extensions environment for statistical computing and graphics statistical and graphical techniques extensible through packages Competitors: SPSS, Matlab 2
3
Variables 18 August 20153 Scale or numeric variables time, age, weight, distance in Kilometers, length, number of children, GDP Nominal or categorical variables country of residence, sex, degree course Ordinal variables education level, rankings, Likert scale in statistical analysis are often considered as nominal or scale variables Questionnaire overview
4
Missing values 18 August 20154 NA: means "not available", are inserted manually by you whenever datum is missing NaN: means "not a number", whenever calculation cannot be done for this datum Are skipped in any statistical analysis Any math operation with them gives NaN 4
5
Portable R 18 August 20155 Portable R Download from my website already preconfigured or download from http://rportable.sourceforge.nethttp://rportable.sourceforge.net Uncompress it on your computer’s hard disk or on an USB pendrive or install R on your computer Download from www.r-project.orgwww.r-project.org Install it on your computer Try desperately to set the language to English 5
6
Installing packages 18 August 20156 To install R commander Packages Install Package(s)... CRAN Mirror Rcmdr wait for installation of Rcmdr and additional packages To load R commander Packages Load Package... Rcmdr to warning on missing packages answer Yes answer to download them from CRAN Learn to load an R package 6
7
Running R commander 18 August 20157 Whenever you want to run it Packages Load Package... Rcmdr File Change Working directory R commander has problems navigating through your directories’ tree Choose an easy-to-find directory, such as your Desktop or the place where you keep your R exercises. 7
8
Files to save 18 August 20158 R commander windows script, contains the written instructions R commander File Save Script as… output, contains the output R commander File Save Output as… pasting them into a text file Workspace contains the data structure File Save Workspace… R commander File Save R workspace As… File Load Workspace… 8
9
data.frame or dataset 18 August 20159 database table suited for statistical analysis case names are optional 9
10
Building a new dataset 18 August 201510 R commander Data New data set … Insert all variables first Only after insert data and build a codebook use numbers for nominal and ordinal variables Convert nominal and ordinal variables to factor R commander Data Manage variables in active data set Convert numeric variables to factor Convert ordinal variables to ordered Submit the 3 lines of code with ordered instead of factor ls.str() and str(dataset) 10
11
Importing dataset 18 August 201511 R commander Data import from a package Data in packages import from a text file Import Data from text file, clipboard or URL… import from Excel (hoping that it works ) Import Data from Excel, Access or dBase data set… export to a text file Active data set Export active data set… 11
12
Importing dataset from SPSS 18 August 201512 written here just in case you'll ever need it; better and easier converting to text file! R commander Data Import Data from SPSS data set… Pay attention to value labels and factors date importing is wrong! Fix it with library(chron) var <- as.chron(ISOdate(1582, 10, 14) + var) 12
13
Univariate descriptive analysis 18 August 201513 Statistics Summaries For scale variables Numerical summaries For ordinal and nominal variables Frequency distributions 13
14
Graphs for one nominal variable Column plot 18 August 201514
15
Graphs for one nominal variable Pie chart Radar graph 18 August 201515
16
Graphs for one nominal variable Bar plot Line plot 18 August 201516
17
Graphs for one nominal variable Area plot 3D variants 18 August 201517
18
Graphs for one nominal variable 18 August 201518 R commander Graphs Color palette… Bar graph… Pie chart… To change colors, add option col=c(number of colors from palette) to text command, select text command and submit it 18
19
Graphs for one scale variable Building an histogram grouping into bins 18 August 201519 $1,000$2,000$3,000$4,000$5,000 0 4 8 12
20
Graphs for one scale variable Choosing the bins carefully 18 August 201520 $1,000$2,000$3,000$4,000$5,000 0 10 20 30
21
Graphs for one scale variable Boxplot Median in black line Central 50% is in the rectangle Central 90% is between whiskers Extremes are symbols 18 August 201521
22
One scale variable case by case Only for scale variable with few cases Use any appropriate nominal variable graph 18 August 201522
23
Graphs for one scale variable 18 August 201523 R commander Graphs Histogram… Boxplot… Index plot… 23
24
Bivariate analysis: nominal vs nominal 18 August 201524 Statistics Contingency table Two-way table… Percentages Understand clearly when using row percentages and column percentages 24
25
Graphs for nominal vs nominal Side by side Stacked 18 August 201525
26
Graphs for nominal vs nominal Appropriate 3D variants 18 August 201526
27
Graphs for nominal vs nominal a rare example of a useful stacked area chart 18 August 201527
28
Graphs for nominal vs nominal 18 August 201528 No available graph in R as far as I know How to export your graphics into Word right-click copy as bitmap 28
29
Bivariate analysis: scale vs nominal 18 August 201529 Statistics Summaries Numerical summaries Summarize by groups… Table of statistics… 29
30
Graphs for scale vs nominal Boxplot side by side Histogram one above the other 18 August 201530
31
Graphs for two variables 18 August 201531 R commander Graphs Boxplot… Plot by groups… 31
32
Bivariate analysis: scale vs scale 18 August 201532 Statistics Summaries Correlation matrix Pearson linear correlation Spearman rank correlation 32
33
Scale versus scale Scatterplot 18 August 201533
34
Scale versus scale Mathematical graph Regression line 18 August 201534
35
Graphs for two variables 18 August 201535 R commander Graphs Scatterplot… Remove all the unnecessary options Line graph… (mathematical graph) X variable must have values in order 35
36
Multivariate analysis 18 August 201536 Statistics three nominal Contingency table Multi-way table three scale Summaries Correlation matrix 36
37
Graphs for three scale variables Surface plot 18 August 201537
38
Graphs for three scale variables Bubble chart www.gapminder.org 18 August 201538
39
Graphs for two scale and one nominal variables 18 August 201539 R commander Graphs Scatterplot… Plot by groups… 39
40
Restrict data set 18 August 201540 R commander Data Active Data Set Subset active data set… Used to restrict data set to some cases Use labels and not numbers for nominal variables! Remove cases with missing data… 40
41
Recode 18 August 201541 Used to create or modify factor/ordered variables R commander Data Manage variables in active data set Recode variables… "Bolzano"="here" c("Munich","Hannover",“Bonn") = "Germany“ Do not use "Munich","Hannover",“Bonn" = "Germany” as suggest by help else= "Others" For numerical variableswe may use also 8:27= "high" together with lo and hi Massive recoding 41
42
Binning 18 August 201542 Used to group scale variables into ordered (but it produces factor) R commander Data Manage variables in active data set Bin numeric variable… 42
43
Compute 18 August 201543 Used to create new variable through math operations R commander Data Manage variables in active data set Compute new variable… newvector <- with(dataset, formula) CO2$myname <- with(CO2, uptake*7-sqrt(conc) ) it is identical to CO2$myname <- CO2$uptake*7-sqrt(CO2$conc) 43
44
Computing (line command) 18 August 201544 Instruction produced by compute CO2$myname <- with(CO2, uptake*7-sqrt(conc) ) can be easily typed directly by you! Or you can type CO2$myname <- CO2$uptake*7-sqrt(CO2$conc) Variables’ names must be preceded by dataset’s name and $ <- means take things from the right and put on the left 44
45
Computing (line command) 18 August 201545 If you do not specify dataset$, variable will be created outside the dataset with only one case (unless otherwise specified) print(variable) to look at it Variable assignment variable <- value or formula, value or formula -> variable + - * / ** 45
46
Computing (line command) 18 August 201546 Variable with many cases outside dataset is called “vector” vector <- c(list of items) to create it manually vector[index] to access a specific vector’s element vector[from:to] to access a sequence of vector’s elements 46
47
18 August 201547 Statistical tests Example: we want to study the age of Internet users, checking whether the average age is 35 years or not The only information we have are the observations on a sample of 100 users, which are: 25; 26; 27; 28; 29; 30; 31; 30; 33; 34; 35; 36; 37; 38; 30; 30; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 51; 52; 20; 54; 55; 56; 57; 20; 20; 20; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 20; 21; 22; 23; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35.
48
18 August 201548 Statistical tests Test’s hypotheses: H 0 : average age on population is 35 H 1 : average age on population is not 35 We calculate the age average on the sample, 36.2, which is an estimation for the average population’s age. We compare this result with the 35 of the H 0 hypothesis and we find a difference of +1.2. We ask ourselves whether this difference is: large, implying that the average population’s age is not 35 and thus H 0 must be rejected small and it can be caused by random fluctuation in the sample choice and therefore H 0 must be accepted.
49
18 August 201549 Statistical tests In order to answer, the test provides us with a significance: probability that H 0 is not false In this example significance is 16% If significance is large, we accept H 0 this implies that we do not know If significance is small, we reject H 0 this implies that we are almost sure that H 0 is false Significance is also called p-value
50
18 August 201550 Typical univariate analysis techniques Variables Numerical description Graphical description Parametric test Non- parametric test nominal Frequencies (one-dimensional contingency table) Column plot Pie chart --- Chi-square for a one-dimensional contingency table scale Descriptive statistics Histogram Boxplot Student’s t for one variable Sign test
51
18 August 201551 Tests for one scale variable Student’s t test for one var H0: avg on the population = m Statistics Means Single-sample t-test Sign test H0: median on the population = m Not available in R commander
52
18 August 201552 Tests for one nominal variable Chi-square test for a one-dimensional contingency table H0: classification follows a predetermined distribution Statistics Summaries Frequencies Distributions… Chi-square
53
18 August 201553 Typical bivariate analysis techniques Variables Numerical description Graphical description Parametric test Non-parametric test nominal vs nominal 2D contingency table Clustered or stacked or 3D column plot --- Chi square for a 2D contingency table binary nominal vs scale Descriptive statistics by groups Boxplots or histograms by groups Student’s t for two populations Mann-Whitney non binary nominal vs scale One-way analysis of variance (ANOVA) Kruskal-Wallis scale vs scale Person’s or Spearman’s correlation Scatterplot Pearson’s correlation Student’s t for paired data Spearman’s correlation Wilcoxon signed rank test
54
18 August 201554 Tests for two nominal variables Chi-square test for a two-dimensional contingency table H0: classification of two variables is independent Statistics Contingency table Two-way table… Statistics Chi-square test of independence Warning: you should have no expected frequency less than 5
55
18 August 201555 Test for binary nominal vs scale Student’s t test for two pop H0: average group 1 = average group 2 Statistics Means Independent samples t-test Warning: scale variable should be normally distributed on two groups
56
18 August 201556 Non-parametric test for binary nominal vs scale Mann-Whitney Wilcoxon rank-sum It tests the ranks H0: position group 1 = position group 2 Statistics Nonparametric tests Two- samples Wilcoxon test
57
18 August 201557 Test for non-binary nominal vs scale ANOVA (ANalysis Of VAriance) H0: average is the same for all groups Statistics Means One-way ANOVA Test rejects if just one population’s average is different than the others Warning: scale variable should be normally distributed for each group
58
18 August 201558 Non-parametric test for non- binary nominal vs scale Kruskal-Wallis It tests the ranks H0: position is the same for all groups Statistics Nonparametric tests Kruskal- Wallis test
59
18 August 201559 Tests for two scale variables Pearson’s and Spearman’s correlation tests H0: correlation = 0 Statistics Summaries Correlation test
60
18 August 201560 Tests for difference of two scale variables When using tests on variables differences Student’s t test for paired data H0: average (var 1 – var 2) = 0 Statistics Means Paired t test Warning: distribution of difference of scale variables must be normal
61
18 August 201561 Nonparametric test for two scale paired variables Wilcoxon signed-rank test It tests the ranks H0: var 1 – var 2 is positioned around 0 Statistics Nonparametric tests Paired-samples Wilcoxon test
62
18 August 201562 Is a variable normally distributed? Histogram with normal curve Find out average a and standard deviation s Build an histogram with appropriate binning close it, add prob=TRUE and rebuild it do not close it! curve(dnorm(x, mean=a, sd=s), col="blue", lwd=2, add=TRUE, yaxt="n") Q-Q plot (data must be on the line) Graphs Quantile-comparison Plot
63
18 August 201563 Is a variable normally distributed? Skewness negative: tail left, positive: tail right excess Kurtosis negative : flat, 0: normal, positive: too pointy Statistics Summaries Numerical summaries Options Shapiro-Wilk normality test H0: variable comes from a normal distribution Statistics Summaries Shapiro-Wilk test of normality
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.