Presentation is loading. Please wait.

Presentation is loading. Please wait.

18 August 20151 Statistical Analysis with R Questionnaires Variables organization Descriptive analysis Graphs Statistical tests 1.

Similar presentations


Presentation on theme: "18 August 20151 Statistical Analysis with R Questionnaires Variables organization Descriptive analysis Graphs Statistical tests 1."— Presentation transcript:

1 18 August 20151 Statistical Analysis with R Questionnaires Variables organization Descriptive analysis Graphs Statistical tests 1

2 18 August 20152 R Statistical package 4 th generation programming language extensible through functions and extensions environment for statistical computing and graphics statistical and graphical techniques extensible through packages Competitors: SPSS, Matlab 2

3 Variables 18 August 20153 Scale or numeric variables time, age, weight, distance in Kilometers, length, number of children, GDP Nominal or categorical variables country of residence, sex, degree course Ordinal variables education level, rankings, Likert scale in statistical analysis are often considered as nominal or scale variables Questionnaire overview

4 Missing values 18 August 20154 NA: means "not available", are inserted manually by you whenever datum is missing NaN: means "not a number", whenever calculation cannot be done for this datum Are skipped in any statistical analysis Any math operation with them gives NaN 4

5 Portable R 18 August 20155 Portable R Download from my website already preconfigured or download from http://rportable.sourceforge.nethttp://rportable.sourceforge.net Uncompress it on your computer’s hard disk or on an USB pendrive or install R on your computer Download from www.r-project.orgwww.r-project.org Install it on your computer Try desperately to set the language to English 5

6 Installing packages 18 August 20156 To install R commander Packages  Install Package(s)...  CRAN Mirror  Rcmdr wait for installation of Rcmdr and additional packages To load R commander Packages  Load Package...  Rcmdr to warning on missing packages answer Yes answer to download them from CRAN Learn to load an R package 6

7 Running R commander 18 August 20157 Whenever you want to run it Packages  Load Package...  Rcmdr File  Change Working directory R commander has problems navigating through your directories’ tree Choose an easy-to-find directory, such as your Desktop or the place where you keep your R exercises. 7

8 Files to save 18 August 20158 R commander windows script, contains the written instructions R commander  File  Save Script as… output, contains the output R commander  File  Save Output as… pasting them into a text file Workspace contains the data structure File  Save Workspace… R commander  File  Save R workspace As… File  Load Workspace… 8

9 data.frame or dataset 18 August 20159 database table suited for statistical analysis case names are optional 9

10 Building a new dataset 18 August 201510 R commander  Data  New data set … Insert all variables first Only after insert data and build a codebook use numbers for nominal and ordinal variables Convert nominal and ordinal variables to factor R commander  Data  Manage variables in active data set  Convert numeric variables to factor Convert ordinal variables to ordered Submit the 3 lines of code with ordered instead of factor ls.str() and str(dataset) 10

11 Importing dataset 18 August 201511 R commander  Data  import from a package  Data in packages import from a text file  Import Data  from text file, clipboard or URL… import from Excel (hoping that it works  )  Import Data  from Excel, Access or dBase data set… export to a text file  Active data set  Export active data set… 11

12 Importing dataset from SPSS 18 August 201512 written here just in case you'll ever need it; better and easier converting to text file! R commander  Data  Import Data  from SPSS data set… Pay attention to value labels and factors date importing is wrong! Fix it with library(chron) var <- as.chron(ISOdate(1582, 10, 14) + var) 12

13 Univariate descriptive analysis 18 August 201513 Statistics  Summaries  For scale variables  Numerical summaries For ordinal and nominal variables  Frequency distributions 13

14 Graphs for one nominal variable Column plot 18 August 201514

15 Graphs for one nominal variable Pie chart Radar graph 18 August 201515

16 Graphs for one nominal variable Bar plot Line plot 18 August 201516

17 Graphs for one nominal variable Area plot 3D variants 18 August 201517

18 Graphs for one nominal variable 18 August 201518 R commander  Graphs  Color palette…  Bar graph…  Pie chart… To change colors, add option col=c(number of colors from palette) to text command, select text command and submit it 18

19 Graphs for one scale variable Building an histogram grouping into bins 18 August 201519 $1,000$2,000$3,000$4,000$5,000 0 4 8 12

20 Graphs for one scale variable Choosing the bins carefully 18 August 201520 $1,000$2,000$3,000$4,000$5,000 0 10 20 30

21 Graphs for one scale variable Boxplot Median in black line Central 50% is in the rectangle Central 90% is between whiskers Extremes are symbols 18 August 201521

22 One scale variable case by case Only for scale variable with few cases Use any appropriate nominal variable graph 18 August 201522

23 Graphs for one scale variable 18 August 201523 R commander  Graphs  Histogram…  Boxplot…  Index plot… 23

24 Bivariate analysis: nominal vs nominal 18 August 201524 Statistics   Contingency table  Two-way table… Percentages Understand clearly when using row percentages and column percentages 24

25 Graphs for nominal vs nominal Side by side Stacked 18 August 201525

26 Graphs for nominal vs nominal Appropriate 3D variants 18 August 201526

27 Graphs for nominal vs nominal a rare example of a useful stacked area chart 18 August 201527

28 Graphs for nominal vs nominal 18 August 201528 No available graph in R  as far as I know How to export your graphics into Word right-click  copy as bitmap 28

29 Bivariate analysis: scale vs nominal 18 August 201529 Statistics  Summaries   Numerical summaries  Summarize by groups…  Table of statistics… 29

30 Graphs for scale vs nominal Boxplot side by side Histogram one above the other 18 August 201530

31 Graphs for two variables 18 August 201531 R commander  Graphs  Boxplot…  Plot by groups… 31

32 Bivariate analysis: scale vs scale 18 August 201532 Statistics  Summaries   Correlation matrix Pearson linear correlation Spearman rank correlation 32

33 Scale versus scale Scatterplot 18 August 201533

34 Scale versus scale Mathematical graph Regression line 18 August 201534

35 Graphs for two variables 18 August 201535 R commander  Graphs  Scatterplot… Remove all the unnecessary options  Line graph… (mathematical graph) X variable must have values in order 35

36 Multivariate analysis 18 August 201536 Statistics  three nominal  Contingency table   Multi-way table three scale  Summaries  Correlation matrix 36

37 Graphs for three scale variables Surface plot 18 August 201537

38 Graphs for three scale variables Bubble chart www.gapminder.org 18 August 201538

39 Graphs for two scale and one nominal variables 18 August 201539 R commander  Graphs  Scatterplot…  Plot by groups… 39

40 Restrict data set 18 August 201540 R commander  Data  Active Data Set  Subset active data set… Used to restrict data set to some cases Use labels and not numbers for nominal variables!  Remove cases with missing data… 40

41 Recode 18 August 201541 Used to create or modify factor/ordered variables R commander  Data  Manage variables in active data set  Recode variables… "Bolzano"="here" c("Munich","Hannover",“Bonn") = "Germany“ Do not use "Munich","Hannover",“Bonn" = "Germany” as suggest by help else= "Others" For numerical variableswe may use also 8:27= "high" together with lo and hi Massive recoding 41

42 Binning 18 August 201542 Used to group scale variables into ordered (but it produces factor) R commander  Data  Manage variables in active data set  Bin numeric variable… 42

43 Compute 18 August 201543 Used to create new variable through math operations R commander  Data  Manage variables in active data set  Compute new variable… newvector <- with(dataset, formula) CO2$myname <- with(CO2, uptake*7-sqrt(conc) ) it is identical to CO2$myname <- CO2$uptake*7-sqrt(CO2$conc) 43

44 Computing (line command) 18 August 201544 Instruction produced by compute CO2$myname <- with(CO2, uptake*7-sqrt(conc) ) can be easily typed directly by you! Or you can type CO2$myname <- CO2$uptake*7-sqrt(CO2$conc) Variables’ names must be preceded by dataset’s name and $ <- means take things from the right and put on the left 44

45 Computing (line command) 18 August 201545 If you do not specify dataset$, variable will be created outside the dataset with only one case (unless otherwise specified) print(variable) to look at it Variable assignment variable <- value or formula, value or formula -> variable + - * / ** 45

46 Computing (line command) 18 August 201546 Variable with many cases outside dataset is called “vector” vector <- c(list of items) to create it manually vector[index] to access a specific vector’s element vector[from:to] to access a sequence of vector’s elements 46

47 18 August 201547 Statistical tests Example: we want to study the age of Internet users, checking whether the average age is 35 years or not The only information we have are the observations on a sample of 100 users, which are: 25; 26; 27; 28; 29; 30; 31; 30; 33; 34; 35; 36; 37; 38; 30; 30; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 51; 52; 20; 54; 55; 56; 57; 20; 20; 20; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 20; 21; 22; 23; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35; 36; 37; 35.

48 18 August 201548 Statistical tests Test’s hypotheses: H 0 : average age on population is 35 H 1 : average age on population is not 35 We calculate the age average on the sample, 36.2, which is an estimation for the average population’s age. We compare this result with the 35 of the H 0 hypothesis and we find a difference of +1.2. We ask ourselves whether this difference is: large, implying that the average population’s age is not 35 and thus H 0 must be rejected small and it can be caused by random fluctuation in the sample choice and therefore H 0 must be accepted.

49 18 August 201549 Statistical tests In order to answer, the test provides us with a significance: probability that H 0 is not false In this example significance is 16% If significance is large, we accept H 0 this implies that we do not know If significance is small, we reject H 0 this implies that we are almost sure that H 0 is false Significance is also called p-value

50 18 August 201550 Typical univariate analysis techniques Variables Numerical description Graphical description Parametric test Non- parametric test nominal Frequencies (one-dimensional contingency table) Column plot Pie chart --- Chi-square for a one-dimensional contingency table scale Descriptive statistics Histogram Boxplot Student’s t for one variable Sign test

51 18 August 201551 Tests for one scale variable Student’s t test for one var H0: avg on the population = m Statistics  Means  Single-sample t-test Sign test H0: median on the population = m Not available in R commander

52 18 August 201552 Tests for one nominal variable Chi-square test for a one-dimensional contingency table H0: classification follows a predetermined distribution Statistics  Summaries  Frequencies Distributions…  Chi-square

53 18 August 201553 Typical bivariate analysis techniques Variables Numerical description Graphical description Parametric test Non-parametric test nominal vs nominal 2D contingency table Clustered or stacked or 3D column plot --- Chi square for a 2D contingency table binary nominal vs scale Descriptive statistics by groups Boxplots or histograms by groups Student’s t for two populations Mann-Whitney non binary nominal vs scale One-way analysis of variance (ANOVA) Kruskal-Wallis scale vs scale Person’s or Spearman’s correlation Scatterplot Pearson’s correlation Student’s t for paired data Spearman’s correlation Wilcoxon signed rank test

54 18 August 201554 Tests for two nominal variables Chi-square test for a two-dimensional contingency table H0: classification of two variables is independent Statistics  Contingency table  Two-way table…  Statistics  Chi-square test of independence Warning: you should have no expected frequency less than 5

55 18 August 201555 Test for binary nominal vs scale Student’s t test for two pop H0: average group 1 = average group 2 Statistics  Means  Independent samples t-test Warning: scale variable should be normally distributed on two groups

56 18 August 201556 Non-parametric test for binary nominal vs scale Mann-Whitney Wilcoxon rank-sum It tests the ranks H0: position group 1 = position group 2 Statistics  Nonparametric tests  Two- samples Wilcoxon test

57 18 August 201557 Test for non-binary nominal vs scale ANOVA (ANalysis Of VAriance) H0: average is the same for all groups Statistics  Means  One-way ANOVA Test rejects if just one population’s average is different than the others Warning: scale variable should be normally distributed for each group

58 18 August 201558 Non-parametric test for non- binary nominal vs scale Kruskal-Wallis It tests the ranks H0: position is the same for all groups Statistics  Nonparametric tests  Kruskal- Wallis test

59 18 August 201559 Tests for two scale variables Pearson’s and Spearman’s correlation tests H0: correlation = 0 Statistics  Summaries  Correlation test

60 18 August 201560 Tests for difference of two scale variables When using tests on variables differences Student’s t test for paired data H0: average (var 1 – var 2) = 0 Statistics  Means  Paired t test Warning: distribution of difference of scale variables must be normal

61 18 August 201561 Nonparametric test for two scale paired variables Wilcoxon signed-rank test It tests the ranks H0: var 1 – var 2 is positioned around 0 Statistics  Nonparametric tests  Paired-samples Wilcoxon test

62 18 August 201562 Is a variable normally distributed? Histogram with normal curve Find out average a and standard deviation s Build an histogram with appropriate binning close it, add prob=TRUE and rebuild it do not close it! curve(dnorm(x, mean=a, sd=s), col="blue", lwd=2, add=TRUE, yaxt="n") Q-Q plot (data must be on the line) Graphs  Quantile-comparison Plot

63 18 August 201563 Is a variable normally distributed? Skewness negative: tail left, positive: tail right excess Kurtosis negative : flat, 0: normal, positive: too pointy Statistics  Summaries  Numerical summaries  Options Shapiro-Wilk normality test H0: variable comes from a normal distribution Statistics  Summaries  Shapiro-Wilk test of normality


Download ppt "18 August 20151 Statistical Analysis with R Questionnaires Variables organization Descriptive analysis Graphs Statistical tests 1."

Similar presentations


Ads by Google