Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data.

Similar presentations


Presentation on theme: "Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data."— Presentation transcript:

1 Basic Data Analysis Using R Xiao He 1

2 AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data visualization 2

3 AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression 3

4 1. DATA CLEANING NA and NaN : 1. NA (Not Available): missing values. a). Represented in the form of NA, or in the form of. 2. NaN (Not a number): when an arithmetic operation returns a non- numeric result: e.g., in R, 0/0 gives you NaN. 4

5 1. DATA CLEANING NA and NaN : 3. Deal with NA and NaN ? a). Check how many cases (rows) do NOT have NA or NaN : complete.cases() #Returns Booleans (TRUE, FALSE). TRUE means no missing #value (in a given row), and FALSE means there is at #least one missing value. b). Remove cases with NA or NaN : na.omit() #return a data object with NA and NaN removed. For data #frames, an entire row will be removed if it contains NA #or NaN. c). More sophisticated ways of dealing with NA (covered by Addie’s workshop in two weeks) Ex1.1: (refer to handout) Ex1.2: (refer to handout) 5

6 AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression 6

7 2. DESCRIPTIVE STATISTICS 1.Help you diagnose potential problems w/ data entry or collection: a.Were any values entered incorrectly? e.g., survey study using 1 – 5 Likert scale, but when you checked the range of your data, you found that the maximum value in your dataset was 7. a.Any strange responses? e.g., Did your participants give you any odd responses? 2.Help you get a sense of how your data are distributed. a.Extreme values (outliers) b.Non-normality c.Skewness d.Unequal variance 7

8 2. DESCRIPTIVE STATISTICS  Compute individual descriptive statistics 1.Location:  mean(x, trim)  median(x) 2.Dispersion  var(x)  sd(x)  range(x); min(x); max(x)  IQR(x) Ex2.1: (refer to handout) Ex2.2: (refer to handout) 8

9 2. DESCRIPTIVE STATISTICS  Compute a set of descriptive statistics 1.Use the function summary(). 2. function describe() in the package `psych`. Ex3.1: (refer to handout) Ex3.2: (refer to handout) 9

10 AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression 10

11 3. T TESTS t.test() t.test(x, y = NULL, alternative=c("two.sided", "less", "greater”), mu = 0, paired = FALSE, var.equal = FALSE, conf.level =.95) 1.One-sample t-test: 11

12 3. T TESTS t.test() t.test(x, y = NULL, alternative=c("two.sided", "less", "greater”), mu = 0, paired = FALSE, var.equal = FALSE, conf.level =.95) 1.One-sample t-test: Suppose someone hypothesized that the mean undergrad age was 19.75. Let’s test whether the mean age was significantly different from 19.75. H0: mu 0 = 19.75 H1: mu 0 ≠ 19.75 Ex4.1: (refer to handout) 12

13 3. T TESTS t.test() t.test(x, y = NULL, alternative=c("two.sided", "less", "greater”), mu = 0, paired = FALSE, var.equal = FALSE, conf.level =.95) 2.Independent t-test: Let’s test whether the mean height of female students is significantly different from the mean height of male students H0: mu Female = mu Male H1: mu Female ≠ mu Male Ex4.2: (refer to handout) 13

14 3. T TESTS t.test() t.test(x, y = NULL, alternative=c("two.sided", "less", "greater”), mu = 0, paired = FALSE, var.equal = FALSE, conf.level =.95) t.test(formula, data,…) 2.Independent t-test: Let’s test whether the mean height of female students is significantly different from the mean height of male students H0: mu Female = mu Male H1: mu Female ≠ mu Male Ex4.3: (refer to handout) formula = Y ~ X Y : outcome variable (e.g., height) X : 2 level grouping variable (e.g., Sex) formula = Y ~ X Y : outcome variable (e.g., height) X : 2 level grouping variable (e.g., Sex) 14

15 3. T TESTS t.test() t.test(x, y = NULL, alternative=c("two.sided", "less", "greater”), mu = 0, paired = FALSE, var.equal = FALSE, conf.level =.95) t.test(formula, data,…) 3.Paired t-test: Let’s test whether the mean Writing hand span (Wr.Hnd) and the mean Non- writing hand span (NW.Hnd) differ significantly. H0: mu Wr.Hnd = mu NW.Hnd H1: mu Wr.Hnd ≠ mu NW.Hnd Ex4.4: (refer to handout) 15

16 AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression 16

17 4. ANOVA aov() aov(formula, data) 1.One-way ANOVA: Suppose we are interested in whether the mean pulse rates differ amongst people of different exercise statuses. H0: mu None = mu Some = mu Freq H1: Not all groups are equal. Ex5.1: (refer to handout) 17

18 We will import a new dataset and will use it for the next exercise. hsb2 <- read.table("http://www.ats.ucla.edu/stat/r/faq/hsb2.cs v", sep=",", header=TRUE) 18

19 4. ANOVA aov() aov(formula, data) 2.Two-way ANOVA: The formula/model for factorial ANOVA (take 2 way interaction for example) is specified as follows: Y ~ X1 * X2 which is equivalent to Y ~ X1 + X2 + X1:X2 Suppose we are interested the main effects of race and schtyp (school type) as well as the interaction effect between the two variables on read. The formula/model can be specified as: read ~ as.factor(race) * as.factor(schtyp) Ex5.2: (refer to handout) 19 Why do we do this?

20 20

21 AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression 21

22 Let’s import a new dataset expenditure <- read.table("http://dornsife.usc.edu/assets/sites/210/d ocs/GC3/educationExpenditure.txt", sep=",", header=TRUE) 22

23 Ex6.1: (refer to handout) 5. LINEAR REGRESSION lm() lm(formula, data) education: Per-capita education expenditures, dollars. income: Per-capita income, dollars. young: Proportion under 18, per 1000. urban: Proportion urban, per 1000. 1.Simple linear regression formula = Y ~ X Suppose we are interested in testing the regression of per-capita education expenditure on per-capita income, the model should be specified as: formula = education ~ income 23

24 5. LINEAR REGRESSION lm() lm(formula, data) education: Per-capita education expenditures, dollars. income: Per-capita income, dollars. young: Proportion under 18, per 1000. urban: Proportion urban, per 1000. 2.Multiple regression formula = Y ~ X 1 + X 2 + … + X p Suppose we are interested in testing the regression of education on income, young, and urban, the model should be specified as: formula = education ~ income + young + urban 24 Ex6.2: (refer to handout)

25 Thanks! 25


Download ppt "Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data."

Similar presentations


Ads by Google