Download presentation
Published bySergio Frome Modified over 9 years ago
1
CA200 (based on the book by Prof. Jane M. Horgan)
3. Basics of R – cont. Summarising Statistical Data Graphical Displays 4. Basic distributions with R CA200 (based on the book by Prof. Jane M. Horgan)
2
Basics 6+7*3/2 #general expression [1] 16.5
x <- 1:4 #integers are assigned to the vector x x #print x [1] x2 <- x**2 #square the element, or x2<-x^2 x2 [1] X < #case sensitive! prod1 <- X*x prod1 [1] CA200
3
Getting Help click the Help button on the toolbar help() help.start()
demo() ?read.table help.search ("data.entry") apropos (“boxplot”) - "boxplot", "boxplot.default", "boxplot.stat” CA200
4
Statistics: Measures of Central Tendency
Typical or central points: Mean: Sum of all values divided by the number of cases Median: Middle value. 50% of data below and 50% above Mode: Most commonly occurring value, value with the highest frequency CA200
5
Statistics: Measures of Dispersion
Spread or variation in the data Standard Deviation (σ): The square root of the average squared deviations from the mean - measures how the data values differ from the mean - a small standard deviation implies most values are near the average - a large standard deviation indicates that values are widely spread above and below the average. CA200
6
Statistics: Measures of Dispersion
Spread or variation in the data Range: Lowest and highest value Quartiles: Divides data into quarters. 2nd quartile is median Interquartile Range: 1st and 3rd quartiles, middle 50% of the data. CA200
7
Data Entry Entering data from the screen to a vector Example: 1.1
downtime <-c(0, 1, 2, 12, 12, 14, 18, 21, 21, 23, 24, 25, 28, 29, 30,30,30,33,36,44,45,47,51) mean(downtime) [1] median(downtime) [1] 25 range(downtime) [1] 0 51 sd(downtime) [1] CA200
8
Data Entry – cont. Entering data from a file to a data frame
Example 1.2: Examination results: results.txt gender arch1 prog1 arch2 prog2 m m NA NA m m m m m f and so on CA200
9
Data Entry – cont. results$arch1[5] NA indicates missing value.
No mark for arch1 and prog1 in second record. results <- read.table ("C:\\results.txt", header = T) # download the file to desired location results$arch1[5] [1] 89 Alternatively attach(results) names(results) allows you to access without prefix results. arch1[5] CA200
10
Data Entry – Missing values
mean(arch1) [1] NA #no result because some marks are missing na.rm = T (not available, remove) or na.rm = TRUE mean(arch1, na.rm = T) [1] mean(prog1, na.rm = T) [1] 84.25 mean(arch2, na.rm = T) mean(prog2, na.rm = T) mean(results, na.rm = T) gender arch1 prog1 arch2 prog2 NA
11
Data Entry – cont. Use “read.table” if data in text file are separated by spaces Use “read.csv” when data are separated by commas Use “read.csv2” when data are separated by semicolon CA200
12
Data Entry – cont. Entering a data into a spreadsheet:
newdata <- data.frame() #brings up a new spreadsheet called newdata fix(newdata) #allows to subsequently add data to this data frame CA200
13
Summary Statistics Example 1.1: Downtime: summary(downtime) Min. 1st Qu. Median Mean 3rd Qu. Max Example 1.2: Examination Results: summary(results) Gender arch1 prog1 arch2 prog2 f: 4 Min. : 3.00 Min. :65.00 Min. :56.00 Min. :63.00 m:22 1st Qu.: st Qu.: st Qu.: st Qu.:77.50 Median : Median :82.50 Median :85.50 Median :84.00 Mean : Mean :84.25 Mean :81.15 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.:92.50 Max. : Max. :98.00 Max. :96.00 Max. :97.00 NA's : 2.00 NA's : 2.00
14
Summary Statistics - cont.
Example 1.2: Examination Results: For a separate analysis use: mean(results$arch1, na.rm=T) # hint: use attach(results) [1] summary(arch1, na.rm=T) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
15
Programming in R x <- sum(downtime) # sum of elements in downtime
Example 1.3: Write a program to calculate the mean of downtime Formula for the mean: x <- sum(downtime) # sum of elements in downtime n <- length(downtime) #number of elements in the vector mean_downtime <- x/n or mean_downtime <- sum(downtime) / length(downtime)
16
Programming in R – cont. #hint - use sqrt function
Example 1.4: Write a program to calculate the standard deviation of downtime #hint - use sqrt function CA200
17
Graphical displays - Boxplots
Boxplot – a graphical summary based on the median, quartile and extreme values boxplot(downtime) box represents the interquartile range which contains 50% of cases whiskers are lines that extend from max and min value line across the box represents median extreme values are cases on more than 1.5box length from max/min value CA200
18
Graphical displays – Boxplots – cont.
To improve graphical display use labels: boxplot(downtime, xlab = "downtime", ylab = "minutes")
19
Graphical displays – Multiple Boxplots
Multiple boxplots at the same axis - by adding extra arguments to boxplot function: boxplot(results$arch1, results$arch2, xlab = " Architecture, Semesters 1 and 2" ) Conclusions: marks are lower in sem2 Range of marks in narrower in sem2 Note outliers in sem1! 1.5 box length from max/min value. Atypical values.
20
Graphical displays – Multiple Boxplots – cont.
Displays values per gender: boxplot(arch1~gender, xlab = "gender", ylab = "Marks(%)", main = "Architecture Semester 1") Note the effect of using: main = "Architecture Semester 1”
21
Par Display plots using par function
par (mfrow = c(2,2)) #outputs are displayed in 2x2 array boxplot (arch1~gender, main = "Architecture Semester 1") boxplot(arch2~gender, main = "Architecture Semester 2") boxplot(prog1~gender, main = "Programming Semester 1") boxplot(prog2~gender, main = "Programming Semester 2") To undo matrix type: par(mfrow = c(1,1)) #restores graphics to the full screen
22
Par – cont. Conclusions:
- female students are doing less well in programming for sem1 - median for female students for prog. sem1 is lower than for male students
23
Histograms hist(arch1, breaks = 5, xlab ="Marks(%)",
A histogram is a graphical display of frequencies in the categories of a variable hist(arch1, breaks = 5, xlab ="Marks(%)", ylab = "Number of students", main = "Architecture Semester 1“ ) Note: A histogram with five breaks equal width - count observations that fill within categories or “bins”
24
Histograms hist(arch2, xlab ="Marks(%)", ylab = "Number of students",
main = “Architecture Semester 2“ ) Note: A histogram with default breaks CA200
25
Using par with histograms
The par can be used to represent all the subjects in the diagram par (mfrow = c(2,2)) hist(arch1, xlab = "Architecture", main = " Semester 1", ylim = c(0, 35)) hist(arch2, xlab = "Architecture", main = " Semester 2", ylim = c(0, 35)) hist(prog1, xlab = "Programming", main = " ", ylim = c(0, 35)) hist(prog2, xlab = "Programming", Note: ylim = c(0, 35) ensures that the y-axis is the same scale for all four objects! CA200
26
CA200
27
Stem and leaf Stem and leaf – more modern way of displaying data! Like histograms: diagrams gives frequencies of categories but gives the actual values in each category Stem usually depicts the 10s and the leaves depict units. stem (downtime, scale = 2) The decimal point is 1 digit(s) to the right of the | 0 | 012 1 | 2248 2 | 3 | 00036 4 | 457 5 | 1 CA200
28
Stem and leaf – cont. stem(prog1, scale = 2)
The decimal point is 1 digit(s) to the right of the | 6 | 5 7 | 12 7 | 66 8 | 8 | 5788 9 | 012 9 | 7778 Note: e.g. there are many students with mark 80%-85% CA200
29
Scatter Plots To investigate relationship between variables:
plot(prog1, prog2, xlab = "Programming, Semester 1", ylab = "Programming, Semester 2") Note: one variable increases with other! students doing well in prog1 will do well in prog2! CA200
30
Pairs If more than two variables are involved:
courses <- results[2:5] pairs(courses) #scatter plots for all possible pairs or pairs(results[2:5]) CA200
31
Pairs – cont. CA200
32
Graphical display vs. Summary Statistics
Importance of graphical display to provide insight into the data! Anscombe(1973), four data sets Each data set consist of two variables on which there are 11 observations CA200
33
Graphical display vs. Summary Statistics
Data Set 1 Data Set 2 Data Set 3 Data Set 4 x1 y1 x2 y2 x3 y3 x4 y CA200
34
First read the data into separate vectors:
x1<-c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) y1<-c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68) x2 <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) y2 <-c(9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74) x3<- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) y3 <- c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73) x4<- c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8) y4 <- c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89) CA200
35
For convenience, group the data into frames:
dataset1 <- data.frame(x1,y1) dataset2 <- data.frame(x2,y2) dataset3 <- data.frame(x3,y3) dataset4 <- data.frame(x4,y4) CA200
36
It is usual to obtain summary statistics: Calculate the mean:
mean(dataset1) x y1 mean(data.frame(x1,x2,x3,x4)) x1 x2 x3 x4 mean(data.frame(y1,y2,y3,y4)) y y y y4 Calculate the standard deviation: sd(data.frame(x1,x2,x3,x4)) x x x x4 sd(data.frame(y1,y2,y3,y4)) Everything seems the same! CA200
37
plot(x1,y1, xlim=c(0, 20), ylim =c(0, 13))
But when we plot: par(mfrow = c(2, 2)) plot(x1,y1, xlim=c(0, 20), ylim =c(0, 13)) plot(x2,y2, xlim=c(0, 20), ylim =c(0, 13)) plot(x3,y3, xlim=c(0, 20), ylim =c(0, 13)) plot(x4,y4, xlim=c(0, 20), ylim =c(0, 13)) CA200
38
Note: Data set 1 in linear with some scatter Data set 2 is quadratic Data set 3 has an outlier. Without them the data would be linear Data set 4 contains x values which are equal expect one outlier. If removed, the data would be vertical. Everything seems different! Graphical displays are the core of getting insight/feel for the data!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.