CA200 (based on the book by Prof. Jane M. Horgan)

Slides:



Advertisements
Similar presentations
Then we have.
Advertisements

Equivalent Expressions L.O. To simplify algebraic expressions (Level 5) abc a + ba + b + 1b + c +2 For each block find the sum of the two blocks underneath.
Psychoanalytic Theory
Example T1: T2: Constraint: Susan’s salary = Jane’s salary
1. Basics of R 2. Basic probability with R
SAT Question of the Day #7
By: Jane Lord By The End of This Session You will have entered Jane’s World I am a 50-year-old with Asperger’s who is self employed and reasonably.
Learning about book covers Written by Dennis Williams Polar Bear Pictures by Jane Roberts Happy Book Company is The title of a book the name of the book.
13-1 Introduction to Quadratic Equations  CA Standards 14.0 and 21.0  Quadratic Equations in Standard Form.
The Distributive Property Purpose: To use the distributive property Outcome: To simplify algebraic expressions.
11-3 Simplifying Radical Expressions Standard 2.0 One Property.
Lesson 8.1 Apply Exponent Properties Involving Products After today’s lesson, you should be able to use properties of exponents involving products to simplify.
Energy and Gross National Product (new book of late prof. em. Ivo Kolin) Presenter:Sonja Koščak Kolin, MSc University of Zagreb Faculty of Mining, Geology.
Some of the best books of  &mid= A23D2FC75CD A23D2FC75CD7.
Standard: Algebraic Functions 1.0 and 1.1 Students evaluate expressions with variables.
Simplify (-7)2. -6[(-7) + 10] – 4 Evaluate each expression for m = -3, n = 4, and p = m / n + p4. (mp) 3 5. mnp Warm Up.
Primary Longman Express
Prime Numbers and Factoring. Which expression is equal to 4x6? (a) 2 × 2 x 2 (b) 2 × 3 × 4 (c) 4 × 5 × 3 (d) 6 × 6.
This is the Andersons family:. Jane This is the Andersons family: Jane David.
By Mdm Yvonne Ong. Jane has 30 fish. Amy has 24 fish. Express the number of fish Jane has as a fraction to the number of fish Amy has. J A = = 5.
Evaluate: 1) Classwork: Book: pg. 100; 2 Homework: Book: pg. 100; 10.
Sentence Part Structure Ms. Anderson’s 3 rd grade Language Arts Class.
Change to scientific notation: A. B. C. 289,800, x x x
Objective The student will be able to: use the distributive property to simplify expressions.
QUESTIONS. Who What Where When Why How How often How much How many How old 1.Jane eats _______________ every day. 2.Paul was born on ______________. 3.They.
The Perfect Party Bus Service Riverside CA
Miss Horgan’s Kindergarten Newsletter
Notes Over 1.2.
Simplifying Expressions
Distributive Property
8 Tips to Enhance Business Trips
Warm-up October 12, ) Use the distributive property and then simplify: 9(4R + 7B) – 23B – 12R + 2R 2.) Write an expression for “3 times a number.
Review Problems Algebra 1 11-R.
Neuron PPS SUBHASHINI, ASSOCIATE PROF.
7 Important Benefits of Hiring a Party Bus Service
Objective The student will be able to:
By Norton Setup CA TollFree (US/CA/AU/UK)
بلکه ” ما“ شود تقدیم به او که نمی خواهد من و تو ” ما“ شویم
Solving Systems of Equations
Have you ever heard of the planet Saturn?
Regular Expressions Prof. Busch - LSU.
Regular expressions 2 Day /23/16
Solving Division Equations
الخطوة الثالثة الاختيار.
Chapter 5-1 Exponents.
Radical Expressions CA 2.0.
חוסן לאומי ומקומי בישראל
Donate a book £5.75 Or Donations towards a book.
Fractional Exponents CA 2.0.
مقرر التوحيد كتاب تعليمي للناشئة و المبتدئين [ المرحلة الإعدادية ]
نحو مفهوم شامل للجودة ومتطلبات التأهيل للأيزو
Regular expressions 3 Day /26/16
The word percent means:
READY?.
VEHICLE TECHNOLOGY Prof. Dr. Kadir AYDIN AIR CONDITIONING SYSTEMS.
Christmas in Toyland SOL8.9 Mrs. Hatfield.
Guiding Question Name.
Standard: EE1.,EE2b., and EE2c.
Jane Wise Franklin Gothic 32 pt
Factorization by Using Identities of Sum and Difference of Two Cubes
1. The students will be able to express likes and dislikes (in terms of movies, books, music). 2. The students will be able to express familiarity (using.
The Distributive Property Guided Notes
Unit 3 Welcome to the school library

8.1 – 8.3 Review Exponents.
How to Address an Envelope
Warmup Blue book- pg 105 # 1-5.
Because and so.
Target Language English Created by Jane Driver.
Prof. Qi Tian Fall 2013 CS 3843 Midterm Review Prof. Qi Tian Fall 2013.
Presentation transcript:

CA200 (based on the book by Prof. Jane M. Horgan) 3. Basics of R – cont. Summarising Statistical Data Graphical Displays 4. Basic distributions with R CA200 (based on the book by Prof. Jane M. Horgan)

Basics 6+7*3/2 #general expression [1] 16.5 x <- 1:4 #integers are assigned to the vector x x #print x [1] 1 2 3 4 x2 <- x**2 #square the element, or x2<-x^2 x2 [1] 1 4 9 16 X <- 10 #case sensitive! prod1 <- X*x prod1 [1] 10 20 30 40 CA200

Getting Help click the Help button on the toolbar help() help.start() demo() ?read.table help.search ("data.entry") apropos (“boxplot”) - "boxplot", "boxplot.default", "boxplot.stat” CA200

Statistics: Measures of Central Tendency Typical or central points: Mean: Sum of all values divided by the number of cases Median: Middle value. 50% of data below and 50% above Mode: Most commonly occurring value, value with the highest frequency CA200

Statistics: Measures of Dispersion Spread or variation in the data Standard Deviation (σ): The square root of the average squared deviations from the mean - measures how the data values differ from the mean - a small standard deviation implies most values are near the average - a large standard deviation indicates that values are widely spread above and below the average. CA200

Statistics: Measures of Dispersion Spread or variation in the data Range: Lowest and highest value Quartiles: Divides data into quarters. 2nd quartile is median Interquartile Range: 1st and 3rd quartiles, middle 50% of the data. CA200

Data Entry Entering data from the screen to a vector Example: 1.1 downtime <-c(0, 1, 2, 12, 12, 14, 18, 21, 21, 23, 24, 25, 28, 29, 30,30,30,33,36,44,45,47,51) mean(downtime) [1] 25.04348 median(downtime) [1] 25 range(downtime) [1] 0 51 sd(downtime) [1] 14.27164 CA200

Data Entry – cont. Entering data from a file to a data frame Example 1.2: Examination results: results.txt gender arch1 prog1 arch2 prog2 m 99 98 83 94 m NA NA 86 77 m 97 97 92 93 m 99 97 95 96 m 89 92 86 94 m 91 97 91 97 m 100 88 96 85 f 86 82 89 87 and so on CA200

Data Entry – cont. results$arch1[5] NA indicates missing value. No mark for arch1 and prog1 in second record. results <- read.table ("C:\\results.txt", header = T) # download the file to desired location results$arch1[5] [1] 89 Alternatively attach(results) names(results) allows you to access without prefix results. arch1[5] CA200

Data Entry – Missing values mean(arch1) [1] NA #no result because some marks are missing na.rm = T (not available, remove) or na.rm = TRUE mean(arch1, na.rm = T) [1] 83.33333 mean(prog1, na.rm = T) [1] 84.25 mean(arch2, na.rm = T) mean(prog2, na.rm = T) mean(results, na.rm = T) gender arch1 prog1 arch2 prog2 NA 94.42857 93.00000 89.75000 90.37500

Data Entry – cont. Use “read.table” if data in text file are separated by spaces Use “read.csv” when data are separated by commas Use “read.csv2” when data are separated by semicolon CA200

Data Entry – cont. Entering a data into a spreadsheet: newdata <- data.frame() #brings up a new spreadsheet called newdata fix(newdata) #allows to subsequently add data to this data frame CA200

Summary Statistics Example 1.1: Downtime: summary(downtime) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 16.00 25.00 25.04 31.50 51.00 Example 1.2: Examination Results: summary(results) Gender arch1 prog1 arch2 prog2 f: 4 Min. : 3.00 Min. :65.00 Min. :56.00 Min. :63.00 m:22 1st Qu.: 79.25 1st Qu.:80.75 1st Qu.:77.75 1st Qu.:77.50 Median : 89.00 Median :82.50 Median :85.50 Median :84.00 Mean : 83.33 Mean :84.25 Mean :81.15 Mean :83.85 3rd Qu.: 96.00 3rd Qu.:90.25 3rd Qu.:91.00 3rd Qu.:92.50 Max. :100.00 Max. :98.00 Max. :96.00 Max. :97.00 NA's : 2.00 NA's : 2.00

Summary Statistics - cont. Example 1.2: Examination Results: For a separate analysis use: mean(results$arch1, na.rm=T) # hint: use attach(results) [1] 83.33333 summary(arch1, na.rm=T) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 3.00 79.25 89.00 83.33 96.00 100.00 2.00

Programming in R x <- sum(downtime) # sum of elements in downtime Example 1.3: Write a program to calculate the mean of downtime Formula for the mean: x <- sum(downtime) # sum of elements in downtime n <- length(downtime) #number of elements in the vector mean_downtime <- x/n or mean_downtime <- sum(downtime) / length(downtime)

Programming in R – cont. #hint - use sqrt function Example 1.4: Write a program to calculate the standard deviation of downtime #hint - use sqrt function CA200

Graphical displays - Boxplots Boxplot – a graphical summary based on the median, quartile and extreme values boxplot(downtime) box represents the interquartile range which contains 50% of cases whiskers are lines that extend from max and min value line across the box represents median extreme values are cases on more than 1.5box length from max/min value CA200

Graphical displays – Boxplots – cont. To improve graphical display use labels: boxplot(downtime, xlab = "downtime", ylab = "minutes")

Graphical displays – Multiple Boxplots Multiple boxplots at the same axis - by adding extra arguments to boxplot function: boxplot(results$arch1, results$arch2, xlab = " Architecture, Semesters 1 and 2" ) Conclusions: marks are lower in sem2 Range of marks in narrower in sem2 Note outliers in sem1! 1.5 box length from max/min value. Atypical values.

Graphical displays – Multiple Boxplots – cont. Displays values per gender: boxplot(arch1~gender, xlab = "gender", ylab = "Marks(%)", main = "Architecture Semester 1") Note the effect of using: main = "Architecture Semester 1”

Par Display plots using par function par (mfrow = c(2,2)) #outputs are displayed in 2x2 array boxplot (arch1~gender, main = "Architecture Semester 1") boxplot(arch2~gender, main = "Architecture Semester 2") boxplot(prog1~gender, main = "Programming Semester 1") boxplot(prog2~gender, main = "Programming Semester 2") To undo matrix type: par(mfrow = c(1,1)) #restores graphics to the full screen

Par – cont. Conclusions: - female students are doing less well in programming for sem1 - median for female students for prog. sem1 is lower than for male students

Histograms hist(arch1, breaks = 5, xlab ="Marks(%)", A histogram is a graphical display of frequencies in the categories of a variable hist(arch1, breaks = 5, xlab ="Marks(%)", ylab = "Number of students", main = "Architecture Semester 1“ ) Note: A histogram with five breaks equal width - count observations that fill within categories or “bins”

Histograms hist(arch2, xlab ="Marks(%)", ylab = "Number of students", main = “Architecture Semester 2“ ) Note: A histogram with default breaks CA200

Using par with histograms The par can be used to represent all the subjects in the diagram par (mfrow = c(2,2)) hist(arch1, xlab = "Architecture", main = " Semester 1", ylim = c(0, 35)) hist(arch2, xlab = "Architecture", main = " Semester 2", ylim = c(0, 35)) hist(prog1, xlab = "Programming", main = " ", ylim = c(0, 35)) hist(prog2, xlab = "Programming", Note: ylim = c(0, 35) ensures that the y-axis is the same scale for all four objects! CA200

CA200

Stem and leaf Stem and leaf – more modern way of displaying data! Like histograms: diagrams gives frequencies of categories but gives the actual values in each category Stem usually depicts the 10s and the leaves depict units. stem (downtime, scale = 2) The decimal point is 1 digit(s) to the right of the | 0 | 012 1 | 2248 2 | 1134589 3 | 00036 4 | 457 5 | 1 CA200

Stem and leaf – cont. stem(prog1, scale = 2) The decimal point is 1 digit(s) to the right of the | 6 | 5 7 | 12 7 | 66 8 | 01112223 8 | 5788 9 | 012 9 | 7778 Note: e.g. there are many students with mark 80%-85% CA200

Scatter Plots To investigate relationship between variables: plot(prog1, prog2, xlab = "Programming, Semester 1", ylab = "Programming, Semester 2") Note: one variable increases with other! students doing well in prog1 will do well in prog2! CA200

Pairs If more than two variables are involved: courses <- results[2:5] pairs(courses) #scatter plots for all possible pairs or pairs(results[2:5]) CA200

Pairs – cont. CA200

Graphical display vs. Summary Statistics Importance of graphical display to provide insight into the data! Anscombe(1973), four data sets Each data set consist of two variables on which there are 11 observations CA200

Graphical display vs. Summary Statistics Data Set 1 Data Set 2 Data Set 3 Data Set 4 x1 y1 x2 y2 x3 y3 x4 y4 10 8.04 10 9.14 10 7.46 8 6.58 8 6.95 8 8.14 8 6.77 8 5.76 13 7.58 13 8.74 13 12.74 8 7.71 9 8.81 9 8.77 9 7.11 8 8.84 11 8.33 11 9.26 11 7.81 8 8.47 14 9.96 14 8.10 14 8.84 8 7.04 6 7.24 6 6.13 6 6.08 8 5.25 4 4.26 4 3.10 4 5.39 19 12.50 12 10.84 12 9.13 12 8.15 8 5.56 7 4.82 7 7.26 7 6.42 8 7.91 5 5.68 5 4.74 5 5.73 8 6.89 CA200

First read the data into separate vectors: x1<-c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) y1<-c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68) x2 <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) y2 <-c(9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74) x3<- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) y3 <- c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73) x4<- c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8) y4 <- c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89) CA200

For convenience, group the data into frames: dataset1 <- data.frame(x1,y1) dataset2 <- data.frame(x2,y2) dataset3 <- data.frame(x3,y3) dataset4 <- data.frame(x4,y4) CA200

It is usual to obtain summary statistics: Calculate the mean: mean(dataset1) x1 y1 9.000000 7.500909 mean(data.frame(x1,x2,x3,x4)) x1 x2 x3 x4 9 9 9 9 mean(data.frame(y1,y2,y3,y4)) y1 y2 y3 y4 7.500909 7.500909 7.500000 7.500909 Calculate the standard deviation: sd(data.frame(x1,x2,x3,x4)) x1 x2 x3 x4 3.316625 3.316625 3.316625 3.316625 sd(data.frame(y1,y2,y3,y4)) 2.031568 2.031657 2.030424 2.030579 Everything seems the same! CA200

plot(x1,y1, xlim=c(0, 20), ylim =c(0, 13)) But when we plot: par(mfrow = c(2, 2)) plot(x1,y1, xlim=c(0, 20), ylim =c(0, 13)) plot(x2,y2, xlim=c(0, 20), ylim =c(0, 13)) plot(x3,y3, xlim=c(0, 20), ylim =c(0, 13)) plot(x4,y4, xlim=c(0, 20), ylim =c(0, 13)) CA200

Note: Data set 1 in linear with some scatter Data set 2 is quadratic Data set 3 has an outlier. Without them the data would be linear Data set 4 contains x values which are equal expect one outlier. If removed, the data would be vertical. Everything seems different! Graphical displays are the core of getting insight/feel for the data!