Boxplots in R The function boxplot() in R plots boxplots By default, boxplot() in R plots the maximum and the minimum non-outlying values instead of the.

Slides:



Advertisements
Similar presentations
Chapter 3, Numerical Descriptive Measures
Advertisements

1 Chapter Four understand the use and the importance of graphical presentation; recognise the features of varies graphical presentations; and organise.
Measures of Variation Sample range Sample variance Sample standard deviation Sample interquartile range.
Measures of Dispersion
Describing Quantitative Data with Numbers Part 2
Measures of Dispersion or Measures of Variability
Chapter 3 Describing Data Using Numerical Measures
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data Lesson2-1 Lesson 2: Descriptive Statistics.
Chap 3-1 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 3 Describing Data: Numerical.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch. 2-1 Statistics for Business and Economics 7 th Edition Chapter 2 Describing Data:
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 3-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Statistics for Managers Using Microsoft® Excel 4th Edition
1 BUS 297D: Data Mining Professor David Mease Lecture 3 Agenda: 1) Reminder about HW #1 (Due Thurs 9/10) 2) Finish lecture over Chapter 3.
1 Business 260: Managerial Decision Analysis Professor David Mease Lecture 1 Agenda: 1) Course web page 2) Greensheet 3) Numerical Descriptive Measures.
Basic Business Statistics 10th Edition
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 3-1 Introduction to Statistics Chapter 3 Using Statistics to summarize.
Statistics for Managers using Microsoft Excel 6th Global Edition
1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 3 Describing Data Using Numerical Measures.
Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Describing Data: Numerical
Numerical Descriptive Measures
Describing distributions with numbers
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Statistics for Managers.
Chapter 3 – Descriptive Statistics
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 7 = Finish chapter 3 and.
Applied Quantitative Analysis and Practices LECTURE#08 By Dr. Osman Sadiq Paracha.
Measures of Central Tendency and Dispersion Preferred measures of central location & dispersion DispersionCentral locationType of Distribution SDMeanNormal.
Descriptive statistics Describing data with numbers: measures of variability.
Chapter 12, Part 2 STA 291 Summer I Mean and Standard Deviation The five-number summary is not the most common way to describe a distribution numerically.
STAT 280: Elementary Applied Statistics Describing Data Using Numerical Measures.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Lecture 3 Describing Data Using Numerical Measures.
Applied Quantitative Analysis and Practices LECTURE#09 By Dr. Osman Sadiq Paracha.
Variation This presentation should be read by students at home to be able to solve problems.
BUS304 – Data Charaterization1 Other Numerical Measures  Median  Mode  Range  Percentiles  Quartiles, Interquartile range.
UTOPPS—Fall 2004 Teaching Statistics in Psychology.
Measures of Dispersion How far the data is spread out.
INVESTIGATION 1.
Chap 3-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 3 Describing Data Using Numerical.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Business Statistics, A First Course.
LECTURE CENTRAL TENDENCIES & DISPERSION POSTGRADUATE METHODOLOGY COURSE.
Introduction to Statistics Santosh Kumar Director (iCISA)
Basic Business Statistics Chapter 3: Numerical Descriptive Measures Assoc. Prof. Dr. Mustafa Yüzükırmızı.
Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Basic Business Statistics 11 th Edition.
1 1 Slide © 2003 South-Western/Thomson Learning TM Chapter 3 Descriptive Statistics: Numerical Methods n Measures of Variability n Measures of Relative.
Statistical Methods © 2004 Prentice-Hall, Inc. Week 3-1 Week 3 Numerical Descriptive Measures Statistical Methods.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
Yandell – Econ 216 Chap 3-1 Chapter 3 Numerical Descriptive Measures.
Descriptive Statistics ( )
Stats 202: Statistical Aspects of Data Mining Professor Rajan Patel
Measures of Dispersion
Business and Economics 6th Edition
Descriptive Statistics
Chapter 3 Describing Data Using Numerical Measures
Measures of dispersion
Descriptive Statistics
Chapter 3 Describing Data Using Numerical Measures
Numerical Descriptive Measures
Quartile Measures DCOVA
Boxplots in R The function boxplot() in R plots boxplots
Business and Economics 7th Edition
Presentation transcript:

Boxplots in R The function boxplot() in R plots boxplots By default, boxplot() in R plots the maximum and the minimum non-outlying values instead of the 10 th and 90 th percentiles as the book describes outlier 10 th percentile 25 th percentile 75 th percentile 50 th percentile 90 th percentile Maximum Minimum See: > a<-c(11,12, 22, 33, 34, 100) > boxplot(a) > b<-c(11,12, 22, 33, 34, 65) > boxplot(b)  a<-c(11,12,22, 33, 34, 50, 100)  boxplot(a)

Boxplots (Pages ) Boxplots help you visualize the differences in the medians relative to the variation Example: The median value of Attribute A was 2.0 for men and 4.1 for women. Is this a “big” difference?

Boxplots (Pages ) Boxplots help you visualize the differences in the medians relative to the variation Example: The median value of Attribute A was 2.0 for men and 4.1 for women. Is this a “big” difference? Maybe yes:

Boxplots (Pages ) Boxplots help you visualize the differences in the medians relative to the variation Example: The median value of Attribute A was 2.0 for men and 4.1 for women. Is this a “big” difference? Maybe yes:Maybe no:

In class exercise #14: Use boxplot() in R to make boxplots comparing the first and second exam scores in the data at This is paired data, but I’m ignoring that for now.

In class exercise #14: Use boxplot() in R to make boxplots comparing the first and second exam scores in the data at Answer: data<-read.csv("exams_and_names.csv") boxplot(data[,2],data[,3],col="blue", main="Exam Scores", names=c("Exam 1","Exam 2"),ylab="Exam Score")

In class exercise #14: Use boxplot() in R to make boxplots comparing the first and second exam scores in the data at Answer:

Using Color in Plots In R, the graphing parameter “col” can often be used to specify different colors for points, lines etc. Some advantages of color: - provides a nice way to differentiate - makes it more interesting to look at Some disadvantages of color: - Some people are color blind - Most printing is in black and white - Color can be distracting - A poor color scheme can make the graph difficult to read

3-Dimesional Plots 3D plots can sometimes be useful One example is the 3D scatter plot for plotting 3 attributes (page 119) The function scatterplot3d() makes fairly nice 3D scatter plots in R -this is not in the base package so you need to do: install.packages("scatterplot3d") library(scatterplot3d) However, it may be better to show the 3 rd dimension by simply using a 2D plot with different plotting characters (page 119)

3-Dimesional Plots Never use the 3 rd dimension in a manner that conveys no extra information just to make the plot look more impressive

3-Dimesional Plots Never use the 3 rd dimension in a manner that conveys no extra information just to make the plot look more impressive Examples:

Do’s and Don’ts (Page 130) Read the ACCENT Principles Read Tufte’s Guidelines

Compressing Vertical Axis Good Presentation Quarterly Sales Bad Presentation Q1Q2Q3 Q4 $ Q1Q2 Q3 Q4 $ ✓

N O Z ERO P OINT O N V ERTICAL A XIS Monthly Sales J F MAMJ $ JFM A MJ $ ✓ Good Presentations Monthly Sales Bad Presentation JFMAMJ $ Graphing the first six months of sales or

No Relative Basis Good Presentation A’s received by students. Bad Presentation FRSOJRSR Freq. 10% 30% FRSOJRSR FR = Freshmen, SO = Sophomore, JR = Junior, SR = Senior ✓ % 0% %

C HART J UNK Good Presentation 1960: $ : $ : $ : $3.80 Minimum Wage $ Bad Presentation ✓

Mean vs. Median While both are measures of center, the median is sometimes preferred over the mean because it is more robust to outliers (=extreme observations) and skewness If the data is right-skewed, the mean will be greater than the median If the data is left-skewed, the mean will be smaller than the median If the data is symmetric, the mean will be equal to the median

Measures of Spread: ● The range is the maximum minus the minimum. This is not robust and is extremely sensitive to outliers. ● The variance is where n is the sample size. This is also not very robust to outliers. var() in R ● The standard deviation is simply the square root of the variance. It is on the scale of the original data. It is roughly the average distance from the mean. sd() in R ● The interquartile range is the 3 rd quartile minus the 1 st quartile. This is quite robust to outliers.

Measures of Association: The covariance between x and y is defined as where n is the sample size. This will be positive if x and y have a positive relationship and negative if they have a negative relationship. The correlation is the covariance divided by the product of the two standard deviations. It will be between -1 and +1 inclusive. It is often denoted r. It is sometimes called the coefficient of correlation. These are both very sensitive to outliers.

Correlation (r): Y X Y X Y X Y X r = -1 r = -.6 r = +.3 r = +1

In class exercise #16: Match each plot with its correct coefficient of correlation. Choices: r=-3.20, r=-0.98, r=0.86, r=0.95, r=1.20, r=-0.96, r=-0.40 A) B) C) D)E)

In class exercise #17: Make two vectors of length 1,000,000 in R using runif( ) and compute the coefficient of correlation using cor(). Does the resulting value surprise you?

In class exercise #18: What value of r would you expect for the two exam scores in which are plotted below. Compute the value to check your intuition. > data = read.csv(“exams_and_names.csv”, header=T) > cor(data[,2],data[,3],use="pairwise.complete.obs")