Boxplots in R The function boxplot() in R plots boxplots

Slides:



Advertisements
Similar presentations
Measures of Dispersion
Advertisements

Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
1 BUS 297D: Data Mining Professor David Mease Lecture 3 Agenda: 1) Reminder about HW #1 (Due Thurs 9/10) 2) Finish lecture over Chapter 3.
1 Business 260: Managerial Decision Analysis Professor David Mease Lecture 1 Agenda: 1) Course web page 2) Greensheet 3) Numerical Descriptive Measures.
Basic Business Statistics 10th Edition
Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.
Describing Data: Numerical
Describing distributions with numbers
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Statistics for Managers.
Numerical Descriptive Techniques
Chapter 3 – Descriptive Statistics
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 7 = Finish chapter 3 and.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Measures of Central Tendency and Dispersion Preferred measures of central location & dispersion DispersionCentral locationType of Distribution SDMeanNormal.
Descriptive statistics Describing data with numbers: measures of variability.
STAT 280: Elementary Applied Statistics Describing Data Using Numerical Measures.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Lecture 3 Describing Data Using Numerical Measures.
Applied Quantitative Analysis and Practices LECTURE#09 By Dr. Osman Sadiq Paracha.
Measures of Dispersion How far the data is spread out.
Chap 3-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 3 Describing Data Using Numerical.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Business Statistics, A First Course.
LECTURE CENTRAL TENDENCIES & DISPERSION POSTGRADUATE METHODOLOGY COURSE.
Introduction to Statistics Santosh Kumar Director (iCISA)
Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.
1 1 Slide © 2003 South-Western/Thomson Learning TM Chapter 3 Descriptive Statistics: Numerical Methods n Measures of Variability n Measures of Relative.
Boxplots in R The function boxplot() in R plots boxplots By default, boxplot() in R plots the maximum and the minimum non-outlying values instead of the.
Statistical Methods © 2004 Prentice-Hall, Inc. Week 3-1 Week 3 Numerical Descriptive Measures Statistical Methods.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
Chapter 2 Describing Data: Numerical
Yandell – Econ 216 Chap 3-1 Chapter 3 Numerical Descriptive Measures.
Descriptive Statistics ( )
Describing Data: Two Variables
Exploring Data: Summary Statistics and Visualizations
Stats 202: Statistical Aspects of Data Mining Professor Rajan Patel
Measures of Dispersion
Statistics for Managers Using Microsoft® Excel 5th Edition
Business and Economics 6th Edition
MATH-138 Elementary Statistics
Descriptive Statistics
Chapter 3 Describing Data Using Numerical Measures
Chapter 5 : Describing Distributions Numerically I
Ch 4 實習.
Descriptive Statistics
Reasoning in Psychology Using Statistics
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
Chapter 3 Describing Data Using Numerical Measures
Numerical Descriptive Measures
Chapter 5: Describing Distributions Numerically
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Numerical Descriptive Measures
Quartile Measures DCOVA
Displaying Distributions with Graphs
Displaying and Summarizing Quantitative Data
Honors Stats Chapter 4 Part 6
LESSON 4: MEASURES OF VARIABILITY AND PROPORTION
Displaying and Summarizing Quantitative Data
Exploratory Data Analysis
Numerical Descriptive Measures
Summary (Week 1) Categorical vs. Quantitative Variables
Measures of Position Section 3.3.
Summary (Week 1) Categorical vs. Quantitative Variables
Describing Distributions Numerically
Honors Statistics Review Chapters 4 - 5
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Business and Economics 7th Edition
Numerical Descriptive Measures
Presentation transcript:

Boxplots in R The function boxplot() in R plots boxplots By default, boxplot() in R plots the maximum and the minimum non-outlying values instead of the 10th and 90th percentiles as the book describes See: > a<-c(11,12, 22, 33, 34, 100) > boxplot(a) > b<-c(11,12, 22, 33, 34, 65) > boxplot(b) a<-c(11,12,22, 33, 34, 50, 100) boxplot(a) outlier 10th percentile 25th percentile 75th percentile 50th percentile 90th percentile Maximum Minimum *

Boxplots (Pages 114-115) Boxplots help you visualize the differences in the medians relative to the variation Example: The median value of Attribute A was 2.0 for men and 4.1 for women. Is this a “big” difference? *

Boxplots (Pages 114-115) Boxplots help you visualize the differences in the medians relative to the variation Example: The median value of Attribute A was 2.0 for men and 4.1 for women. Is this a “big” difference? Maybe yes: *

Boxplots (Pages 114-115) Boxplots help you visualize the differences in the medians relative to the variation Example: The median value of Attribute A was 2.0 for men and 4.1 for women. Is this a “big” difference? Maybe yes: Maybe no: *

This is paired data, but I’m ignoring that for now. In class exercise #14: Use boxplot() in R to make boxplots comparing the first and second exam scores in the data at http://sites.google.com/site/stats202/data/exams_and_names.csv This is paired data, but I’m ignoring that for now. *

data<-read.csv("exams_and_names.csv") In class exercise #14: Use boxplot() in R to make boxplots comparing the first and second exam scores in the data at http://sites.google.com/site/stats202/data/exams_and_names.csv Answer: data<-read.csv("exams_and_names.csv") boxplot(data[,2],data[,3],col="blue", main="Exam Scores", names=c("Exam 1","Exam 2"),ylab="Exam Score") https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/boxplot.html *

In class exercise #14: Use boxplot() in R to make boxplots comparing the first and second exam scores in the data at http://sites.google.com/site/stats202/data/exams_and_names.csv Answer: *

Some advantages of color: - provides a nice way to differentiate Using Color in Plots In R, the graphing parameter “col” can often be used to specify different colors for points, lines etc. Some advantages of color: - provides a nice way to differentiate - makes it more interesting to look at Some disadvantages of color: - Some people are color blind - Most printing is in black and white - Color can be distracting - A poor color scheme can make the graph difficult to read *

3-Dimesional Plots 3D plots can sometimes be useful One example is the 3D scatter plot for plotting 3 attributes (page 119) The function scatterplot3d() makes fairly nice 3D scatter plots in R -this is not in the base package so you need to do: install.packages("scatterplot3d") library(scatterplot3d) However, it may be better to show the 3rd dimension by simply using a 2D plot with different plotting characters (page 119) *

3-Dimesional Plots Never use the 3rd dimension in a manner that conveys no extra information just to make the plot look more impressive *

3-Dimesional Plots Never use the 3rd dimension in a manner that conveys no extra information just to make the plot look more impressive Examples: *

Do’s and Don’ts (Page 130) Read the ACCENT Principles http://www.datavis.ca/gallery/accent.php Read Tufte’s Guidelines *

Compressing Vertical Axis ✓ Bad Presentation Good Presentation Quarterly Sales Quarterly Sales $ $ 200 50 100 25 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 *

No Zero Point On Vertical Axis ✓ Good Presentations Bad Presentation $ Monthly Sales 45 Monthly Sales 42 $ 45 39 36 42 39 J F M A M J or 36 $ J F M A M J 60 40 Graphing the first six months of sales 20 J F M A M J *

A’s received by students. A’s received by students. No Relative Basis ✓ Bad Presentation Good Presentation A’s received by students. A’s received by students. % Freq. 30% 300 20% 200 100 10% 0% FR SO JR SR FR SO JR SR FR = Freshmen, SO = Sophomore, JR = Junior, SR = Senior *

✓ Chart Junk Bad Presentation Good Presentation $ 4 2 1960 1970 1980 Minimum Wage Minimum Wage $ 1960: $1.00 4 1970: $1.60 2 1980: $3.10 1960 1970 1980 1990 1990: $3.80 *

Final Touches Many times plots are difficult to read or unattractive because people do not take the time to learn how to adjust default values for font size, font type, color schemes, margin size, plotting characters, etc. In R, the function par() controls a lot of these Also in R, the command expression() can produce subscripts and Greek letters in the text -example: xlab=expression(alpha[1]) *

Mean vs. Median While both are measures of center, the median is sometimes preferred over the mean because it is more robust to outliers (=extreme observations) and skewness If the data is right-skewed, the mean will be greater than the median If the data is left-skewed, the mean will be smaller than the median If the data is symmetric, the mean will be equal to the median *

*

Measures of Spread: The range is the maximum minus the minimum. This is not robust and is extremely sensitive to outliers. The variance is where n is the sample size. This is also not very robust to outliers. var() in R The standard deviation is simply the square root of the variance. It is on the scale of the original data. It is roughly the average distance from the mean. sd() in R The interquartile range is the 3rd quartile minus the 1st quartile. This is quite robust to outliers. *

Measures of Association: The covariance between x and y is defined as where n is the sample size. This will be positive if x and y have a positive relationship and negative if they have a negative relationship. The correlation is the covariance divided by the product of the two standard deviations. It will be between -1 and +1 inclusive. It is often denoted r. It is sometimes called the coefficient of correlation. These are both very sensitive to outliers. *

Correlation (r): Y Y X X r = -1 r = -.6 Y Y X X r = +1 r = +.3 *

Match each plot with its correct coefficient of correlation. In class exercise #16: Match each plot with its correct coefficient of correlation. Choices: r=-3.20, r=-0.98, r=0.86, r=0.95, r=1.20, r=-0.96, r=-0.40 A) B) C) D) E) *

Make two vectors of length 1,000,000 in R using In class exercise #17: Make two vectors of length 1,000,000 in R using runif(1000000) and compute the coefficient of correlation using cor(). Does the resulting value surprise you? *

What value of r would you expect for the two exam scores in In class exercise #18: What value of r would you expect for the two exam scores in http://sites.google.com/site/stats202/data/exams_and_names.csv which are plotted below. Compute the value to check your intuition. > data = read.csv(“exams_and_names.csv”, header=T) > cor(data[,2],data[,3],use="pairwise.complete.obs") *