Download presentation
Presentation is loading. Please wait.
Published byCecil McGee Modified over 9 years ago
1
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 9/6/12 Describing Data: One Variable SECTIONS 2.1, 2.2, 2.3, 2.4 One categorical variable (2.1) One quantitative variable (2.2, 2.3, 2.4)
2
Statistics: Unlocking the Power of Data Lock 5 The Big Picture Population Sample Sampling Statistical Inference Descriptive Statistics
3
Statistics: Unlocking the Power of Data Lock 5 Descriptive Statistics In order to make sense of data, we need ways to summarize and visualize it Summarizing and visualizing variables and relationships between two variables is often known as descriptive statistics (also known as exploratory data analysis) Type of summary statistics and visualization methods depend on the type of variable(s) being analyzed (categorical or quantitative)
4
Statistics: Unlocking the Power of Data Lock 5 One Categorical Variable A random sample of US adults in 2012 were surveyed regarding the type of cell phone owned Android? iPhone? Blackberry? Non- smartphone? No cell phone?
5
Statistics: Unlocking the Power of Data Lock 5 Frequency Table R: table(x) A frequency table shows the number of cases that fall in each category: Android458 iPhone437 Blackberry141 Non Smartphone924 No cell phone293 Total2253
6
Statistics: Unlocking the Power of Data Lock 5 Proportion
7
Statistics: Unlocking the Power of Data Lock 5 Proportion What proportion of adults sampled do not own a cell phone? Android458 iPhone437 Blackberry141 Non Smartphone924 No cell phone293 Total2253 or 13% Proportions and percentages can be used interchangeably
8
Statistics: Unlocking the Power of Data Lock 5 Relative Frequency Table A relative frequency table shows the proportion of cases that fall in each category R: table(x)/length(x) Android0.203 iPhone0.194 Blackberry0.063 Non Smartphone0.410 No cell phone0.130 All the numbers in a relative frequency table sum to 1
9
Statistics: Unlocking the Power of Data Lock 5 Bar Chart/Plot/Graph In a barplot, the height of the bar corresponds to the number of cases falling in each category R: barchart(x)
10
Statistics: Unlocking the Power of Data Lock 5 Pie Chart In a pie chart, the relative area of each slice of the pie corresponds to the proportion in each category R: pie(table(x))
11
Statistics: Unlocking the Power of Data Lock 5 StatKey www.lock5stat.com/statkey
12
Statistics: Unlocking the Power of Data Lock 5 Summary: One Categorical Variable Summary Statistics Proportion Frequency table Relative frequency table Visualization Bar chart Pie chart
13
Statistics: Unlocking the Power of Data Lock 5 One Quantitative Variable World gross for all 2011 Hollywood movies HollywoodMovies2011 More graphics on profits for Hollywood movies
14
Statistics: Unlocking the Power of Data Lock 5 HollywoodMovies2011
15
Statistics: Unlocking the Power of Data Lock 5 Dotplot In a dotplot, each case is represented by a dot and dots are stacked. Easy way to see each case
16
Statistics: Unlocking the Power of Data Lock 5 Histogram The height of the each bar corresponds to the number of cases within that range of the variable R: hist(x)
17
Statistics: Unlocking the Power of Data Lock 5 Histogram vs Bar Chart This is a a) Histogram b) Bar chart c) Other d) I have no idea
18
Statistics: Unlocking the Power of Data Lock 5 Histogram vs Bar Chart This is a a) Histogram b) Bar chart c) Other d) I have no idea
19
Statistics: Unlocking the Power of Data Lock 5 Histogram vs Bar Chart A bar chart is for categorical data, and the x-axis has no numeric scale A histogram is for quantitative data, and the x- axis is numeric For a categorical variable, the number of bars equals the number of categories, and the number in each category is fixed For a quantitative variable, the number of bars in a histogram is up to you (or your software), and the appearance can differ with different number of bars
20
Statistics: Unlocking the Power of Data Lock 5 Shape SymmetricLeft-SkewedRight-Skewed Long right tail
21
Statistics: Unlocking the Power of Data Lock 5 Notation The sample size, the number of cases in the sample, is denoted by n We often let x or y stand for any variable, and x 1, x 2, …, x n represent the n values of the variable x x 1 = 97.009, x 2 = 201.897, x 3 = 216.196, …
22
Statistics: Unlocking the Power of Data Lock 5 Mean R: mean(x)
23
Statistics: Unlocking the Power of Data Lock 5 Median The median, m, is the middle value when the data are ordered. If there are an even number of values, the median is the average of the two middle values. The median splits the data in half. R: median(x)
24
Statistics: Unlocking the Power of Data Lock 5 m = 76.66 =150.74 Mean is “pulled” in the direction of skewness Measures of Center World Gross (in millions)
25
Statistics: Unlocking the Power of Data Lock 5 Skewness and Center A distribution is left-skewed. Which measure of center would you expect to be higher? a) Mean b) Median The mean will be pulled down towards the skewness (towards the long tail).
26
Statistics: Unlocking the Power of Data Lock 5 Outlier An outlier is an observed value that is notably distinct from the other values in a dataset.
27
Statistics: Unlocking the Power of Data Lock 5 Outliers World Gross (in millions) Harry Potter Transformers Pirates of the Caribbean
28
Statistics: Unlocking the Power of Data Lock 5 Resistance A statistic is resistant if it is relatively unaffected by extreme values. The median is resistant while the mean is not. MeanMedian With Harry Potter$150,742,300$76,658,500 Without Harry Potter$141,889,900$75,009,000
29
Statistics: Unlocking the Power of Data Lock 5 Outliers When using statistics that are not resistant to outliers, stop and think about whether the outlier is a mistake If not, you have to decide whether the outlier is part of your population of interest or not Usually, for outliers that are not a mistake, it’s best to run the analysis twice, once with the outlier(s) and once without, to see how much the outlier(s) are affecting the results
30
Statistics: Unlocking the Power of Data Lock 5 Standard Deviation The standard deviation for a quantitative variable measures the spread of the data Sample standard deviation: s Population standard deviation: (“sigma”) R: sd(x)
31
Statistics: Unlocking the Power of Data Lock 5 Standard Deviation The standard deviation gives a rough estimate of the typical distance of a data values from the mean The larger the standard deviation, the more variability there is in the data and the more spread out the data are
32
Statistics: Unlocking the Power of Data Lock 5 Standard Deviation Both of these distributions are bell-shaped
33
Statistics: Unlocking the Power of Data Lock 5 95% Rule If a distribution of data is approximately symmetric and bell-shaped, about 95% of the data should fall within two standard deviations of the mean. For a population, 95% of the data will be between µ – 2 and µ + 2 StatKey
34
Statistics: Unlocking the Power of Data Lock 5 The 95% Rule
35
Statistics: Unlocking the Power of Data Lock 5 The 95% Rule The standard deviation for hours of sleep per night is closest to a) ½ b) 1 c) 2 d) 4 e) I have no idea
36
Statistics: Unlocking the Power of Data Lock 5 z-score
37
Statistics: Unlocking the Power of Data Lock 5 z-score A z-score puts values on a common scale A z-score is the number of standard deviations a value falls from the mean 95% of all z-scores fall between what two values? z-scores beyond -2 or 2 can be considered extreme -2 and 2
38
Statistics: Unlocking the Power of Data Lock 5 z-score Which is better, an ACT score of 28 or a combined SAT score of 2100? ACT: = 21, = 5 SAT: = 1500, = 325 Assume ACT and SAT scores have approximately bell-shaped distributions a) ACT score of 28 b) SAT score of 2100 c) I don’t know
39
Statistics: Unlocking the Power of Data Lock 5 Other Measures of Location Maximum = largest data value Minimum = smallest data value Quartiles: Q 1 = median of the values below m. Q 3 = median of the values above m.
40
Statistics: Unlocking the Power of Data Lock 5 Five Number Summary Five Number Summary: MinMaxQ1Q1 Q3Q3 m 25% R: summary(x)
41
Statistics: Unlocking the Power of Data Lock 5 Five Number Summary The distribution of number of hours spent studying each week is a) Symmetric b) Right-skewed c) Left-skewed d) Impossible to tell > summary(study_hours) Min. 1st Qu. Median 3rd Qu. Max. 2.00 10.00 15.00 20.00 69.00
42
Statistics: Unlocking the Power of Data Lock 5 Percentile The P th percentile is the value which is greater than P% of the data We already used z-scores to determine whether an SAT score of 2100 or an ACT score of 28 is better We could also have used percentiles: ACT score of 28: 91st percentile SAT score of 2100: 97th percentile
43
Statistics: Unlocking the Power of Data Lock 5 Five Number Summary Five Number Summary: MinMaxQ1Q1 Q3Q3 m 25% 0 th percentile 100 th percentile 50 th percentile 75 th percentile 25 th percentile
44
Statistics: Unlocking the Power of Data Lock 5 Measures of Spread Range = Max – Min Interquartile Range (IQR) = Q 3 – Q 1 Is the range resistant to outliers? a) Yes b) No Is the IQR resistant to outliers? a) Yes b) No The range depends entirely on the most extreme values. The IQR is based off the middle 50% of the data, which will not contain outliers.
45
Statistics: Unlocking the Power of Data Lock 5 Comparing Statistics Measures of Center: Mean (not resistant) Median (resistant) Measures of Spread: Standard deviation (not resistant) IQR (resistant) Range (not resistant) Most often, we use the mean and the standard deviation, because they are calculated based on all the data values, so use all the available information
46
Statistics: Unlocking the Power of Data Lock 5 Outliers Outliers can be informally identified by looking at a plot, but one rule of thumb for identifying outliers is data values more than 1.5 IQRs beyond the quartiles A data value is an outlier if it is Smaller than Q 1 – 1.5(IQR) or Larger than Q 3 + 1.5(IQR)
47
Statistics: Unlocking the Power of Data Lock 5 Boxplot Median Q1Q1 Q3Q3 Lines (“whiskers”) extend from each quartile to the most extreme value that is not an outlier Outliers R: boxplot(x)
48
Statistics: Unlocking the Power of Data Lock 5 Boxplot Which boxplot goes with the histogram of waiting times for the bus? (a)(b)(c) The data do not show any low outliers.
49
Statistics: Unlocking the Power of Data Lock 5 StatKey www.lock5stat.com/statkey
50
Statistics: Unlocking the Power of Data Lock 5 Summary: One Quantitative Variable Summary Statistics Center: mean, median Spread: standard deviation, range, IQR Percentiles 5 number summary Visualization Dotplot Histogram Boxplot Other concepts Shape: symmetric, skewed, bell-shaped Outliers, resistance z-scores
51
Statistics: Unlocking the Power of Data Lock 5 To Do Read Sections 2.1, 2.2, 2.3, 2.4 Do Homework 1 (due Tuesday, 9/11)Homework 1 If you haven’t already… Get the textbook (at bookstore now) Get a clicker and register it (due Tuesday, 9/11)register it
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.