Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 1 Stats Starts Here Copyright © 2009 Pearson Education, Inc.

Similar presentations


Presentation on theme: "Chapter 1 Stats Starts Here Copyright © 2009 Pearson Education, Inc."— Presentation transcript:

1

2 Chapter 1 Stats Starts Here Copyright © 2009 Pearson Education, Inc.

3 Think, Show, Tell There are three simple steps to doing Statistics right: first. Know where you’re headed and why. is about the mechanics of calculating statistics and graphical displays, which are important (but are not the most important part of Statistics). what you’ve learned. You must explain your results so that someone else can understand your conclusions.

4 Chapter 2 Data Copyright © 2009 Pearson Education, Inc.

5 The “W’s” of the data. To provide context we need the W’s Who
What (and in what units) When Where Why (if possible) and How of the data. Note: the answers to “who” and “what” are essential.

6 Who The Who of the data tells us the individual cases about which (or whom) we have collected data. Individuals who answer a survey are called respondents. People on whom we experiment are called subjects or participants. Animals, plants, and inanimate subjects are called experimental units.

7 What and Why Variables are characteristics recorded about each individual. The variables should have a name that identify What has been measured. To understand variables, you must Think about what you want to know.

8 What and Why (cont.) A categorical (or qualitative) variable names categories and answers questions about how cases fall into those categories. Categorical examples: sex, race, ethnicity A quantitative variable is a measured variable (with units) that answers questions about the quantity of what is being measured. Quantitative examples: income ($), height (inches), weight (pounds)

9 Where, When, and How We need the Who, What, and Why to analyze data. But, the more we know, the more we understand. When and Where give us some nice information about the context. Example: Values recorded at a large public university may mean something different than similar values recorded at a small private college.

10 Where, When, and How (cont.)
How the data are collected can make the difference between insight and nonsense. Example: results from voluntary Internet surveys are often useless The first step of any data analysis should be to examine the W’s—this is a key part of the Think step of any analysis. And, make sure that you know the Why, Who, and What before you proceed with your analysis.

11 What Can Go Wrong? Don’t label a variable as categorical or quantitative without thinking about the question you want it to answer. Just because your variable’s values are numbers, don’t assume that it’s quantitative. Always be skeptical—don’t take data for granted.

12 What have we learned? (cont.)
We treat variables as categorical or quantitative. Categorical variables identify a category for each case. Quantitative variables record measurements or amounts of something and must have units. Some variables can be treated as categorical or quantitative depending on what we want to learn from them.

13 Displaying and Describing Categorical Data
Chapter 3 Displaying and Describing Categorical Data Copyright © 2009 Pearson Education, Inc.

14 The Three Rules of Data Analysis
The three rules of data analysis won’t be difficult to remember: Make a picture—things may be revealed that are not obvious in the raw data. These will be things to think about. Make a picture—important features of and patterns in the data will show up. You may also see things that you did not expect. Make a picture—the best way to tell others about your data is with a well-chosen picture.

15 Frequency Tables: Making Piles
We can “pile” the data by counting the number of data values in each category of interest. We can organize these counts into a frequency table, which records the totals and the category names.

16 Frequency Tables: Making Piles (cont.)
A relative frequency table is similar, but gives the percentages (instead of counts) for each category.

17 Bar Charts A bar chart displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison. A bar chart stays true to the area principle. Thus, a better display for the ship data is:

18 Bar Charts (cont.) A relative frequency bar chart displays the relative proportion of counts for each category. A relative frequency bar chart also stays true to the area principle. Replacing counts with percentages in the ship data:

19 Pie Charts When you are interested in parts of the whole, a pie chart might be your display of choice. Pie charts show the whole group of cases as a circle. They slice the circle into pieces whose size is proportional to the fraction of the whole in each category.

20 Contingency Tables (cont.)
The margins of the table, both on the right and on the bottom, give totals and the frequency distributions for each of the variables. Each frequency distribution is called a marginal distribution of its respective variable. The marginal distribution of Survival is:

21 Contingency Tables (cont.)
Each cell of the table gives the count for a combination of values of the two values. For example, the second cell in the crew column tells us that 673 crew members died when the Titanic sunk.

22 Conditional Distributions
A conditional distribution shows the distribution of one variable for just the individuals who satisfy some condition on another variable. The following is the conditional distribution of ticket Class, conditional on having survived:

23 Conditional Distributions (cont.)
The conditional distributions tell us that there is a difference in class for those who survived and those who perished. This is better shown with pie charts of the two distributions:

24 Conditional Distributions (cont.)
We see that the distribution of Class for the survivors is different from that of the nonsurvivors. This leads us to believe that Class and Survival are associated, that they are not independent. The variables would be considered independent when the distribution of one variable in a contingency table is the same for all categories of the other variable.

25 Segmented Bar Charts A segmented bar chart displays the same information as a pie chart, but in the form of bars instead of circles. Here is the segmented bar chart for ticket Class by Survival status:

26 What Can Go Wrong? (cont.)
Be sure to use enough individuals! Do not make a report like “We found that 66.67% of the rats improved their performance with training. The other rat died.”

27 What Can Go Wrong? (cont.)
Don’t overstate your case—don’t claim something you can’t. Don’t use unfair or silly averages—this could lead to Simpson’s Paradox, so be careful when you average one variable across different levels of a second variable.

28 What have we learned? We can summarize categorical data by counting the number of cases in each category (expressing these as counts or percents). We can display the distribution in a bar chart or pie chart. And, we can examine two-way tables called contingency tables, examining marginal and/or conditional distributions of the variables. If conditional distributions of one variable are the same for every category of the other, the variables are independent.

29 Displaying and Summarizing Quantitative Data
Chapter 4 Displaying and Summarizing Quantitative Data Copyright © 2009 Pearson Education, Inc.

30 Histograms: Earthquake Magnitudes (cont.)
A histogram plots the bin counts as the heights of bars (like a bar chart). Here is a histogram of earthquake magnitudes

31 Histograms: Earthquake magnitudes (cont.)
A relative frequency histogram displays the percentage of cases in each bin instead of the count. In this way, relative frequency histograms are faithful to the area principle. Here is a relative frequency histogram of earthquake magnitudes:

32 Stem-and-Leaf Example
Compare the histogram and stem-and-leaf display for the pulse rates of 24 women at a health clinic. Which graphical display do you prefer?

33 Dotplots A dotplot is a simple display. It just places a dot along an axis for each case in the data. The dotplot to the right shows Kentucky Derby winning times, plotting each race as its own dot. You might see a dotplot displayed horizontally or vertically.

34 Shape, Center, and Spread
When describing a distribution, make sure to always tell about three things: shape, center, and spread…

35 What is the Shape of the Distribution?
Does the histogram have a single, central hump or several separated humps? Is the histogram symmetric? Do any unusual features stick out?

36 Symmetry Is the histogram symmetric?
If you can fold the histogram along a vertical line through the middle and have the edges match pretty closely, the histogram is symmetric.

37 Symmetry (cont.) The (usually) thinner ends of a distribution are called the tails. If one tail stretches out farther than the other, the histogram is said to be skewed to the side of the longer tail. In the figure below, the histogram on the left is said to be skewed left, while the histogram on the right is said to be skewed right.

38 Anything Unusual? Do any unusual features stick out?
Sometimes it’s the unusual features that tell us something interesting or exciting about the data. You should always mention any stragglers, or outliers, that stand off away from the body of the distribution. Are there any gaps in the distribution? If so, we might have data from more than one group.

39 Center of a Distribution – Median
The median is the value with exactly half the data values below it and half above it. It is the middle data value (once the data values have been ordered) that divides the histogram into two equal areas. It has the same units as the data.

40 Spread: Home on the Range
Always report a measure of spread along with a measure of center when describing a distribution numerically. The range of the data is the difference between the maximum and minimum values: Range = max – min A disadvantage of the range is that a single extreme value can make it very large and, thus, not representative of the data overall.

41 Spread: The Interquartile Range (cont.)
Quartiles divide the data into four equal sections. One quarter of the data lies below the lower quartile, Q1 One quarter of the data lies above the upper quartile, Q3. The difference between the quartiles is the interquartile range (IQR), so IQR = upper quartile – lower quartile

42 5-Number Summary The 5-number summary of a distribution reports its median, quartiles, and extremes (maximum and minimum) The 5-number summary for the recent tsunami earthquake Magnitudes looks like this:

43 Summarizing Symmetric Distributions – The Mean
When we have symmetric data, there is an alternative other than the median, If we want to calculate a number, we can average the data. We use the Greek letter sigma to mean “sum” and write: The formula says that to find the mean, we add up the numbers and divide by n.

44 Summarizing Symmetric Distributions – The Mean (cont)
Because the median considers only the order of values, it is resistant to values that are extraordinarily large or small; it simply notes that they are one of the “big ones” or “small ones” and ignores their distance from center. To choose between the mean and median, start by looking at the data. If the histogram is symmetric and there are no outliers, use the mean. However, if the histogram is skewed or with outliers, you are better off with the median.

45 What About Spread? The Standard Deviation
A more powerful measure of spread than the IQR is the standard deviation, which takes into account how far each data value is from the mean. A deviation is the distance that a data value is from the mean. Since adding all deviations together would total zero, we square each deviation and find an average of sorts for the deviations.

46 What About Spread? The Standard Deviation (cont.)
The variance, notated by s2, is found by summing the squared deviations and (almost) averaging them: The variance will play a role later in our study, but it is problematic as a measure of spread—it is measured in squared units!

47 What About Spread? The Standard Deviation (cont.)
The standard deviation, s, is just the square root of the variance and is measured in the same units as the original data.

48 Tell - Shape, Center, and Spread
Next, always report the shape of its distribution, along with a center and a spread. If the shape is skewed, report the median and IQR. If the shape is symmetric, report the mean and standard deviation and possibly the median and IQR as well.

49 Tell - What About Unusual Features?
If there are multiple modes, try to understand why. If you identify a reason for the separate modes, it may be good to split the data into two groups. If there are any clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be quite revealing.

50 Understanding and Comparing Distributions
Chapter 5 Understanding and Comparing Distributions Copyright © 2009 Pearson Education, Inc.

51 The Big Picture We can answer much more interesting questions about variables when we compare distributions for different groups. Below is a histogram of the Average Wind Speed for every day in 1989.


Download ppt "Chapter 1 Stats Starts Here Copyright © 2009 Pearson Education, Inc."

Similar presentations


Ads by Google