Statistics
Quiz Tomorrow Questions on anything? Comments? Fears? Ask away!!!
So….. We have now learned how to display CATEGORICAL data. Now… we will talk about quantitative data.
To learn how to display and describe quantitative data we will be using some baseball statistics. The following table shows the number of home runs in a single season for three well-known baseball players: Hank Aaron, Barry Bonds, and Babe Ruth.
Dot Plot Label the horizontal axis with the name of the variable and title the graph Scale the axis based on the values of the variable Mark a dot (we’ll use x’s) above the number on the axis corresponding to each data value
Describing a Distribution We describe a distribution using the acronym SOCS
Shape: We describe the shape of a distribution in one of two ways. 1. Symmetric/Approximately Symmetric
Shape: 2. Skewed Right Skewed vs. Left Skewed “tail” “tail” Notice that the direction of the “skew” is the same direction as the “tail”
Outliers: These are observations that we would consider “unusual”. Pieces of data that don’t “fit” the overall pattern of the data. Babe Ruth had two seasons that appear to be somewhat different than the rest of his career. These may be “outliers” Unusual observation???
Outliers: The season in which Barry Bonds hit 73 home runs does not appear to fit the overall pattern. This piece of data may be an outlier. Unusual observation???
Center: A single value that describes the entire distribution. A “typical” value that gives a concise summary of the whole batch of numbers.
Center: A single value that describes the entire distribution. A “typical” value that gives a concise summary of the whole batch of numbers. A typical season for Babe Ruth appears to be approximately 46 home runs
Spread: Since we know that not everyone is typical, we need to also talk about the variation of a distribution. We need to discuss if the values of the distribution are tightly clustered around the center making it easy to predict or do the values vary a great deal from the center making prediction more difficult?
Spread: Babe Ruth’s number of home runs in a single season varies from a low of 23 to a high of 60.
Distribution Description using SOCS The distribution of Babe Ruth’s number of home runs in a single season is approximately symmetric1 with two possible unusual observations at 23 and 25 home runs.2 He typically hits about 463 home runs in a season. Over his career, the number of home runs has varied from a low of 23 to a high of 60.4 1-Shape 2-Outliers 3-Center 4-Spread
Creating a stem and leaf plot Order the data points from least to greatest Separate each observation into a stem (all but the rightmost digit) and a leaf (the final digit)—Ex. 123->12 (stem): 3 (leaf) In a T-chart, write the stems vertically in increasing order on the left side of the chart. On the right side of the chart write each leaf to the right of its stem, spacing the leaves equally Include a key and title for the graph
Stem and Leaf Example: Number of Home Runs in a Single Season Key = 46
Split Stem and Leaf Plot If the data in a distribution is concentrated in just a few stems, the picture may be more descriptive if we “split” the stems When we “split” stems we want the same number of digits to be possible in each stem. This means that each original stem can be split into 2 or 5 new stems. A good rule of thumb is to have a minimum of 5 stems overall Let’s look at how splitting stems changes the look of the distribution of Hank Aaron’s home run data.
Split Stem and Leaf Plot Split each stem into 2 new stems. This means that the first stem includes the leaves 0-4 and the second stem has the leaves 5-9 Splitting the stems helps us to “see” the shape of the distribution in this case. Number of Home Runs in a Single Season Key: = 46
Back-to-Back Stem and Leaf Number of Home Runs in a Single Season Back-to-Back stem and leaf plots allow us to quickly compare two distributions. Use SOCS to make comparisons between distributions Key: = 46
Advantages and Disadvantages of dotplots/stem and leaf plots Preserves each piece of data Shows features of the distribution with regards to shape—such as clusters, gaps, outliers, etc Disadvantages If creating by hand, large data sets can be cumbersome Data that is widely varied may be difficult to graph