Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted.

Describing Distributions with Numbers Chapter 2

What we will do We are continuing our exploration of data. In the last chapter we graphically depicted data Now we are going to look at how we can describe data using “summary” statistics We will look at statistics that provide measures of central tendency We will also look at statistics that provide measures of dispersion

Sometimes Statistics are So Simple… Sometimes statistics are so simple we have to do something to make them look fancier than they are. Enter “The Mean”. The mean simply means taking the average of something. You all know how to do this. You add up the group, then you divide it by the number of items in the group.

But just to make sure you know I know what I am doing I have a formula

We may talk about these formulas but… Don’t worry, we may talk about the formulas that mathematically describe statistics so you can get a better understanding of how they work. I might also hand calculate a few to demonstrate this But no one today hand calculates real data Neither should you that is why we have software

The Median The Median is the mid point of a distribution. Half the observations have values less than the median, half have values more The formula looks like this Note the formula gives the location of the median (the observation which has a value equal to the median) not its value

Here is where Stem & Leaf Graphs can come in handy (N=20)

Mean and Median which one? In general the Mean is more susceptible to distortion by –abnormally large cases, in the language of the book a distribution skewed to the right –or abnormally small cases, in the language of the book a distribution skewed to the left. For example, one Bill Gates among a thousand people will seriously distort the “Mean” income of this sample. However, it will have little or no impact on the “Median” Income

Level of Measure Matters Also You cannot take the mean of a categorical variable (one measured at the nominal or ordinal level). You can however calculate the median of a variable measured at the ordinal level. This is a good point to stop and remind you about the stupidity of machines. Unless the variables are tagged in the data set as to level of measure, your computer really won’t care and will happily chug along calculating even meaningless statistics such as the mean of your categorical variables.

One more The Mode is the measure of central tendency for nominal data. It is simply the category with the largest number of cases.

If all we knew was how well the data clumped together… Even though the Median is less susceptible to distortion by an abnormally large or small case, it can still provide a very weak description of your data if the observations are widely dispersed. This is why we are often interested in the Quartiles

Just like the Median only smaller Quartiles are just like the Median only on a smaller scale. Instead of defining the mid point of the distribution they define the break-point between: –The first quarter and the second quarter –The break between the second quarter and the third quarter (which is the Median by the way) –The break between the third quarter and the fourth quarter

The Five-Number Summary Moore is very big on the use of the five- number summary to summarily describe data. Minimum value Q1 M Q3 Maximum value

You can graphically depict this with a box plot Fortunately all the computer programs we are employing can easily generate both the numerical summary and the accompanying box plots SPSS can generate all this and more using its “Frequencies” and “Explore” commands. Excel does the job just as nicely.

Here is an example of an SPSS Box plot for before tax income for men and women in Ontario from the Survey of Household Spending

Notice on the previous slide how the distance from the first quartile to the median and then to the third quartile is not necessarily symmetrical and then that the whiskers on the box plot are also not symmetrical. This is an indication of skew Unlike the example in the book my whiskers indicate not max and min value but percentiles,

Here is the five number summary for Men and Women

Spotting outliers Obviously our box plots provide an excellent way to spot outliers. A statistic that can also help is the “interquartile range”. This is just the range between quartile one and three. When an observation lies 1 1/2 times the Interquartile range above quartile three or below quartile 1, it is often considered to be an outlier.

While I used ratio level data… While I used ratio level data for my example of the five-number summary, it should be noted that there is nothing here (quartiles, Median, maximum, minimum value) that would not work with data measured at the interval or ordinal level

Range Along with quartiles (which works when data is at least measured at the ordinal level) we must also remember to look at “Range” which is the only measure of dispersion that works at the nominal level.

Standard Deviation The best way to describe Standard Deviation (notation S) is that it is the square root of Variance (notation S 2 ) So why do you need variance? A bit of math if you look at the formula in your book.

The Formula for S 2 Variance is the sum of the squared distances of each observation from the mean over N-1 (N-1 being the degree of freedom).

The Formula for S 2 involves a squaring We have to square these distances as, otherwise -- in a symmetrical distribution -- they would cross cancel and there would be no variance. The problem with variance is all that squaring produces numbers that are very large and not too intuitive to read on their own (though you will see later that variance is an important tool and even a building block for other things).

Taking the square root produces a much more usable number (S). Quite simply, when you know and S You can go up and down a list of numbers and figure out which list is more concentrated about its mean and which is more diffuse and which are similar

If you want a quick example FrequencyValue 10 11 12 13 14 15 16 17 18 19 110 N= 11∑ = 55 Mean = 5S 2 =11 S= 3.3 FrequencyValue 10 12 14 16 18 110 112 114 116 118 120 N= 11∑ = 110 Mean = 10S 2 = 44 S= 6.6

But once again, keep in mind… If the mean is susceptible to distortion from extreme variables, S is doubly so due to all those squarings

Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted.

Similar presentations

Presentation on theme: "Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted.

Similar presentations

Presentation on theme: "Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted."— Presentation transcript:

Similar presentations

About project

Feedback