Presentation is loading. Please wait.

Presentation is loading. Please wait.

Psy B07 Chapter 2Slide 1 DESCRIBING AND EXPLORING DATA.

Similar presentations


Presentation on theme: "Psy B07 Chapter 2Slide 1 DESCRIBING AND EXPLORING DATA."— Presentation transcript:

1 Psy B07 Chapter 2Slide 1 DESCRIBING AND EXPLORING DATA

2 Psy B07 Chapter 2Slide 2  Plotting data  Grouping data  Terminology  Notation  Measures of Central Tendency  Measures of Variability  Properties of a Statistic Outline

3 Psy B07 Chapter 2Slide 3 Plotting Data  Once a bunch of data has been collected, the raw numbers must be manipulated in some fashion to make them more informative.  Several options are available including plotting the data or calculating descriptive statistics

4 Psy B07 Chapter 2Slide 4 Plotting Data Age 18 26 21 21 25 18 20 21 18 21 21 21 20 21 20 23 22 20 21 22 24 26 19 19 Weight 107 115 108 111 163 119 119 200 178 135 143 113 103 166 112 151 192 135 117 138 137 161 117 142  Raw data of typical age and weight in a second year course (made- up data) Age 20 21 20 19 19 21 22 19 20 20 19 19 19 20 20 19 20 20 20 22 22 19 23 20 Weight 108 110 109 127 143 121 112 136 161 131 144 123 101 193 127 158 149 138 129 138 137 156 122 132

5 Psy B07 Chapter 2Slide 5 Plotting Data  Often, the first thing one does with a set of raw data is to plot frequency distributions.  Usually this is done by first creating a table of the frequencies broken down by values of the relevant variable, then the frequencies in the table are plotted in a histogram

6 Psy B07 Chapter 2Slide 6 Plotting Data  Example: Typical age in a second year course  Note: The frequencies in the adjacent table were calculated by simply counting the number of subjects having the specified value for the age variable

7 Psy B07 Chapter 2Slide 7 Plotting Data

8 Psy B07 Chapter 2Slide 8 Grouping Data  Plotting is easy when the variable of interest has a relatively small number of values (like our age variable did).  However, the values of a variable are sometimes more continuous, resulting in uninformative frequency plots if done in the above manner.

9 Psy B07 Chapter 2Slide 9 Grouping Data  For example, our weight variable ranges from 100 lb. to 200 lb. If we used the previously described technique, we would end up with 100 bars, most of which with a frequency less than 2 or 3 (and many with a frequency of zero).  We can get around this problem by grouping our values into bins. Try for around 10 bins with natural splits.

10 Psy B07 Chapter 2Slide 10 Grouping Data

11 Psy B07 Chapter 2Slide 11 Grouping Data Check out this demo which clearly shows how the width of the bin that you select can clearly affect the “look” of the datathis demo Here is another similar demonstration of the effects of bin width demonstration  See section in text on cumulative frequency distributions

12 Psy B07 Chapter 2Slide 12 Terminology  Often, frequency histograms tend to have a roughly symmetrical bell-shape and such distributions are called normal or Gaussian

13 Psy B07 Chapter 2Slide 13 Terminology  Sometimes, the bell shape is not symmetrical  The term positive skew refers to the situation where the “tail” of the distribution is to the right, negative skew is when the “tail” is to the left

14 Psy B07 Chapter 2Slide 14 Terminology

15 Psy B07 Chapter 2Slide 15 Notation  Variables  When we describe a set of data corresponding to the values of some variable, we will refer to that set using a letter such as X or Y.  When we want to talk about specific data points within that set, we specify those points by adding a subscript to the letter like X 1.

16 Psy B07 Chapter 2Slide 16 Notation 5,8, 12,3,6,8,7 X1, X2, X3, X4, X5, X6, X7 X1, X2, X3, X4, X5, X6, X7

17 Psy B07 Chapter 2Slide 17 Notation  The Greek letter sigma, which looks like , means “add up” or “sum” whatever follows it.  Thus,  X i, means “add up all the X i s.  If we use the X i s from the previous example,  X i = 49 (or just  X).

18 Psy B07 Chapter 2Slide 18 Nasty Example

19 Psy B07 Chapter 2Slide 19 Nasty Example  X = 360  Y = 336  (X-Y) = 24  X 2 = 26262 (  X) 2 = 129600

20 Psy B07 Chapter 2Slide 20 Your turn  (XY) = 24283 (  (X-Y)) 2 = 576  (X 2 -Y 2 ) = 2956

21 Psy B07 Chapter 2Slide 21 Notation  Sometimes things are made more complicated because letters (e.g., X) are sometimes used to refer to entire data sets (as opposed to single variables) and multiple subscripts are used to specify specific data points.

22 Psy B07 Chapter 2Slide 22 Notation X 24 = 3  X or  X ij = 61

23 Psy B07 Chapter 2Slide 23 Measures of Central Tendency  While distributions provide an overall picture of some data set, it is sometimes desirable to represent the entire data set using descriptive statistics.  The first descriptive statistics we will discuss, are those used to indicate where the centre of the distribution lies.

24 Psy B07 Chapter 2Slide 24 Measures of Central Tendency

25 Psy B07 Chapter 2Slide 25 Measures of Central Tendency  There are, in fact, three different measures of central tendency.  The first of these is called the mode.  The mode is simply the value of the relevant variable that occurs most often (i.e., has the highest frequency) in the sample.

26 Psy B07 Chapter 2Slide 26 Measures of Central Tendency  Note that if you have done a frequency histogram, you can often identify the mode simply by finding the value with the highest bar.  However, that will not work when grouping was performed prior to plotting the histogram (although you can still use the histogram to identify the modal group, just not the modal value)

27 Psy B07 Chapter 2Slide 27 Measures of Central Tendency  Create a non-grouped frequency table as described previously, then identify the value with the greatest frequency.  Example: Class height.

28 Psy B07 Chapter 2Slide 28 Measures of Central Tendency  A second measure of central tendency is called the median.  The median is the point corresponding to the score that lies in the middle of the distribution (i.e., there are as many data points above the median as there are below the median).

29 Psy B07 Chapter 2Slide 29 Measures of Central Tendency  To find the median, the data points must first be sorted into either ascending or descending numerical order.  The position of the median value can then be calculated using the following formula: Median Location

30 Psy B07 Chapter 2Slide 30 Measures of Central Tendency 1) If there are an odd number of data points: (1, 3, 3, 4, 4, 5, 6, 7, 12) The median is the item in the fifth position of the ordered data set, therefore the median is 4 Median Location

31 Psy B07 Chapter 2Slide 31 Measures of Central Tendency 2) If there are an even number of data points: (1, 3, 3, 3, 5, 5, 6, 7) We take the average of the two adjacent values – in this case giving us 4 Median Location

32 Psy B07 Chapter 2Slide 32 Measures of Central Tendency  Finally, the most commonly used measure of central tendency is called the mean (denoted x for a sample, and μ for a population).  The mean is the same of what most of us call the average, and it is calculated in the following manner:

33 Psy B07 Chapter 2Slide 33 Measures of Central Tendency  For example, given the data set that we used to calculate the median (odd number example), the corresponding mean would be:

34 Psy B07 Chapter 2Slide 34 Measures of Central Tendency  When a distribution is fairly symmetrical, the mean, median, and mode will be quite similar  However, when the underlying distribution is not symmetrical, the three measures of central tendency can be quite different

35 Psy B07 Chapter 2Slide 35 Measures of Central Tendency  This raises the issue of which measure is best.  Note that if you were calculating these values, you would show all your steps (it’s good to be a prof!).

36 Psy B07 Chapter 2Slide 36 Measures of Central Tendency   Here is a demonstration that allows you to change a frequency histogram while simultaneously noting the effects of those changes on the mean versus the median. Here is a demonstration   As you use the demo, you should easily be able to think about how these changes are also affecting the mode, right?

37 Psy B07 Chapter 2Slide 37 Measures of Variability  In addition to knowing where the centre of the distribution is, it is often helpful to know the degree to which individual values cluster around the centre.  This is known as variability

38 Psy B07 Chapter 2Slide 38 Measures of Variability  There are various measures of variability, the most straightforward being the range of the sample: Highest value minus lowest value  While range provides a good first pass at variance, it is not the best measure because of its sensitivity to extreme scores (see text).

39 Psy B07 Chapter 2Slide 39 Measures of Variability  One approach to estimating variability is to directly measure the degree to which individual data points differ from the mean and then average those deviations.  This is known as the average deviation

40 Psy B07 Chapter 2Slide 40 Measures of Variability  However, if we try to do this with real data, the result will always be zero: Example: (2,3,3,4,4,6,6,12)

41 Psy B07 Chapter 2Slide 41 Measures of Variability  One way to get around the problem with the average deviation is to use the absolute value of the differences, instead of the differences themselves.  The absolute value of some number is just the number without any sign: For Example: |-3| = 3 And: |+3| = 3 And: |+3| = 3

42 Psy B07 Chapter 2Slide 42 Measures of Variability  Thus, we could re-write and solve our average deviation question as follows:  Therefore, this data set has a mean of 5, and a MAD of 2.25

43 Psy B07 Chapter 2Slide 43 Measures of Variability  Although the MAD is an acceptable measure of variability, the most commonly used measure is variance (denoted s 2 for a sample and  2 for a population) and its square root termed the standard deviation (denoted s for a sample and  for a population).

44 Psy B07 Chapter 2Slide 44 Measures of Variability  The computation of variance is also based on the basic notion of the average deviation however, instead of getting around the “zero problem” by using absolute deviations (as in MAD), the “zero problem” is eliminating by squaring the differences from the mean

45 Psy B07 Chapter 2Slide 45 Measures of Variability  Example: (2,3,4,4,4,5,6,12)

46 Psy B07 Chapter 2Slide 46 Measures of Variability  To convert the variance into SD, we simply take a square root of it:

47 Psy B07 Chapter 2Slide 47 Measures of Variability  This demonstration allows you to play with the mean and standard deviation of a distribution. Note that changing the mean of the distribution simply moves the entire distribution to the left or right without changing its shape. In contrast, changing the standard deviation alters the spread of the data but does not affect where the distribution is “centered”  This demonstration allows you to play with the mean and standard deviation of a distribution. Note that changing the mean of the distribution simply moves the entire distribution to the left or right without changing its shape. In contrast, changing the standard deviation alters the spread of the data but does not affect where the distribution is “centered”DEMODEMO

48 Psy B07 Chapter 2Slide 48 Measures of Variability  Population vs. Sample  As mentioned, we usually deal with statistics, not parameters. σ 2 and σ are parameters. Their counterparts, when dealing with samples are s 2 and s. The formulae are slightly different

49 Psy B07 Chapter 2Slide 49 Properties of a Statistic  So, the mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent the data points of some sample.  The real reason that they are the preferred measures of central tendency and variance is because of certain properties they have as estimators of their corresponding population parameters; μ and  2.

50 Psy B07 Chapter 2Slide 50 Properties of a Statistic  Four properties are considered desirable in a population estimator; sufficiency, unbiasedness, efficiency, & resistance.  Both the mean and the variance are the best estimators in their class in terms of the first three of these four properties.  To understand these properties, you first need to understand a concept in statistics called the sampling distribution

51 Psy B07 Chapter 2Slide 51 Properties of a Statistic   We will discuss sampling distributions off and on throughout the course, and I only want to touch on the notion now.   Basically, the idea is this – in order to examine the properties of a statistic we often want to take repeated samples from some population of data and calculate the relevant statistic on each sample. We can then look at the distribution of the statistic across these samples and ask a variety of questions about it.   Check out this demonstration which I hope makes the concept of sampling distributions more clear.this demonstration

52 Psy B07 Chapter 2Slide 52 Properties of a Statistic 1) Sufficiency  A sufficient statistic is one that makes use of all of the information in the sample to estimate its corresponding parameter.

53 Psy B07 Chapter 2Slide 53 Properties of a Statistic 2) Unbiasedness  A statistic is said to be an unbiased estimator if its expected value (i.e., the mean of a number of sample means) is equal to the population parameter it is estimating.  Explanation of N-1 in s 2 formula.

54 Psy B07 Chapter 2Slide 54 Properties of a Statistic  Using the procedure, the mean can be shown to be an unbiased estimator (see p 47).  However, if the σ 2 formula is used to calculate s 2 it turns out to underestimate σ 2

55 Psy B07 Chapter 2Slide 55 Properties of a Statistic  The reason for this bias is that, when we calculate s 2, we use x, an estimator of the population mean  The chances of x being EXACTLY the same as μ are virtually nil, which results in the bias  To compensate, we use N-1  Note that this is only true when calculating s 2, if you have a measurable population and you want to calculate  2, you use N in the denominator, not N-1

56 Psy B07 Chapter 2Slide 56 Properties of a Statistic  Degrees of Freedom  The mean of 6, 8, & 10 is 8.  If I allow you to change as many of these numbers as you want BUT the mean must stay 8, how many of the numbers are you free to vary?

57 Psy B07 Chapter 2Slide 57 Properties of a Statistic  The point of this exercise is that when the mean is fixed, it removes a degree of freedom from your sample -- this is like actually subtracting 1 from the number of observations in your sample.  It is for exactly this reason that we use N-1 in the denominator when we calculate s 2 (i.e., the calculation requires that the mean be fixed first which effectively removes -- fixes -- one of the data points).

58 Psy B07 Chapter 2Slide 58 Properties of a Statistic 3) Efficiency  The efficiency of a statistic is reflected in the variance that is observed when one examines the means of a bunch of independently chosen samples. The smaller the variance, the more efficient the statistic is said to be

59 Psy B07 Chapter 2Slide 59 Properties of a Statistic 4) Resistance  The resistance of an estimator refers to the degree to which that estimate is effected by extreme values.  As mentioned previously, both X and s 2 are highly sensitive to extreme values

60 Psy B07 Chapter 2Slide 60 Properties of a Statistic 4) Resistance  Despite this, they are still the most commonly used estimates of the corresponding population parameters, mostly because of their superiority over other measures in terms sufficiency, unbiasedness, & efficiency


Download ppt "Psy B07 Chapter 2Slide 1 DESCRIBING AND EXPLORING DATA."

Similar presentations


Ads by Google