Presentation is loading. Please wait.

Presentation is loading. Please wait.

Description and measurement

Similar presentations


Presentation on theme: "Description and measurement"— Presentation transcript:

1 Description and measurement
Dr Kwang Lee 03/05/2013

2 Outline 1. Concepts of scale of measurement (types of data e.g. categorical, continuous) 2. Sampling methods, frequency and probability distributions. 3. Summary statistics and graphs, outliers, stem-and-leaf plots, Box plots, scattergrams.

3 Scales of Measurement: categorical data
 Nominal Scale - Labels represent various levels of a categorical variable. Gender, Ethnicity, or Marital Status. Statistical test: chi square  Ordinal Scale - Labels represent an order that indicates either preference or ranking. quality of food (0, 1, or 2) etc statistical tests: Spearman's Rank Order Correlation (rho), Mann-Whitney U * Nominal (unordered; male, female) vs ordinal (ordered; food quality score 0,1,2,3)

4 Scales of Measurement: continuous data
 Interval Scale - Numerical labels indicate order and distance between elements. There is no absolute zero and multiples of measures are not meaningful. Most personality measures & scale scores statistical tests: t-test, ANOVA, regression, factor analysis etc  Ratio Scale - Numerical labels indicate order and distance between elements. There is an absolute zero and multiples of measures are meaningful. Length or distance in centimeters, inches etc that have the absolute zero.

5 Ordinal vs. interval scale
Most personality measures & scale scores

6 Classify the data according to the level of measurement.
1. Temperature, 2. Salary, 3. time, 4. postcode, 5. grade A ) interval, nominal, interval, ratio, interval B ) nominal, ratio, interval, ordinal, ratio C ) ratio, ordinal, ordinal, interval, ratio D ) interval, ratio, ratio, nominal, ordinal

7 A study was conducted to investigate the effect of a coal-fire generating plant upon the water quality of a river. As part of an environmental impact study, fish were captured, tagged, and released. The following information was recorded for each fish: sex(0=female, 1=male), length(cm), maturation (0=young, 1=adult), weight(g). The scale of these variables is: (a) nominal, ratio, nominal, ratio (b) nominal, interval, ordinal, ratio (c) nominal, ratio, ordinal, ratio (d) ordinal, ratio, nominal, ratio (e) ordinal, interval, ordinal, ratio

8 Descriptive statistics
Methods of organising, summarising, and presenting data in a convenient and informative way. These methods include: numerical techniques graphical techniques The actual method used depends on what information you would like to extract. Are you interested in: measures of central location and/or measures of variability (dispersion)?

9 Measures of central location

10 MEAN Mean is probably the most common indicator. The mean can be defined as as the arithmetic average of all values. The mean measures the central tendency of a variable.                   where n      is the sample size.

11 Median – a different kind of average
“Middle value” Order data When n is odd  middle value When n is even  average two middle values  median = average of 27 and 28 = 27.5

12 Median is “robust” Robust  resistant to skews and outliers
This data set has a mean (xbar) of 1600: This data set has an outlier and a mean of 2743: Outlier The median is 1614 in both instances. The median was not influenced by the outlier.

13 Mode Mode  value with greatest frequency
e.g., {4, 7, 7, 7, 8, 8, 9} has mode = 7 Used only in very large data sets The mode is used less frequently than the mean or the median.

14 Mean, Median, Mode Symmetrical data: mean = median
positive skew: mean > median [mean gets “pulled” by tail] negative skew: mean < median

15 Measures of variability

16 Range Simplest way to describe the spread of dataset is to quote the minimum (lowest) and maximum (highest) value. e.g., Minimum: 116, maximum: 170: range: 54 Affected by extreme values

17 Quartiles Quartiles divide the values of a data set into four subsets of equal size, each comprising 25% of the observations.

18 Inter-Quartile Range 50% 25% 25% Q1 Q3 Inter-Quartile Range
= IQR = Q3 - Q1

19 Variance and Standard deviation.
The variance of a set of data is a measure of spread about the mean of a distribution. The variance uses all the data The standard deviation is the square root of the variance

20 The Variance Variance is one of the most frequently used measures of spread, for population, for sample,

21 The Standard Deviation
Since variance is given in squared units, we often find uses for the standard deviation, which is the square root of variance: for a population, for a sample,

22 Shape of the Distribution: Skewness
Values need not be symmetrically distributed around the central point; distributions can be skewed Mean and standard deviation are insufficient to describe the distribution Frequency This distribution is skewed to the right (positively skewed) Mode Mean x Median

23 Consequences of a Skewed Distribution
Especially socio-economic data (wages, income, wealth and related variables) is frequently skewed Skewed variables can lead to undesirable effects Test statistics and confidence intervals are biased If the variable is not significantly skewed, continue If the variable is skewed, transform the variable: For this reason you often find the logarithm of income, the square root of the mortality rate, etc.

24 Kurtosis: a measure of the "peakedness"
Two variables with equal mean and standard deviation, and symmetrically distributed, but a different kurtosis f(x) f(y) f(y)  Here, variable y has the larger kurtosis than variable x sy sx f(x) m x,y

25 Describe Samples: graphs Box plot and stem-and-leaf diagram,

26 Box Plot Visual display of
Max value Third quartile Mean Median First quartile Min value Visual display of Central tendency, Variability, Departure from symmetry, Outliers give a good graphical image of the concentration of the data. They also show how far from most of the data the extreme values are. 26 26

27 STEM AND LEAF DIAGRAMS STEM LEAVES
A Stem and Leaf diagram is a way of sorting data. They look like this. The data is split into tens (the stem) and the units (the leaves).

28 STEM AND LEAF DIAGRAMS We are going to put this data into a stem and leaf diagram. 12, 32, 22, 16, 24, 34, 12, 10, 25, 30, 28 STEM LEAVES 1 2 3 2 We have numbers in the tens, twenties and thirties so this becomes our stem. Now we need to enter the leaves. The first number twelve has a 2 in the unit column so this becomes the leaf.

29 3 2 STEM AND LEAF DIAGRAMS 12, 32, 22, 16, 24, 34, 12, 10, 25, 30 STEM
LEAVES 3 2 1 2 3 2 6 2 2 4 5 8 The next number is 32. This has a 2 in the units column so it goes as shown. 2 4 The rest go as shown. Key: = 12

30 STEM AND LEAF DIAGRAMS STEM LEAVES
If an ORDERED stem and leaf diagram is required then you have to put the leaves in numerical order. 1 2 3 We can now use this to find the median. There are 11 pieces of data so the median is the 6th number. Key: = 12 Median = 24 It is a good choice when the data sets are small!

31 If most of the measurements in a large data set are of approximately the same magnitude except for a few measurements that are quite a bit larger, how would the mean and median of the data set compare and what shape would a histogram of the data set have? (a) The mean would be smaller than the median and the histogram would be skewed with a long left tail. (b) The mean would be larger than the median and the histogram would be skewed with a long right tail. (c) The mean would be larger than the median and the histogram would be skewed with a long left tail. (d) The mean would be smaller than the median and the histogram would be skewed with a long right tail. (e) The mean would be equal to the median and the histogram would be symmetrical.

32 When extreme values are present in a set of data, which of the following descriptive summary measures are most appropriate? (a) Coefficient variation and range. (b) Mean and standard deviation. (c) Median and inter-quartile range. (d) Mode and variance.

33 The weights of the male and female students in a class are summarized in the following boxplots:
Which of the following is NOT correct? (a) About 50% of the male students have weights between 150 and 185 lbs. (b) About 25% of female students have weights more than 130 lbs. (c) The median weight of male students is about 162 lbs. (d) The mean weight of female students is about 120 because of symmetry. (e) The male students have less variability than the female students.

34 The following is a stem-plot of the birth weights of male babies born to a group of mothers who smoked during pregnancies. The stems are in units of kg. The median birth weight is: (a) (b) 3.2 (c) 3.5 (d) 3.7 (e) Average of 13 and 14. The first quartile (25th) percentile of the weights is (a) 2.3 (b) 2.7 (c) .25 (d) 6.5 (e) 2.8

35 Thank you


Download ppt "Description and measurement"

Similar presentations


Ads by Google