Agricultural and Biological Statistics
Summarizing Data Chapter 2
Summarizing Data Data would be observations on one or more variables selected from a population ( or a sample)
Summarizing Data Qualitative Variables - Variables that express attributes about a sample or population Quantitative Variables - Variables that are the result of measurement or counting.
Qualitative Variables give rise to nominal or ordinal data. Summarizing Data Qualitative Variables give rise to nominal or ordinal data. Quantitative Variables give rise to ratio or intervals
It is also important to categorize observations as to their source. Summarizing Data It is also important to categorize observations as to their source. Primary Data- Collected by means of experiment or survey. Secondary Data- acquired from a source that did not collect the data even though the source may have published them.
Summarizing Data Data collection allows us to make informed decisions about the problem at hand. Manipulation and statistical analysis of the data allows for improved decision making. Generally speaking statistical data is collected in random order. Since there is no real order to the data it is difficult to obtain any valuable information upon inspection. Data: 3, 1, 7, 22, 9, 10, 4, 17, 19
Definition of Array Array- an array reorders the data from the smallest to the largest value. Data: 3, 1, 7, 22, 9, 10, 4, 17, 19 Arrayed data: 1, 3, 4, 7, 9, 10, 17, 19, 22
Measures of Dispersion Definition: Range - a range is computed by subtracting the smallest from the largest observation. Range: 22-1=21 An array also indicates something about the distribution the units between the two extremes and their tendency to cluster toward some central value.
Measures of Dispersion Data can further be summarized in the form of a frequency distribution. A number of classes are chosen (5 to 15 normally) The distribution has the classes on the vertical axis and frequencies on the horizontal axis.
Measures of Dispersion Cotton Yield 215 to 235 235 to 255 255 to 275 These don’t have to have equal width classes. (Income) Number of farms 4 6 13 21 15 7 5
Measures of Dispersion Histogram- A frequency distribution presented as a bar chest Advantage- See its shape Frequency polygon- A line graph used to display data Frequencies on y-axis. Class midpoints on x-axis.
Averages Averages- a number used to represent the central value of data set or distribution. 1. Arithmetic Mean- most widely used. n µ = ∑ Xi 1 N
Example: 7 this is the population 3 2 8 20/4 = 5 = Arithmetic Mean
Example cont: Now Take Some Samples X= ∑ Xi 1 S1 = 7 3 10/2 = 5 S2 = 7 8 15/2 = 7.5
1a. Weighted Mean. n x = ∑ wi xi 1 n ∑ wi
Crop Hourly Number wx Cucumbers 4.50 950 4,275 Melons 4.75 600 2,850 Weighted Mean Example Crop Hourly Number wx Wage,x Workers,w Cucumbers 4.50 950 4,275 Melons 4.75 600 2,850 Onions 5.25 1,020 5,355 2,570 12,480 x = 12,480 = 4.86. The other way it is 4.83 2,570
Two Properties of Arithmetic Mean a. Sum of deviations from the mean are zero. 3 3-4 = -1 7 x = 4 7-4 = 3 4 4-4 = 0 6 6-4 = 2 0 0-4 = -4
b. The sum of squares of the deviation’s from the mean is a minimum.
2. Midrange- ( or center) is the arithmetic mean of the smallest and the largest items in the data set. Unreliable as estimate of the population mean. Based on two values that change significantly from sample to sample
Example 2 where X1is smallest and Xn is the largest MR = 0 + 7 = 3.5 2 MR = X1 + Xn 2 where X1is smallest and Xn is the largest MR = 0 + 7 = 3.5 2
Median 3. Median – a place average for ungrouped data, it is the value of the middle observation after the data is arrayed. When there is an even number of observations the middle two observations are averaged. Better measure when extreme values are encountered. Should not be used for small sample sizes. Half of observations are below half above.
Mode 4. Mode – It is most common observation in the data set. For ungrouped data we determine the mode by inspection. Ungrouped data may not have a mode. All values appear once. Several modes could occur as well. Use mode when we want to know what is in vogue.
An arithmetic mean might be meaningless. ABC show 1 CBS show 2 NBC show 3 X = 2.3 meaningless
Characteristics of Mean, Median, Mode Use three averages together to determine relative symmetry of distribution. Perfect symmetry.. All three values (averages) are identical. If distribution has a tail on the right. Skewed positively. Arithmetic mean is largest Mode smallest Median 2/3 of the way in between. Toward mean.
Characteristics of Mean, Median, Mode Mean is the largest because its affected by large values. Median is sensitive to position of the values. Arithmetic mean is only one that can be used in algebraic calculations, which makes it most useful. Down side impossible to calculate with open ended classes. This does not affect the other two averages.
Measures of dispersion Range (R) = Xn-X1 Can be used with mean, median, and midrange. Range indicates both how high and low the numbers go and the range of the data itself. Based on two extreme values of the data set. Not first choice for a measure of dispersion.
Quartile Deviation QD = Q3- Q1 2 Used only with the median. One half the distance between the first and the third quartiles.
Quartile Deviation 8 12 6 14 10 6 8 10 12 14 1st Middle 3rd 8 12 6 14 10 6 8 10 12 14 1st Middle 3rd Quartile Quartile Quartile QD = 13-7/2 = 3
Quartile Deviation QD is similar to the range but uses values in the middle half of the distribution rather than the endpoints. Poor measure when wide dispersion in the tails of the distribution!
Standard Deviation A measure of dispersion used with the arithmetic mean. Its value is based on all the observations of the data set.
SD For ungrouped data SD is most widely used measure of dispersion. Arithmetic mean is most widely used average. s2- sample variance is an estimate of the population variance σ2 computed from sample data.
Standard Deviation In repeated sampling, the sample variance is biased and underestimates to population variance by the fixed amount Thus revision in the sample SD formula is needed; divide by n-1 for sample
Example for Calculating SD Days Absent 7 -3 9 14 4 16 8 -2 5 -5 25 15 11 1 60 80 x=10
Standard Deviation (Another Formula) This does not contain deviations from the mean.
Two properties of S & X 1. If we add a constant to every element in the data set, the mean changes by that same value and the SD remains unchanged. 2. Multiplying each value of x by a constant multiplies the mean and SD by the absolute value of the constant and the variance by the square of the constant.
Standardizing The Data Mean for every element 1/s for every element in data set. 0 mean 1SD becomes Z for population
Coefficient of Variation Uses the SD and mean to measure the variability of the data set. Gives a relative measure of variability in the data set States how large the Standard Deviation is in comparison to the mean in percentage terms CV=100 would mean that the S & X are equal. When CV over 50 use caution in stating that mean represents population.