Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Definitions variableA variable is a characteristic that changes or varies over time and/or for different individuals or objects under consideration. Examples:Examples: Hair color, white blood cell count, time to failure of a computer component.
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Definitions experimental unitAn experimental unit is the individual or object on which a variable is measured. data, sample population.A set of measurements, called data, can be either a sample or a population.
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Definitions PopuationPopuation is collection of all items we are interested in. SampleSample is subset of population that we observe.
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Types of Variables Qualitative Quantitative Discrete Continuous
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Types of Variables Qualitative variablesQualitative variables measure a quality or characteristic on each experimental unit. Examples:Examples: Hair color (black, brown, blonde…) Make of car (Dodge, Honda, Ford…) Gender (male, female) State of birth (California, Arizona,….)
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Types of Variables Quantitative variablesQuantitative variables measure a numerical quantity on each experimental unit. Discrete Discrete if it can assume only a finite or countable number of values. Continuous Continuous if it can assume the infinitely many values corresponding to the points on a line interval.
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Examples For each orange tree in a grove, the number of oranges is measured. –Quantitative discrete For a particular day, the number of cars entering a college campus is measured. –Quantitative discrete Time until a light bulb burns out –Quantitative continuous
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 2.1 Describing Qualitative Data data distributionUse a data distribution to describe: –What values –What values of the variable have been measured –How often –How often each value has occurred “How often” can be measured 3 ways: –Frequency –Relative frequency = Frequency/n –Percent = 100 x Relative frequency
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Example A bag of M&M ® s contains 25 candies: Raw Data:Raw Data: Statistical Table:Statistical Table: ColorTallyFrequencyRelative Frequency Percent Red55/25 =.2020% Blue33/25 =.1212% Green22/25 =.088% Orange33/25 =.1212% Brown88/25 =.3232% Yellow44/25 =.1616% m m m mm m m m m m m m m m m m m m m m m m m m m m m m m m m mmmm mm m mm mmmmmmm mmm
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Graphs Bar Chart Pie Chart
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 2.2 Describing Quantitative Data Dot plot Stem and leaf plot Histogram.
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc.Dotplots The simplest graph for quantitative data Plots the measurements as points on a horizontal axis, stacking the points that duplicate existing points. Example:Example: The set 4, 5, 5, 7,
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Stem and Leaf plot The ages of the CEOs of 30 top ranked small companies in Americain |38 4| | |01369
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Relative Frequency Histograms relative frequency histogramA relative frequency histogram for a quantitative data set is a bar graph in which the height of the bar shows “how often” (measured as a proportion or relative frequency) measurements fall in a particular class or subinterval. Create intervals Stack and draw bars
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Relative Frequency Histograms 5-12 subintervalsDivide the range of the data into 5-12 subintervals of equal length. approximate widthCalculate the approximate width of the subinterval as Range/number of subintervals. Round the approximate width up to a convenient value. left inclusionUse the method of left inclusion, including the left endpoint, but not the right in your tally. statistical tableCreate a statistical table including the subintervals, their frequencies and relative frequencies.
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Relative Frequency Histograms relative frequency histogramDraw the relative frequency histogram, plotting the subintervals on the horizontal axis and the relative frequencies on the vertical axis. The height of the bar represents proportion –The proportion of measurements falling in that class or subinterval. probability –The probability that a single measurement, drawn at random from the set, will belong to that class or subinterval.
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc.Example The ages of 50 tenured faculty at a state university We choose to use 6 intervals. =(70 – 26)/6 = 7.33Minimum class width = (70 – 26)/6 = 7.33 = 8Convenient class width = Use 6 classes of length 8, starting at 25.
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. AgeTallyFrequencyRelative Frequency Percent 25 to < /50 =.1010% 33 to < /50 =.2828% 41 to < /50 =.2626% 49 to < /50 =.1818% 57 to < /50 =.1414% 65 to < /50 =.044%
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 2.4 Numerical Measures of Center Skewed left: Mean < Median Skewed right: Mean > Median Symmetric: Mean = Median
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc : Interpreting the Standard Deviation Chebyshev’s Rule The Empirical Rule Both tell us something about where the data will be relative to the mean.
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Chebyshev’s Theorem Given a number k greater than or equal to 1 and a set of n measurements, at least 1-(1/k 2 ) of the measurement will lie within k standard deviations of the mean. Can be used for either samples ( and s) or for a population ( and ). Valid for any dataset. Important results: Important results: If k = 2, at least 1 – 1/2 2 = 3/4= 75% of the measurements are within 2 standard deviations of the mean. If k = 3, at least 1 – 1/3 2 = 8/9=89% of the measurements are within 3 standard deviations of the mean.
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Using Measures of Center and Spread: The Empirical Rule Given a distribution of measurements that is approximately mound-shaped: The interval contains approximately 68% of the measurements. The interval 2 contains approximately 95% of the measurements. The interval 3 contains approximately 99.7% of the measurements. Given a distribution of measurements that is approximately mound-shaped: The interval contains approximately 68% of the measurements. The interval 2 contains approximately 95% of the measurements. The interval 3 contains approximately 99.7% of the measurements.
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. Empirical Rule Example Hummingbirds beat their wings in flight an average of 55 times per second. Assume the standard deviation is 10, and that the distribution is symmetrical and mounded. –Approximately what percentage of hummingbirds beat their wings between 45 and 65 times per second? –Between 55 and 65? –Less than 45?
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 24 Since 45 and 65 are exactly one standard deviation below and above the mean, the empirical rule says that about 68% of the hummingbirds will be in this range. Hummingbirds beat their wings in flight an average of 55 times per second. Assume the standard deviation is 10, and that the distribution is symmetrical and mounded. –Approximately what percentage of hummingbirds beat their wings between 45 and 65 times per second? –Between 55 and 65? –Less than 45? Empirical Rule Example
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 25 This range of numbers is from the mean to one standard deviation above it, or one-half of the range in the previous question. So, about one- half of 68%, or 34%, of the hummingbirds will be in this range. Hummingbirds beat their wings in flight an average of 55 times per second. Assume the standard deviation is 10, and that the distribution is symmetrical and mounded. –Approximately what percentage of hummingbirds beat their wings between 45 and 65 times per second? –Between 55 and 65? –Less than 45? Empirical Rule Example
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 26 Half of the entire data set lies above the mean, and ~34% lie between 45 and 55 (between one standard deviation below the mean and the mean), so ~84% (~34% + 50%) are above 45, which means ~16% are below 45. Hummingbirds beat their wings in flight an average of 55 times per second. Assume the standard deviation is 10, and that the distribution is symmetrical and mounded. –Approximately what percentage of hummingbirds beat their wings between 45 and 65 times per second? –Between 55 and 65? –Less than 45? Empirical Rule Example
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 27 Since ~95% of all the measurements will be within 2 standard deviations of the mean, only ~5% will be more than 2 standard deviations from the mean. About half of this 5% will be far below the mean, leaving only about 2.5% of the measurements at least 2 standard deviations above the mean. Empirical Rule
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc : Numerical Measures of Relative Standing Percentiles: for any (large) set of n measurements (arranged in ascending or descending order), the p th percentile is a number such that p% of the measurements fall below that number and (100 – p)% fall above it. K-tk Quartile: k quarters lie below it.
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 29 Percentiles Finding percentiles is similar to finding the median – the median is the 50 th percentile. –If you are in the 50 th percentile for the GRE, half of the test-takers scored better and half scored worse than you. –If you are in the 75 th percentile, you scored better than three-quarters of the test-takers.
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 30 Z-scores The z-score tells us how many standard deviations above or below the mean a particular measurement is. Sample z-score Population z-score
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 31 Z-Scores Z scores are related to the empirical rule: For a perfectly symmetrical and mound-shaped distribution, –~68 % will have z-scores between -1 and 1 –~95 % will have z-scores between -2 and 2 –~99.7% will have z-scores between -3 and 3
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc : Methods for Determining Outliers An outlier is a measurement that is unusually large or small relative to the other values. Three possible causes: –Observation, recording or data entry error –Item is from a different population –A rare, chance event
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 33 Box plot The box plot is a graph representing information about certain percentiles for a data set and can be used to identify outliers
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 34 Minimum Value Lower Quartile (Q L ) MedianUpper Quartile (Q U ) Maximum Value
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 35 Interquartile Range (IQR) = Q U - Q L
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 36 Inner Fence at Q U + 1.5(IQR) Outer Fence at Q U + 3(IQR)
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 37 Outliers and z-scores –The chance that a z-score is between -3 and +3 is over 99%. –Any measurement with |z| > 3 is considered an outlier. Outliers and Z-scores
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 38 Outliers and z-scores Here are the descriptive statistics for the games won at the All-Star break, except one team had its total wins for 2006 recorded. That team, with 104 wins recorded, had a z-score of ( )/12.11 = That’s a very unlikely result, which isn’t surprising given what we know about the observation. #Winsn = 30 Mean45.68 Sample Variance Sample Standard Deviation Minimum25 Maximum104
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc : Graphing Bivariate Relationships Scattergram (or scatterplot) shows the relationship between two quantitative variables
Copyright ©2003 Brooks/Cole A division of Thomson Learning, Inc. 40 If there is no linear relationship between the variables, the scatterplot may look like a cloud, a horizontal line or a more complex curve Source: Quantitative Environmental Learning Project