Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.

Lecture 4 Dustin Lueker

 The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets finer and finer  Similar to the idea of using smaller and smaller rectangles to calculate the area under a curve when learning how to integrate  Symmetric distributions ◦ Bell-shaped ◦ U-shaped ◦ Uniform  Not symmetric distributions: ◦ Left-skewed ◦ Right-skewed ◦ Skewed 2STA 291 Summer 2010 Lecture 4

 Center of the data ◦ Mean ◦ Median ◦ Mode  Dispersion of the data  Sometimes referred to as spread ◦ Variance, Standard deviation ◦ Interquartile range ◦ Range 3STA 291 Summer 2010 Lecture 4

 Mean ◦ Arithmetic average  Median ◦ Midpoint of the observations when they are arranged in order  Smallest to largest  Mode ◦ Most frequently occurring value 4STA 291 Summer 2010 Lecture 4

 Sample size n  Observations x 1, x 2, …, x n  Sample Mean “x-bar” 5STA 291 Summer 2010 Lecture 4

 Population size N  Observations x 1, x 2,…, x N  Population Mean “mu”  Note: This is for a finite population of size N 6STA 291 Summer 2010 Lecture 4

 Requires numerical values ◦ Only appropriate for quantitative data ◦ Does not make sense to compute the mean for nominal variables ◦ Can be calculated for ordinal variables, but this does not always make sense  Should be careful when using the mean on ordinal variables  Example “Weather” (on an ordinal scale) Sun=1, Partly Cloudy=2, Cloudy=3, Rain=4, Thunderstorm=5 Mean (average) weather=2.8  Another example is “GPA = 3.8” is also a mean of observations measured on an ordinal scale 7STA 291 Summer 2010 Lecture 4

 Center of gravity for the data set  Sum of the differences from values above the mean is equal to the sum of the differences from values below the mean ◦ 3+2+2 = 3 + 4 STA 291 Summer 2010 Lecture 48

 Mean ◦ Sum of observations divided by the number of observations  Example ◦ {7, 12, 11, 18} ◦ Mean = 9STA 291 Summer 2010 Lecture 4

 Highly influenced by outliers ◦ Data points that are far from the rest of the data ◦ Example  Monthly income for five people 1,0002,0003,0004,000100,000  Average monthly income =  What is the problem with using the average to describe this data set? 10STA 291 Summer 2010 Lecture 4

 Measurement that falls in the middle of the ordered sample  When the sample size n is odd, there is a middle value ◦ It has the ordered index (n+1)/2  Ordered index is where that value falls when the sample is listed from smallest to largest  An index of 2 means the second smallest value ◦ Example  1.7, 4.6, 5.7, 6.1, 8.3 n=5, (n+1)/2=6/2=3, index = 3 Median = 3 rd smallest observation = 5.7 11STA 291 Summer 2010 Lecture 4

 When the sample size n is even, average the two middle values ◦ Example  3, 5, 6, 9, n=4 (n+1)/2=5/2=2.5, Index = 2.5 Median = midpoint between 2 nd and 3 rd smallest observations = (5+6)/2 = 5.5 12STA 291 Summer 2010 Lecture 4

 For skewed distributions, the median is often a more appropriate measure of central tendency than the mean  The median usually better describes a “typical value” when the sample distribution is highly skewed  Example ◦ Monthly income for five people 1,000 2,000 3,000 4,000 100,000 ◦ Median monthly income:  Why is the median better to use with this data than the mean? 13STA 291 Summer 2010 Lecture 4

14 Mode - Most frequent value. Notation: Subscripted variables n = # of units in the sample N = # of units in the population x = Variable to be measured x i = Measurement of the i th unit Mean - Arithmetic Average Median - Midpoint of the observations when they are arranged in increasing order STA 291 Summer 2010 Lecture 4

 Example: Highest Degree Completed 15 Highest DegreeFrequencyPercentage Not a high school graduate 38,01221.4 High school only 65,29136.8 Some college, no degree 33,19118.7 Associate, Bachelor, Master, Doctorate, Professional 41,12423.2 Total 177,618100 STA 291 Summer 2010 Lecture 4

 n = 177,618  (n+1)/2 = 88,809.5  Median = midpoint between the 88809 th smallest and 88810 th smallest observations ◦ Both are in the category “High school only”  Mean wouldn’t make sense here since the variable is ordinal  Median ◦ Can be used for interval data and for ordinal data ◦ Can not be used for nominal data because the observations can not be ordered on a scale 16STA 291 Summer 2010 Lecture 4

 Mean ◦ Interval data with an approximately symmetric distribution  Median ◦ Interval data ◦ Ordinal data  Mean is sensitive to outliers, median is not 17STA 291 Summer 2010 Lecture 4

 Symmetric distribution ◦ Mean = Median  Skewed distribution ◦ Mean lies more toward the direction which the distribution is skewed 18STA 291 Summer 2010 Lecture 4

 While the median is better than the mean for skewed distributions there is one large disadvantage to using the median ◦ Insensitive to changes within the lower or upper half of the data ◦ Example  1, 2, 3, 4, 5  1, 2, 3, 100, 100 ◦ Sometimes, the mean is more informative even when the distribution is skewed 19STA 291 Summer 2010 Lecture 4

 Keeneland Sales STA 291 Summer 2010 Lecture 420

 The deviation of the i th observation x i from the sample mean is the difference between them, ◦ Sum of all deviations is zero ◦ Therefore, we use either the sum of the absolute deviations or the sum of the squared deviations as a measure of variation 21STA 291 Summer 2010 Lecture 4

 Variance of n observations is the sum of the squared deviations, divided by n-1 22STA 291 Summer 2010 Lecture 4

23 ObservationMeanDeviationSquared Deviation 1 3 4 7 10 Sum of the Squared Deviations n-1 Sum of the Squared Deviations / (n-1) STA 291 Summer 2010 Lecture 4

 About the average of the squared deviations ◦ “average squared distance from the mean”  Unit ◦ Square of the unit for the original data  Difficult to interpret ◦ Solution  Take the square root of the variance, and the unit is the same as for the original data  Standard Deviation 24STA 291 Summer 2010 Lecture 4

 s ≥ 0 ◦ s = 0 only when all observations are the same  If data is collected for the whole population instead of a sample, then n-1 is replaced by N  s is sensitive to outliers 25STA 291 Summer 2010 Lecture 4

 Sample ◦ Variance ◦ Standard Deviation  Population ◦ Variance ◦ Standard Deviation 26STA 291 Summer 2010 Lecture 4

 Population mean and population standard deviation are denoted by the Greek letters μ (mu) and σ (sigma) ◦ They are unknown constants that we would like to estimate  Sample mean and sample standard deviation are denoted by and s ◦ They are random variables, because their values vary according to the random sample that has been selected 27STA 291 Summer 2010 Lecture 4

 If the data is approximately symmetric and bell-shaped then ◦ About 68% of the observations are within one standard deviation from the mean ◦ About 95% of the observations are within two standard deviations from the mean ◦ About 99.7% of the observations are within three standard deviations from the mean 28STA 291 Summer 2010 Lecture 4

 Scores on a standardized test are scaled so they have a bell-shaped distribution with a mean of 1000 and standard deviation of 150 ◦ About 68% of the scores are between ◦ About 95% of the scores are between ◦ If you have a score above 1300, you are in the top % 29STA 291 Summer 2010 Lecture 4

Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.

Similar presentations

Presentation on theme: "Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.

Similar presentations

Presentation on theme: "Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets."— Presentation transcript:

Similar presentations

About project

Feedback