Presentation is loading. Please wait.

Presentation is loading. Please wait.

SYMMETRIC SKEWED LEFT SKEWED RIGHT

Similar presentations


Presentation on theme: "SYMMETRIC SKEWED LEFT SKEWED RIGHT"— Presentation transcript:

1 SYMMETRIC SKEWED LEFT SKEWED RIGHT
Describe the shape, center, and spread of a distribution… for shape, see below… Mode = Mean = Median SYMMETRIC Mean Mode Mode Mean Median Median SKEWED LEFT (negatively) SKEWED RIGHT (positively)

2 Mathematical notation:
The Mean (average), "Xbar" Mathematical notation: w o ma n ( i ) h ei gh t x = 1 5 8 . 2 14 6 4 9 15 3 7 16 17 18 19 20 21 22 10 23 11 24 12 25 13 S The function in R is called mean() ; or use summary()

3 Your numerical summary must be meaningful!
Height of 25 women in a class The distribution of women’s heights appears coherent and symmetrical. The mean is a good numerical summary. Here the shape of the distribution is wildly irregular. Why? Could we have more than one plant species or phenotype?

4 A single numerical summary here would not make sense.
A single numerical summary here would not make sense.

5 median(x, na.rm=T) #median of vector x
The Median (M) is often called the "middle" value and is the value at the midpoint of the observations when they are ranked from smallest to largest value…. arrange the data from smallest to largest if n is odd then the median is the single observation in the center (at the (n+1)/2 position in the ordering) if n is even then the median is the average of the two middle observations (at the (n+1)/2 position; i.e., in between…) Use R to calculate the median number of hours completed in the S215 dataset - be carefull about the missing values ("NA" - see below) How would you interpret its value? What does it mean in the context of this data? median(x, na.rm=T) #median of vector x quantile(x,na.rm=T,c(0,.25,.5,.75,1),type=6) #look at help(quantile) for the different values of type #note this quantile gives the "five number summary" of vector x

6 Mean and median of a distribution with outliers
Without the outliers With the outliers Percent of people dying The median, on the other hand, is only slightly pulled to the right by the outliers (from 3.4 to 3.6). The mean is pulled to the right a lot by the outliers (from 3.4 to 4.2).

7 Mean and median of a symmetric … and a right-skewed distribution
Impact of skewed data Disease X: Mean and median are the same. Mean and median of a symmetric Multiple myeloma: … and a right-skewed distribution The mean is pulled toward the skew.

8 Spread: percentiles, quartiles (Q1 and Q3), IQR,
5-number summary (and boxplots), range, standard deviation pth percentile of a variable is a data value such that p% of the values of the variable fall at or below it - also called quantiles the lower (Q1) and upper (Q3) quartiles are special percentiles dividing the data into quarters (fourths). Get them by finding the medians of the lower and upper halfs of the data IQR = interquartile range = Q3 - Q1 = spread of the middle 50% of the data. IQR is used with the so-called 1.5*IQR criterion for outliers - know this! In R, use the quantile function for all the above; for IQR use diff(quantile(x,c(.25,.75),na.rm=T,names=F)) IQR(x) #gives the same thing

9 Consider the CO2 data: First do a plot with hist and stem; check out the use of the rug function along with the hist function… change the number of bins with the breaks argument in hist (see help(hist) for how to use the breaks argument… Now get summary statistics with summary and fivenum and IQR or individually with mean, median, quantile, and var, sd Notice the slight differences in the quartiles when summary and fivenum are used (use sort to check this out)…graphically, boxplot shows a plot of the five-number summary. See Notes1.2, page 21 for a picture of a generic boxplot

10 Measure of spread: the quartiles
The first quartile, Q1, is the value in the sample that has 25% of the data at or below it ( it is the median of the lower half of the sorted data, excluding M). The third quartile, Q3, is the value in the sample that has 75% of the data at or below it ( it is the median of the upper half of the sorted data, excluding M). Q1= first quartile = 2.2 M = median = 3.4 Q3= third quartile = 4.35

11 Five-number summary and boxplot
Largest = max = 6.1 BOXPLOT Q3= third quartile = 4.35 M = median = 3.4 Q1= first quartile = 2.2 Five-number summary: min Q1 M Q3 max Smallest = min = 0.6

12 Boxplots for skewed data
Comparing box plots for a normal and a right-skewed distribution Boxplots remain true to the data and depict clearly symmetry or skew.

13 Distance to Q3 7.9 − 4.35 = 3.55 Interquartile range Q3 – Q1
8 Distance to Q3 7.9 − 4.35 = 3.55 Q3 = 4.35 Interquartile range Q3 – Q1 4.35 − 2.2 = 2.15 Q1 = 2.2 Individual #25 has a value of 7.9 years, which is 3.55 years above the third quartile. This is more than years, 1.5 * IQR. Thus, individual #25 is an outlier by our 1.5 * IQR rule.

14

15 Look at this graph to see deviations from the mean
Look at this graph to see deviations from the mean... metabolic rates for 7 men in a dieting study: 1792, 1666, 1362, 1614, 1460, 1867, Mean=1600 cals., s= calories. Note that in R we can use the var function to get the standard deviation from the variance function: std = function(x) sqrt(var(x,na.rm=T)) sd(x) #built-in standard deviation function

16 why do we square the deviations
why do we square the deviations? - two technical reasons that we'll see when we discuss the normal distribution in the next section… why do we use the standard deviation (s) instead of the variance (s2)? s2 has units which are the squares of the original units of the data… why do we divide by n-1 instead of n? n-1 is called the number of degrees of freedom; since the sum of the deviations is zero, the last deviation can always be found if we know n-1 of them … be careful when using the TI-83 since it calculates both division by n and n-1 … which measure of spread is best? 5-number summary is better than the mean and s.d. for skewed data - use mean & s.d. for symmetric data

17 HW for next Wednesday: Make sure you have R installed and are comfortable with its usage in doing simple computations - we'll be building from here… Review the two sets of notes (Notes1.1 and Notes1.2) on the website. Come to class with any questions… Finish the reading and the problems at the end of Reading & Problems 1.1. Come to class with any questions… We will wrap up the Reading and Problems 1.2 next week and move on to our study of the Normal Distribution.


Download ppt "SYMMETRIC SKEWED LEFT SKEWED RIGHT"

Similar presentations


Ads by Google