Download presentation
Presentation is loading. Please wait.
Published byNancy Lang Modified over 9 years ago
1
STAT02 - Descriptive statistics (cont.) 1 Descriptive statistics (cont.) Lecturer: Smilen Dimitrov Applied statistics for testing and evaluation – MED4
2
STAT02 - Descriptive statistics (cont.) 2 Introduction We previously discussed arithmetic mean, as a measure of central tendency (location) of a data sample (collection) in descriptive statistics Here we continue with other important measures of central tendency – namely mode and median We will also get acquainted with frequency tables, and their graphical form – histograms – and also get acquainted to the range as a measure of statistical variability (dispersion or spread) in descriptive statistics We will look at how we perform these operations in R, and a bit more about plotting
3
STAT02 - Descriptive statistics (cont.) 3 Arithmetic mean as central tendency, range and outliers The range is the length of the smallest interval which contains all the data. –calculated by subtracting the smallest observations from the greatest In R, we can use the commands min and max to find the range of a data collection We can use abline to plot straight lines
4
STAT02 - Descriptive statistics (cont.) 4 Arithmetic mean as central tendency, range and outliers Our sample data set (raisins), with quantities plotted as bar graph (using barplot), and with the range and arithmetic mean shown:
5
STAT02 - Descriptive statistics (cont.) 5 Arithmetic mean as central tendency, range and outliers Our sample data set (raisins), with quantities plotted as point/line plot (using plot), and with the range and arithmetic mean shown: The y axis is auto scaled to show the range with plot
6
STAT02 - Descriptive statistics (cont.) 6 Arithmetic mean as central tendency, range and outliers Our sample data set (raisins), with quantities plotted as point/line plot (using plot), and with the range and arithmetic mean shown – with only one value changed to lie outside the original range:
7
STAT02 - Descriptive statistics (cont.) 7 Arithmetic mean as central tendency, range and outliers Both the range and the arithmetic mean change significantly, if only one value is quite different than the others –outlier - is a single observation 'far away' from the rest of the data. However, one outlier does not change the fact that the other values still tend to have values close to the original arithmetic mean and range Therefore we need tools / methods for describing central tendency and variability, which are less sensitive to outliers For central tendency, we can use mode and median
8
STAT02 - Descriptive statistics (cont.) 8 Mode and frequency distribution By definition, outliers occur rarely - they are single occurrences. Useful to see which values occur the most often (most frequently) - mode –mode means the most frequent value assumed by a random variable, or occurring in a sampling of a random variable. –applied both to probability distributions and to collections of experimental data –Can be unusable for real numbers (they are unique – occur only once), unless we apply histogram techniques –Can be applied to nominal data (most frequent name for instance)
9
STAT02 - Descriptive statistics (cont.) 9 Mode and frequency distribution To see which value occurs most frequently, we must first count how many times does each value in the data collection occur – frequency count == distribution –Collection and aggregation of data result in a distribution. Distributions are most often in the form of a histogram or a table (frequency table) – looking to approximate to a math function, and infer conclusions –Frequency of an event i is the number n i of times the event occurred in the experiment or the study. These frequencies are often graphically represented in histograms. absolute frequencies - when the counts n i themselves are given (relative) frequencies - when the counts are normalized by the total number of events:
10
STAT02 - Descriptive statistics (cont.) 10 Building a histogram and frequency table (ex using applets) 1. Standard collection of our data:2. Building a point plot histogram “manually”, from the individual counts observed 4. Transition to a bar graph histogram from a point plot histogram 3. Building a frequency table from a point plot histogram
11
STAT02 - Descriptive statistics (cont.) 11 Mode and frequency distribution - histogram Histogram – graphical display of a frequency table (distribution) –A histogram is a graphical display of tabulated frequencies. –A histogram is the graphical version of a table which shows what proportion of cases fall into each of several or many specified categories. –The categories are usually specified as non-overlapping intervals of some variable – bins –In a more general mathematical sense - a histogram is simply a mapping that counts the number of observations that fall into various disjoint categories (known as bins), whereas the graph of a histogram is merely one way to represent a histogram.
12
STAT02 - Descriptive statistics (cont.) 12 Mode and frequency distribution - histogram In R – a frequency table is obtained through table command A histogram is most easily drawn (for integer data) by plotting the output of table using plot or barplot
13
STAT02 - Descriptive statistics (cont.) 13 Mode and frequency distribution - histogram In R there is a special command hist that is used for plotting a histogram –however, as it can accept real (in addition to integer) numeric data, it needs some fine-tuning to graph integer data correctly Plotting relative frequencies is relatively easy – by dividing with the number of elements ( length ) in the data collection
14
STAT02 - Descriptive statistics (cont.) 14 Median A median is a number dividing the higher half of a sample, a population, or a probability distribution from the lower half. –At most half the population have values less than the median and at most half have values greater than the median. –If both groups contain less than half the population, then some of the population is exactly equal to the median. In R – means one should –Sort the data collection – in ascending order –Find out whether the data collection has odd or even number of elements If they are odd, return the mid-element in the collection If they are even, return the mean value of the two mid-elements in the data sample
15
STAT02 - Descriptive statistics (cont.) 15 Review Arithmetic mean Median Mode Range Measures of Central tendency (location) Measure of Statistical variability (dispersion - spread) Descriptive statistics
16
STAT02 - Descriptive statistics (cont.) 16 Exercise for mini-module 2 – STAT02 Exercise Use the Sample Data Set of Southern Oscillations, given on http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4412.htm Collect the southern oscillation data per month for three consecutive years in an Excel sheet. –Choose the years based on your group number g, according to the formula: (so group 1 would choose 1955, 1956, 1957; group 2 would choose 1958, 1959, 1960 etc.) –Multiply all oscillation data with 10 so as to work with integers. –Hint: you could use month number as row names, and years as column names in Excel and in R. Import the data into R, and for each year, find the arithmetic mean, the median and the mode of the oscillation. Using R, plot as quantity the oscillation each month, for each of the assigned years. Mark graphically the range and the median on each graph. Using R, plot the relative frequency histogram for each of the assigned years. Mark graphically the arithmetic mean on each graph. Delivery: Deliver the collected data (in tabular format), the found statistics and the requested graphs for the assigned years in an electronic document. You are welcome to include R code as well.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.