Univariate EDA (Exploratory Data Analysis)
EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant measures/displays –little influenced by changes in a small proportion of the total number of cases –resistant to the effects of outliers –emphasizes smooth over rough components concepts apply to statistics and to graphical methods
Tree Ring dates (AD) dendrochronology dates what do they mean???? usually helps to sort the data…
Stem-and-Leaf Diagram |62 12|39,39,40,41,41,43,55,71 original values preserved no rounding, no loss of information…
can simplify in various ways… 11|6 12| –‘leaves’ rounded to nearest decade –‘stem’ based on centuries
|2 117| 118| 119| 120| 121| 122| 123|99 124| |5 126| 127|1 ‘stem’ based on decades…
Stem-and-Leaf Diagram 11|6 12| –‘leaves’ rounded to nearest decade –‘stem’ based on centuries
Stem-and-Leaf Diagram 11|62 12|39,39,40,41,41,43,55,71 original values preserved no rounding, no loss of information…
|2 117| 118| 119| 120| 121| 122| 123|99 124| |5 126| 127|1 ‘stem’ based on decades…
|2 117| 118| 119| 120| 121| 122| 123|99 124| |5 126| 127|1 highlights existence of gaps in the distribution of dates, groups of dates…
R stem() vu round(runif(25, 0, 50),0); stem(vu) vn round(rnorm(25, 25, 10),0); stem(vn) stem(vn, scale=2)
unit 1unit unit 1unit 2 Back-to-back stem-and-leaf plot rim diameter data (cm)
percentiles useful for constructing various kinds of EDA graphics don’t confuse percentile with percent or proportion Note: frequency = count relative frequency = percent or proportion
percentiles “the pth percentile of a distribution: number such that approximately p percent of the values in the distribution are equal or less than that number…” can be calculated for numbers that actually exist in the distribution, and interpolated for numbers than don’t…
percentiles sort the data so that x 1 is the smallest value, and x n is the largest (where n=total number of cases) x i is the p i th percentile of a dataset of n members where:
p 1 = 100( ) / 7 = 7.1 p 2 = 100( ) / 7 = 21.4 p 3 = 100( ) / 7 = 35.7 p 4 = 100( ) / 7 = 50 etc… [1]
25 ? 85 ? th percentile: i=(7*50)/ i=4, x i =7 25 th percentile: i=(7*25)/ i=2.25, 3<x i <5
? if i integer, then… k = integer part of i; f = fractional part of i x int = interpolated value of x x int = (1-f)x k + fx k+1 x int = (1-.25)*3+.25*5 x int = th percentile: i=(7*25)/ i=2.25, 3<x i <5 25
use R!! test<-c(1,3,5,7,9,9,14) quantile(test,.25, type=5)
75 th 25 th 50 th percentiles: interquartile range (midspread) upper hingelower hinge inner fence “boxplot” (1.5 x midspread)
Figure 6.25: Internal diversity of neighbourhoods used to define N-clusters, measured by the 'evenness' statistic H/Hmax on the basis of counts of various A-clusters, and broken down by N-cluster and phase. [Boxes encompass the midspread; lines inside boxes indicate the median, while whiskers show the range of cases that fall within 1.5-times the midspread, above or below the limits of the box.]
Cleveland, W. S. (1985) The Elements of Graphing Data.
Histograms divide a continuous variable into intervals called ‘bins’ count the number of cases within each bin use bars to reflect counts intervals on the horizontal axis counts on the vertical axis
“bins” Histogram counts percent
useful for illustrating the shape of the distribution of a batch of numbers may be helpful for identifying modes and modal behaviour Histograms
mode mode? mode! the distribution is clearly bimodal may be multimodal…
important variables in histogram constuction: bin width bin starting point
boundaries of ‘bins’… bins: 1-2; 2-3; George Cowgill: construct ‘bins’ of whole multiples of “minimum meaningful measurement units” (“mmmus”) where to count a value like ‘2.0’? Shennan: really means ; ; is this OK?? 2.0 = 1.95> <2.05 mmmu= ; ; ; …
observed value 2.0…
smoothing histograms may want to accentuate the ‘smooth’ in a data distribution… calculate “running averages” on bin counts level of smoothing is arbitrary…
histogram / barchart variations 3d stacked dual frequency polygon kernel density methods
dual barchart
Site 1 Site 2
‘mirror’ barchart
stacked barchart
3d barchart
frequency polygon
kernel density model
controlling kernel density plots… hd <- density(XX) hh <- hist(XX, plot=F) maxD <- max(hd$y) maxH <- max(hh$density) Y <- c(0, max(c(maxD, maxH))) hist(XX, freq=F, ylim=Y) lines(density(XX))
Dot Plot [R: dotchart()]
Dot Histogram [R: stripchart()] VAR VAR VAR00003 method = “stack”
cooking/serviceserviceritual line plot
cooking/serviceserviceritual
20% 19% 18% 21% 22% pie chart
percent cumulative percent Cumulative Percent Graph
cumulative percent some useful statistical measures (ordinal or ratio scale) can be misleading when used with nominal data good for comparing data sets Cumulative Percent Graph