Visual Displays of Data and Basic Descriptive Statistics
Where to get information on R : R: Just need the base RStudio: A great IDE for R Work on all platforms Sometimes slows down performance… CRAN: Library repository for R Click on Search on the left of the website to search for package/info on packages
Finding our way around R/RStudio Script Window Command Line
Basic Input and Output Handy Commands: x <- 4 x <- “text goes in quotes” variables: store information Numeric input Text (character) input :Assignment operator
Get help on an R command: If you know the name: ?command name ?plot brings up html on plot command If you don’t know the name: Use Google (my favorite) ??key word Handy Commands:
Histograms: Histograms: “bin” a variable and plot frequencies nD Counts Relative Frequencies First Thing: Look at your Data!
Histograms
Box and Whiskers Plots: 25 th -%tile 1 st -quartile 75 th -%tile 3 rd -quartile median 50 th -%tile range possible outliers possible outliers First Thing: Look at your Data!
Note the relationship: Box-and-Whiskers
With Outliers: Without Outliers: Box-and-Whiskers
Stem-and-Leaf Displays Consider a numerical data set x 1, x 2, x 3,…, x n – each x i consists of at least two digits. – an informative visual representation a stem-and- leaf display.
Stems Leaves for each stem Stem-and-Leaf Displays
Dotplots Each observation is represented by a dot above the corresponding location on a horizontal measurement scale. – When a value occurs more than once, there is a dot for each occurrence – Dots are stacked vertically. A dotplot is useful when: – there is not a large set of data – where there are relatively few distinct values.
Dotplots
Given a sample from some population: What is a good “summary” value which well describes the sample? We will look at: Average (arithmetic mean) Median Mode Measures of Location For reference see (available on-line): “The Dynamic Character of Disguised Behaviour for Text-based, Mixed and Stylized Signatures” LA Mohammed, B Found, M Caligiuri and D Rogers J Forensic Sci 56(1),S136-S141 (2011)
Histogram Points of Interest Velocity for the first segment of genuine signatures in (soon to be classic) Mohammed et al. study. What is a good summary number? How spread out is the data? (We will talk about this later)
Arithmetic sample mean (average): The sum of data divided by number of observations: Measures of Location intuitive formula fancy formula
Example from LAM study: Compute the average absolute size of segment 1 for the genuine signature of subject 2: Subj. 2; Gen; Seg. 1Absolute Size (cm) Measures of Location
Example: More useful: Consider again Absolute Average Velocity for Genuine Signatures across all writers in the LAM study: 92 subjects × 10 measurements/subject = 920 velocity measurements Average Absolute Average Velocity: Measures of Location
Follow up question: Is there a difference in the Abs. Avg. Veloc. for Genuine signatures vs. Disguised signatures (DWM and DNM)?? Genuine DWMDNM We will learn how to answer this, but not yet. Measures of Location
Sample median: Ordering the n pieces of data from smallest value to largest value, the median is the “middle value”: If n is odd, median is largest data point. If n is even, median is average of and largest data points. Measures of Location
Example: Median of Average Absolute Velocity for Genuine Signatures, LAM: Avg Measures of Location
Sample mode: Needs careful definition but basically: The data value that occurs the most Avg mode = Med Measures of Location
Some trivia: Nice and symmetric: Mean = Median = Mode Mean Modes Measures of Location
Toss out the largest 5% and smallest 5% of the data
Sample variance: (Almost) the average of squared deviations from the sample mean. Measures of Data Spread data point i sample mean there are n data points Standard deviation is The sample average and standard dev. are the most common measures of central tendency and spread Sample average and standard dev have the same units
Measures of Data Spread If you have “enough” data, you can fit a smooth probability density function to the histogram
Measures of Data Spread ~ 68% ± 1s ~ 95% ± 2s ~ 99% ± 3s Trivia: The famous (standardized) “Bell Curve” Also called “normal” and “Gaussian” Mean = 0 Std Dev = 1 Units are in Std Devs ---
Measures of Data Spread
Sample range: The difference between the largest and smallest value in the sample Very sensitive to outliers (extreme observations) Percentiles: The p th percentile data value, x, means that p- percent of the data are less than or equal to x. Median = 50 th percentile Measures of Data Spread
1 st -%tile 99 th -%tile Measures of Data Spread
Sample relative standard deviation: Ratio of standard dev to the average Also called coefficient of variation Data quality-outliers: Rule of thumb, if : x i > 75 th -%tile + ×(75 th -%tile - 25 th -%tile) x i < 25 th -%tile + ×(75 th -%tile - 25 th -%tile) x i outlier for x i extreme outlier for Measures of Data Spread