Download presentation
Presentation is loading. Please wait.
Published byMyron Hoover Modified over 9 years ago
1
Visual Displays of Data and Basic Descriptive Statistics http://www.halcyonmaps.com
2
Where to get information on R : R: http://www.r-project.org/http://www.r-project.org/ Just need the base RStudio: http://rstudio.org/http://rstudio.org/ A great IDE for R Work on all platforms Sometimes slows down performance… CRAN: http://cran.r-project.org/http://cran.r-project.org/ Library repository for R Click on Search on the left of the website to search for package/info on packages
3
Finding our way around R/RStudio Script Window Command Line
4
Basic Input and Output Handy Commands: x <- 4 x <- “text goes in quotes” variables: store information Numeric input Text (character) input :Assignment operator
5
Get help on an R command: If you know the name: ?command name ?plot brings up html on plot command If you don’t know the name: Use Google (my favorite) ??key word Handy Commands:
6
Histograms: Histograms: “bin” a variable and plot frequencies nD Counts Relative Frequencies First Thing: Look at your Data!
7
Histograms
8
Box and Whiskers Plots: 25 th -%tile 1 st -quartile 75 th -%tile 3 rd -quartile median 50 th -%tile range possible outliers possible outliers First Thing: Look at your Data!
9
Note the relationship: Box-and-Whiskers
10
With Outliers: Without Outliers: Box-and-Whiskers
12
Stem-and-Leaf Displays Consider a numerical data set x 1, x 2, x 3,…, x n – each x i consists of at least two digits. – an informative visual representation a stem-and- leaf display.
13
Stems Leaves for each stem Stem-and-Leaf Displays
14
Dotplots Each observation is represented by a dot above the corresponding location on a horizontal measurement scale. – When a value occurs more than once, there is a dot for each occurrence – Dots are stacked vertically. A dotplot is useful when: – there is not a large set of data – where there are relatively few distinct values.
15
Dotplots
16
Given a sample from some population: What is a good “summary” value which well describes the sample? We will look at: Average (arithmetic mean) Median Mode Measures of Location For reference see (available on-line): “The Dynamic Character of Disguised Behaviour for Text-based, Mixed and Stylized Signatures” LA Mohammed, B Found, M Caligiuri and D Rogers J Forensic Sci 56(1),S136-S141 (2011)
17
Histogram Points of Interest Velocity for the first segment of genuine signatures in (soon to be classic) Mohammed et al. study. What is a good summary number? How spread out is the data? (We will talk about this later)
18
Arithmetic sample mean (average): The sum of data divided by number of observations: Measures of Location intuitive formula fancy formula
19
Example from LAM study: Compute the average absolute size of segment 1 for the genuine signature of subject 2: Subj. 2; Gen; Seg. 1Absolute Size (cm) 10.0548 20.2951 30.1026 40.1005 50.2491 60.1287 70.0496 80.2299 90.256 100.0538 Measures of Location
20
Example: More useful: Consider again Absolute Average Velocity for Genuine Signatures across all writers in the LAM study: 92 subjects × 10 measurements/subject = 920 velocity measurements Average Absolute Average Velocity: Measures of Location
21
Follow up question: Is there a difference in the Abs. Avg. Veloc. for Genuine signatures vs. Disguised signatures (DWM and DNM)?? Genuine DWMDNM We will learn how to answer this, but not yet. Measures of Location
22
Sample median: Ordering the n pieces of data from smallest value to largest value, the median is the “middle value”: If n is odd, median is largest data point. If n is even, median is average of and largest data points. Measures of Location
23
Example: Median of Average Absolute Velocity for Genuine Signatures, LAM: Avg Measures of Location
24
Sample mode: Needs careful definition but basically: The data value that occurs the most Avg mode = 9.2541 Med Measures of Location
25
Some trivia: Nice and symmetric: Mean = Median = Mode Mean Modes Measures of Location
27
Toss out the largest 5% and smallest 5% of the data
28
Sample variance: (Almost) the average of squared deviations from the sample mean. Measures of Data Spread data point i sample mean there are n data points Standard deviation is The sample average and standard dev. are the most common measures of central tendency and spread Sample average and standard dev have the same units
29
Measures of Data Spread If you have “enough” data, you can fit a smooth probability density function to the histogram
30
Measures of Data Spread ~ 68% ± 1s ~ 95% ± 2s ~ 99% ± 3s Trivia: The famous (standardized) “Bell Curve” Also called “normal” and “Gaussian” Mean = 0 Std Dev = 1 Units are in Std Devs ---
31
Measures of Data Spread
32
Sample range: The difference between the largest and smallest value in the sample Very sensitive to outliers (extreme observations) Percentiles: The p th percentile data value, x, means that p- percent of the data are less than or equal to x. Median = 50 th percentile Measures of Data Spread
33
1 st -%tile 99 th -%tile 1.52003 1.52008 Measures of Data Spread
35
Sample relative standard deviation: Ratio of standard dev to the average Also called coefficient of variation Data quality-outliers: Rule of thumb, if : x i > 75 th -%tile + ×(75 th -%tile - 25 th -%tile) x i < 25 th -%tile + ×(75 th -%tile - 25 th -%tile) x i outlier for x i extreme outlier for Measures of Data Spread
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.