Visual Displays of Data and Basic Descriptive Statistics

Visual Displays of Data and Basic Descriptive Statistics http://www.halcyonmaps.com

Where to get information on R : R: http://www.r-project.org/http://www.r-project.org/ Just need the base RStudio: http://rstudio.org/http://rstudio.org/ A great IDE for R Work on all platforms Sometimes slows down performance… CRAN: http://cran.r-project.org/http://cran.r-project.org/ Library repository for R Click on Search on the left of the website to search for package/info on packages

Finding our way around R/RStudio Script Window Command Line

Basic Input and Output Handy Commands: x <- 4 x <- “text goes in quotes” variables: store information Numeric input Text (character) input :Assignment operator

Get help on an R command: If you know the name: ?command name ?plot brings up html on plot command If you don’t know the name: Use Google (my favorite) ??key word Handy Commands:

Histograms: Histograms: “bin” a variable and plot frequencies nD Counts Relative Frequencies First Thing: Look at your Data!

Histograms

Box and Whiskers Plots: 25 th -%tile 1 st -quartile 75 th -%tile 3 rd -quartile median 50 th -%tile range possible outliers possible outliers First Thing: Look at your Data!

Note the relationship: Box-and-Whiskers

With Outliers: Without Outliers: Box-and-Whiskers

Stem-and-Leaf Displays Consider a numerical data set x 1, x 2, x 3,…, x n – each x i consists of at least two digits. – an informative visual representation a stem-and- leaf display.

Stems Leaves for each stem Stem-and-Leaf Displays

Dotplots Each observation is represented by a dot above the corresponding location on a horizontal measurement scale. – When a value occurs more than once, there is a dot for each occurrence – Dots are stacked vertically. A dotplot is useful when: – there is not a large set of data – where there are relatively few distinct values.

Dotplots

Given a sample from some population: What is a good “summary” value which well describes the sample? We will look at: Average (arithmetic mean) Median Mode Measures of Location For reference see (available on-line): “The Dynamic Character of Disguised Behaviour for Text-based, Mixed and Stylized Signatures” LA Mohammed, B Found, M Caligiuri and D Rogers J Forensic Sci 56(1),S136-S141 (2011)

Histogram Points of Interest Velocity for the first segment of genuine signatures in (soon to be classic) Mohammed et al. study. What is a good summary number? How spread out is the data? (We will talk about this later)

Arithmetic sample mean (average): The sum of data divided by number of observations: Measures of Location intuitive formula fancy formula

Example from LAM study: Compute the average absolute size of segment 1 for the genuine signature of subject 2: Subj. 2; Gen; Seg. 1Absolute Size (cm) 10.0548 20.2951 30.1026 40.1005 50.2491 60.1287 70.0496 80.2299 90.256 100.0538 Measures of Location

Example: More useful: Consider again Absolute Average Velocity for Genuine Signatures across all writers in the LAM study: 92 subjects × 10 measurements/subject = 920 velocity measurements Average Absolute Average Velocity: Measures of Location

Follow up question: Is there a difference in the Abs. Avg. Veloc. for Genuine signatures vs. Disguised signatures (DWM and DNM)?? Genuine DWMDNM We will learn how to answer this, but not yet. Measures of Location

Sample median: Ordering the n pieces of data from smallest value to largest value, the median is the “middle value”: If n is odd, median is largest data point. If n is even, median is average of and largest data points. Measures of Location

Example: Median of Average Absolute Velocity for Genuine Signatures, LAM: Avg Measures of Location

Sample mode: Needs careful definition but basically: The data value that occurs the most Avg mode = 9.2541 Med Measures of Location

Some trivia: Nice and symmetric: Mean = Median = Mode Mean Modes Measures of Location

Toss out the largest 5% and smallest 5% of the data

Sample variance: (Almost) the average of squared deviations from the sample mean. Measures of Data Spread data point i sample mean there are n data points Standard deviation is The sample average and standard dev. are the most common measures of central tendency and spread Sample average and standard dev have the same units

Measures of Data Spread If you have “enough” data, you can fit a smooth probability density function to the histogram

Measures of Data Spread ~ 68% ± 1s ~ 95% ± 2s ~ 99% ± 3s Trivia: The famous (standardized) “Bell Curve” Also called “normal” and “Gaussian” Mean = 0 Std Dev = 1 Units are in Std Devs ---

Measures of Data Spread

Sample range: The difference between the largest and smallest value in the sample Very sensitive to outliers (extreme observations) Percentiles: The p th percentile data value, x, means that p- percent of the data are less than or equal to x. Median = 50 th percentile Measures of Data Spread

1 st -%tile 99 th -%tile 1.52003 1.52008 Measures of Data Spread

Sample relative standard deviation: Ratio of standard dev to the average Also called coefficient of variation Data quality-outliers: Rule of thumb, if : x i > 75 th -%tile +  ×(75 th -%tile - 25 th -%tile) x i < 25 th -%tile +  ×(75 th -%tile - 25 th -%tile) x i outlier for  x i extreme outlier for  Measures of Data Spread

Visual Displays of Data and Basic Descriptive Statistics

Similar presentations

Presentation on theme: "Visual Displays of Data and Basic Descriptive Statistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Visual Displays of Data and Basic Descriptive Statistics

Similar presentations

Presentation on theme: "Visual Displays of Data and Basic Descriptive Statistics"— Presentation transcript:

Similar presentations

About project

Feedback