Hands-on Introduction to R
We live in oceans of data. Computers are essential to record and help analyse it. Competent scientists speak C/C++, Java, MATLAB, Python, Perl, R and/or Mathematica Data collection and analysis very important in Forensic Science since NAS 2009 Using the above languages, codes can easily be made available for review/discovery Why Leaning Programing?
All machines understand is on/off! High/low voltage High/low current High/low charge 1/0 binary digits (bits) To make a computer do anything, you have to speak machine language to it: Getting a computer to do anything useful Add 1 and 2. Store the result. Wikipedia
Machine language is not intuitive and can vary a great deal over designs The basic operations operations however are the same, e.g.: Move data here Combine these values Store this data Etc. “Human readable” language for basic machine operations: assembly language Getting a computer to do anything useful
Assembly is still cumbersome for (most) humans Getting a computer to do anything useful MOV AL, 61h Assembly A machine encoding Move the number 97 over to “storage area” AL
Better yet is a more “Englishy”, “high-level” language Enter: C, C++, Fortran, Java, … Higher level languages like these are translated (“compiled”) to machine language Not exactly true for Java, but it’s something analogous… Getting a computer to do anything useful
Even more “Englishy” and “high-level” are interpreted languages Enter: R MATLAB, Perl, Python, Mathematica, Maple, … The “code” of these languages are “interpreted” as commands by a program that is already running They make many assumptions behind the scenes Much easier to program with Much slower than compiled languages Getting a computer to do anything useful
R is not a black box! Codes available for review; totally transparent! R maintained by a professional group of statisticians, and computational scientists From very simple to state-of-the-art procedures available Very good graphics for exhibits and papers R is extensible (it is a full scripting language) Coding/syntax similar to Python and MATLAB Easy to link to C/C++ routines Why ?
Where to get information on R : R: Just need the base RStudio: A great IDE for R Work on all platforms Sometimes slows down performance… CRAN: Library repository for R Click on Search on the left of the website to search for package/info on packages Why ?
Finding our way around R/RStudio Script Window Command Line
Basic Input and Output Handy Commands: x <- 4 x <- “text goes in quotes” variables: store information Numeric input Text (character) input :Assignment operator
Get help on an R command: If you know the name: ?command name ?plot brings up html on plot command If you don’t know the name: Use Google (my favorite) ??key word Handy Commands:
R is driven by functions: Handy Commands: func(arguement1, argument2) x <- func(arg1, arg2) function name input to function goes in parenthesis function returns something; gets dumped into x
Input from Excel Save spreadsheet as a CSV file Use read.csv function Needs the path to the file Handy Commands: "/Users/npetraco/latex/papers/data.csv” Mac e.g.: “C:\Users\npetraco\latex\papers\data.csv” Windows e.g.: *Exercise: basicIO.R
Matrices: X X[,1] returns column 1 of matrix X X[3,] returns row 3 of matrix X Handy functions for data frames and matrices: dim, nrow, ncol, rbind, cbind User defined functions syntax: func.name <- function(arguements) { do something return(output) } To use it: func.name(values) Handy Commands:
o Explore the Glass dataset of the mlbench package Source (load) all_data_source.R *visualize_with_plots.r Scatter plots: plot any two variables against each other First Thing: Look at your Data
Pairs plots: do many scatter plots at once First Thing: Look at your Data
Histograms: “bin” a variable and plot frequencies First Thing: Look at your Data
Histograms conditioned on other variables: use lattice package First Thing: Look at your Data RIs Conditioned on glass group membership
Probability density plots: also needs lattice First Thing: Look at your Data
Empirical Probability Distribution plots: also called empirical cumulative density First Thing: Look at your Data
Box and Whiskers plots: First Thing: Look at your Data 25 th -%tile 1 st -quartile 75 th -%tile 3 rd -quartile median 50 th -%tile range possible outliers possible outliers RI
Note the relationship: Visualizing Data
Box and Whiskers plots: First Thing: Look at your Data Box-Whiskers plots for actual variable values Box-Whiskers plots for scaled variable values
Confidence Intervals A confidence interval (CI) gives a range in which a true population parameter may be found. Specifically, (1 – )×100% CIs for a parameter, constructed from a random sample (of a given sample size), will contain the true value of the parameter approximately (1 – )×100% of the time. Different from tolerance and prediction intervals
Confidence Intervals Caution: IT IS NOT CORRECT to say that there a (1 - )×100% probability that the true value of a parameter is between the bounds of any given CI. true value of parameter Here 90% of the CIs contain the true value of the parameter Graphical representation of 90% CIs is for a parameter: Take a sample. Compute a CI.
Construction of a CI for a mean depends on: Sample size n Standard error for means Level of confidence 1- is significance level Use to compute t c -value (1- )×100% CI for population mean using a sample average and standard error is: Confidence Intervals
Compute a 99% confidence interval for the mean using this sample set: Confidence Intervals Fragment #Fragment nD ( /2=0.005) t c = 3.17 Putting this together: [ (3.17)( ), (3.17)( )] 99% CI for sample = [ , ] *Try out confidence_intervals.R