DATA MANAGEMENT MODULE: USING SQL in R HMI 7530– Programming in R DATA MANAGEMENT MODULE: USING SQL in R Jennifer Lewis Priestley, Ph.D. Kennesaw State University 1
DATA MANAGEMENT MODULE Importing and Exporting Imputting data directly into R Creating, Adding and Dropping Variables Assigning objects Subsetting and Formatting Merging, Stacking and Recoding Using SQL in R 2 2 2
Data Management Module: SQL Definition of SQL: The original Structured Query Language was designed by an IBM research center in 1974-75 and introduced commercially by Oracle in 1979. There are different dialects of SQL, but it remains as close to a standard query language as you will get. Some standard SQL commands are as follows: SELECT DELETE INSERT CREATE UPDATE DROP 3
Data Management Module: SQL SQL is used for the following tasks: Generate reports Generate summary statistics Retrieve data from tables or views Combine data from tables or views Create tables, views, and indexes Update the data values in SQL tables Update and retrieve data from database management system (DBMS) tables. 4
Data Management Module: SQL To get started using SQL in R, you first need to install the “sqldf” package. Basically sqldf is a function that tells R that you are now coding in SQL – much like calling PROC SQL in SAS. 5
Data Management Module: SQL SELECT – this SQL term will allow you to select all or a portion of rows/columns from a data frame sqldf('select * from PS2') The “*” is the symbol that represents all rows and columns sqldf('select Tattoo, Looks from PS2') Selecting individual columns (vectors) is simple – just separate the column names by a comma (no final comma). sqldf('select Sex, ((HtChoice-Height)/Height)*100 as PCTDIFF from PS2') You can create new variables at the same time that you select them. 6
Data Management Module: SQL WHERE – this SQL term will allow you to select only a specified set of observations. This is particularly useful if there are values that are either clearly illogical (like negative age) or statistically unlikely (like adult height less than 50”) sqldf('select * from PS2 where HtChoice >=60') sqldf('select Sex, Tattoo from PS2 where Sex="Male" AND Height >70’) 7
Data Management Module: SQL Summarization – you can do some basic summarization in the context of SQL programming that is very efficient. Some examples include: AVG, MEAN, COUNT, NMISS, RANGE, STD, SUM. sqldf('select Sex, count(Sex) N, avg(NumPrces) AVG_NumPrces, stdev(NumPrces) StdDev from PS2 group by Sex') Note that in this code, you will be generating the requested statistics only for the variable that you specify after the term. 8 8