Introduction to Exploratory Descriptive Data Analysis in S-Plus Jagdish S. Gangolly School of Business State University of New York at Albany 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Acc 522 Statistical Methods for Business Decisions (J Gangolly) S-Plus in MS-Windows To quit S-Plus shell while in the command line window: Q() or Ctrl-D The S-Plus prompt is > 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Simple Structures I: Arithmetic Operators *, /, +, and -. Avoid ambiguity by using parentheses, eg., (7+2)*3, since 7+2*3=13 and not 27. Multiplication and division are evaluated before addition & subtraction. Raising to a power (^ or **) takes precedence over everything else. 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Simple Structures II: Assignments X <- 3 or 3 -> x or x_3 or x=3 Not a good idea to use underscore for assignment or the equals sign. To see the value of a variable x: X or print(x) To remove a variable x: Rm(x) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Simple Structures III: Concatenation Used to create vectors of any length > X <- c(1.5, 2, 2.5) > X 1.5 2.0 2.5 > X^2 2.25 4.00 6.25 .c can be used with any type of data 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Simple Structures IV: Sequence Sequence command Seq(lower, upper, increment) Some examples: seq(1,35,5): 1 6 11 16 21 26 31 seq(5,15,1.5): 5 6.5 8.0 9.5 11 12.5 14.0 seq(50,25,-5): 50 45 40 35 30 25 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Simple Structures V: Replicate Replicate command: to generate data that follow a regular pattern: Some examples: rep(8,5): 8 8 8 8 8 rep(“8”, 5): “8” “8” “8” “8” “8” rep(c(0,”ab”),2):“0” “ab” “0” “ab” rep(1:4, 1:4): 1 2 2 3 3 3 4 4 4 4 Rep(1:3, rep(2,3)): 1 1 2 2 3 3 Rep(c(1,8,7),length=5)):1 8 7 1 8 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Simple Structures VI: Expressions > X <- seq(2,10,2) > Y <- 1:5 > Z <- ((3*x^2+2*y)/((x+y)*(x-y)))^(0.5) > X 2 4 6 8 10 > Y 1 2 3 4 5 > Z 2.160247 2.081666 2.054805 2.041241 2.033060 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Simple Structures VI: Logical Operators < Less Than > Greater than <= Less than or equal to >= Greater than or equal to == Equal to != Not equal to 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Simple Structures VII Index Brackets: Square brackets are used to index vectors and matrices. > x <- seq(0,20,10) > x[2] 10 > x[5] NA > x[c(1,3)] 0 20 > x[-1] 10 20 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Data Manipulation I: Frames & matrices I Matrices: two-dimensional vectors (have row and column indices Arrays: General data structure in S-Plus Zero-dimensional: scalar One-dimensional: vector Two-dimensional: matrix Three to eight-dimensional: arrays The data in a matrix must all be of the same data type (usually numeric data types) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Data Manipulation I: Frames & matrices II The columns in dataframes can be of different data types Lists: The most general data type in S-Plus 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Data Manipulation I: Matrices I Reading data S-Plus is very finicky about format of input data To read a table: Read.table(“filename”) The first column must be row names The first row must be column names The top left cell must be empty Space/tab the default column delimiters See the example in /db4/teach/acc522/fasb103.txt and play around with it. 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Data Manipulation I: matrices II Read.table and as.matrix(): x <- Read.table(“filename”) as.matrix(x) Enter data directly: Matrix(data, nrow, ncol, byrow=F) Example: x <- matrix(1:6, nrow=2, byrow=T) dim(x): (2 X 3) dimnames(x): (NULL) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Data Manipulation I: matrices III Elements of matrices are accessed by specifying the row and column indices. Example: data <- c(227,8,1.3,1534,58,1.2,2365,82,1.8) countries <- c(“austria”, “france”, “germany”) variables <- c(“gdp”, “pop”, “inflation”) country.data <- matrix(data,nrow=3,byrow=T) dimnames(country.data)<- list(countries,variables) Country.data[1:2,2:3]: pop and inflation of austria & france 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Acc 522 Statistical Methods for Business Decisions (J Gangolly) S-Plus Graphics I To plot two variables x and y, plot(x,y) Example: (sine curve) plot(1:100, sin(1:100/10)) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Acc 522 Statistical Methods for Business Decisions (J Gangolly) Data Manipulation: Matrices: bind rows (rbind), bind columns (cbind) Arrays: rowMeans, colMeans, rowSums, colSums, rowVars, colVars,… apply(data, dim, function,…) attach(framename):permits you to refer to variables without cumbersome notations. You can detach the frame when done. function (x) { function definition}: To define your own functions rm(comma-separated S-Plus objects): To remove objects 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Acc 522 Statistical Methods for Business Decisions (J Gangolly) S-Plus Graphics motif( ) : To open a graphics window. Each time you invoke this, a new graphics window is opened. dev.off() : Close the most recent graphics device opened. graphics.off() : Close all graphics devices. plot(comma-separated variables, plot character) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Acc 522 Statistical Methods for Business Decisions (J Gangolly) Trellis Graphics I A matrix of graphs Example: >par(mfrow=c(2,2)) # 2 X 2 matrix of figures >x <- 1:100/100:1 >plot(x) # plot cell (1,1) >plot(x, type=“l”) # plot cell (1,2) line >hist(x) # plot cell (2,1) histogram >boxplot(x) # plot cell (2,2) boxplot 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Acc 522 Statistical Methods for Business Decisions (J Gangolly) Trellis Graphics II Syntax: Dependent variable ~ explanatory variable |conditioning variable Data set Output: >trellis.device(motif) >dev.off() or >graphics.off() 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Acc 522 Statistical Methods for Business Decisions (J Gangolly) Trellis Graphics I Syntax: Dependent variable ~ explanatory variable |conditioning variable Data set Output: >trellis.device(motif) (unix version) >dev.off() or >graphics.off() 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Acc 522 Statistical Methods for Business Decisions (J Gangolly) Trellis Graphics II Example: histogram(~height | voice.part, data=singer) No dependent variable for histogram Height is explanatory variable Data set is singer 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Acc 522 Statistical Methods for Business Decisions (J Gangolly) Trellis Graphics III Layout: layout and skip and aspect parameters (p.147). Ordering graphs: left to right, bottom to top. If as.table=T, left to right top to bottom p.149). 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Descriptive Data Exploration summary : mean, median, quantiles p.193-200 stem : stem and leaf display p.193-2200 stdev p.197 tapply : splits data p.198 by p.199 mean works on vector, and other structures need to be converted to vectors before computing means. 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Data Preprocessing for Datamining I Why Incomplete Attribute values not available, equipment malfunctions, not considered important Noisy (errors) instrument problems, human/computer errors, transmission errors Inconsistent inconsistencies due to data definitions 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Data Preprocessing for Datamining II Data Cleaning Missing values: ignore tuple, fill-in values manually, use a global constant (unknown), missing value=attribute mean, missing value = attribute group mean, missing value= most probable value Noisy data: Binning: partitioning into equi-sized bins, smoothing by bin means or bin boundaries Clustering Inspection: computer & human Regression Inconsistencies 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Data Preprocessing for Datamining III Data Integration: Combining data from different sources into a coherent whole Schema integration: combining data models (entity identification problems) Redundancy (derived values, calculated fields, use of different key attributes): use of correlations to detect redundancies Resolution of data value conflicts (coding values in different measures) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Data Preprocessing for Datamining III Transformation Smoothing Aggregation Generalisation Normalisation Attribute (or feature) construction 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Data Preprocessing for Datamining IV Data Reduction & compression Data cube aggregation (p.117) Dimension reduction: minimise loss of information. Attribute selection Decision tree induction Principal components analysis 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Data Preprocessing for Datamining IV Numerosity reduction Regression/log-linear regression histograms Clustering 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)