Introduction to Exploratory Descriptive Data Analysis in S-Plus Jagdish S. Gangolly School of Business State University of New York at Albany
S-Plus in MS-Windows To quit S-Plus shell while in the command line window: Q() or Ctrl-D The S-Plus prompt is >
Simple Structures I: Arithmetic Operators *, /, +, and -. Avoid ambiguity by using parentheses, eg., (7+2)*3, since 7+2*3=13 and not 27. Multiplication and division are evaluated before addition & subtraction. Raising to a power (^ or **) takes precedence over everything else.
Simple Structures II: Assignments X <- 3 or 3 -> x or x_3 or x=3 Not a good idea to use underscore for assignment or the equals sign. To see the value of a variable x: X or print(x) To remove a variable x: Rm(x)
Simple Structures III: Concatenation Used to create vectors of any length > X <- c(1.5, 2, 2.5) > X 1.5 2.0 2.5 > X^2 2.25 4.00 6.25 .c can be used with any type of data
Simple Structures IV: Sequence Sequence command Seq(lower, upper, increment) Some examples: seq(1,35,5): 1 6 11 16 21 26 31 seq(5,15,1.5): 5 6.5 8.0 9.5 11 12.5 14.0 seq(50,25,-5): 50 45 40 35 30 25
Simple Structures V: Replicate Replicate command: to generate data that follow a regular pattern: Some examples: rep(8,5): 8 8 8 8 8 rep("8", 5): "8" "8" "8" "8" "8" rep(c(0,"ab"),2):"0" "ab" "0" "ab" rep(1:4, 1:4): 1 2 2 3 3 3 4 4 4 4 Rep(1:3, rep(2,3)): 1 1 2 2 3 3 Rep(c(1,8,7),length=5)):1 8 7 1 8
Simple Structures VI: Expressions > X <- seq(2,10,2) > Y <- 1:5 > Z <- ((3*x^2+2*y)/((x+y)*(x-y)))^(0.5) > X 2 4 6 8 10 > Y 1 2 3 4 5 > Z 2.160247 2.081666 2.054805 2.041241 2.033060
Simple Structures VI: Logical Operators < Less Than > Greater than <= Less than or equal to >= Greater than or equal to == Equal to != Not equal to
Simple Structures VII Index Brackets: Square brackets are used to index vectors and matrices. > x <- seq(0,20,10) > x[2] 10 > x[5] NA > x[c(1,3)] 0 20 > x[-1] 10 20
Data Manipulation I: Frames & matrices I Matrices: two-dimensional vectors (have row and column indices Arrays: General data structure in S-Plus Zero-dimensional: scalar One-dimensional: vector Two-dimensional: matrix Three to eight-dimensional: arrays The data in a matrix must all be of the same data type (usually numeric data types)
Data Manipulation I: Frames & matrices II The columns in dataframes can be of different data types Lists: The most general data type in S-Plus
Data Manipulation I: Matrices I Reading data S-Plus is very finicky about format of input data To read a table: Read.table("filename") The first column must be row names The first row must be column names The top left cell must be empty Space/tab the default column delimiters See the example in /db4/teach/acc522/fasb103.txt and play around with it.
Data Manipulation I: matrices II Read.table and as.matrix(): x <- Read.table("filename") as.matrix(x) Enter data directly: Matrix(data, nrow, ncol, byrow=F) Example: x <- matrix(1:6, nrow=2, byrow=T) dim(x): (2 X 3) dimnames(x): (NULL)
Data Manipulation I: matrices III Elements of matrices are accessed by specifying the row and column indices. Example: data <- c(227,8,1.3,1534,58,1.2,2365,82,1.8) countries <- c("austria", "france", "germany") variables <- c("gdp", "pop", "inflation") <- matrix(data,nrow=3,byrow=T) dimnames(<- list(countries,variables)[1:2,2:3]: pop and inflation of austria & france
S-Plus Graphics I To plot two variables x and y, plot(x,y) Example: (sine curve) plot(1:100, sin(1:100/10))
Data Manipulation: Matrices: bind rows (rbind), bind columns (cbind) Arrays: rowMeans, colMeans, rowSums, colSums, rowVars, colVars,… apply(data, dim, function,…) attach(framename):permits you to refer to variables without cumbersome notations. You can detach the frame when done. function (x) { function definition}: To define your own functions rm(comma-separated S-Plus objects): To remove objects
S-Plus Graphics motif( ) : To open a graphics window. Each time you invoke this, a new graphics window is opened. : Close the most recent graphics device opened. : Close all graphics devices. plot(comma-separated variables, plot character)
Trellis Graphics I A matrix of graphs Example: >par(mfrow=c(2,2)) # 2 X 2 matrix of figures >x <- 1:100/100:1 >plot(x) # plot cell (1,1) >plot(x, type="l") # plot cell (1,2) line >hist(x) # plot cell (2,1) histogram >boxplot(x) # plot cell (2,2) boxplot
Trellis Graphics II Syntax: Dependent variable ~ explanatory variable |conditioning variable Data set Output: >trellis.device(motif) > or >
Trellis Graphics II Example: histogram(~height | voice.part, data=singer) No dependent variable for histogram Height is explanatory variable Data set is singer
Trellis Graphics III Layout: layout and skip and aspect parameters (p.147). Ordering graphs: left to right, bottom to top. If as.table=T, left to right top to bottom p.149).
Descriptive Data Exploration summary : mean, median, quantiles p.193-200 stem : stem and leaf display p.193-2200 stdev p.197 tapply : splits data p.198 by p.199 mean works on vector, and other structures need to be converted to vectors before computing means.
Data Preprocessing for Datamining I Why Incomplete Attribute values not available, equipment malfunctions, not considered important Noisy (errors) instrument problems, human/computer errors, transmission errors Inconsistent inconsistencies due to data definitions
Data Preprocessing for Datamining II Data Cleaning Missing values: ignore tuple, fill-in values manually, use a global constant (unknown), missing value=attribute mean, missing value = attribute group mean, missing value= most probable value Noisy data: Binning: partitioning into equi-sized bins, smoothing by bin means or bin boundaries Clustering Inspection: computer & human Regression Inconsistencies
Data Preprocessing for Datamining III Data Integration: Combining data from different sources into a coherent whole Schema integration: combining data models (entity identification problems) Redundancy (derived values, calculated fields, use of different key attributes): use of correlations to detect redundancies Resolution of data value conflicts (coding values in different measures)
Data Preprocessing for Datamining III Transformation Smoothing Aggregation Generalisation Normalisation Attribute (or feature) construction
Data Preprocessing for Datamining IV Data Reduction & compression Data cube aggregation (p.117) Dimension reduction: minimise loss of information. Attribute selection Decision tree induction Principal components analysis
Data Preprocessing for Datamining IV Numerosity reduction Regression/log-linear regression histograms Clustering