Download presentation
Presentation is loading. Please wait.
Published byDwayne Pitts Modified over 5 years ago
1
Introduction to Exploratory Descriptive Data Analysis in S-Plus
Jagdish S. Gangolly School of Business State University of New York at Albany 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
2
Acc 522 Statistical Methods for Business Decisions (J Gangolly)
S-Plus in MS-Windows To quit S-Plus shell while in the command line window: Q() or Ctrl-D The S-Plus prompt is > 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
3
Simple Structures I: Arithmetic Operators
*, /, +, and -. Avoid ambiguity by using parentheses, eg., (7+2)*3, since 7+2*3=13 and not 27. Multiplication and division are evaluated before addition & subtraction. Raising to a power (^ or **) takes precedence over everything else. 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
4
Simple Structures II: Assignments
X <- 3 or > x or x_3 or x=3 Not a good idea to use underscore for assignment or the equals sign. To see the value of a variable x: X or print(x) To remove a variable x: Rm(x) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
5
Simple Structures III: Concatenation
Used to create vectors of any length > X <- c(1.5, 2, 2.5) > X > X^2 .c can be used with any type of data 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
6
Simple Structures IV: Sequence
Sequence command Seq(lower, upper, increment) Some examples: seq(1,35,5): seq(5,15,1.5): seq(50,25,-5): 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
7
Simple Structures V: Replicate
Replicate command: to generate data that follow a regular pattern: Some examples: rep(8,5): rep(“8”, 5): “8” “8” “8” “8” “8” rep(c(0,”ab”),2):“0” “ab” “0” “ab” rep(1:4, 1:4): Rep(1:3, rep(2,3)): Rep(c(1,8,7),length=5)): 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
8
Simple Structures VI: Expressions
> X <- seq(2,10,2) > Y <- 1:5 > Z <- ((3*x^2+2*y)/((x+y)*(x-y)))^(0.5) > X > Y > Z 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
9
Simple Structures VI: Logical Operators
< Less Than > Greater than <= Less than or equal to >= Greater than or equal to == Equal to != Not equal to 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
10
Simple Structures VII Index Brackets:
Square brackets are used to index vectors and matrices. > x <- seq(0,20,10) > x[2] 10 > x[5] NA > x[c(1,3)] 0 20 > x[-1] 10 20 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
11
Data Manipulation I: Frames & matrices I
Matrices: two-dimensional vectors (have row and column indices Arrays: General data structure in S-Plus Zero-dimensional: scalar One-dimensional: vector Two-dimensional: matrix Three to eight-dimensional: arrays The data in a matrix must all be of the same data type (usually numeric data types) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
12
Data Manipulation I: Frames & matrices II
The columns in dataframes can be of different data types Lists: The most general data type in S-Plus 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
13
Data Manipulation I: Matrices I
Reading data S-Plus is very finicky about format of input data To read a table: Read.table(“filename”) The first column must be row names The first row must be column names The top left cell must be empty Space/tab the default column delimiters See the example in /db4/teach/acc522/fasb103.txt and play around with it. 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
14
Data Manipulation I: matrices II
Read.table and as.matrix(): x <- Read.table(“filename”) as.matrix(x) Enter data directly: Matrix(data, nrow, ncol, byrow=F) Example: x <- matrix(1:6, nrow=2, byrow=T) dim(x): (2 X 3) dimnames(x): (NULL) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
15
Data Manipulation I: matrices III
Elements of matrices are accessed by specifying the row and column indices. Example: data <- c(227,8,1.3,1534,58,1.2,2365,82,1.8) countries <- c(“austria”, “france”, “germany”) variables <- c(“gdp”, “pop”, “inflation”) country.data <- matrix(data,nrow=3,byrow=T) dimnames(country.data)<- list(countries,variables) Country.data[1:2,2:3]: pop and inflation of austria & france 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
16
Acc 522 Statistical Methods for Business Decisions (J Gangolly)
S-Plus Graphics I To plot two variables x and y, plot(x,y) Example: (sine curve) plot(1:100, sin(1:100/10)) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
17
Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Data Manipulation: Matrices: bind rows (rbind), bind columns (cbind) Arrays: rowMeans, colMeans, rowSums, colSums, rowVars, colVars,… apply(data, dim, function,…) attach(framename):permits you to refer to variables without cumbersome notations. You can detach the frame when done. function (x) { function definition}: To define your own functions rm(comma-separated S-Plus objects): To remove objects 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
18
Acc 522 Statistical Methods for Business Decisions (J Gangolly)
S-Plus Graphics motif( ) : To open a graphics window. Each time you invoke this, a new graphics window is opened. dev.off() : Close the most recent graphics device opened. graphics.off() : Close all graphics devices. plot(comma-separated variables, plot character) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
19
Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Trellis Graphics I A matrix of graphs Example: >par(mfrow=c(2,2)) # 2 X 2 matrix of figures >x <- 1:100/100:1 >plot(x) # plot cell (1,1) >plot(x, type=“l”) # plot cell (1,2) line >hist(x) # plot cell (2,1) histogram >boxplot(x) # plot cell (2,2) boxplot 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
20
Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Trellis Graphics II Syntax: Dependent variable ~ explanatory variable |conditioning variable Data set Output: >trellis.device(motif) >dev.off() or >graphics.off() 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
21
Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Trellis Graphics I Syntax: Dependent variable ~ explanatory variable |conditioning variable Data set Output: >trellis.device(motif) (unix version) >dev.off() or >graphics.off() 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
22
Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Trellis Graphics II Example: histogram(~height | voice.part, data=singer) No dependent variable for histogram Height is explanatory variable Data set is singer 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
23
Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Trellis Graphics III Layout: layout and skip and aspect parameters (p.147). Ordering graphs: left to right, bottom to top. If as.table=T, left to right top to bottom p.149). 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
24
Descriptive Data Exploration
summary : mean, median, quantiles p stem : stem and leaf display p stdev p.197 tapply : splits data p.198 by p.199 mean works on vector, and other structures need to be converted to vectors before computing means. 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
25
Data Preprocessing for Datamining I
Why Incomplete Attribute values not available, equipment malfunctions, not considered important Noisy (errors) instrument problems, human/computer errors, transmission errors Inconsistent inconsistencies due to data definitions 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
26
Data Preprocessing for Datamining II
Data Cleaning Missing values: ignore tuple, fill-in values manually, use a global constant (unknown), missing value=attribute mean, missing value = attribute group mean, missing value= most probable value Noisy data: Binning: partitioning into equi-sized bins, smoothing by bin means or bin boundaries Clustering Inspection: computer & human Regression Inconsistencies 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
27
Data Preprocessing for Datamining III
Data Integration: Combining data from different sources into a coherent whole Schema integration: combining data models (entity identification problems) Redundancy (derived values, calculated fields, use of different key attributes): use of correlations to detect redundancies Resolution of data value conflicts (coding values in different measures) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
28
Data Preprocessing for Datamining III
Transformation Smoothing Aggregation Generalisation Normalisation Attribute (or feature) construction 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
29
Data Preprocessing for Datamining IV
Data Reduction & compression Data cube aggregation (p.117) Dimension reduction: minimise loss of information. Attribute selection Decision tree induction Principal components analysis 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
30
Data Preprocessing for Datamining IV
Numerosity reduction Regression/log-linear regression histograms Clustering 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.