Introduction to Exploratory Descriptive Data Analysis in S-Plus

Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at Albany.
Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at Albany.
Matrix Algebra Matrix algebra is a means of expressing large numbers of calculations made upon ordered sets of numbers. Often referred to as Linear Algebra.

6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Matlab Matlab is a powerful mathematical tool and this tutorial is intended to be an introduction to some of the functions that you might find useful.
Introduction to Exploratory Descriptive Data Analysis in S-Plus Jagdish S. Gangolly State University of New York at Albany.
Concatenation MATLAB lets you construct a new vector by concatenating other vectors: – A = [B C D... X Y Z] where the individual items in the brackets.
Lecture 2 MATLAB fundamentals Variables, Naming Rules, Arrays (numbers, scalars, vectors, matrices), Arithmetical Operations, Defining and manipulating.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Introduction to R: The Basics Rosales de Veliz L., David S.L., McElhiney D., Price E., & Brooks G. Contributions from Ragan. M., Terzi. F., & Smith. E.
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Introduction to MATLAB January 18, 2008 Steve Gu Reference: Eta Kappa Nu, UCLA Iota Gamma Chapter, Introduction to MATLAB,
ECE 1304 Introduction to Electrical and Computer Engineering Section 1.1 Introduction to MATLAB.
Data Objects in R Vector1 dimensionAll elements have the same data types Data types: numeric, character logic, factor Matrix2 dimensions Array2 or more.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Descriptive Exploratory Data Analysis III Jagdish S. Gangolly State University of New York at Albany.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
STAT 251 Lab 1. Outline Lab Accounts Introduction to R.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
INTRODUCTION TO MATLAB DAVID COOPER SUMMER Course Layout SundayMondayTuesdayWednesdayThursdayFridaySaturday 67 Intro 89 Scripts 1011 Work
Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.
INTRODUCTION TO MATLAB Dr. Hugh Blanton ENTC 4347.
Introduction to Exploratory Descriptive Data Analysis in S-Plus Jagdish S. Gangolly State University of New York at Albany.
1 Faculty Name Prof. A. A. Saati. 2 MATLAB Fundamentals 3 1.Reading home works ( Applied Numerical Methods )  CHAPTER 2: MATLAB Fundamentals (p.24)
Math 252: Math Modeling Eli Goldwyn Introduction to MATLAB.
Descriptive Exploratory Data Analysis II Jagdish S. Gangolly State University of New York at Albany.
Matrices. Matrix - a rectangular array of variables or constants in horizontal rows and vertical columns enclosed in brackets. Element - each value in.
Data Mining What is to be done before we get to Data Mining?
Pinellas County Schools
Working with data in R 2 Fish 552: Lecture 3. Recommended Reading An Introduction to R (R Development Core Team) –
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Descriptive Exploratory Data Analysis II
ECE 1304 Introduction to Electrical and Computer Engineering
Introduction to R Samal Dharmarathna.
© 2016 Pearson Education, Ltd. All rights reserved.
Exploring Microarray data
Noisy Data Noise: random error or variance in a measured variable.
Chapter 3 Arrays and Vectors
Introduction Osborn.
Introduction to MATLAB for Engineers, Third Edition
UNIT-2 Data Preprocessing
INTRODUCTION TO BASIC MATLAB
MATLAB DENC 2533 ECADD LAB 9.
LINEAR MODELS AND MATRIX ALGEBRA
Introduction to Exploratory Descriptive Data Analysis in S-Plus II
MATH 493 Introduction to MATLAB
Use of Mathematics using Technology (Maltlab)
Graphics in S-Plus Jagdish S. Gangolly School of Business
Vectors and Matrices I.
Introduction to MATLAB [Vectors and Matrices] Lab 2
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Communication and Coding Theory Lab(CS491)
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Introduction to Matlab
CSCI N317 Computation for Scientific Applications Unit R
INTRODUCTION TO MATLAB
Matrices.
Data Mining Data Preprocessing
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
R Course 1st Lecture.
By Sandeep Patil, Department of Computer Engineering, I²IT
Stat 251 (2009, Summer) Lab 2 TA: Yu, Chi Wai.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Presentation transcript:

Introduction to Exploratory Descriptive Data Analysis in S-Plus Jagdish S. Gangolly School of Business State University of New York at Albany 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Acc 522 Statistical Methods for Business Decisions (J Gangolly) S-Plus in MS-Windows To quit S-Plus shell while in the command line window: Q() or Ctrl-D The S-Plus prompt is > 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Simple Structures I: Arithmetic Operators *, /, +, and -. Avoid ambiguity by using parentheses, eg., (7+2)*3, since 7+2*3=13 and not 27. Multiplication and division are evaluated before addition & subtraction. Raising to a power (^ or **) takes precedence over everything else. 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Simple Structures II: Assignments X <- 3 or 3 -> x or x_3 or x=3 Not a good idea to use underscore for assignment or the equals sign. To see the value of a variable x: X or print(x) To remove a variable x: Rm(x) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Simple Structures III: Concatenation Used to create vectors of any length > X <- c(1.5, 2, 2.5) > X 1.5 2.0 2.5 > X^2 2.25 4.00 6.25 .c can be used with any type of data 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Simple Structures IV: Sequence Sequence command Seq(lower, upper, increment) Some examples: seq(1,35,5): 1 6 11 16 21 26 31 seq(5,15,1.5): 5 6.5 8.0 9.5 11 12.5 14.0 seq(50,25,-5): 50 45 40 35 30 25 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Simple Structures V: Replicate Replicate command: to generate data that follow a regular pattern: Some examples: rep(8,5): 8 8 8 8 8 rep(“8”, 5): “8” “8” “8” “8” “8” rep(c(0,”ab”),2):“0” “ab” “0” “ab” rep(1:4, 1:4): 1 2 2 3 3 3 4 4 4 4 Rep(1:3, rep(2,3)): 1 1 2 2 3 3 Rep(c(1,8,7),length=5)):1 8 7 1 8 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Simple Structures VI: Expressions > X <- seq(2,10,2) > Y <- 1:5 > Z <- ((3*x^2+2*y)/((x+y)*(x-y)))^(0.5) > X 2 4 6 8 10 > Y 1 2 3 4 5 > Z 2.160247 2.081666 2.054805 2.041241 2.033060 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Simple Structures VI: Logical Operators < Less Than > Greater than <= Less than or equal to >= Greater than or equal to == Equal to != Not equal to 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Simple Structures VII Index Brackets: Square brackets are used to index vectors and matrices. > x <- seq(0,20,10) > x[2] 10 > x[5] NA > x[c(1,3)] 0 20 > x[-1] 10 20 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Data Manipulation I: Frames & matrices I Matrices: two-dimensional vectors (have row and column indices Arrays: General data structure in S-Plus Zero-dimensional: scalar One-dimensional: vector Two-dimensional: matrix Three to eight-dimensional: arrays The data in a matrix must all be of the same data type (usually numeric data types) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Data Manipulation I: Frames & matrices II The columns in dataframes can be of different data types Lists: The most general data type in S-Plus 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Data Manipulation I: Matrices I Reading data S-Plus is very finicky about format of input data To read a table: Read.table(“filename”) The first column must be row names The first row must be column names The top left cell must be empty Space/tab the default column delimiters See the example in /db4/teach/acc522/fasb103.txt and play around with it. 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Data Manipulation I: matrices II Read.table and as.matrix(): x <- Read.table(“filename”) as.matrix(x) Enter data directly: Matrix(data, nrow, ncol, byrow=F) Example: x <- matrix(1:6, nrow=2, byrow=T) dim(x): (2 X 3) dimnames(x): (NULL) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Data Manipulation I: matrices III Elements of matrices are accessed by specifying the row and column indices. Example: data <- c(227,8,1.3,1534,58,1.2,2365,82,1.8) countries <- c(“austria”, “france”, “germany”) variables <- c(“gdp”, “pop”, “inflation”) country.data <- matrix(data,nrow=3,byrow=T) dimnames(country.data)<- list(countries,variables) Country.data[1:2,2:3]: pop and inflation of austria & france 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Acc 522 Statistical Methods for Business Decisions (J Gangolly) S-Plus Graphics I To plot two variables x and y, plot(x,y) Example: (sine curve) plot(1:100, sin(1:100/10)) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Acc 522 Statistical Methods for Business Decisions (J Gangolly) Data Manipulation: Matrices: bind rows (rbind), bind columns (cbind) Arrays: rowMeans, colMeans, rowSums, colSums, rowVars, colVars,… apply(data, dim, function,…) attach(framename):permits you to refer to variables without cumbersome notations. You can detach the frame when done. function (x) { function definition}: To define your own functions rm(comma-separated S-Plus objects): To remove objects 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Acc 522 Statistical Methods for Business Decisions (J Gangolly) S-Plus Graphics motif( ) : To open a graphics window. Each time you invoke this, a new graphics window is opened. dev.off() : Close the most recent graphics device opened. graphics.off() : Close all graphics devices. plot(comma-separated variables, plot character) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Acc 522 Statistical Methods for Business Decisions (J Gangolly) Trellis Graphics I A matrix of graphs Example: >par(mfrow=c(2,2)) # 2 X 2 matrix of figures >x <- 1:100/100:1 >plot(x) # plot cell (1,1) >plot(x, type=“l”) # plot cell (1,2) line >hist(x) # plot cell (2,1) histogram >boxplot(x) # plot cell (2,2) boxplot 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Acc 522 Statistical Methods for Business Decisions (J Gangolly) Trellis Graphics II Syntax: Dependent variable ~ explanatory variable |conditioning variable Data set Output: >trellis.device(motif) >dev.off() or >graphics.off() 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Acc 522 Statistical Methods for Business Decisions (J Gangolly) Trellis Graphics I Syntax: Dependent variable ~ explanatory variable |conditioning variable Data set Output: >trellis.device(motif) (unix version) >dev.off() or >graphics.off() 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Acc 522 Statistical Methods for Business Decisions (J Gangolly) Trellis Graphics II Example: histogram(~height | voice.part, data=singer) No dependent variable for histogram Height is explanatory variable Data set is singer 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Acc 522 Statistical Methods for Business Decisions (J Gangolly) Trellis Graphics III Layout: layout and skip and aspect parameters (p.147). Ordering graphs: left to right, bottom to top. If as.table=T, left to right top to bottom p.149). 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Descriptive Data Exploration summary : mean, median, quantiles p.193-200 stem : stem and leaf display p.193-2200 stdev p.197 tapply : splits data p.198 by p.199 mean works on vector, and other structures need to be converted to vectors before computing means. 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Data Preprocessing for Datamining I Why Incomplete Attribute values not available, equipment malfunctions, not considered important Noisy (errors) instrument problems, human/computer errors, transmission errors Inconsistent inconsistencies due to data definitions 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Data Preprocessing for Datamining II Data Cleaning Missing values: ignore tuple, fill-in values manually, use a global constant (unknown), missing value=attribute mean, missing value = attribute group mean, missing value= most probable value Noisy data: Binning: partitioning into equi-sized bins, smoothing by bin means or bin boundaries Clustering Inspection: computer & human Regression Inconsistencies 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Data Preprocessing for Datamining III Data Integration: Combining data from different sources into a coherent whole Schema integration: combining data models (entity identification problems) Redundancy (derived values, calculated fields, use of different key attributes): use of correlations to detect redundancies Resolution of data value conflicts (coding values in different measures) 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Data Preprocessing for Datamining III Transformation Smoothing Aggregation Generalisation Normalisation Attribute (or feature) construction 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Data Preprocessing for Datamining IV Data Reduction & compression Data cube aggregation (p.117) Dimension reduction: minimise loss of information. Attribute selection Decision tree induction Principal components analysis 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)

Data Preprocessing for Datamining IV Numerosity reduction Regression/log-linear regression histograms Clustering 2/22/2019 Acc 522 Statistical Methods for Business Decisions (J Gangolly)