Multidimensional data processing
Multivariate data consist of several variables for each observation. Actually, serious data is always multivariate. Some variables are usually not collected to simplify collecting and processing. Removal of variables before data analysis leads to information loss. Unknown information is never recovered. One of the most common task is clustering or classification.
classification target classes are known properties of target classes are usually unknown goal: find rules which separate observed data into target classes clustering target classes are unknown goal: find observations with common properties which may (or may not) represent classes in real world difficult situation
we are trying to extract information from data measurements, observations, surveys data preparation data adjustment – removal of invalid or incomplete observations/measurements normalization? – best handled when collecting extracting information we know what we are looking for – testing of an hypothesis trying to discover something new – data exploration
preliminary analysis of the data better understanding of its characteristics allows to select the right tools for preprocessing or analysis wrong tools may yield invalid information or hide important patterns also known as Exploratory Data Analysis (EDA) a different approach – mind shift is required concentrates on the larger view aka visual data mining
Richard Wesley Hamming, Numerical Methods for Scientists and Engineers, 1962
steps maximize insight into a data set uncover underlying structure extract important variables detect outliers and anomalies test underlying assumptions develop minimalistic models determine optimal property settings heavily relies on graphics numbers are very abstract
Characteristics: N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Residual standard deviation = Correlation = Have we realized something important?
Run-sequence plot similar to line-chart in excel shifts in variations shifts in location outliers Histogram center, spread, skew, multimodality outliers very useful – know how to create it! nice presentations (e.g. word-cloud, tag-cloud)
check whether the data set is random or no random data should have no observable structure lag = fixed time displacement can be arbitrary most common is 1 observe week autocorrelation strong autocorrelation sinusoidal model outliers
1 dimension – piece of cake (pie) 2 dimensions – still easy – Cartesian coordinate system 3 dimensions – still doable in Cartesian system 4 and more dimensions – only Chuck Norris can do that in Cartesian system other types of visualization are required some may be useful only for some types of data
understanding the data is very important good visualization can help us understand the contained information results need to be presented to other people sanity check, intuition – people capture patterns, which are missed by automated methods some options: bubble chart (3dim scatter plot) scatter plot array star plot, Radviz, Polyviz parallel coordinates
also called: 3 dimensional scatter plot 2 data dimensions – graph X and Y 3 rd dimension – point size optional 4 th dimension – point color advantages allows to uncover clusters and variable dependencies easy to understand disadvantages different combinations need to be tried
extension to common scatter plot 2 dimensional array of scatter plots each combination of variables is drawn (twice) diagonal descriptions easy to create messy dependencies between more than two variables are still hidden
Sepal width Petal length Petal width Sepal length
axes radiate from central point Star plot values of a data point are connected to form a polygon can display only a small number of points order of variables may be important Radviz values of a data point act as spring stiffness values normalized into interval object is placed in equilibrium of all forces order of variables becomes very important
Iris-virginica
Iris-versicolor
Iris-setosa
similar principle to Radviz data points are not attracted to a single point data points are attracted to an axis circle becomes polygon → Polyviz order of variables is less important polygon edges become very important candidates for classification rules different combinations of variables exact position of point is displayed – no information loss
advantages determine correlation between variables both positive and negative determine partial correlations only some values of some variable are correlated with some values of other variable very important disadvantages dependent on variable ordering not that useful without interactive software may be hard to understand for newbies
Exploratory data analysis: Have a look at the graphical techniques: a33.htm a33.htm Orange Canvas – open-source data mining interface similar to IBM Clementine (SPSS Modeler) widget documentation: Sample data ibm.com/software/data/cognos/manyeyes/ ibm.com/software/data/cognos/manyeyes/