Multidimensional data processing. Multivariate data consist of several variables for each observation. Actually, serious data is always multivariate.

Slides:



Advertisements
Similar presentations
Multiple Analysis of Variance – MANOVA
Advertisements

Tan,Steinbach, Kumar: Exploratory Data Analysis (with modifications by Ch. Eick) Exploratory Data Analysis Remark: covers Chapter 3 of the Tan book in.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 3. Charts and Graphs: A Picture Says a Thousand Words.
Shared Graphics Skills Cameras and Clipping Planes
Correlation and Linear Regression
x – independent variable (input)
BA 555 Practical Business Analysis
Types of Data Displays Based on the 2008 AZ State Mathematics Standard.
Visualization and Data Mining. 2 Outline  Graphical excellence and lie factor  Representing data in 1,2, and 3-D  Representing data in 4+ dimensions.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Chapter 11: Inference for Distributions
Presenting information
CHAPTER 1: Picturing Distributions with Graphs
Chapter 12 Section 1 Inference for Linear Regression.
2.1 Summarizing Qualitative Data  A graphic display can reveal at a glance the main characteristics of a data set.  Three types of graphs used to display.
Histogram A frequency plot that shows the number of times a response or range of responses occurred in a data set.
PSYCHOLOGY 820 Chapters Introduction Variables, Measurement, Scales Frequency Distributions and Visual Displays of Data.
Advantages and Disadvantages
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Tutor: Prof. A. Taleb-Bendiab Contact: Telephone: +44 (0) CMPDLLM002 Research Methods Lecture 9: Quantitative.
Prior Knowledge Linear and non linear relationships x and y coordinates Linear graphs are straight line graphs Non-linear graphs do not have a straight.
Basic concepts in ordination
Quantitative Skills 1: Graphing
The Scientific Method Honors Biology Laboratory Skills.
CHAPTER 7: Exploring Data: Part I Review
1 Statistical Distribution Fitting Dr. Jason Merrick.
StatisticsStatistics Graphic distributions. What is Statistics? Statistics is a collection of methods for planning experiments, obtaining data, and then.
Data Mining Manufacturing Data Dave E. Stevens Eastman Chemical Company Kingsport, TN.
Chapter 21 Basic Statistics.
Tan,Steinbach, Kumar: Exploratory Data Analysis (with modifications by Ch. Eick) Data Mining: “New” Teaching Road Map 1. Introduction to Data Mining and.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Section 2.2 Bar Graphs, Circle Graphs, and Time-Series Graphs 2.2 / 1.
Dr. Serhat Eren Other Uses for Bar Charts Bar charts are used to display data for different categories where the data are some kind of quantitative.
Summer Student Program 15 August 2007 Cluster visualization using parallel coordinates representation Bastien Dalla Piazza Supervisor: Olivier Couet.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Exploratory Data Analysis Exploratory Data Analysis Dr.Lutz Hamel Dr.Joan Peckham Venkat Surapaneni.
Multidimensional data processing. x 1G [x 1G, x 2G ] x 2G.
Statistical Analysis Topic – Math skills requirements.
Chapter 3 Response Charts.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
STATISTICS AND OPTIMIZATION Dr. Asawer A. Alwasiti.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
Quality Improvement Tools CHAPTER SIX SUPPLEMENT McGraw-Hill/Irwin Copyright © 2011 by the McGraw-Hill Companies, Inc. All rights reserved.
ANOVA, Regression and Multiple Regression March
Copyright 2011 by W. H. Freeman and Company. All rights reserved.1 Introductory Statistics: A Problem-Solving Approach by Stephen Kokoska Chapter 2 Tables.
Data Visualization.
Bell Ringer You will need a new bell ringer sheet – write your answers in the Monday box. 3. Airport administrators take a sample of airline baggage and.
3/13/2016 Data Mining 1 Lecture 2-1 Data Exploration: Understanding Data Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB)
24 Nov 2007Data Management and Exploratory Data Analysis 1 Exploratory Data Analysis Exploratory Data Analysis (EDA) is an Approach that Employs a Variety.
Pictograph Uses an Icon to represent a quantity. A key must be used to explain the icon Advantages Easy to read Visually appealing Handles large data.
Introduction Exploring Categorical Variables Exploring Numerical Variables Exploring Categorical/Numerical Variables Selecting Interesting Subsets of Data.
Techniques for Decision-Making: Data Visualization Sam Affolter.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Why Is It There? Chapter 6. Review: Dueker’s (1979) Definition “a geographic information system is a special case of information systems where the database.
Midterm Review IN CLASS. Chapter 1: The Art and Science of Data 1.Recognize individuals and variables in a statistical study. 2.Distinguish between categorical.
Exploring Data: Summary Statistics and Visualizations
CHAPTER 2 : DESCRIPTIVE STATISTICS: TABULAR & GRAPHICAL PRESENTATION
Two-Sample Hypothesis Testing
Data Mining: EXPLORING DATA
Exploring Microarray data
Ms jorgensen Unit 1: Statistics and Graphical Representations
Regression model Y represents a value of the response variable.
Data Mining: Exploring Data
Data Mining: “New” Teaching Road Map
Lecture 1: Descriptive Statistics and Exploratory
Displaying data Seminar 2.
Data exploration and visualization
Introductory Statistics
Presentation transcript:

Multidimensional data processing

Multivariate data consist of several variables for each observation. Actually, serious data is always multivariate. Some variables are usually not collected to simplify collecting and processing. Removal of variables before data analysis leads to information loss. Unknown information is never recovered. One of the most common task is clustering or classification.

classification target classes are known properties of target classes are usually unknown goal: find rules which separate observed data into target classes clustering target classes are unknown goal: find observations with common properties which may (or may not) represent classes in real world difficult situation

we are trying to extract information from data measurements, observations, surveys data preparation data adjustment – removal of invalid or incomplete observations/measurements normalization? – best handled when collecting extracting information we know what we are looking for – testing of an hypothesis trying to discover something new – data exploration

preliminary analysis of the data better understanding of its characteristics allows to select the right tools for preprocessing or analysis wrong tools may yield invalid information or hide important patterns also known as Exploratory Data Analysis (EDA) a different approach – mind shift is required concentrates on the larger view aka visual data mining

Richard Wesley Hamming, Numerical Methods for Scientists and Engineers, 1962

steps maximize insight into a data set uncover underlying structure extract important variables detect outliers and anomalies test underlying assumptions develop minimalistic models determine optimal property settings heavily relies on graphics numbers are very abstract

Characteristics: N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Residual standard deviation = Correlation = Have we realized something important?

Run-sequence plot similar to line-chart in excel shifts in variations shifts in location outliers Histogram center, spread, skew, multimodality outliers very useful – know how to create it! nice presentations (e.g. word-cloud, tag-cloud)

check whether the data set is random or no random data should have no observable structure lag = fixed time displacement can be arbitrary most common is 1 observe week autocorrelation strong autocorrelation sinusoidal model outliers

1 dimension – piece of cake (pie) 2 dimensions – still easy – Cartesian coordinate system 3 dimensions – still doable in Cartesian system 4 and more dimensions – only Chuck Norris can do that in Cartesian system other types of visualization are required some may be useful only for some types of data

understanding the data is very important good visualization can help us understand the contained information results need to be presented to other people sanity check, intuition – people capture patterns, which are missed by automated methods some options: bubble chart (3dim scatter plot) scatter plot array star plot, Radviz, Polyviz parallel coordinates

also called: 3 dimensional scatter plot 2 data dimensions – graph X and Y 3 rd dimension – point size optional 4 th dimension – point color advantages allows to uncover clusters and variable dependencies easy to understand disadvantages different combinations need to be tried

extension to common scatter plot 2 dimensional array of scatter plots each combination of variables is drawn (twice) diagonal descriptions easy to create messy dependencies between more than two variables are still hidden

Sepal width Petal length Petal width Sepal length

axes radiate from central point Star plot values of a data point are connected to form a polygon can display only a small number of points order of variables may be important Radviz values of a data point act as spring stiffness values normalized into interval object is placed in equilibrium of all forces order of variables becomes very important

Iris-virginica

Iris-versicolor

Iris-setosa

similar principle to Radviz data points are not attracted to a single point data points are attracted to an axis circle becomes polygon → Polyviz order of variables is less important polygon edges become very important candidates for classification rules different combinations of variables exact position of point is displayed – no information loss

advantages determine correlation between variables both positive and negative determine partial correlations only some values of some variable are correlated with some values of other variable very important disadvantages dependent on variable ordering not that useful without interactive software may be hard to understand for newbies

Exploratory data analysis: Have a look at the graphical techniques: a33.htm a33.htm Orange Canvas – open-source data mining interface similar to IBM Clementine (SPSS Modeler) widget documentation: Sample data ibm.com/software/data/cognos/manyeyes/ ibm.com/software/data/cognos/manyeyes/