Visualizing and Exploring Data Summary statistics for data (mean, median, mode, quartile, variance, skewnes) Distribution of values for single variables.

Slides:



Advertisements
Similar presentations
Chapter 3, Numerical Descriptive Measures
Advertisements

Descriptive Measures MARE 250 Dr. Jason Turner.
Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information.
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Random Sampling and Data Description
Dimension reduction (1)
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Lecture 7: Principal component analysis (PCA)
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
LISA Short Course Series Multivariate Analysis in R Liang (Sally) Shan March 3, 2015 LISA: Multivariate Analysis in RMar. 3, 2015.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
CHAPTER 19 Correspondence Analysis From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Techniques for studying correlation and covariance structure
Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.
Lecture II-2: Probability Review
Separate multivariate observations
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Empirical Modeling Dongsup Kim Department of Biosystems, KAIST Fall, 2004.
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering Maastricht University.
ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:
Some matrix stuff.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
What is variability in data? Measuring how much the group as a whole deviates from the center. Gives you an indication of what is the spread of the data.
From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon
Factor Analysis Psy 524 Ainsworth. Assumptions Assumes reliable correlations Highly affected by missing data, outlying cases and truncated data Data screening.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Chapter 8 Making Sense of Data in Six Sigma and Lean
1 Sample Geometry and Random Sampling Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.
CSE 185 Introduction to Computer Vision Face Recognition.
Algorithms 2005 Ramesh Hariharan. Algebraic Methods.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Statistical Summary ATM 305 – 12 November Review of Primary Statistics Mean Median Mode x i - scalar quantity N - number of observations Value at.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying.
Principle Component Analysis and its use in MA clustering Lecture 12.
Box Plots & Cumulative Frequency Graphs
Principal Component Analysis (PCA)
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Feature Extraction 主講人:虞台文.
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
Principal Components Analysis ( PCA)
Central limit theorem - go to web applet. Correlation maps vs. regression maps PNA is a time series of fluctuations in 500 mb heights PNA = 0.25 *
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
a graphical presentation of the five-number summary of data
Information Management course
Correlation, Bivariate Regression, and Multiple Regression
LECTURE 10: DISCRIMINANT ANALYSIS
Principal Component Analysis (PCA)
Description of Data (Summary and Variability measures)
Chapter 3 Describing Data Using Numerical Measures
Numerical Measures: Skewness and Location
Principal Component Analysis
Nat. Rev. Nephrol. doi: /nrneph
LECTURE 09: DISCRIMINANT ANALYSIS
Presentation transcript:

Visualizing and Exploring Data Summary statistics for data (mean, median, mode, quartile, variance, skewnes) Distribution of values for single variables (histogram, smoothing, box and whisker plot) Relationships between pairs of variables (scatterplot, contour plot) Relationship between multiple variables (scatterplot matrix, trellis plotting, star icons, parallel coordinates) Projection pursuit methods (principal component analysis) Parallel coordinates plots

Summarizing data Mean  = 1/n  i x i Median value that has an equal number of data points above and below it. Quartile first quartile = value that is greater than a quarter of data points Variance  2 = 1/n  i (x i - ) 2 Skewnes measures whether or not a distribution has a single long tail  i (x i - ) 3 / ( i (x i - ) 2 ) 3/2

Histogram (Microsoft Excel)

Smoothing estimates The contribution of a data point x i to the estimate at some point x * depends on K((x * - x i )/h) K() … kernel function  i K(x i ) = 1 e.g. normal (Gaussian) distribution h … bandwidth Estimated density at point x * is f(x) = 1/n  i K((x * - x i )/h) Example (Xgobi koule.txt -var2 )

Box and whisker plots Upper and lower boundaries of each box represent the first and third quartiles. Horizontal line within each box represents the median. The whiskers extend 1.5 times the interquartile range from the end of each box. All data points outside the whiskers are plotted individually

Scatterplot Two variables at a time One point for each data record Example (Xgobi koule.txt ) Scatterplots can reveal anomalies and shortcomings in data. Example: changes in measured weight of childern in summer and winter periods Problems. 1. In case of many points we may get a black rectangle. 2. Overprinting can conceal the strength of correlation. A solution is the Contour plot – with contour lines like in a topographic map. 3. Only two dimensional.

More than two variables Scatterplot matrix Multivariate data are projected into two-dimensional plots (all other variables are ignored). Example Crystal Vision pollen.data Trellis plot Series of scatterplots conditioned on levels of one or more other variables Brushing Enables to highlight corresponding points Star icons Different directions from the origin correspond to different variables. The lengths correspond to the magnitudes.

Interactive graphics Rotating directions of projections in search for a structure Random rotations Manual rotations Example (Xgobi koule ) Projection pursuit methods Allowing computer to search for interesting directions using a criteria Example (Xgobi krychle ) a special case – Principal component analysis

Principal component analysis Assumption data lie in a two dimensional linear subspace spanned by a linear combinations of measured variables Criteria for interesting direction a plane for which the sum of squared distances between the data points and their projections onto this plane is minimized Solution in polynomial time the plane is spanned by the linear combination of variables that has maximum sample variance and the linear combination that has maximum variance subject to being uncorrelated with the first linear combination

Principal component analysis X … n x p data matrix, rows are data cases a … p x 1 column vector of projection weights a T x … projection of a vector x Xa … projected values of all data vectors  a 2 = ( Xa ) T ( Xa ) = a T V a … variance along a Maximize variance under a normaliz. constraint a T a =1, i.e. max a T V a - ( a T a – 1 ) It reduces to eigenvalue form (V - I) a = 0 The first principal component a is the eigenvector associated with the largest eigenvalue. The second principal component a is the eigenvector associated with the second largest eigenvalue, etc. Scree plot … amount of variance explained by each consecutive value

Example (Huba et al. 1981) Data on 1684 students in LA showing consumption of 13 legal and illegal psychoactive substances The weights of the first principal components were: cigarettes 0.278, beer 0.286, wine 0.265, spirits 0.318, cocaine 0.208, tranquilizers 0.293, medications 0.176, heroin 0.202, marijuana 0.339, hashish 0.329,inhalants 0.276,hallucinogens 0.248, amphetamines a measure how often students use psychoactive substances, regardless of which substance they use. The weights of the second principal components were : 0.280, 0.396, 0.392, 0.325, , , , , 0.163, , , , it gives positive weights to legal substances and negative weights to illegal ones. Once the overall substance use is controlled, the major difference lies in the legal versus illegal use.

General Multidimensional Scaling Crumbled piece of paper is two-dimensional but principle components analysis would fail. The goal of scaling methods: preserving distances in a lower dimensional space Methods differ in: distances that are to be preserved … jk distances they map to … d jk how the calculations are performed

General Multidimensional Scaling Most common distance measure is Euclidean metric Common score function is stress (  j  k ( jk 2 - d jk 2 ) 2 /  j  k d jk 2 ) 1/2 The methods may start from distances between data vectors … metric scalling or rank order or a monotonic relationship … non-metric scalling The methods can be iterative: 1) regression of distances and 2) minimization of the stress

Parallel coordinates plots Variables as parallel axes Each data case is a piecewise linear plot connecting the values of the case Wegman, E. J. (1990), Hyperdimensional data analysis using parallel coordinates, J. American Statistical Association, 85, Example (Crystal Vision krychle.data )