Using Entropy-Related Measures in Categorical Data Visualization  Jamal Alsakran The University of Jordan  Xiaoke Huang, Ye Zhao Kent State University.

Slides:

Advertisements

Similar presentations

An Interactive-Voting Based Map Matching Algorithm

Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Unsupervised Learning

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.

Chapter 3 – Data Visualization © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.

Analysis of variance (ANOVA)-the General Linear Model (GLM)

Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.

Vote Calibration in Community Question-Answering Systems Bee-Chung Chen (LinkedIn), Anirban Dasgupta (Yahoo! Labs), Xuanhui Wang (Facebook), Jie Yang (Google)

Mapping Nominal Values to Numbers for Effective Visualization Presented by Matthew O. Ward Geraldine Rosario, Elke Rundensteiner, David Brown, Matthew.

Semi-automatic Range to Range Registration: A Feature-based Method Chao Chen & Ioannis Stamos Computer Science Department Graduate Center, Hunter College.

Lecture Notes for Chapter 2 Introduction to Data Mining

ANOVA notes NR 245 Austin Troy

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

The Experimental Approach September 15, 2009Introduction to Cognitive Science Lecture 3: The Experimental Approach.

What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.

Berkeley Parlab 1. INTRODUCTION A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing 2. CORRELATIONS TO THE GROUND.

Analysis of Variance & Multivariate Analysis of Variance

1 A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data Jinwook Seo, Ben Shneiderman University of Maryland Hyun Young Song.

CORRELATIO NAL RESEARCH METHOD. The researcher wanted to determine if there is a significant relationship between the nursing personnel characteristics.

Info Vis: Multi-Dimensional Data Chris North cs3724: HCI.

1 A Network Traffic Classification based on Coupled Hidden Markov Models Fei Zhang, Wenjun Wu National Lab of Software Development.

Title: Spatial Data Mining in Geo-Business. Overview  Twisting the Perspective of Map Surfaces — describes the character of spatial distributions through.

Identifying Computer Graphics Using HSV Model And Statistical Moments Of Characteristic Functions Xiao Cai, Yuewen Wang.

The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.

Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.

Census A survey to collect data on the entire population. Data The facts and figures collected, analyzed, and summarized for presentation and.

Class Meeting #11 Data Analysis. Types of Statistics Descriptive Statistics used to describe things, frequently groups of people.  Central Tendency 

1 Least squares procedure Inference for least squares lines Simple Linear Regression.

Statistical Analysis A Quick Overview. The Scientific Method Establishing a hypothesis (idea) Collecting evidence (often in the form of numerical data)

ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.

Are You Smarter Than a 5 th Grader?. 1,000,000 5th Grade Topic 15th Grade Topic 24th Grade Topic 34th Grade Topic 43rd Grade Topic 53rd Grade Topic 62nd.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Clustering II. 2 Finite Mixtures Model data using a mixture of distributions –Each distribution represents one cluster –Each distribution gives probabilities.

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

Tile-based parallel coordinates and its application in financial visualization Jamal Alsakran, Ye Zhao Kent State University, Department of Computer Science,

The Statistical Analysis of Data. Outline I. Types of Data A. Qualitative B. Quantitative C. Independent vs Dependent variables II. Descriptive Statistics.

1 Nonparametric Statistical Techniques Chapter 17.

1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.

Academic Research Academic Research Dr Kishor Bhanushali M

Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.

ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,

Applied Quantitative Analysis and Practices

Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.

Daniel A. Keim, Hans-Peter Kriegel Institute for Computer Science, University of Munich 3/23/ VisDB: Database exploration using Multidimensional.

Category Independent Region Proposals Ian Endres and Derek Hoiem University of Illinois at Urbana-Champaign.

Chapter 8: Simple Linear Regression Yang Zhenlin.

De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.

Uncovering Clusters in Crowded Parallel Coordinates Visualizations Alimir Olivettr Artero, Maria Cristina Ferreiara de Oliveira, Haim levkowitz Information.

Chapter 8. Learning of Gestures by Imitation in a Humanoid Robot in Imitation and Social Learning in Robots, Calinon and Billard. Course: Robots Learning.

Visual Correlation Analysis of Numerical and Categorical Data on the Correlation Map Zhiyuan Zhang, Kevin T. McDonnell, Erez Zadok, Klaus Mueller.

NTU & MSRA Ming-Feng Tsai

Review of statistical modeling and probability theory Alan Moses ML4bio.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.

Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.

MIS 420: Data Visualization, Representation, and Presentation Content adapted from Chapter 2 and 3 of

An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,

Correlation & Simple Linear Regression Chung-Yi Li, PhD Dept. of Public Health, College of Med. NCKU 1.

Queensland University of Technology

Construct a probability distribution and calculate its summary statistics. Then/Now.

Basic machine learning background with Python scikit-learn

Enhanced-alignment Measure for Binary Foreground Map Evaluation

Basic Statistical Terms

Yang Liu, Perry Palmedo, Qing Ye, Bonnie Berger, Jian Peng

LECTURE 23: INFORMATION THEORY REVIEW

Probabilistic Latent Preference Analysis

Lecture 16. Classification (II): Practical Considerations

Visually Analyzing Latent Accessibility Clusters of Urban POIs

Presentation transcript:

Using Entropy-Related Measures in Categorical Data Visualization  Jamal Alsakran The University of Jordan  Xiaoke Huang, Ye Zhao Kent State University  Jing Yang UNC Charlotte  Karl Fast Kent State University

Categorical Datasets Generated in a large variety of applications – E.g., health/social studies, bank transactions, online shopping records, and taxonomy classifications Contain a series of categorical dimensions (variables)

Categorical Discreteness classes cap- shape cap- surfacecap-colorbruisesodor gill- attachme nt gill- spacinggill-sizegill-color cccccccccc pxsntpfcnk exsytafcbk ebswtlfcbn pxywtpfcnn Values of a dimension comprise a set of discrete categories Mushroom dataset – 8,124 records and 23 categorical dimensions

Challenges Multidimensional visualization methods are often undermined when directly applied to categorical datasets – the limited number of categories creates overlapping elements and visual clutter – the lack of an inherent order (in contrast to numeric variables) confounds the visualization design

Categorical data visualization Sieve diagram and Mosaic display Contigency Wheel Parallel Sets Mapping to numbers

Our Work Investigate the use of entropy-related measures in visualizing multidimensional categorical data – Show how entropy-related measures can help users understand and navigate categorical data – Employ these measures in managing and ordering dimensions within the parallel set visualization – Conduct user studies on real-world data

Entropy and Related Measures Considering a categorical variable as a discrete random variable X, Probability distribution Entropy – Measure diversity of one dimension Joint Entropy – Measure diversity with two variables Mutual Information – Measure the variables' mutual dependence

Use of Entropy Chen and Janicke proposed an information- theoretic framework for visualization. Pargnostics: pixel-based entropy used for order optimization of coordinates We use entropy and mutual information in categorical data visualization

Visualize Data Facts Mushroom dataset – Size: the number of categories – Color: entropy

Navigation Guide: Scatter Plot Matrix Joint entropy matrix – High joint entropy indicates diversely distributed data records in a scatter plot – Low joint entropy reveals lots of overlaps

Navigation Guide: Scatter Plot Matrix Mutual information matrix – Large mutual information indicates high dependency between two dimensions – Small mutual information reveals less dependency

Dimension Management on Parallel Sets Use entropy related measures to help users manage dimension spacing, ordering and filtering Ribbon colors defined by mushroom classes – Green: edible Blue: poisonous

Filtering and Spacing Remove low diversity dimensions by setting an entropy threshold Arrange space between neighboring coordinates with joint entropy

Sorting Categories over Coordinates Unlike numerical dimensions, no inherent order exists for categorical variables – reading order – alphabetical order We use pairwise joint probability distribution to find an optimal sequence – Reduce ribbon intersections

Sorting Categories over Coordinates Using the reading orders of coordinates and categories over them After Sorting categories of neighboring coordinates

Optimal Ordering of Multiple Coordinates For parallel coordinates many existing approaches reduce line crossings between neighboring coordinates Using line crossings as cost function between every pair of dimensions, global cost minimization is achieved by a graph theory based method [32] However, reducing crossings does not necessarily lead to more effective insight discovery – ribbon crossings reliant on the sequences of categories over axes (reading order? Alphabeta order?)

Our Method We use mutual information as the cost function – Benefit: the cost is not related to the sequences of categories over axes Globally maximize the sum of mutual information of a series of dimensions A Hamiltonian path algorithm of the Traveling Salesman Problem is solved to create optimal ordering

C2: Optimized by ribbon crossings with alphabetical category sequence C3: Optimized by mutual information with alphabetical category sequence C4: Optimized by mutual information with optimized category sequence

User Studies Assess user performance on insight discovery with different ordering approaches Design specific tasks for users to complete in a limited time period Apply statistical analysis on the results

Mushroom Data 11 participants received training and 10 minutes practice before test Each participant was given 90 seconds to find the mushroom characteristics as many as possible, which are (T1) All-edible; (T2) All- poisonous; (T3) Mostly-edible; (T4) Mostly poisonous Compared with ground truth, each participant was given a score

Results Average percentage of user findings over ground truth on each task

Results Total performance of user findings using different visualizations

Results Total error rate of user findings using different visualizations

Statistical Test We applied the Friedman test of variance by ranks (a non-parametric statistical test) Statistical significant differences are discovered – Between C1 and C4 (p-value= 0.011) – Between C2 and C4 (p-value = 0.035) – Between C3 and c4 (p-value = 0.007)

Congressional Voting Records Green: Democrat Red: Republican C1: Using the reading order C2: Using the optimized order

Congressional Voting Records leftmost dimension is the votes of education-spending Green: nay Red: yea C3: Using the reading order C4: Using the optimized order

User Study of Voting Dataset 35 participants were given 2 mins to complete tasks Using C1 and C2, for each bill – (T1) which party vote more for yea? – (T2) which party vote more for nay? Using C3 and C4, for each bill – (T3) which congressmen group vote more for yea? – (T4) which congressmen group vote more for nay?

Results We graded each participant – 1 point if the answer was correct – -1 point if the answer was incorrect – 0 points if they said it was hard to identify The average score of using C1 was 11.5 The average score of using C2 was 20.1 The average score of using C3 was 13.2 The average score of using C4 was 18.0

Statistical Test One-way analysis of variance (ANOVA) to compare the effect of using different visualizations One test was performed for C1 and C2 – p-value = Another test was performed for C3 and C4 – p-value = 0.02

Conclusion Utilize measures from information theory to enhance the visualization of high dimensional categorical data Support users to browse data facts among dimensions, to determine starting points of data analysis, and to test- and-tune parameters for visual reasoning

Thanks! This work is partially supported by US NSF IIS , IIS , and Google Faculty Research Award