Download presentation
Presentation is loading. Please wait.
Published byCecily Edith Farmer Modified over 9 years ago
1
Mapping Nominal Values to Numbers for Effective Visualization Presented by Matthew O. Ward Geraldine Rosario, Elke Rundensteiner, David Brown, Matthew Ward Computer Science Department, Worcester Polytechnic Institute Supported by NSF grant IIS-0119276. Presented at InfoVis2003, October 20, 2003.
2
2 Visualizing Nominal Variables What if variable is nominal? Most tools which are designed for nominal variables cannot handle large # of values. Most data visualization tools are designed for numeric variables.
3
3 Targeted Result
4
4 Goals Main goal: To display data sets containing nominal variables in visual exploration tools Sub-goals: For each nominal variable To provide order and spacing to the values To group similar values together Desired Features of the Solution: Data-driven Multivariate Scalable Distance-preserving Association-preserving
5
5 Proposed Approach Distance – transform the data so that the distance between 2 nominal values can be calculated (based on the variable’s relationship with other variables) Quantification – assign order and spacing to the nominal values Classing – determine which values are similar to each other and can be grouped together Pre-process nominal variables using a Distance- Quantification-Classing (DQC) approach Each step can be accomplished using more than one technique. Multiple Correspondence Analysis Focused Correspondence Analysis Modified Optimal Scaling Hierarchical Cluster Analysis
6
6 DISTANCE STEP QUANTIFICATION STEPCLASSING STEP Transformed data for distance calculation Nominal-to-numeric mapping Classing tree Target variable & data set with nominal variables Distance-Quantification-Classing Approach
7
7 Example Input to Output Data: Quality (3): good,ok,bad Color (6) : blue,green,orange, purple,red,white Size (10) : a to j blue purple green red orange white Task: Pre-process color based on its patterns across quality and size. Observed Counts COLOR by QUALITY Good Ok Bad Total Blue 187 727 546 1460 Green 267 538 356 1161 Orange 276 411 191 878 Purple 155 436 361 952 Red 283 307 357 947 White 459 366 327 1152 Total 1627 2785 2138 6550 DQC Nominal Numeric Blue -0.02 Green -0.54 Orange 0.55 Purple 0 Red -0.50 White 0.57
8
8 Distance Step: Correspondence Analysis Observed Counts COLOR by QUALITY Good Ok Bad Total Blue 187 727 546 1460 Green 267 538 356 1161 Orange 276 411 191 878 Purple 155 436 361 952 Red 283 307 357 947 White 459 366 327 1152 Total 1627 2785 2138 6550 How strong is the association between COLOR and QUALITY? Similar profiles: (blue,purple) Can we find similar COLORs based on its association with QUALITY? Row Percentages Good Ok Bad Blue 13 50 37 100 Green 23 46 31 100 Orange 31 47 22 100 Purple 16 46 38 100 Red 30 32 38 100 White 40 32 28 100
9
9 Similar column profiles are combined to produce fewer independent dimensions. [Singular Value Decomposition, etc.] Similar row profiles: (blue,purple), … Similar column profiles: (ok,bad), … Coordinates for Independent Dimensions Dim1 Dim2 Blue - 0.02 - 0.28 Green - 0.54 0.14 Orange 0.55 0.10 Purple 0 - 0.25 Red - 0.50 0.20 White 0.57 0.19 color quality size quality color size color quality size Focused Corresp Analysis (FCA) Multiple Corresp Analysis (MCA)
10
10 Quantification Step: Modified Optimal Scaling Coordinates for Independent Dimensions Dim1 Dim2 Blue - 0.02 - 0.28 Green - 0.54 0.14 Orange 0.55 0.10 Purple 0 - 0.25 Red - 0.50 0.20 White 0.57 0.19 Nominal Numeric Blue -0.02 Green -0.54 Orange 0.55 Purple 0 Red -0.50 White 0.57 Nominal-to-numeric mapping
11
11 Cluster Analysis weighted by counts blue purple green red orange white [from FCA] Classing Step: Hierarchical Cluster Analysis Coordinates for Independent Dimensions Dim1 Dim2 Counts Blue - 0.02 - 0.28 1460 Green - 0.54 0.14 1161 Orange 0.55 0.10 878 Purple 0 - 0.25 952 Red - 0.50 0.20 947 White 0.57 0.19 1152 0 100 50 Info loss
12
12 Experimental Evaluation Wrong quantification and classing can introduce artificial patterns and cause errors in interpretation Evaluation measures: Believability Quality of Visual Display Quality of classing Quality of quantification Space – FCA less space Run time – MCA faster perception computational statistical
13
13 Test Data Sets * UCI Repository of Machine Learning Databases
14
14 Believability and Quality of Visual Display Given two displays resulting from different nominal-to-numeric mappings: Which mapping gives a more believable ordering and spacing? Based on your domain knowledge, are the values that are positioned close together similar to each other? Are the values that are positioned far from the rest of the values really outliers? Which display has less clutter?
15
15 Are these patterns believable? Automobile Data: Alphabetical Order, equal spacing Believability and Quality of Visual Display
16
16 Are these patterns believable? Automobile Data: FCA Believability and Quality of Visual Display
17
17 Quality of Classing Classing A is better than classing B if, given a classing tree, the rate of information loss with each merging is slower Depends on data set Information loss due to classing for one variable [The lower the line, the slower the info loss, the better the classing.] Calculate difference between the lines, then summarize.
18
18 Quality of Quantification A quantification is good if … 1.If data points that are close together in nominal space are also close together in numeric space 2.If two variables are highly associated with each other, then their quantified versions should also have high correlation. MCA gives better quantification for most data sets based on average squared correlation measure.
19
19 Summary DQC is a general-purpose approach for pre-processing nominal variables for data analysis techniques requiring numeric variables (linear regression) or low cardinality nominal variables (association rules) DQC – multivariate, data-driven, scalable, distance- preserving, association-preserving FCA is a viable alternative to MCA when memory space is limited Quality of classing and quantification depends on strength of associations within the data set. is in the eye of the user
20
20 Next Steps Stress test the technique with more experiments Perform user study that measures the quality of the visual display resulting from MCA vs. FCA Further investigate tuning parameters and sensitivity to characteristics of the data set Mixed or numeric variables as analysis variables Cascaded Focused Correspondence Analysis
21
21 Related Work Visualizing nominal data: CA plots [Fri99], sieve diagrams, mosaic displays, fourfold displays, Dimensional Stacking, TreeMaps Quantification: optimal scaling, homogeneity analysis [Gre93] Classing nominal variables: loss of inertia [Gre93], decision trees, concept hierarchy Clustering nominal variables: k-prototypes [Hua97b]
22
22 For further information XmdvTool Homepage: http://davis.wpi.edu/~xmdv xmdv@cs.wpi.edu Code is free for research and education. Contact author: ger@wpi.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.