Download presentation
Presentation is loading. Please wait.
1
Scalability for High Cardinality in Steerable MDS CPSC 533C Project Presentation Allan Rempel December 19, 2005
2
Scalability Ability to run on very large data sets Both theoretical and practical efficiency Both time and space efficiency Effective use of resources – memory, disk, db Avoid waste and thrashing Cardinality and dimensionality
3
Review - multidimensional scaling (MDS) Display multivariate abstract point data in 2D Data from bioinformatics, financial sector, etc. No inherent mapping in 2D space p-dim embedding of q-dim space (p < q) where inter-object relationships are approximated in low-dimensional space Proximity in high-D -> proximity in 2D High-dim distance between points (similarity) determines relative (x,y) position Absolute (x,y) positions are not meaningful Want to see clusters, curves, etc. Features that stand out from the noise
4
Review - multidimensional scaling (MDS) Refinement of algorithms O(N 3 ), O(N 2 ), O(N log N) 1996, 2002, 2003, 2004 Chalmers, Morrison, Ross, Jourdan, Melançon Progressive steerable MDS Munzner et al, 2004, 2005
5
MDSteer++ MDS visualization system developed by T. Munzner, M. Tory, D. Westrom, M. Williams C++ with Qt GUI, OpenGL Runs on Windows and Linux Test platform – P-III, 256 MB, Linux
6
MDSteer++
7
MDSteer++ limitations and proposal Theoretically, handle 300+ dimensions and 1,000,000+ points Practically, limited by system memory ~100,000 points calculating distance on the fly ~10,000 points with precomputed distance matrix ~5000 points exhausted memory on test machine Big Question: Can we raise these limits by using a database? At what cost?
8
Scaling issues Time complexity matters – Chalmers Time complexity doesn’t matter - Jourdan Lots of things matter Actual vs. perceived speed – progressiveness What are the limits of the hardware?
9
Experimental results – Chalmers et al 3-D data sets: 5000 – 50,000 points 13-D data sets: 2000 – 24,000 points Took less than 1/3 the time of the O(N 2 ) Achieved lower stress when done Also compared against original O(N 3 ) model 9 seconds vs. 577; and 24 vs. 3642 Achieved much lower stress (0.06 vs. 0.2)
10
Experimental results – Chalmers et al
11
Comparison – Jourdan & Melançon Chalmers et al is better for N < 5500 Main diff is in parent-finding, represented by Fig. 3
12
Comparison – Jourdan & Melançon Experimental study confirms theoretical results This technique becomes better for N > 70,000
13
Multiscale MDS – Jourdan & Melançon Recursively defining the initial kernel set of points can yield much better real-time performance Same time complexity, but much faster execution
14
Experimental results Memory usage with/without 5000 node MDS run
15
Offloading distance matrix Step 1: Write to file instead of database Database uses files anyway – save db overhead Use /tmp for high speed/capacity – 3GB available Similar speed and memory usage as before… Probably because file is being cached Probably similar scenario to memory being swapped … until 2000 of 5000 points placed – thrashing 2000 points = 16 MB; 5000 points = 100 MB
16
Offloading distance matrix Step 2: Write to database Offload some work to database server Tried to connect to db over network Tried to run MDSteer++ on db server Tried to modify MDSteer++ for MDS only, no GUI Suspect that database overhead, network comm., NFS vs. local disks would cause significant slowdown
17
MDSteer++ code modifications Build and run on Linux Fixed some bugs Class to handle file-based matrix Class to handle database-based matrix Restructured code to accomodate modes Partial separation of GUI from MDS back end
18
Future work Use.ui files for GUI specification Separate MDS from GUI functionality What does MDSteer++ do if user not interactive? Continue examination of database use Library – provide hooks for use in other s/w Not require recompile for diff modes Clean compile w/o warnings Cross-platform with single code base
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.