Techniques for Visualizing Massive Data Sets Leilani Battle, Mike Stonebraker
Context Visualization System query result Database Have a database with lots of data Want a visual overview (i.e. stats plots like scatterplot, heatmap, etc.) Want visualizations to be interactive (I.e. pan and zoom)
Problem Performance Over-plotting Vis systems don’t scale well for big data Or are turning into databases Over-plotting Makes visualizations unreadable Waste of time/resources
Solution: Resolution Reduction Visualization System Database Resolution Reduction Layer query modified query queryplan query When query will return too much data, reduce it Aggregate, sample, filter, etc. “Too big” means: Will slow down the vis system Will cause over-plotting reduced result queryplan result
ScalaR Scalable vis system for data exploration Web front-end Uses SciDB (www.scidb.org) Visualizes query results Performs Resolution Reduction Advertise SciDB (open source array-oriented db system, great for scientific applications and scalable machine learning, give url scidb.org)
Demo of ScalaR
Array Browser Collaboration with: Brown: Justin DeBrabant, Stan Zdonik, Ugur Cetintemel Stanford: Zhicheng Liu, Jeff Heer Google Maps-style exploration experience Fetches subsets of the data (aka data tiles)
Array Browser Example
Array Browser Architecture
Demo of Array Browser
Future Work: Prefetching Goal: Reduce user-wait time by prefetching tiles Cache tiles in the tile buffer Need algorithms to decide what to pre-fetch
User Behavior Predictor (Seer) Learn common query sequences from user traces P P
Statistical Analysis Predictor Look for statistical similarities in tiles Try to guess what’s important based on patterns P P P
Using Multiple Predictors Run multiple predictors (or experts) in parallel Compare predictions to user’s actual behavior Use predictions from best performing expert May change over time based on user’s goals
Other Challenges Lots if interesting problems left to address Best eviction policy for the tile buffer? How to share data between multiple users? More predictors? Explain these bullets LRU and weighted LRU, lots of work don on these But not clear how LRU works with multiple predictors
Questions?
Gemini Sagittarius Dogs Cats
Prefetching Experts User behavior predictor (Seer) Learn common query sequences from user traces Stats analysis predictor Look for statistical similarities in tiles Try to guess what’s important based on patterns 2 min Describe movement patterns through the data set, and explain how Seer is great for this