The Big Picture Scientific disciplines have developed a computational branch Models without closed form solutions solved numerically This has lead to an explosion of data Simulation and analysis workloads are data-intensive Producing\scanning large amounts of data Management of these data represents a significant challenge Storage\archiving Query processing Visualization
Remote Immersive Analysis Formerly, analysis performed during the computation No data stored for subsequent examination Data-intensive computing breakthroughs have allowed for new interaction with scientific numerical simulations Turbulence Database Cluster Stores entire space-time evolution of the simulation Provides public access to world-class simulations Implements “immersive turbulence * ” approach Introduces new challenges * E. Perlman, R. Burns, Y. Li, and C. Meneveau. Data exploration of turbulence simulations using a database cluster. In Supercomputing, 2007.
Goals Develop data-driven query processing techniques Reduce I/O and computation costs Reduce or eliminate storage overhead Exploit domain knowledge and structure Provide user interfaces that are efficient and flexible Streamline the process of data ingest
Turbulence Database Cluster
Processing a Batch Query query 1 query 3 query 2 q1: q2: q3: Redundant I/O Multiple disk seeks
I/O Streaming Evaluation Method Linear data requirements of the computation allow for: Incremental evaluation Streaming over the data Concurrent evaluation of batch queries
Processing a Batch Query query 1 query 3 query q1 q3 q1 q3 q1 q2 q1 q2 I/O Streaming: Sequential I/O Single pass
Lagrange Polynomial Interpolation Lagrange coefficients Data
Spatial Differentiation
Derivative Interpolation
128 Workload Over an order of magnitude improvement Sorting leads to a more sequential acces Join/Order By executes entire batch as a join I/O Streaming Each atom is read only once Effective cache usage
I/O Streaming alleviates I/O bottleneck Computation emerges as the more costly operation
Particle Tracking Web Server/Mediator DB Node 1 Distribute Points based on Computational Module Storage Layer Retrieve DB Node N Computational Module Storage Layer Retrieve x p (t m ) x * p (t m )
Particle Tracking Web Server/Mediator DB Node 1 Distribute Points based on Computational Module Storage Layer Retrieve DB Node N Computational Module Storage Layer Retrieve x * p (t m ) x p (t m+1 )
Summary and Future Work Extend I/O streaming technique to different decomposable kernel computations: Differentiation Spatial Interpolation Temporal interpolation Filtering and coarse-graining Provide a flexible user interface Allow for different filter functions Allow for new kernel computations Improve particle tracking routine Reduce communication between mediator and DB nodes Asynchronous processing Caching and pre-fetching
Questions Images courtesy of Kai Buerger