Distributed and Streaming Evaluation of Batch Queries for Data-Intensive Computational Turbulence Kalin Kanov Department of Computer Science Johns Hopkins University
Streaming Evaluation Method Linear data requirements of the computation allow for: – Incremental evaluation – Streaming over the data – Concurrent evaluation of batch queries
Motivation Heavy DB usage slows down the service by a factor of 10 to 20 Query evaluation techniques adapted from simulation code do not access data coherently Substantial storage overhead incurred to localize each computation 95% of queries perform Lagrange Polynomial interpolation
Turbulence Database Cluster
MHD Database Stores velocity, magnetic field, magnetic vector potential and pressure fields – 10 attributes, 4 bytes each – 1024 time-steps over a grid – 40TB total size In order to reduce total amount of I/O: – Smaller atoms (4 3 voxel) – No replication
Lagrange Polynomial Interpolation Lagrange coefficients Data
Processing a Batch Query
Additional Optimizations Process the computation of values that are stored together concurrently Iterate in the appropriate order Compute the Lagrange coefficients with the procedures described by Purser and Leslie* *R. J. Purser and L. M. Leslie. An Efficient Interpolation Procedure for High-Order Three- Dimensional Semi-Lagrangian Models. Monthly Weather Review, 119:2492–+, 1991.
Experimental Evaluation Random workloads: – across the entire cube space – a subset of the entire space Workload derived from the usage log of the Turbulence Database cluster Compare with: – Direct methods of evaluation
Setup Experimental version of the MHD database – ~300 timesteps of the velocity fields of the MHD DNS – Two 2.33 GHz dual quad-core Windows 2003 servers with SQL Server 2008 and 8GB of memory – Data tables striped across 7 disks
Questions/Comments