U N I V E R S I T Y O F S O U T H F L O R I D A Database-centric Data Analysis of Molecular Simulations Yicheng Tu *, Sagar Pandit §, Ivan Dyedov *, and Vladimir Grupcev * * Department of Computer Science and Engineering, § Department of Physics Molecular Simulations (MS) Large scale biological structures are represented using all the individual atoms. Thus, providing nano–scopic description of biological processes. Data is stored in single or multiple trajectory files containing time frames. Each frame is a sequential list of atoms with their positions, velocities, perhaps forces, masses, and types. Dataset is very large: millions of atoms, tens of thousands of frames. Abstract Molecular simulations (MS) have become an integral part of molecular and structural biology. By providing model descriptions for biochemical and biophysical processes at nano–scopic scale, MS can provide fundamental understanding of diseases and help discovery of drugs. MS, by their nature, generate large amounts of data. Although many of the MS software are carefully designed to achieve maximum computational performance in simulation, they seriously fall short on storage and handling of the large scale data output. The objective of this project is to use database technologies to improve the efficiency, ease of maintenance, and security of MS data analysis. We accomplish this by developing novel data structures and query processing algorithms in the kernel of the database management system (DBMS), in addition to leveraging the advantages of such systems in their current forms. We focus on creative indexing and data organization techniques and query processing and optimization strategies. We believe that such innovations will bring significant intellectual merit from which both the biomedical and database management communities will benefit. State-of-the-art in MS Data Analysis Store trajectory in computer files Organize data into files Where to find data? Use the file names to encode file “content” Smarter systems: SimDB 1 and BioSimGrid 2 use relational databases to manage these trajectory files Figure 1. A simulated hydrated dipalmitoylphosphatidylcholine bilayer system. Research Challenges Difficult to maintain application programs - tedious coding is required for each new query Data security is poorly supported - only on the whole file level Most important, efficiency in data retrieval is very low - sequential file search is often needed Our Approach A database-centric MS data analysis (DCMS) framework that o stores, queries raw data in a database management system (DBMS) o allows efficient application development via declarative query language (e.g., SQL) provides fine-granularity access control and view-based data access Figure 2. DCMS architecture. Processing Histogram Queries Histogram queries are very popular in DCMS o given a set of (or all) atoms in a time frame, compute the distribution of a physical measurement in a histogram with bucket width h Histogram of pairwise distances (PDH) is more challenging Naive algorithm needs to compute all N(N-1)/2 distances where N is the number of atoms Our solution uses a Quadtree-based data structure called density map o If distance of all atoms in two cells in the map fall into a histogram bucket, no need to compute the distances Time complexity is O(N 1.5 ) for 2D data and O(N ) for 3D data Figure 5. Solving a histogram query (bucket width h = 3) using two density maps generated from raw data (left) with low (middle) and high (right) resolution. Summary Existing file-based MS data processing bears serious drawbacks in application development, security, and efficiency in data access Storing and querying MS data in DCMS (with a legacy DBMS) provides a better solution that solves the above problems DCMS improves query efficiency by 1-5 orders of magnitude Further improvement in efficiency can be achieved by augmenting the DCMS with novel indexes and query processing algorithms Further improve the efficiency of data retrieval and analysis via o novel indexing structures o sophisticated query processing algorithms Figure 3. Structure of Time-Parameterized B + -Tree (TPB) index. References 1 Feig et al, Future Generation Computer Systems, 16(1): , (1999) 2 Ng et al, Future Generation Computer Systems, 22(6): , (2006) Contacts: Experimental results Four popular query types Comparison with Gromacs Dataset size: 286,000 atoms, 100,000 frames Indexing MS Data Multiple indexes needed, each targeting a set of queries o TPB-tree: random point and trajectory queries o TPS-tree: spatial range queries o kd-tree: range queries on other non-spatial measurements Figure 4. Query processing time in file- based and database-based systems.