U N I V E R S I T Y O F S O U T H F L O R I D A Database-centric Data Analysis of Molecular Simulations Yicheng Tu *, Sagar Pandit §, Ivan Dyedov *, and.

Slides:



Advertisements
Similar presentations
Database Management System CEN 351. Course Description A database management system (DBMS) is a computer application program designed for the efficient.
Advertisements

ICS 434 Advanced Database Systems
Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
Kien A. Hua Division of Computer Science University of Central Florida.
Yicheng Tu, § Shaoping Chen, §¥ and Sagar Pandit § § University of South Florida, Tampa, Florida, USA ¥ Wuhan University of Technology, Wuhan, Hubei, China.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Database: A collection of related data [Elmasri]. A database represents some aspect of real world called “miniworld” [Elmasri] or “enterprise” [Ramakrishnan].
U N I V E R S I T Y O F S O U T H F L O R I D A Computing Distance Histograms Efficiently in Scientific Databases Yicheng Tu, * Shaoping Chen, *§ and Sagar.
A Novel Scheme for Video Similarity Detection Chu-Hong Hoi, Steven March 5, 2003.
Introduction to Databases
Introduction to Databases
Database Administration
Introduction to Databases Transparencies
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
Chapter 1 Introduction to Databases
Chapter 2 Database System Concepts and Architecture
Introduction to Databases and Database Languages
MIS 710 Module 0 Database fundamentals Arijit Sengupta.
Introduction to Database
Database Systems: Design, Implementation, and Management Ninth Edition
The Role of DBMS in Computing
CSC2012 Database Technology & CSC2513 Database Systems.
Chapter 1: Introduction to Spatial Databases 1.1 Overview 1.2 Application domains 1.3 Compare a SDBMS with a GIS 1.4 Categories of Users 1.5 An example.
DBS201: DBA/DBMS Lecture 13.
Introduction. 
Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)
Database Architecture Introduction to Databases. The Nature of Data Un-structured Semi-structured Structured.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS (Cont’d) Instructor Ms. Arwa Binsaleh.
CS6530 Graduate-level Database Systems Prof. Feifei Li.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Database System Concepts and Architecture
IST 210 Introduction to Spatial Databases. IST 210 Evolution of acronym “GIS” Fig 1.1 Geographic Information Systems (1980s) Geographic Information Science.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Fundamentals of Database Chapter 7 Database Technologies.
Department of Computer Science & Engineering Abstract:. In our time, the advantage of technology is the biggest thing for current scientific works. One.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Section 05Concepts Of DBMS1 HSQ - DATABASES & SQL And Franchise Colleges 05 Concepts of DBMS By MANSHA NAWAZ.
Bdbms: A Database System for Scientific Data Management Mohamed Y. Eltabakh, Mourad Ouzzani, Walid G. Aref, Ahmed Elmagarmid, Yasin Silva, Umer Arshad,
 Three-Schema Architecture Three-Schema Architecture  Internal Level Internal Level  Conceptual Level Conceptual Level  External Level External Level.
Indexing and Visualizing Multidimensional Data I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,Budapest.
Lecture # 3 & 4 Chapter # 2 Database System Concepts and Architecture Muhammad Emran Database Systems 1.
6/2/20161 Database Systems Lecture # 3 By: Asma Ahmad Jan 21 st, 2011.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
Introduction to Database AIT632 Chapter 1 Sungchul Hong.
Systems Biology ___ Toward System-level Understanding of Biological Systems Hou-Haifeng.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, March 29, 2000.
A Motion-Aware Approach to Continuous Retrieval of 3D Objects (ICDE 2008) Mohammed Eunus Ali Rui Zhang Egemen Tanin Lars Kulik Department of Computer Science.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Chapter 2 Database System Concepts and Architecture Dr. Bernard Chen Ph.D. University of Central Arkansas.
Creating and Maintaining Geographic Databases. Outline Definitions Characteristics of DBMS Types of database Relational model SQL Spatial databases.
Digital Intuition Cluster, Smart Geometry 2013, Stylianos Dritsas, Mirco Becker, David Kosdruy, Juan Subercaseaux Welcome Notes Overview 1. Perspective.
CIS/SUSL1 Fundamentals of DBMS S.V. Priyan Head/Department of Computing & Information Systems.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.
1 Geog 357: Data models and DBMS. Geographic Decision Making.
Week 7 Lecture Part 2 Introduction to Database Administration Samuel S. ConnSamuel S. Conn, Asst Professor.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Difference between DBMS and File System
Secure Data Outsourcing
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
IT 5433 LM1. Learning Objectives Understand key terms in database Explain file processing systems List parts of a database environment Explain types of.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Chapter 2 Database System Concepts and Architecture
9. Creating and Maintaining Geographic Databases
Database Management System (DBMS)
Database Systems Instructor Name: Lecture-3.
Terms: Data: Database: Database Management System: INTRODUCTION
Efficient Aggregation over Objects with Extent
Presentation transcript:

U N I V E R S I T Y O F S O U T H F L O R I D A Database-centric Data Analysis of Molecular Simulations Yicheng Tu *, Sagar Pandit §, Ivan Dyedov *, and Vladimir Grupcev * * Department of Computer Science and Engineering, § Department of Physics Molecular Simulations (MS) Large scale biological structures are represented using all the individual atoms. Thus, providing nano–scopic description of biological processes. Data is stored in single or multiple trajectory files containing time frames. Each frame is a sequential list of atoms with their positions, velocities, perhaps forces, masses, and types. Dataset is very large: millions of atoms, tens of thousands of frames. Abstract Molecular simulations (MS) have become an integral part of molecular and structural biology. By providing model descriptions for biochemical and biophysical processes at nano–scopic scale, MS can provide fundamental understanding of diseases and help discovery of drugs. MS, by their nature, generate large amounts of data. Although many of the MS software are carefully designed to achieve maximum computational performance in simulation, they seriously fall short on storage and handling of the large scale data output. The objective of this project is to use database technologies to improve the efficiency, ease of maintenance, and security of MS data analysis. We accomplish this by developing novel data structures and query processing algorithms in the kernel of the database management system (DBMS), in addition to leveraging the advantages of such systems in their current forms. We focus on creative indexing and data organization techniques and query processing and optimization strategies. We believe that such innovations will bring significant intellectual merit from which both the biomedical and database management communities will benefit. State-of-the-art in MS Data Analysis Store trajectory in computer files Organize data into files Where to find data? Use the file names to encode file “content” Smarter systems: SimDB 1 and BioSimGrid 2 use relational databases to manage these trajectory files Figure 1. A simulated hydrated dipalmitoylphosphatidylcholine bilayer system. Research Challenges Difficult to maintain application programs - tedious coding is required for each new query Data security is poorly supported - only on the whole file level Most important, efficiency in data retrieval is very low - sequential file search is often needed Our Approach A database-centric MS data analysis (DCMS) framework that o stores, queries raw data in a database management system (DBMS) o allows efficient application development via declarative query language (e.g., SQL) provides fine-granularity access control and view-based data access Figure 2. DCMS architecture. Processing Histogram Queries Histogram queries are very popular in DCMS o given a set of (or all) atoms in a time frame, compute the distribution of a physical measurement in a histogram with bucket width h Histogram of pairwise distances (PDH) is more challenging Naive algorithm needs to compute all N(N-1)/2 distances where N is the number of atoms Our solution uses a Quadtree-based data structure called density map o If distance of all atoms in two cells in the map fall into a histogram bucket, no need to compute the distances Time complexity is O(N 1.5 ) for 2D data and O(N ) for 3D data Figure 5. Solving a histogram query (bucket width h = 3) using two density maps generated from raw data (left) with low (middle) and high (right) resolution. Summary Existing file-based MS data processing bears serious drawbacks in application development, security, and efficiency in data access Storing and querying MS data in DCMS (with a legacy DBMS) provides a better solution that solves the above problems DCMS improves query efficiency by 1-5 orders of magnitude Further improvement in efficiency can be achieved by augmenting the DCMS with novel indexes and query processing algorithms Further improve the efficiency of data retrieval and analysis via o novel indexing structures o sophisticated query processing algorithms Figure 3. Structure of Time-Parameterized B + -Tree (TPB) index. References 1 Feig et al, Future Generation Computer Systems, 16(1): , (1999) 2 Ng et al, Future Generation Computer Systems, 22(6): , (2006) Contacts: Experimental results Four popular query types Comparison with Gromacs Dataset size: 286,000 atoms, 100,000 frames Indexing MS Data Multiple indexes needed, each targeting a set of queries o TPB-tree: random point and trajectory queries o TPS-tree: spatial range queries o kd-tree: range queries on other non-spatial measurements Figure 4. Query processing time in file- based and database-based systems.