U N I V E R S I T Y O F S O U T H F L O R I D A Computing Distance Histograms Efficiently in Scientific Databases Yicheng Tu, * Shaoping Chen, *§ and Sagar.

Slides:



Advertisements
Similar presentations
Topological Reasoning between Complex Regions in Databases with Frequent Updates Arif Khan & Markus Schneider Department of Computer and Information Science.
Advertisements

Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
CS252: Systems Programming Ninghui Li Program Interview Questions.
Explicit Preemption Placement for Real- Time Conditional Code via Graph Grammars and Dynamic Programming Bo Peng, Nathan Fisher, and Marko Bertogna Department.
Fast Algorithms For Hierarchical Range Histogram Constructions
Formulation of an algorithm to implement Lowe-Andersen thermostat in parallel molecular simulation package, LAMMPS Prathyusha K. R. and P. B. Sunil Kumar.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Yicheng Tu, § Shaoping Chen, §¥ and Sagar Pandit § § University of South Florida, Tampa, Florida, USA ¥ Wuhan University of Technology, Wuhan, Hubei, China.
Searching on Multi-Dimensional Data
BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
Albert Mas Ignacio Martín Gustavo Patow Fast Inverse Reflector Design FIRD Graphics Group of Girona Institut d’Informàtica i Aplicacions Universitat de.
Query Processing in Databases Dr. M. Gavrilova.  Introduction  I/O algorithms for large databases  Complex geometric operations in graphical querying.
Artificial Learning Approaches for Multi-target Tracking Jesse McCrosky Nikki Hu.
With thanks to Zhijun Wu An introduction to the algorithmic problems of Distance Geometry.
Daniel Blackburn Load Balancing in Distributed N-Body Simulations.
Nature’s Algorithms David C. Uhrig Tiffany Sharrard CS 477R – Fall 2007 Dr. George Bebis.
Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Distributed Quad-Tree for Spatial Querying in Wireless Sensor Networks (WSNs) Murat Demirbas, Xuming Lu Dept of Computer Science and Engineering, University.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Data Structures for Orthogonal Range Queries
Chapter 3: Data Storage and Access Methods
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Shape Matching for Model Alignment 3D Scan Matching and Registration, Part I ICCV 2005 Short Course Michael Kazhdan Johns Hopkins University.
Data Structures for Orthogonal Range Queries A New Data Structure and Comparison to Previous Work. Application to Contact Detection in Solid Mechanics.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Department of Computer Science & Engineering Abstract:. In our time, the advantage of technology is the biggest thing for current scientific works. One.
A Statistical Approach to Speed Up Ranking/Re-Ranking Hong-Ming Chen Advisor: Professor Shih-Fu Chang.
Lesley Charles November 23, 2009.
U N I V E R S I T Y O F S O U T H F L O R I D A Database-centric Data Analysis of Molecular Simulations Yicheng Tu *, Sagar Pandit §, Ivan Dyedov *, and.
Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.
Scaling up Decision Trees. Decision tree learning.
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
Exact indexing of Dynamic Time Warping
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
CSE554Contouring IISlide 1 CSE 554 Lecture 5: Contouring (faster) Fall 2015.
CSE554Contouring IISlide 1 CSE 554 Lecture 3: Contouring II Fall 2011.
CSE554Contouring IISlide 1 CSE 554 Lecture 5: Contouring (faster) Fall 2013.
Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology Iowa State University Joint Work with Tauqir Bibi, Feng Cui, Qunfeng.
Metaheuristics for the New Millennium Bruce L. Golden RH Smith School of Business University of Maryland by Presented at the University of Iowa, March.
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
ITree: Exploring Time-Varying Data using Indexable Tree Yi Gu and Chaoli Wang Michigan Technological University Presented at IEEE Pacific Visualization.
UNC Chapel Hill David A. O’Brien Automatic Simplification of Particle System Dynamics David O’Brien Susan Fisher Ming C. Lin Department of Computer Science.
Similarity Search without Tears: the OMNI- Family of All-Purpose Access Methods Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information.
Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! 1 ITCS 4/5145 Parallel Computing,
Spatial Data Management
Department of Computer Science and Engineering, USF
Query Processing in Databases Dr. M. Gavrilova
CSCE350 Algorithms and Data Structure
Spatio-temporal Pattern Queries
Data Structures: Segment Trees, Fenwick Trees
Spatial Online Sampling and Aggregation
Great Ideas: Algorithm Implementation
Searching Similar Segments over Textual Event Sequences
Dynamic Data Structures for Simplicial Thickness Queries
All-to-All Pattern A pattern where all (slave) processes can communicate with each other Somewhat the worst case scenario! ITCS 4/5145 Parallel Computing,
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Continuous Density Queries for Moving Objects
Topological Signatures For Fast Mobility Analysis
Chapter 4 . Trajectory planning and Inverse kinematics
Presentation transcript:

U N I V E R S I T Y O F S O U T H F L O R I D A Computing Distance Histograms Efficiently in Scientific Databases Yicheng Tu, * Shaoping Chen, *§ and Sagar Pandit * * University of South Florida, Tampa, FL, U.S.A. § Wuhan University of Technology, Wuhan, Hubei, China Molecular Simulations (MS) Large scale biological structures are represented using all the individual atoms. Data is stored in single or multiple trajectory databases containing time frames. Each frame is a sequential list of atoms with their positions, velocities, perhaps forces, masses, and types. Dataset is very large: millions of atoms, tens of thousands of frames. Similar methodology in other sciences: astronomy, material science, civil engineering Querying a MS Database Mainstream queries: analytical queries (beyond linear aggregates) The m-body correlation functions are very popular o Requires O(N m ) computational time (N is the number of atoms) Of special importance is the Radial Distribution Function (RDF) o Often computed as a spatial distance histogram (SDH) o 2-body function  O(N 2 ) time needed for a brute force algorithm Figure 1. A simulated hydrated dipalmitoylphosphatidylcholine bilayer system. Our approach Main idea: avoid the calculation of pairwise distances Observation: two groups of points can be processed in one shot (i.e., resolved) if the range of all inter-group distances falls into a histogram bucket Processing Histogram Queries O(N ) not good enough for large N? Our solution: approximate algorithms based on our analytical model Time: Stop before we reach the leaf nodes Approximation: for irresolvable nodes, distribute the distance counts into the overlapping buckets heuristically Correctness: consult the table we generate from the model Figure 3. Solving a histogram query (bucket width h = 3) using two density maps (a density map is the counts of all nodes of a whole tree level) generated from raw data (left) with low (middle) and high (right) resolution. Summary Distance histogram is an important query in simulation databases We propose an algorithm based on a quad-tree-based data structure Our algorithm outperforms the brute-force approach We develop an approximate algorithm with guaranteed error bound and very low time complexity References 1 Feig et al, Future Generation Computer Systems, 16(1): , (1999) 2 Ng et al, Future Generation Computer Systems, 22(6): , (2006) Contacts: Problem Statement Given coordinates of N points, draw a histogram of all pairwise distances - total distance counts will be N(N- 1)/2 We focus on the standard SDH, in which o domain of distance [0, L max ] o Buckets are of the same width: [0,p), [p, 2p), … o Query has one single parameter: bucket width p of the histogram, or total number of buckets l = L max /p State-of-the-art and Objective Popular simulation packages such as GROMACS 1 all adopt the brute force way of computing SDH Can we beat O(N 2 )? Figure 2. A snapshot of the Millennium simulation of 10 billion stars. Photo source: bucket i Histogram[i] += 20*15; The DM-SDH algorithm Organize all data into a Quad-tree (2D data) or Oct-tree (3D data). Cache the atoms counts of each tree node Try resolving all pairs of nodes on a tree level M 0 o If not resolvable, recursively resolve all pairs of children nodes Complexity analysis of DM-SDH algorithm Based on a geometric modeling approach The main result: The above result gives the following analysis 1.α(m) is the percentage of pairs of nodes that are NOT resolvable on level m of the quad(oct)tree. 2.We managed to derive a closed- form for α(m) Theorem 1: When the particles are reasonably distributed, * the time complexity of DM-SDH is O(N (2d-1)/2d ). O(N 1.5 ) for 2D data and O(N ) for 3D data Only in rare cases is the data not reasonably distributed Figure 4. Running time of the DM-SDH algorithm for different 2D (upper figures) and 3D (lower figures) datasets Figure 5. Running time (a) and correctness (b-d) of the approximate algorithm