Department of Computer Science and Engineering, USF

Department of Computer Science and Engineering, USF
Analyzing Big Data: A case Study of Processing Complex Functions against Large-Scale Molecular Simulation Outputs Yicheng Tu Department of Computer Science and Engineering, USF February 23, USF-ISSG

Data Management for Computational Sciences
Scientific tasks generate huge data to analyze Genome sequencing: 600 GB – 6 TB in 2 – 3 days CERN Hadron Collider: 320 MB/sec Queries are often analytical and complex to compute Molecular (or particle) simulations Frame (snapshot): Information about particles At time instant t: A snapshot of the simulation is stored Each frame is analyzed with complex query

Scientific Simulations
Astronomy Biology/chemistry – molecular simulation (MS) Photo source:

Database-Centric Molecular Simulation
Congrats! Can you share the data? Sure, here it is. But how? UPS? Portal for community-wise data sharing and analysis Take advantage of existing database technologies But modern DBMSs need more functionalities Complex analytics vs. simple aggregates

MS Analytics 1-frame analytics 1-body function 2-frame analytics

Spatial Distance Histogram (SDH) Problem
Given N data points and user specified distance w, draw histogram of all pairwise distances Domain of distances [0, Lmax] Buckets are of the same width: [0,w), [w, 2w), … Query has one single parameter: Bucket width w of the histogram, or Total number of buckets l = Lmax/w Brute-force approach takes O(N2) time N can be a large number What if need is to compute for every frame simulated ? There could be more than 10,000 frames

Efficient SDH Algorithms
Main idea: avoid the calculation of pairwise distances Observation: two groups of points can be processed in one shot (constant time) if the range of all inter-group distances falls into a histogram bucket We say these two groups are resolved into that bucket 10 15 Histogram[i] += 10*15; 13 bucket i

The DM-SDH Algorithm Organize all data into a Quad-tree (2D data) or Oct-tree (3D data). Cache the atoms counts of each tree node Density map: all counts in one tree level start from one proper density map M0 FOR every pair of nodes A and B in M0 resolve A and B IF A and B are not resolvable THEN FOR each child node A’ of A FOR each child node B’ of B resolve A’ and B’ Based on the above observation, we come up with this algorithm that we call the density map-based SDH algorithm. The idea is : we organize all the particle data into a Quad-tree, or, in the case of 3D data, an Oct-tree. For each tree node, we cache the counts of all particles in that node. All the nodes in one level of the tree is called a density map. The algorithm works as this: we choose an appropriate density map M_0 and try to resolve all pairs of cells in it. For any pair of cells that are not resolvable, we recursively resolve all pairs of child nodes. The recursion will stop at the lowest level of the tree, or the density map with the smallest cells in it. When the cells on the lowest tree nodes are still not resolvable, we have to calculate the particle-to-particle distances in these cells.

A Case Study Now let us go through a case study to show how the algorithm works. Here we have a small dataset containing over 100 particles. By organizing the simulation space into a quad tree, we can generate two density maps like these. One is of lower resolution and the other is of higher resolution. Assume the bucket width in the histogram is 3 and the side length of the low resolution map is 2, obviously the side length becomes 1 in the high resolution map. We start the algorithm by studying the cells in map one. We first do the intra-cell processing. All the distances between two particles within a cell are smaller than 2 times the square root of 2, which is the length of the diagonal in each cell. Then we try to resolve all pairs of cells, for example, the cell in the upper left corner vs. the cell in the lower bottom corner of the map 1. But the minimum distance between these 2 cells is 2 and the maximum distance is square root of 52. This means the distance range overlaps with the first and the second buckets, and the two cells are not resolvable. In this case, we will try to resolve all pairs of their subcells on the next level of the tree, which is the density map 2. There are 4x4=16 such pairs to resolve. Fortunately, some of the pairs are resolvable. For example, this cell and this cell gives a distance range of square root of 10 to square root of 34, and this range falls into the second bucket. We then increment the count of the second bucket by 6x7 = 42. One thing to notice is that, when we reach two cells on the leaf level and there are still not resolvable, we have to retrieve all the particles in those cells and calculate the distances to determine which bucket each distance belongs to. [0, 2√2] [√10, √34] Distance calculations needed for those non-resolvable nodes on the leaf level !!

How Good is DM-SDH? Quantitative analysis of DM-SDH based on a geometric modeling approach The main result: An important problem is to analyze the performance of DM-SDH. We accomplished a rigorous analysis on this based on a geometric modeling approach. Let us skip the tedious maths here and focus on the main analytical results. Let m be the total levels of trees we visit for the purpose of resolving cells. We know more pairs of cells will be resolved if we visit more levels. Let alpha_m be the percentage of pairs of cells that are still not resolvable. Our analysis shows that this percentage decreases by half with one more level of density map visited. This result is derived from a closed-form we generated for alpha m. Notes: α(m) is the percentage of pairs of nodes that are NOT resolvable on level m of the quad(oct) tree. We managed to derive a closed-form for α(m) α(m) decreases exponentially with more density maps visited

Easily, we have Theorem 1: Let d be the dimension of data, then time spent on resolving nodes in DM-SDH is Theorem 2: time spent on calculating distances of atoms in leaf nodes that are not resolvable is also The above result can easily lead to the time complexity of the DM-SDH algorithm. Since the non-resolvable cell pairs decreases exponentially, we conclude that the time spend on resolving cells is on the order of N to the two d minus one over two d power. Recall that we still have to calculate distances in the leaf nodes, the time spent on that is of the same complexity. Notes: O(N1.5) for 2D data, O(N1.667) for 3D data Theorems hold true for all reasonably distributed data

Experimental verification (2D data)
Here are some experimental results. In this experiment, we compare the running time of our algorithm with that of the quadratic brute-force approach for 2D data. Graph a and b are the results of synthetic datasets and c is those for a real simulation dataset. IN all graphs, the running time and size of the dataset N are plotted in logarithmic scale. So the gradient of the lines represent the time complexity. The solid red line is the time for the brute force algorithm, it has a slope of 2. The other lines represent our algorithm under different bucket width. As we can see, the slope of the lines representing the DM-SDH algorithms is always smaller than two, and is close to 1.5. Here we also draw a dotted black line with a slope of 1.5 exactly. With the decrease of the bucket width, the time needed for computing SDH increases. However, the slope stays the same. When the bucket width is very small, we only see the advantage of DM-SDH when the dataset size is large. For example, when N is smaller than a million, the time needed to compute the results is the same or even longer than the brute force approach.

Experimental verification (3D data)
Same trends can be observed for 3D data, except the line slope is more close to

Faster algorithms – ADM-SDH
O(N1.667) not good enough for large N? Our solution: approximate algorithms based on our analytical model Time: Stop before we reach the leaf nodes Approximation: for non-resolvable nodes, distribute the distance counts into the overlapping buckets heuristically Correctness: consult the table we generate from the model Some might argue that the DM-SDH algorithm, although faster than the brute force approach, still requires very long time to compute the histogram in a large simulation system. This motivates us to develop even faster algorithms for the same problem. For the high processing speed, we sacrifice correctness of the query results. Our idea comes from our analytical model developed for the DM-SDH algorithm. Instead of trying to resolve all cell pairs till we reach the lowest level of the quad tree, we stop the recursion after a certain number of levels when we are sure the number of distances in the resolved cells so far is more than we need to ensure the correctness bound. When we stop on a certain density map, there will be cells that are not resolvable. For those non-resolvable cells, we use some greedy methods to distribute the distance counts into the relevant buckets. But how do we know when to stop the recursion? Fortunately, our previous analysis can provide an answer.

Approximation 10 13 2 7 15 6 13 4 9 1 10 3 5 12 8 Density Map - DM2 17
42 21 29 Density Map - DM1 The simulation space Density Map – DM0 10 13 15 bucket i bucket i+1 bucket i+2 x% Distances = 10*15 y% z% Min distance Max distance

Non-resolvable nodes Not resolvable = range of distances overlaps with multiple buckets Take a guess in distributing the distance counts into these buckets Several heuristics: Left(right)most only Even distribution Proportional distribution Distance range When we stop recursion in the approximate algorithm, we leave some unresolvable cells, for those cells, we make a guess on the distribution of distances to the relevant buckets. For example, this graph shows a distance range of two cells overlapping with three buckets. A number of heuristics can be used to make this guess. Such as: one, always put all the counts into the leftmost bucket, two, evenly distribute the counts; or three, distribute the counts proportionally. bucket i bucket i bucket i+2

What our geometric model says …
Remember that we generated a closed-form formula for calculating the percentage of unresolved distances, alpha. From that formula we can calculate the unresolved distances under a given query parameter, l, the number of buckets, or p, the bucket width, after visiting m levels. For example, some of the numbers are shown in this table. The numbers here are the percentage of resolvable distances, in other words, 1 – alpha. Assume we are computing a SDH with bucket number 128, and we need the algorithm to return a SDH with 97% correctness, meaning 97% of the distances are sure to have been resolved, or we are not sure about only 3% of the total distances. From the table we can find the first row with a value greater than 97%, which is the row of m=5 in this case. So we can say that, in order to get 97% correctness, we can stop the recursion of DM-SDH after visiting 5 levels of density map. Example: under l = 128, for a correctness guarantee of 97%, we have m=5, i.e., only need to visit 5 levels of the tree (starting from M0)

How Much Time is Needed? The time is independent to system size N !
Given an error bound Є, number of level to visit is No time will be spent on distance calculation Time to resolve nodes in the tree is where I is the number of node pairs in M0 Since the percentage of unresolved distances decrease by half with the increase of m, we can easily get a formula to describe the relationship between a given error bound epsilon and m. that is, m equals log of one over epsilon. With this, our further analysis of the approximate algorithm shows that the time complexity is related to the two d minus one th power of one over epsilon. The good thing about this is : the time complexity is only related to the error bound epsilon, but not to the size of the simulation system N. The time is independent to system size N !

ADM-SDH Performance - Accuracy
hi: count of bucket i in the correct histogram h’i : count of bucket i in the obtained histogram

ADM-SDH Performance - Efficiency

End of story? I is not related to N, but the bucket width w Worse cases: slower than the brute force So far we do not consider multiple frames Need more practical solution Get away from resolving cells, heuristics only! And good heuristics …

Our Method Can special features of data in the frames help ?
Spatial uniformity: Distribution of points in cells is known distribution of distances into buckets can be computed more accurately Temporal uniformity: Points do not change in cells of consecutive frames helps skip unnecessary computations

Spatial Uniformity Identify uniform regions in which particles are uniformly distributed Derive probability distribution function (PDF) of distances between uniform cells Assign actual distance counts to buckets as per PDF Actual PDF Proportional x% y% z% bucket i bucket i+1 bucket i+2

Temporal Uniformity Particles often interact with each other in groups
Move randomly in small sub-region Same sub-regions in consecutive frames Total particles in it may not change Skip such pairs of cells in density map (DM) DMi DMi+1 2 7 15 6 13 4 9 1 10 3 5 12 8 10 5 3 6 8 19 2 7 9 4 5.0 0.7 0.2 1.0 0.6 1.5 2.1 8 2.3 1.8 0.3 Frame i Frame i+1 Ratio Density Map (RDM)

Temporal Uniformity is Common
Distribution of density ratios in two consecutive frames randomly chosen from a 890K atom MS dataset

Temporal Uniformity start from one proper ratio density map M0
FOR every pair of cells A and B in M0 IF ratio product RA x RB == 1 THEN no change to histogram ELSE IF either go for spatial uniformity check or proportional distribution of distances

Experimental Results A1: ADM-SDH Algorithm A2: Temporal Uniformity
A3: Spatial Uniformity A4: Complete Algorithm – (b) Data set of 890,000 atoms (c) – (d) Data set of 8,000,000 atoms

GPU Implementations GPU host thousands of cores
Hierarchy of memory with different access latency Process data in SIMD fashion Multiple threads access memory in parallel Core Register File Instruction Cache Shared Memory | L1 Cache L2 Cache Multi-Processor Multi- Processor Global Memory GPU Device CPU Main Memory Host

DM-SDH on GPUs Load all density maps on global memory.
Shared memory is loaded by threads in a CUDA block Each thread loads information about one cell Each thread processes a pair of cells Replacing only one cell in loop until all cells are processed Next a new cell is loaded and paired with distinct other cells

DM-SDH Algorithm on GPUs
Global Memory Gi Gj Gk Load DM cells slide Histogram Buckets Histogram Buckets Shared Memory Intra-group cell pairs Inter-group cell pairs Density Map (DM) Tree

Experimental Results – Performance
MM: CPU version, GM: GPU version

Experimental Results – Performance
MM: CPU version, GM: GPU Global Memory, SM: GPU Shared Memory

Summary Big data: 3Vs + big complexity
Function computation – an integrated functionality of scientific DBMSs Case study: spatial distance histogram (SDH) Algorithmic design DM-SDH algorithm: accurate with lower complexity ADM-SDH algorithm: random algorithm with very low complexity and controllable error bounds (Almost) real-time SDH processing by considering spatiotemporal locality Parallel computing Real-time SDH computing

Relevant Publications
[TKDE14] A. Kumar, V. Grupcev, Y. Yuan, Y. Tu, Jin Huang, and G. Shen. Computing Spatial Distance Histograms for Large Scientific Datasets On-the-fly. To appear in IEEE Transactions on Knowledge and Data Engineering (TKDE). [TKDE13] V. Grupcev, Y. Yuan, Y. Tu, Jin Huang, S. Chen, S. Pandit, and M. Weng. Approximate Algorithms for Computing Distance Histograms with Accuracy Guarantees. IEEE Transactions on Knowledge and Data Engineering (TKDE) 25(9): , September 2013. [VLDBJ11] S. Chen, Y. Tu, and Y. Xia. Performance Analysis of A Dual-Tree Algorithm for Computing Spatial Distance Histograms. The VLDB Journal. 20(4): , August 2011. [EDBT12] A. Kumar, V. Grupcev, Y. Tu, Y. Yuan, and G. Shen. Distance Histogram Computation Based on Spatiotemporal Uniformity in Scientific Data. In Procs. of 15th IEEE International Conference on Extending Database Technology (EDBT). pp , Berlin, Germany, March 26-30, 2012. [ICDE09] Y. Tu, S. Chen, and S. Pandit. Computing Distance Histograms Efficiently in Scientific Databases. In Procs. of 25th International Conference on Data Engineering (ICDE), pp , Shanghai, China, March 2009.

Acknowledgements Graduate Students Sponsors Collaborators Anand Kumar
NIH/NIGMS NSF NVidia Collaborators Sagar Pandit Yongke Yuan Shaoping Chen Graduate Students Anand Kumar Vladimir Grupcev Jin Huang Purushottam Panta Chengcheng Mou

Department of Computer Science and Engineering, USF

Similar presentations

Presentation on theme: "Department of Computer Science and Engineering, USF"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Department of Computer Science and Engineering, USF

Similar presentations

Presentation on theme: "Department of Computer Science and Engineering, USF"— Presentation transcript:

Similar presentations

About project

Feedback