Download presentation
Presentation is loading. Please wait.
Published byMagdalen Hoover Modified over 9 years ago
1
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University † Los Alamos National Laboratory
2
ICPP 2012 Outline Motivation and Introduction Background System Overview and Optimization Experiment Conclusion
3
ICPP 2012 Motivation Science becomes increasingly data driven; Strong desire for efficient data visualization; Challenges: –Fast data generation speed –Slow disk IO and network speed –Worse performance during visualization –Different kinds of subsetting requests Difficult and Unnecessary to visualize all the data
4
ICPP 2012 Data Subsetting in Paraview A widely used data analysis and visualization application Problems: Load + Filter mode –Load the entire data set –Data filtering in visualization level Threshold Filter: based on values Extract Subset Filter: based on dimension info –Grid transformation needed during filtering Regular Structured Grid -> Unstructured Grid
5
ICPP 2012 A Faster Solution Subset at the I/O level –User specifies the subset in one query for both dimension and value ranges –Reduced I/O time and memory footprint SQL queries in ParaView –Query over Dimensions – API support –Query over Values - Indexing Bitmap Indices and Parallel Bitmap Indices –Efficient subsetting over values
6
ICPP 2012 Background: Bitmap Indexing Fastbit: widely used in Scientific Data Management Suitable for float value for binning small ranges Run Length Compression(WAH, BBC) –Compress bitvector based on continuous 0s or 1s
7
ICPP 2012 Bitmap Index and Dim Subset Run-length Compression(WAH, BBC) –Good: compression rate, fast bitwise operation; –Bad: ability to locate dim subset is lost; Two traditional methods: –With bitmap indices: post-filter on dim info; –Without bitmap indices: post-filter on values; Two-phase optimization: –Index Generate: Distributed Indices over sub-blocks; –Index Retrieval: Transform dim subsetting info into bitvectors, and support fast bitwise operation;
8
ICPP 2012 System Overview Parse the SQL expression Parse the metadata file Generate Query Request Index Generation if not generated; Index Retrieving after that.
9
ICPP 2012 Optimization 1: Distributed Index Generation Study relationship between Queries and Partitions. Partition the data based on Query Preference
10
ICPP 2012 Index Partition Strategy α rate: Participation rate of data elements –Number of elements in indexing / Total data size –Worst: All elements have to be involved –Ideal: Elements exact the same as dim subset Partition Strategies: –Strategy 1: α is proportional to dim subsetting percentage and inversely proportional to number of partitions. –Strategy 2: In general cases where subsetting over each dimension has a similar probability, the partition should have equal preference over each dim. –Strategy 3: If queries only include a subset of dims, the partition should also be based on these dims.
11
ICPP 2012 Optimization 2: Index Retrieval Post-filter?
12
ICPP 2012 Parallel Index Architecture L3: data block L1: data file L2: variable
13
ICPP 2012 Experiment Setup Goals: –SQL subsetting vs. Load + Filter in Paraview –Scalability of parallel indexing method –Indexing and Partition Strategy vs. FastQuery Dataset: –Parallel Ocean Program –Data size: 33.6 GB –Data format: NetCDF(array based) Environment: –IBM Xeon Cluster 8 cores, 2.53GHZ –12 GB memory
14
ICPP 2012 Efficiency Comparison with Filtering in Paraview Data size: 5.6 GB Input: 400 queries Depends on subset percentage General index method is better than filtering when data subset < 60% Two phase optimization achieved a 0.71 – 11.17 speedup compared with filtering method Index m1: Bitmap Indexing, no optimization Index m2: Use bitwise operation instead of post-filtering Index m3: Use both bitwise operation and index partition Filter: load all data + filter
15
ICPP 2012 Memory Comparison with Filtering in Paraview Data size: 5.6 GB Input: 400 queries Depends on subset percentage General index method has much smaller memory cost than filtering method Two phase optimization only has small extra memory cost Index m1: Bitmap Indexing, no optimization Index m2: Use bitwise operation instead of post-filtering Index m3: Use both bitwise operation and index partition Filter: load all data + filter
16
ICPP 2012 Scalability with Different Proc# Data size: 8.4 GB Proc#: 6, 24, 48, 96 Input: 100 queries X pivot: subset percentage Y pivot: time Each process take care of one sub-block Good scalability as number of processes increases
17
ICPP 2012 Alpha Rate with Different Proc# Data size: 8.4 GB Proc#: 6, 24, 48, 96 Input: 100 queries X pivot: subset percentage Y pivot: Alpha Rate More number of processes means more index partitions Good participation rate when selecting a smaller percentage data subset
18
ICPP 2012 Alpha Rate and IO Access Times Comparison with FastQuery FastQuery: Build relational table view over scientific dataset Difference: doesn’t consider multi-dimension data features Data size: 8.4 GB, 48 processes Query Type: value + 1 st dim, value + 2 nd dim, value + 3 rd dim, overall Input: 100 queries for each query type
19
ICPP 2012 Efficiency Comparison with FastQuery Data size: 8.4 GB Proc#: 48 Input: 100 queries for each query type Achieved a 1.41 to 2.12 speedup compared with FastQuery
20
ICPP 2012
21
Conclusion Big data issue in data analysis and visualization Find exact data subset in IO level with SQL interface and bitmap indexing A good speedup compared with filtering method Data partition strategy and parallel indexing A good speedup compared with FastQuery
22
ICPP 2012 Thanks 22
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.