Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory (DISCL) Computer Science Department Texas Tech University Big Data and Science Workshop, October 6, 2013

Richard Hamming’s Quote The purpose of computing is insight, not numbers. -Richard Hamming, 1962 The purpose of computing is insight, not numbers. -Richard Hamming, 1962 2 Big Data and Science Workshop, October 6, 2013

 Data analysis is critical in understanding the phenomenon and insight behind the data and computing Motivation 3 Big Data and Science Workshop, October 6, 2013  Scientific applications tend to be data intensive In a global climate model (left), with 100 × 120 km grid cell, PBs of data managed and analyzed Various parameters, e.g., temperature and wind speed, are recorded Scientists desire higher resolution and finer granularity, which can lead to even larger sizes of datasets Source: UCAR

 Scientists are interested in understanding the phenomenon behind the data. A typical case is to to select data points of interests by performing range queries Select data points From datasets Where pressure>80 and 12<temperature<25; Motivation (cont.)  Traditionally, without any prior knowledge of the datasets, a costly process. 4 Big Data and Science Workshop, October 6, 2013

Our Idea  Our Idea: use statistical metadata to facilitate such data analysis.  Preprocess and the added metadata improve the query response by more than three folds (min/max in this case)  Similar to the widely used indexing scheme, but is lightweight and incurs significantly less storage overhead Performance Comparison with and without statistics 5 Big Data and Science Workshop, October 6, 2013

 FASM system (Fast data Analysis with integrated Statistical Metadata) has four major components: Subsetting, Statistics Generating, Metadata Rich Datasets and Runtime. Fast Analysis with Statistical Metadata (FASM) and System Design System Architecture 6 Big Data and Science Workshop, October 6, 2013

 Challenges  What type of subsetting scheme is better?  What type of statistical metadata is desired?  How to utilize the statistical metadata at the runtime? FASM Challenges 7 Big Data and Science Workshop, October 6, 2013

 Subsetting refers to how to partition the datasets in order to integrate the statistical metadata. For 3D datasets, we can have 1D, 2D, 3D and combined subsetting. Dimension-Driven Subsetting Different Subsetting Schemes 8 Big Data and Science Workshop, October 6, 2013

 The inconsistence between logical access and physical storage using scientific datasets causes locality issue. Locality-Driven Subsetting Locality in Subsetting Schemes TypeDimensionDistance sub1(lat, lon)0 sub2(lon, level)(lat-1)×lon sub3(lat, level)lat×(lon-1) sub4(lon, time)(level×lat-1)×lon sub5(lat, time)(level×lon-1)×lat sub6(level,time)level×(lon×lat-1) 9 Big Data and Science Workshop, October 6, 2013

 Concurrency plays a critical role in exploring parallelism in the access and analysis of scientific datasets. Concurrency-Driven Subsetting Distribution of Datasets on Parallel File Systems SchemeConcurrency sub1 min((x×m)/(stripe_size×level), n) sub2 min((x×m)/(stripe_size×lat),n) sub3 min((x×m)/(stripe_size×lon),n ) sub4, sub5, sub6 min((x×m)/stripe_size,n) 10 Big Data and Science Workshop, October 6, 2013

 There are different statistics we can utilize. e.g., MIN, MAX, MEAN, MEDIAN, 5-number statistics (min, lower quartile, median, upper quartile, and max)  A statistical metadata portion is added Statistics Generating and Enhanced Datasets A Sample of Metadata Rich Datasets 11 Big Data and Science Workshop, October 6, 2013

 The Runtime component leverages integrated statistical metadata to facilitate data analysis and queries.  Significantly reduces the data space for needed analysis FASM Runtime Input: query request and statistics_metadata; Read operation: Step 1: In each access, get statistics_metadata from previous analysis; Step 2: Filter useless subsets; Step 3: Modify accessing pattern: new_start [] = FASM_start; new_count [] =FASM_count; Step 4: read: ncmpi_get_vara_float(ncid, varid, new_start[], new_count[],*fp); Return: Query Result An Example of Runtime Read Operation 12 Big Data and Science Workshop, October 6, 2013

 Testbed Hrothgar, a 640-node cluster in Texas Tech University Each node contains two Intel Xeon (Westmere) 2.8 GHz 6-core processors with 24 GB of memory Nodes are connected with DDR Infiniband PnetCDF v1.3  Datasets and Query Randomly generated synthetic datasets, 300KB-100GB Real application BCCR-BCM, 12GB; Randomly generated range query and analyses, e.g., 10<pressure<30; Current Evaluations 13 Big Data and Science Workshop, October 6, 2013

Statistics and Performance Improvements Performance of Different Statistics The proposed approach demonstrates clear performance advantages as the dataset size increases 14 Big Data and Science Workshop, October 6, 2013

Locality and Concurrency Performance Regarding Locality of Various SubsettingConcurrency of Various Strip Size Sub1, sub2 and sub3 are better than sub4 and sub5 schemes in terms of Locality. Sub3 achieves the best performance when the strip size is 5 MBs, which is 1.67 times faster than the worst case. 15 Big Data and Science Workshop, October 6, 2013

Storage Overhead Storage Overhead of Added Metadata Storage overhead of integrated metadata is less than one percent for three subsetting schemes. 16 Big Data and Science Workshop, October 6, 2013

Amortized Cost 17 Big Data and Science Workshop, October 6, 2013

Conclusion  We argue that raw datasets and current formats not sufficient for achieving an optimal performance for Big Data Science  We propose an idea of integrating statistical metadata into datasets Illustrates the data distribution features and boosts data analyses Lightweight and complementary to indexing  Experiments confirmed query and analysis performance improved  We have also analyzed the impact of various subsetting schemes, statistics generating, scalability, and overhead J.L. Liu and Y. Chen. Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets. In Proceedings of the IEEE International Conference on Cluster Computing, (Cluster'13), 2013. 18 Big Data and Science Workshop, October 6, 2013

Ongoing and Future Work  Integrate the statistical metadata at the node level, maintain the logical information and provide the file systems a better access scheme.  In the future, we will investigate further to support data modifications, resubsetting, and regeneration of statistics at runtime. 19 Big Data and Science Workshop, October 6, 2013

Thank You Questions? Welcome to visit our website: http://discl.cs.ttu.eduhttp://discl.cs.ttu.edu Q&A 20 Big Data and Science Workshop, October 6, 2013

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

Similar presentations

Presentation on theme: "Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

Similar presentations

Presentation on theme: "Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback