Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University.

Slides:



Advertisements
Similar presentations
Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
Classifier Decision Tree A decision tree classifies data by predicting the label for each record. The first element of the tree is the root node, representing.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Science Problem: Cognitive capacity (human/scientist understanding), storage and I/O have not kept up with our capacity to generate massive amounts physics-based.
Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Big Data Vs. (Traditional) HPC Gagan Agrawal Ohio State ICPP Big Data Panel (09/12/2012)
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
SC 2013 SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol Yu Su*, Yi Wang*, Gagan Agrawal*, Rajkumar Kettimuthu.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Research in In-Situ Data Analytics Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, and others)
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Thomas Heinis* Eleni Tzirita Zacharatou ‡ Farhan Tauheed § Anastasia Ailamaki ‡ RUBIK: Efficient Threshold Queries on Massive Time Series § Oracle Labs,
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
1 Supporting a Volume Rendering Application on a Grid-Middleware For Streaming Data Liang Chen Gagan Agrawal Computer Science & Engineering Ohio State.
Model-driven Data Layout Selection for Improving Read Performance Jialin Liu 1, Bin Dong 2, Surendra Byna 2, Kesheng Wu 2, Yong Chen 1 Texas Tech University.
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Database Performance Tuning and Query Optimization
CSCE 990: Advanced Distributed Systems
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Supporting Fault-Tolerance in Streaming Grid Applications
Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Efficient Distribution-based Feature Search in Multi-field Datasets Ohio State University (Shen) Problem: How to efficiently search for distribution-based.
Smita Vijayakumar Qian Zhu Gagan Agrawal
1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The.
Chapter 11 Database Performance Tuning and Query Optimization
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University

Oral Exam 2013 Motivation: Scientific Data Analysis Science becomes increasing data driven Strong requirements for efficient data analysis  Road-runner EC 3 simulation records 7 attributes (X, Y, VX, … MASS) 36 bytes per record Simulation Speed: 2.3 TB  Parallel Ocean Program 3-D Grid: 42 * 2400 * 3600 > 30 attributes (TEMP, SALT …) 1.4 GB per attribute Simulation Speed: > 50 GB

Oral Exam 2013 Motivation: Big Data “Big Data” Challenge: –Fast Data Generation Speed –Slow Disk IO and Network Speed –Gap will become bigger in the future –Different Data Formats Observations: –Scientific analysis over data subsets Community Climate System Model, Data Pipelines from Tomography, X-ray Photon Correlation Spectroscopy Attributes Subset, Spatial Subset, Value Subset –Multi-resolution data analysis –Wide area data transfer protocols

Oral Exam 2013 TEMP SALT UVEL VVEL Network I want to analyze TEMP within North Atlantic Ocean! More Efficient! Entire Data File Data Subset POP.nc An Example of Ocean Simulation Remote Data Server I want to see the average TEMP of the ocean! I want to quickly view the general global ocean TEMP Aggregation Result Data Samples Combine Flexible Data Management Wide Area Data Transfer Protocol

Oral Exam 2013 Introduction A server-side data virtualization method –Standard SQL queries over scientific datasets Translate SQL into low-level data access code Data Formats: NetCDF, HDF5 –Data subsetting and aggregation Multiple subsetting and aggregation types Greatly decrease the data transfer volume –Data sampling Efficient data analysis with small accuracy lost –Combine with wide area transfer protocols Flexible data management + Efficient data transfer SDQuery_DSI in Globus GridFTP

Oral Exam 2013 Thesis Work Existing Work: –Supporting User-Defined Subsetting and Aggregation over Parallel NetCDF Datasets (CCGrid2012) –Indexing and Parallel Query Processing Support for Visualizing Climate Datasets (ICPP2012) –Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices (HPDC2013) –SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol (SC2013) Future Work: –Correlation Data Analysis among Multiple Variables Bitmap Indexing Better Efficiency, More Flexibility –Correlation Data Mining over Scientific Data

Oral Exam 2013 Outline Current Work –Parallel Server-side Data Subsetting and Aggregation –Flexible Data Sampling and Efficient Error Calculation –Combine Data Management with Data Transfer Protocol Proposed Work –Flexible Correlation Analysis over Multi-Variables –Correlation Mining over Scientific Dataset Conclusion

Oral Exam 2013 Contribution Server-side subsetting and aggregation –Subsetting: Dimensions, Coordinates, Values –Bitmap Indexing: two-phase optimizations –Aggregation: SUM, AVG, COUNT, MAX, MIN Keep data in native format(e.g., NetCDF, HDF5) –SciDB, OPeNDAP: huge data loading or transform cost Parallel data processing –Data Partition Strategy –Multiple Parallel Levels – Files, Attributes, Blocks Data visualization –SDQueryReader in Paraview –Visualize only subsets of data

Oral Exam 2013 Background: Bitmap Indexing Widely used in scientific data management Suitable for float value by binning small ranges Run Length Compression (WAH, BBC) –Compress bitvectors based on continuous 0s or 1s Can be treated as a small profile of the data

Oral Exam 2013 Overview of Server-side Data Subsetting and Aggregation Parse the SQL expression Parse the metadata file Generate Query Request Index Generation Index Retrieval Generate data subset based on IDs Perform data aggregation Generate Unstructured Grid

Oral Exam 2013 Bitmap Index Optimizations Run-length Compression(WAH, BBC) –Pros: compression rate, fast bitwise operations –Cons: ability to locate dim subset is lost Value Predicates vs. Dim Predicates Two traditional methods: –Without bitmap indices: post-filter on values –With bitmap indices (Fastbit): post-filter on dim info Two-phase optimizations: –Index Generation: Distributed indices over sub-blocks –Index Retrieval: Transform dim subsetting conditions into bitvectors Support bitwise operation among dim and value bitvectors

Oral Exam 2013 Optimization 1: Distributed Index Generation Index Generation: Generate multi-small indices over sub-blocks of data Partition Strategy: Study relationship between queries and partitions Partition the data based on query preferences α rate: redundancy rate of data elements Index Retrieval: Filter the indices based on dim-based query conditions

Oral Exam 2013 Partition Strategy Queries involve both value and dim conditions –Bitmap Indexing + Dim Filter –Worst: All elements have to be involved –Ideal: Elements exact the same as dim subset α rate: redundancy rate of data elements –Number of elements in index / Total data size Partition Strategies: –Users query has preference Timestamp, Longitude, Latitude –Study relationship between queries and partitions –Partition the data based on query preferences –α rate can be greatly decreased

Oral Exam 2013 Optimization 2: Index Retrieval Post-filter? Value-based Predicates: Find satisfied bitvectors from index files on disk Dim-based Predicates: Dynamically generate dim bitvectors which satisfy current predicates Fast Bitwise Operations: Logic AND operations are performed between dim and value bitvectors to generate the point ID set

Oral Exam 2013 Parallel Processing Framework L3: data block L1: data file L2: attribute

Oral Exam 2013 Experiment Setup Goals: –Index-based Subsetting vs. Load + Filter in Paraview –Scalability of Parallel Indexing Method –Parallel Indexing vs. FastQuery –Server-side Aggregation vs. Client-side Aggregation Dataset: –POP (Parallel Ocean Program) –GCRM (Global Cloud Resolving Model) Environment: –IBM Xeon Cluster 8 cores, 2.53GHZ –12 GB memory

Oral Exam 2013 Efficiency Comparison with Filtering in Paraview Data size: 5.6 GB Input: 400 queries Depends on subset percentage General index method is better than filtering when data subset < 60% Two phase optimization achieved a 0.71 – speedup compared with traditional bitmap indexing method Index m1: Traditional Bitmap Indexing, no optimization Index m2: Use bitwise operation instead of post-filtering Index m3: Use both bitwise operation and index partition Filter: load all data + filter

Oral Exam 2013 Memory Comparison with Filtering in Paraview Data size: 5.6 GB Input: 400 queries Depends on subset percentage General index method has much smaller memory cost than filtering method Two phase optimization only has small extra memory cost Index m1: Bitmap Indexing, no optimization Index m2: Use bitwise operation instead of post-filtering Index m3: Use both bitwise operation and index partition Filter: load all data + filter

Oral Exam 2013 Scalability with Different Proc# Data size: 8.4 GB Proc#: 6, 24, 48, 96 Input: 100 queries X pivot: subset percentage Y pivot: time Each process take care of one sub-block Good scalability as number of processes increases

Oral Exam 2013 Compare with FastQuery FastQuery: –A parallel indexing method based on FastBit Build a relational table view over dataset Generate parallel indices based on partition of the table –Pros: standard way to process data based on tables –Cons: multi-dim feature is lost Only support row-based partition Basic Reading Unit: continuous rows (1-dim segments) Our method: –Flexible Partition Strategy Partition the multi-dim data based on users’ query preference –Smaller Reading Times Basic Reading Unit: multi-dim blocks

Oral Exam 2013 Execution Time Comparison with FastQuery Data size: 8.4 GB, 48 processes Query Type: value + 1 st dim, value + 2 nd dim, value + 3 rd dim, overall Input: 100 queries for each query type Achieved a 1.41 to 2.12 speedup compared with FastQuery

Oral Exam 2013 Parallel Data Aggregation Efficiency Data size: 16GB Process number: Input: 60 aggregation queries Query Type: Only Agg Agg + Group by + Having Agg + Group by Much smaller data transfer volume Relative Speedup: 4 procs: 2.61 – procs: 4.31 – procs: 6.65 – 9.54

Oral Exam 2013 Outline Current Work –Parallel Server-side Data Subsetting and Aggregation –Flexible Data Sampling and Efficient Error Calculation –Combine Data Management with Data Transfer Protocol Proposed Work –Flexible Correlation Analysis over Multi-Variables –Correlation Mining over Scientific Dataset Conclusion

Oral Exam 2013 Contributions Statistic Sampling Techniques: –A subset of individuals to represent whole population –Information Loss and Error Metrics: Mean, Variance, Histogram, Q-Q Plot Challenges: –Sampling Accuracy Considering Data Features –Error Calculation with High Overhead Support Data Sampling over Bitmap Indices –Data samples has better accuracy –Support error prediction before sampling the data –Support data sampling over flexible data subset –No data reorganization is needed

Oral Exam 2013 Data Sampling over Bitmap Indices Features of Bitmap Indexing: –Each bin (bitvector) corresponds to one value range –Different bins reflect the entire value distribution –Each bin keeps the data spatial locality Contains all space IDs (0-bits and 1-bits) Row Major, Column Major Hilbert Curve, Z-Order Curve Method: –Perform stratified random sampling over each bin –Multi-level indices generates multi-level samples

Oral Exam 2013 Stratified Random Sampling over Bins S1: Index Generation S2: Divide Bitvector into Equal Strides S3: Random Select certain % of 1’s out of each stride

Oral Exam 2013 Error Prediction vs. Error Calculation Sampling Request Predict Request Error Prediction Error Calculation Data Sampling Error Calculation Sample Not Good? Multi-Times Error Prediction Error Metrics Feedback Decide Sampling Sampling Request Sample

Oral Exam 2013 Error Prediction Pre-estimate the error metrics before sampling Calculate error metrics based on bins –Bitmap Indices classifies the data into bins Each bin corresponds to one value or value range; Find some representative values for each bin: V i ; –Enforce equal sampling percentage for each bin Extra Metadata: number of 1-bits of each bin: C i ; Compute number of samples of each bin: S i ; –Pre-calculate error metrics based on V i and S i Representative Values: –Small Bin: mean value –Big Bin: lower-bound, upper-bound, mean value

Oral Exam 2013 Data Subsetting + Data Sampling S3: Perform Stratified Sampling on Subset S2: Find Spatial ID subset S1: Find value subset Value = [2, 3) RID = (9, 25)

Oral Exam 2013 Experiment Results Goals: –Accuracy among different sampling methods –Compare Predicted Error with Actual Error –Efficiency among different sampling methods –Speedup for combining data sampling with subsetting Datasets: –Ocean Data – Multi-dimensional Arrays –Cosmos Data – Separate Points with 7 attributes Environment: –Darwin Cluster: 120 nodes, 48 cores, 64 GB memory

Oral Exam 2013 Sample Accuracy Comparison Sampling Methods: –Simple Random Method –Stratified Random Method –KDTree Stratified Random Method –Big Bin Index Random Method –Small Bin Index Random Method Error Metrics: –Means over 200 separate sectors –Histogram using 200 value intervals –Q-Q Plot with 200 quantiles Sampling Percentage: 0.1%

Oral Exam 2013 Sample Accuracy Comparison Traditional sampling methods can not achieve good accuracy; Small Bin method achieves best accuracy in most cases; Big Bin method achieves comparable accuracy to KDTree sampling method. Mean Histogram Q-Q Plot

Oral Exam 2013 Predicted Error vs. Actual Error Means, Histogram, Q-Q Plot for Small Bin Method Means, Histogram, Q-Q Plot for Big Bin Method

Oral Exam 2013 Efficiency Comparison Index-based Sample Generation Time is proportional to the number of bins(1.10 to 3.98 times slower). The Error Calculation Time based on bins is much smaller than that based on data (>28 times faster). Sample Generation TimeError Calculation Time

Oral Exam 2013 Total Time based on Resampling Times Total Sampling Time Index-based Sampling: Multi-time Error Calculations One-time Sampling Other Sampling Methods: Multi-time Samplings Multi-time Error Calculations X axis: resampling times Speedup of Small Bin: 0.91 – 20.12

Oral Exam 2013 Speedup of Sampling over Subset X axis: Data Subsetting Percentage (100%, 50%, 30%, 10%, 1%) Y axis: Index Loading Time + Sampling Generation Time 25% Sampling Percentage Speedup :1.47 – 4.98 for Spatial Subsetting for value Subsetting Subset over Spatial IDsSubset over values

Oral Exam 2013 Outline Current Work –Parallel Server-side Data Subsetting and Aggregation –Flexible Data Sampling and Efficient Error Calculation –Combine Data Management with Data Transfer Protocol Proposed Work –Flexible Correlation Analysis over Multi-Variables –Correlation Mining over Scientific Dataset Conclusion

Oral Exam 2013 Background: Wide-Area Data Transfer Protocols Efficient data transfers over wide-area network Globus GridFTP: –Striped, Streaming, Parallel Data Transfer –Reliable and Restartable Data Transfer Limitation: volume? –The basic data transfer unit is file (GB or TB Level) –Strong requirements for transferring data subsets Goal: Integrate core data management functionality with wide-area data transfer protocols

Oral Exam 2013 Contribution Challenges: –How should the method be designed to allow easy use and integration with existing GridFTP installation? –How can users view a remote file and specify the subsets of data ? –How to support efficient data retrieval with different subsetting scenarios? –How can data retrieval be parallelized and benefits from multi- steaming? GridFTP SDQuery DSI –Efficient Data Transfer over Flexible File Subset –Dynamic Loading / Unloading with Small Overhead –Performance Model based Hybrid Data Reading –Parallel Streaming Data Reading and Transferring

Oral Exam 2013 Outline Current Work –Parallel Server-side Data Subsetting and Aggregation –Flexible Data Sampling and Efficient Error Calculation –Combine Data Management with Data Transfer Protocol Proposed Work –Flexible Correlation Analysis over Multi-Variables –Correlation Mining over Scientific Dataset Conclusion

Oral Exam 2013 Motivation: Correlation Analysis Correlation Attributes (Variables) Analysis – Study relationship among variables – Make scientific discovery – Two Scenarios: Basic Scientific Rule Verification and Discovery Feature Mining – Halo finding, Eddy finding Challenge: –Correlation analysis is useful but extremely time consuming and resource costly –No method support flexible correlation analysis on data subset

Oral Exam 2013 Correlation Metrics Multi-Dimensional Histogram: –Value distributions of variables; Entropy –A metric to show the variability of the dataset; –Low => constant, predictable data; –High => random data; Mutual Information –A metric for computing the dependence between two variables; –Low => two variables are independent; –High => one variable provides information about another; Pearson Correlation Coefficient –A metric to quantify the linear correspondence between two variables; –Value Range: [-1, 1]; – 0 proportional; =0 independent;

Oral Exam 2013 Our Solution and Contribution A framework which supports both individual and correlation data analysis based on bitmap indexing –Individual Analysis: flexible data subsetting –Correlation Analysis: Interactive queries among multi-variables Correlation metrics calculation based on indices Support correlation analysis over data subset Support Correlation Analysis over Bitmap Indices –Better efficiency, smaller memory cost –Support both Static Indexing and Dynamic Indexing –Support correlation analysis over data samples

Oral Exam 2013 User Cases of Correlation Analysis Please enter variable names which you want to perform correlation queries: TEMP SALT UVEL Please enter your SQL query: SELECT TEMP FROM POP WHERE TEMP>0 AND TEMP<1 AND depth_t<50; Entropy: TEMP(2.19), SALT(1.90), UVEL(1.48) Mutual Information: TEMP  SALT: 0.18, TEMP->UVEL->0.017; Pearson Correlation: ….. Histogram: (SALT), (UVEL) Please enter your SQL query: SELECT SALT FROM POP WHERE SALT<0.0346; Entropy: TEMP(2.29), SALT(2.99), UVEL(2.68) Mutual Information: TEMP  UVEL 0.02, SALT->UVEL->0.19; Pearson Correlation: ….. Histogram: (UVEL) Please enter your SQL query: UNDO Entropy: TEMP(2.19), SALT(1.90), UVEL(1.48) Mutual Information: TEMP  SALT: 0.18, TEMP->UVEL->0.017; Pearson Correlation: ….. Histogram: (SALT), (UVEL) Please enter your query:

Oral Exam 2013 Dynamic Indexing No Indexing Support: –Load all data for A and B; –Filtering A and B to generate subset; –Combined Bins: Generate (A 1, B 1 )->count 1, … (A m, B m )- >count m based on each data elements within the data subset; –Calculate Correlation Information based on combined bins; Dynamic Indexing (Indices for each variable): –Query bitvectors for A and B; (no data loading cost, zero or very small filtering cost) –Combined Bins: Generate (A 1, B 1 )->count 1, … (A m, B m )- >count m based on bitwise operations between A and B (much faster because bitvectors# are much smaller than elements#) –Calculate Correlation Information based on combined bins

Oral Exam 2013 Static Indexing Dynamic Indexing: One index for each variable. Still need to perform bitwise operations to generate combine bins. Static Indexing: Generate one big indices file over multi- variables. Only need to perform bitvectors filtering or combining. (Extremely small cost)

Oral Exam 2013 Outline Current Work –Parallel Server-side Data Subsetting and Aggregation –Flexible Data Sampling and Efficient Error Calculation –Combine Data Management with Data Transfer Protocol Proposed Work –Flexible Correlation Analysis over Multi-Variables –Correlation Mining over Scientific Dataset Conclusion

Oral Exam 2013 Correlation Mining Challenges of Correlation Queries –Do not know which subsets contain important correlations –Keep submitting queries to explore correlations Correlation Mining: –Automatically find important correlations –Suggest correlations to users A bottom-up method: –Generate correlations over basic spatial and value units –Use bitmap indexing to speedup this process –Use association rule mining to find and combine similar correlations

Oral Exam 2013 Generate Scientific Association Rule Association Rule Example: t_lon(10.1−15.1), t_lat(25.2−30.2), depth_t(1−10), TEMP(0−1), SALT(0.01−0.02) →Mutual Information(0.23, High)

Oral Exam 2013 Feature Mining Feature Mining based on Correlation Analysis –Sub-halo: Correlation between space and velocity –Eddy: Correlation between speed in different directions OW distance to find eddies –OW > 0, not eddy; OW<= 0, might be eddy –One detection method: Build v based on row major (x, y) Build u based on column major (y, x) Eddy can not exist for long sequence of 1-bits

Oral Exam 2013 Conclusion “Big Data” challenge A server-side data virtualization method Server-side data subsetting and aggregation Data sampling based on bitmap indexing Integrate flexible data management with efficient data transfer protocol Future work: –Correlation queries –Correlation mining

Oral Exam 2013 Thanks for your attention! Q & A 52