Advisor: Gagan Agrawal

Slides:

Advertisements

Similar presentations

1 Optimizing compilers Managing Cache Bercovici Sivan.

Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

Fast Algorithms For Hierarchical Range Histogram Constructions

STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

Clydesdale: Structured Data Processing on MapReduce Jackie.

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates

Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Access Path Selection in a Relational Database Management System Selinger et al.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Big Data Vs. (Traditional) HPC Gagan Agrawal Ohio State ICPP Big Data Panel (09/12/2012)

A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.

1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.

HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

M.Kersten MonetDB, Cracking and recycling Martin Kersten CWI Amsterdam.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.

System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Image taken from: slideshare

Pathology Spatial Analysis February 2017

Yi Wang Department of Computer Science and Engineering

COMP 430 Intro. to Database Systems

Selectivity Estimation of Big Spatial Data

Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Communication and Memory Efficient Parallel Decision Tree Construction

Overview of big data tools

1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The.

Yi Wang, Wei Jiang, Gagan Agrawal

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Advisor: Gagan Agrawal Data Management and Data Processing Support on Array-Based Scientific Data Yi Wang Advisor: Gagan Agrawal Candidacy Examination

Big Data Is Often Big Arrays Array data is everywhere Molecular Simulation: Molecular Data Life Science: DNA Sequencing Data (Microarray) Array data is especially prevalent in the scientific domain. Earth Science: Ocean and Climate Data Space Science: Astronomy Data

Inherent Limitations of Current Tools and Paradigms Most scientific data management and data processing tools are too heavy-weight Hard to cope with different data formats and physical structures (variety) Data transformation and data transfer are often prohibitively expensive (volume) Prominent Examples RDBMSs: not suited for array data Array DBMSs: data ingestion MapReduce: specialized file system

Mismatch Between Scientific Data and DBMS Scientific (Array) Datasets: Very large but processed infrequently Read/append only No resources for reloading data Popular formats: NetCDF and HDF5 Database Technologies For (read-write) data – ACID guaranteed Assume data reloading/reformatting feasible

Example Array Data Format - HDF5 HDF5 (Hierarchical Data Format) To specify a data subset: 1) dimensional range and 2) value range. Each dimension is usually associated with a series of coordinate values, which is stored in a separate 1D dataset – dimension scale.

The Upfront Cost of Using SciDB High-Level Data Flow Requires data ingestion Data Ingestion Steps Raw files (e.g., HDF5) -> CSV Load CSV files into SciDB The data ingestion experience is very painful. The data ingestion cost is 100x of a simple query. “EarthDB: scalable analysis of MODIS data using SciDB” - G. Planthaber et al.

Thesis Statement Native Data Can Be Queried and/or Processed Efficiently Using Popular Abstractions Process data stored in the native format, e.g., NetCDF and HDF5 Support SQL-like operators, e.g., selection and aggregation Support array operations, e.g., structural aggregations Support MapReduce-like processing API

Outline Data Management Support Supporting a Light-Weight Data Management Layer Over HDF5 SAGA: Array Storage as a DB with Support for Structural Aggregations Approximate Aggregations Using Novel Bitmap Indices Data Processing Support SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Future Work

Overall Idea An SQL Implementation Over HDF5 High Efficiency Ease-of-use: declarative language instead of low-level programming language + HDF5 API Abstraction: provides a virtual relational view High Efficiency Load data on demand (lazy loading) Parallel query processing Server-side aggregation

Functionality Query Based on Dimension Index Values (Type 1) Also supported by HDF5 API Query Based on Dimension Scales (Type 2) coordinate system instead of the physical layout (array subscript) Query Based on Data Values (Type 3) Simple datatype + compound datatype Aggregate Query SUM, COUNT, AVG, MIN, and MAX Server-side aggregation to minimize the data transfer index-based condition coordinate-based condition content-based condition

Execution Overview 1D: AND-logic condition list 2D: OR-logic condition list 1D: OR-logic condition list Same content-based condition More optimizations with the metadata information. 11

Experimental Setup Experimental Datasets 4 GB (sequential experiments) and 16 GB (parallel experiments) 4D: time, cols, rows, and layers Compared with Baseline Performance and OPeNDAP Baseline performance: no query parsing OPeNDAP: translates HDF5 into a specialized data format

Sequential Comparison with OPeNDAP (Type2 and Type3 Queries) By using OPeNDAP, the user has to download the entire data from the server first and then write its own filter. The performance scales poorly due to the additional data translation overhead. We implemented a client-side filter for OPeNDAP. By comparing the baseline performance and our sequential performance, we can see that the total sequential processing time for type 1 query is not distinguishable from the baseline or the intrinsic HDF5 query function.

Parallel Query Processing for Type2 and Type3 Queries Scaled the system up to 16 nodes, and the selectivity was varied from <20% to >80%. Good scalability.

Outline Data Management Support Supporting a Light-Weight Data Management Layer Over HDF5 SAGA: Array Storage as a DB with Support for Structural Aggregations Approximate Aggregations Using Novel Bitmap Indices Data Processing Support SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Future Work

Array Storage as a DB A Paradigm Similar to NoDB Still maintains DB functionality But no data ingestion DB and Array Storage as a DB: Friends or Foes? When to use DB? Load once, and query frequently When to directly use array storage? Query infrequently, so avoid loading Our System Focuses on a set of special array operations - Structural Aggregations Absolute power corrupts absolutely.

Structural Aggregation Types Non-Overlapping Aggregation Overlapping Aggregation Grid aggregation: multi-dimensional histogram Sliding aggregation: apply a kernel function to a sliding window – moving average, denoising, time series, etc. Hierarchical aggregation: observe the gradual influence of radiation from a source (pollution source/explosion location) Circular aggregation: concentric but disjoint circles instead of regularly shaped grids

Grid Aggregation Parallelization: Easy after Partitioning Considerations Data contiguity which affects the I/O performance Communication cost Load balancing for skewed data Partitioning Strategies Coarse-grained Fine-grained Hybrid Auto-grained

Partitioning Strategy Decider Cost Model: analyze loading cost and computation cost separately Load cost Loading factor × data amount Computation cost Exception - Auto-Grained: take loading cost and computation cost as a whole Communication cost is trivial, with an exception of fine-grained partitioning with small grid sizes.

Overlapping Aggregation I/O Cost Reuse the data already in the memory Reduce the disk I/O to enhance the I/O performance Memory Accesses Reuse the data already in the cache Reduce cache misses to accelerate the computation Aggregation Approaches Naïve approach Data-reuse approach All-reuse approach

Example: Hierarchical Aggregation Aggregate 3 grids in a 6 × 6 array The innermost 2 × 2 grid The middle 4 × 4 grid The outmost 6 × 6 grid (Parallel) sliding aggregation is much more complicated

Naïve Approach For N grids: N loads + N aggregations Load the innermost grid Aggregate the innermost grid Load the middle grid Aggregate the middle grid Load the outermost grid Aggregate the outermost grid For N grids: N loads + N aggregations

Data-Reuse Approach For N grids: 1 load + N aggregations Load the outermost grid Aggregate the outermost grid Aggregate the middle grid Aggregate the innermost grid For N grids: 1 load + N aggregations Aggregation execution is not so straightforward

All-Reuse Approach For N grids: 1 load + 1 aggregation Load the outermost grid Once an element is accessed, accumulatively update the aggregation results it contributes to For N grids: 1 load + 1 aggregation Pure sequential I/O Only update the outermost aggregation result Update both the outermost and the middle aggregation results Update all the 3 aggregation results

Sequential Performance Comparison Array slab/data size (8 GB) ratio: from 12.5% to 100% Coarse-grained partitioning for the grid aggregation All-reuse approach for the sliding aggregation SciDB stores `chunked’ array: can even support overlapping chunking to accelerate the sliding aggregation

Parallel Sliding Aggregation Performance # of nodes: from 1 to 16 8 GB data Sliding grid size: from 3 × 3 to 6 × 6

Outline Data Management Support Supporting a Light-Weight Data Management Layer Over HDF5 SAGA: Array Storage as a DB with Support for Structural Aggregations Approximate Aggregations Using Novel Bitmap Indices Data Processing Support SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Future Work

Approximate Aggregations Over Array Data Challenges Flexible Aggregation Over Any Subset Dimensional-based/value-based/combined predicate Aggregation Accuracy Spatial distribution/value distribution Aggregation Without Data Reorganization Reorganization is prohibitively expensive Existing Techniques - All Problematic for Array Data Sampling: unable to capture both distributions Histograms: no spatial distribution Wavelets: no value distribution New Data Synopses – Bitmap Indices KDTree-based stratified sampling requires reorganization Histogram: 1D: no spatial distribution MD: either space cost or partitioning granularity increases exponentially, leading to either substantial estimation overheads or high inaccuracy Wavelets: If the value-based attribute added as an extra dimension to the data cube, sorting or reorganization is required.

Bitmap Indexing and Pre-Aggregation Bitmap Indices Pre-Aggregation Statistics Any multi-dimensional array can be mapped to a 1D array

Approximate Aggregation Workflow

Running Example Bitmap Indices Pre-Aggregation Statistics SELECT SUM(Array) WHERE Value > 3 AND ID < 4; Predicate Bitvector: 11110000 i1’: 01000000 i2’: 10010000 Count1: 1 Count2: 2 Estimated Sum: 7 × 1/2 + 16 × 2/3 = 14.167 Precise Sum: 14

A Novel Binning Strategy Conventional Binning Strategies Equi-width/Equi-depth Not designed for aggregation V-Optimized Binning Strategy Inspired by V-Optimal Histogram Goal: approximately minimize Sum Squared Error (SSE) Unbiased V-Optimized Binning: data is queried randomly Weighted V-Optimized Binning: frequently queried subarea is prior knowledge

Unbiased V-Optimized Binning 3 Steps: Initial Binning: use equi-depth binning Iterative Refinement: adjusting bin boundaries Bitvector Generation: mark spatial positions Add a bin/bin boundary -> improve the approximation quality/decrease SSE Remove a bin/bin boundary -> undermine the approximation quality/increase SSE

Weighted V-Optimized Binning Difference: minimize WSSE instead of SSE Similar binning algorithm Major Modification representative value for each bin is not the mean value

Experimental Setup Data Skew 5 Types of Queries Dense Range: less than 5% space but over 90% data Sparse Range: less than 95% space but over 10% data 5 Types of Queries DB: with dimension-based predicates VBD: with value-based predicates over dense range VBS : with value-based predicates over sparse range CD: with combined predicates over dense range CS : with combined predicates over sparse range Ratio of Querying Possibilities – 10 : 1 50% synthetic data is frequently queried 25% real-world data is frequently queried

SUM Aggregation Accuracy of Different Binning Strategies on the Synthetic Dataset Equi-Width: most inaccurate in all the cases; Equi-Depth: most accurate when only value-based predicates exist; V-Optimized: most accurate when only dimension-based predicates exist or over sparse range; Weighted V-optimized: most accurate when 50% data is queried. Equi-Width Equi-Depth Unbiased V-Optimized Weighted V-Optimized

SUM Aggregation Accuracy of Different Methods on the Real-World Dataset Bitmap vs. Sampling with two sampling rates 2% and 20%: 1) A significantly higher sampling rate does not necessarily lead to a significantly higher accuracy; 2) Most accurate when only value-based predicates exist. This small error is caused by the edge bin(s) that overlap with the queried value range. Conservative aggregation is slightly better than aggressive aggregation in this case. Bitmap vs. (Equi-Depth) MD-Histogram: 400 bins/buckets to partition the value domain; Equi-depth partitioning property (similar to equi-depth binning); Even less accurate than equi-depth bitmap: inaccurate in processing dimension-based predicates due to the uniform distribution assumption for every dimension Sampling_2% Sampling_20% (Equi-Depth) MD-Histogram Equi-Depth Unbiased V-Optimized Weighted V-Optimized

Outline Data Management Support Supporting a Light-Weight Data Management Layer Over HDF5 SAGA: Array Storage as a DB with Support for Structural Aggregations Approximate Aggregations Using Novel Bitmap Indices Data Processing Support SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Future Work

Scientific Data Analysis Today “Store-First-Analyze-After” Reload data into another file system E.g., load data from PVFS to HDFS Reload data into another data format E.g., load NetCDF/HDF5 data to a specialized format Problems Long data migration/transformation time Stresses network and disks

System Overview Key Feature scientific data processing module

Scientific Data Processing Module Data adaption layer is customizable: - Insert a third-party adapter - Open for extension but closed for modification

Parallel Data Processing Times on 16 GB Datasets KNN K-Means Thread scalability: all the data in different formats are loaded into our system in the default array layout Node scalability: Performance difference comes from

Future Work Outline Data Management Support SciSD: Novel Subgroup Discovery over Scientific Datasets Using Bitmap Indices SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Data Processing Support StreamingMATE: A Novel MapReduce-Like Framework Over Scientific Data Stream Begin to analyze multi-variate dataset and the underlying relationship among multiple variables. Intuition: frequent membership operation and aggregation are involved. Bitmap, as a vertical layout, is efficient in both set of operations. (E.g., frequent item set mining). Stream Processing is a hot topic!

SciSD Subgroup Discovery Novelty Goal: identify all the subsets that are significantly different from the entire dataset/general population, w.r.t. a target variable Can be widely used in scientific knowledge discovery Novelty Subsets can involve dimensional and/or value ranges All numeric attributes High efficiency by frequent bitmap-based approximate aggregations

Running Example

SciCSM “Sometimes it’s good to contrast what you like with something else. It makes you appreciate it even more.” - Darby Conley, Get Fuzzy, 2001 Contrast Set Mining Goal: identify all the filters that can generate significantly different subsets Common filters: time periods, spatial areas, etc. Usage: classifier design, change detection, disaster prediction, etc.

Running Example

StreamingMATE Extend the precursor system SciMATE to process scientific data stream Generalized Reduction Reduce data stream to a reduction object No shuffling or sorting Focus on the load balancing issues Input data volume can be highly variable Topology update: add/remove/update streaming operators

StreamingMATE Overview

Hyperslab Selector True: False: nullify the condition list nullify the elementary condition 4-dim Salinity Dataset dim1: time [0, 1023] dim2: cols [0, 166] dim3: rows [0, 62] dim4: layers [0, 33] Fill up all the index boundary values

Type2 and Type3 Query Examples

Aggregation Query Examples AG1: Simple global aggregation AG2: GROUP BY clause + HAVING clause AG3: GROUP BY clause

Sequential and Parallel Performance of Aggregation Queries

Array Databases Examples: SciDB, RasDaMan and MonetDB Take Array as the First-Class Citizens Everything is defined in the array dialect Lightweight or No ACID Maintenance No write conflict: ACID is inherently guaranteed Other Desired Functionality Structural aggregations, array join, provenance… Array dialect: both the input and output are defined in array schema, and every operation is array-oriented.

Structural Aggregations Aggregate the elements based on positional relationships E.g., moving average: calculates the average of each 2 × 2 square from left to right 1 2 3 4 5 6 7 8 3.5 4.5 5.5 Input Array Aggregation Result aggregate the elements in the same square at a time

Coarse-Grained Partitioning Pros Low I/O cost Low communication cost Cons Workload imbalance for skewed data

Fine-Grained Partitioning Pros Excellent workload balance for skewed data Cons Relatively high I/O cost High communication cost

Hybrid Partitioning Pros Cons Low communication cost Good workload balance for skewed data Cons High I/O cost

Auto-Grained Partitioning 2 Steps Estimate the grid density (after filtering) by sampling, and thus, estimate the computation cost (based on the time complexity) For each grid, total processing cost = constant loading cost + varying computation cost Partitions the cost array - Balanced Contiguous Multi-Way Partitioning Dynamic programming (small # of grids) Greedy (large # of grids)

Auto-Grained Partitioning (Cont’d) Pros Low I/O cost Low communication cost Great workload balance for skewed data Cons Overhead of sampling an runtime partitioning

Partitioning Strategy Summary I/O Performance Workload Balance Scalability Additional Cost Coarse-Grained Excellent Poor None Fine-Grained Hybrid Good Auto-Grained Great Nontrivial Our partitioning strategy decider can help choose the best strategy

All-Reuse Approach (Cont’d) Key Insight # of aggregates ≤ # of queried elements More computationally efficient to iterate over elements and update the associated aggregates More Benefits Load balance (for hierarchical/circular aggregations) More speedup for compound array elements The data type of an aggregate is usually primitive, but this is not always true for an array element Similar to the simple join operation over two tables, it is more computationally efficient to cache the smaller table and scan the larger table as few as possible.

Parallel Grid Aggregation Performance Used 4 processors on a Real-Life Dataset of 8 GB User-Defined Aggregation: K-Means Vary the number of iterations to vary to the computation amount

Data Access Strategies and Patterns Full Read: probably too expensive for reading a small data subset Partial Read Strided pattern Column pattern Discrete point pattern

Indexing Cost of Different Binning Strategies with Varying # of Bins on the Synthetic Dataset

SUM Aggregation of Equi-Width Binning with Varying # of Bins on the Synthetic Dataset

SUM Aggregation of Equi-Depth Binning with Varying # of Bins on the Synthetic Dataset

SUM Aggregation of V-Optimized Binning with Varying # of Bins on the Synthetic Dataset

Average Relative Error(%) of MAX Aggregation of Different Methods on the Real-World Dataset

SUM Aggregation Times of Different Methods on the Real-World Dataset (DB)

SUM Aggregation Times of Different Methods on the Real-World Dataset (VBD)

SUM Aggregation Times of Different Methods on the Real-World Dataset (VBS)

SUM Aggregation Times of Different Methods on the Real-World Dataset (CD)

SUM Aggregation Times of Different Methods on the Real-World Dataset (CD)

SD vs. Classification Classification, such as decision trees or decision rules, appear unlikely to find all meaningful contrasts. A classifier finds a single model that maximizes the separation of multiple groups, not all interesting models as contrast discovery seeks. The output of classification is likely to be an entire decision tree/classification system, which can have the same subsetting predicates at different levels. Classifier separates each other, not between the subsets and the general population.