Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Gamma Operator for Big Data Summarization on an Array DBMS

Similar presentations


Presentation on theme: "The Gamma Operator for Big Data Summarization on an Array DBMS"— Presentation transcript:

1 The Gamma Operator for Big Data Summarization on an Array DBMS
Carlos Ordonez, Yiqun Zhang, Wellington Cabrera University of Houston (USA) Summary SciDB: parallel array DBMS; query language and basic ACID Contribution: One pass parallel summarization matrix operator for generalized sufficient statistics. Features: only one pass and fully parallel summarization substitutes the data set operator linear time complexity and speedup. solves big family of statistical and machine learning models. Experimental benchmark: an order of magnitude faster than SciDB built-in operators (SciDB calls LAPACK). two orders of magnitude faster than SQL queries on a fast column DBMS. Much faster than the R package even when the data set fits in RAM. Can solve BIG problems in the cloud PCA: LR: VS: Experiments Algorithm Data set high d=100; n in log10-scale In R: read and load data set into RAM. Then do matrix transposition and matrix multiplication. SQL: self join+aggregation on the data set. AQL/AFL in SciDB: do self cross join on the data set. SciDB operators: work directly on the array. Definitions Sufficient Statistics Old: All experiments are performed under default cluster configuration: 1 node running 2 instances. Each node has 4 cores, 4GB RAM and 3TB disk space. New:


Download ppt "The Gamma Operator for Big Data Summarization on an Array DBMS"

Similar presentations


Ads by Google