The Gamma Operator for Big Data Summarization on an Array DBMS

Slides:



Advertisements
Similar presentations
Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23 rd April 2013 Andy Pryke
Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce
Running Large Graph Algorithms – Evaluation of Current State-of-the-Art Andy Yoo Lawrence Livermore National Laboratory – Google Tech Talk Feb Summarized.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.
The Gamma Operator for Big Data Summarization
FLANN Fast Library for Approximate Nearest Neighbors
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Preview of Oracle Database 12 c In-Memory Option Thomas Kyte
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Path-State Modeling for Time Series Anomaly Detection Matt Mahoney.
1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.
Oracle Challenges Parallelism Limitations Parallelism is the ability for a single query to be run across multiple processors or servers. Large queries.
An Introduction to HDInsight June 27 th,
(C) 2008 Clusterpoint(C) 2008 ClusterPoint Ltd. Empowering You to Manage and Drive Down Database Costs April 17, 2009 Gints Ernestsons, CEO © 2009 Clusterpoint.
Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.
Database Systems Carlos Ordonez. What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms;
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
Intel “Big Data” Science and Technology Center Michael Stonebraker.
Knowledge Discovery in a DBMS Data Mining Computing models and finding patterns in large databases current major challenge in database systems & large.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Cassandra as Memcache Edward Capriolo Media6Degrees.com.
Our New Submit Server. chtc.cs.wisc.edu One thing out of the way... Your science is King! We need rules to facilitate resource sharing. Given good reasons.
Big Data is a Big Deal!.
Integrating the R Language Runtime System with a Data Stream Warehouse
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Design Patterns for SSIS Performance
Execution Planning for Success
Solid State Disks Testing with PROOF
Scott Michael Indiana University July 6, 2017
Spark Presentation.
MyRocks at Facebook and Roadmaps
Introduction to NewSQL
Hadoop Clusters Tess Fulkerson.
The R language and its Dynamic Runtime
Introduction to Spark.
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
April 30th – Scheduling / parallel
Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan
1 Demand of your DB is changing Presented By: Ashwani Kumar
Predictive Performance
Communication and Memory Efficient Parallel Decision Tree Construction
Lecture#12: External Sorting (R&G, Ch13)
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Ch 4. The Evolution of Analytic Scalability
Carlos Ordonez, Predrag T. Tosic
Parallel Analytic Systems
Overview of big data tools
Spark and Scala.
Big Data, Bigger Data & Big R Data
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
CherryPick: Adaptively Unearthing the Best
External Sorting.
Optimized Algorithms for Data Analysis in Parallel Database Systems
Wellington Cabrera Carlos Ordonez
The Gamma Operator for Big Data Summarization
Wellington Cabrera, Carlos Ordonez (presenter)
Wellington Cabrera Advisor: Carlos Ordonez
Wellington Cabrera Advisor: Carlos Ordonez
Assoc. Prof. Marc FRÎNCU, PhD. Habil.
Carlos Ordonez, Javier Garcia-Garcia,
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez, Yiqun Zhang, Wellington Cabrera University of Houston (USA) Summary SciDB: parallel array DBMS; query language and basic ACID Contribution: One pass parallel summarization matrix operator for generalized sufficient statistics. Features: only one pass and fully parallel summarization substitutes the data set operator linear time complexity and speedup. solves big family of statistical and machine learning models. Experimental benchmark: an order of magnitude faster than SciDB built-in operators (SciDB calls LAPACK). two orders of magnitude faster than SQL queries on a fast column DBMS. Much faster than the R package even when the data set fits in RAM. Can solve BIG problems in the cloud PCA: LR: VS: Experiments Algorithm Data set high d=100; n in log10-scale In R: read and load data set into RAM. Then do matrix transposition and matrix multiplication. SQL: self join+aggregation on the data set. AQL/AFL in SciDB: do self cross join on the data set. SciDB operators: work directly on the array. Definitions Sufficient Statistics Old: All experiments are performed under default cluster configuration: 1 node running 2 instances. Each node has 4 cores, 4GB RAM and 3TB disk space. New: