Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY.

Slides:



Advertisements
Similar presentations
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Advertisements

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Chapter 14 The Second Component: The Database.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
The Gamma Operator for Big Data Summarization
Mihai Pintea. 2 Agenda Hadoop and MongoDB DataDirect driver What is Big Data.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
...Looking back Why use a DBMS? How to design a database? How to query a database? How does a DBMS work?
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Hive : A Petabyte Scale Data Warehouse Using Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
A Comparison of Column, Row and Array DBMSs to Process Recursive Queries Carlos Ordonez ATT Labs.
OnLine Analytical Processing (OLAP)
NoSQL Databases Oracle - Berkeley DB Rasanjalee DM Smriti J CSC 8711 Instructor: Dr. Raj Sunderraman.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
An Introduction to HDInsight June 27 th,
1 Data Warehouses BUAD/American University Data Warehouses.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Database Systems Carlos Ordonez. What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms;
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
1 CS3431 – Database Systems I Introduction Instructor: Mohamed Eltabakh
Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
CPT-S Topics in Computer Science Big Data 1 Yinghui Wu EME 49.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Plan for Populating a DW
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Data warehouse and OLAP
Big Data Analytics in Parallel Systems
Database Performance Tuning and Query Optimization
The R language and its Dynamic Runtime
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Physical Database Design
20 Questions with Azure SQL Data Warehouse
MANAGING DATA RESOURCES
Parallel Analytic Systems
Overview of big data tools
Chapter 11 Database Performance Tuning and Query Optimization
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
The Gamma Operator for Big Data Summarization
Database System Architectures
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY

Goals of talk 1.Big Data 2.Cubes 3.Highlight some of my “cubic” research 2/79

Big Data: what is different from large databases? VVV+V Variety: –Loosely specified or no schema –Storage: Record -> files; data types -> any digital content Volume: –Higher volume, including streaming –Multiple levels of granularity Velocity: speed of arrival/processing Veracity: Internet, multiple versions of data 3/79

Big Data: specific technical details Integration needed! Finer granularity than transactions; give up ACID? cannot be directly analyzed: pre-processing Diverse data sources, beyond relational/alphanumeric Volume requires parallelism Skip ETL: load files directly Web logs, user interactions, social nets, streams Still only HDD provides capacity and good $; SSD $$; future non-volatile RAM? 4/79

Current technologies for Big Data DBMS –Row –Column –Other: array, XML, Datalog is back Hadoop stack –Apache: + important than GNU in corp. world –MapReduce: forgotten?; HDFS: dominant –Hive; SPARQL; Cassandra, Impala, Cask –Many more: open-source 5/79

Data Warehouse versus Data Lake (Data Swamp?) 6/79 FeatureData WarehouseData Lake Database ModelER modelNone ETLInvolved, data transformation Copy file QueryingSQLSparql, Java program, SQL? Grow # nodesDifficult, $$$Easy, $$

Big Data Analytics Processing Data Profiling Data Exploration; univariate stats Data Preparation Multivariate Statistics Machine Learning Algorithms Analytic Modeling Scoring Lifecycle Maintenance Model Deployment Highly Iterative Process

Processing: DBMS versus Hadoop TaskSQLHadoop /noSQL Available Sequential open-sourceyy Parallel open sourceny Fault tolerant on long jobsnY LibrarieslimitedMany Arrays and matriceslimitedgood Massive parallelism (# servers, 1000s of CPUs)ny 8/60

Some cons about big data not using DBMS technology No SQL No model, no DDL no consistency, although transaction too stringent Web-scale data tough, but not universal Database integration and cleaning much harder Parallel processing with too much hardware Fact: SQL remains main query mechanism 9/79

Why analysis inside a DBMS? llllllll Teradata Your PC with Warehouse Miner ODBC Huge data volumes: potentially better results with larger amounts of data; less process. time Minimizes data redundancy; Eliminate proprietary data structures; simplifies data management; security Caveats: SQL, limited statistical functionality, complex DBMS architecture

DBMS Sequential vs Parallel Physical Operators Serial DBMS (one CPU, RAID): –table Scan –join: hash join, sort merge join, nested loop –external merge sort Parallel DBMS (shared-nothing): –even row distribution, hashing –parallel table scan –parallel joins: large/large (sort-merge, hash); large/short (replicate short) –distributed sort 11/60

Big Data Analytics Overview Simple: –Ad-hoc Queries –Cubes: OLAP, MOLAP, includes descriptive statistics: histograms, means, plots, statistical tests Complex: –Statistical and Machine Learning Models –Patterns: Graphs subsuming other problems 12/60

Cube Processing Input Data set F : n records, d dimensions, e measures Dimensions: discrete, measures: numeric Focus of the talk, d dimensions I/O bottleneck: Cube: lattice of d dimensions High d harder than n 13/60

Cube computations Explore lattice of dimensions Large n: F cannot fit in RAM, minimize I/O Multidimensional –d: tens, maybe hundreds of dimensions Internally computed with data structures 14/60

Cube algorithms: elevator story Behavior with respect to data set X: –Level-wise: k passes Time complexity bottleneck d: O(n2 d ) Cubes research today: –Parallel processing –Data structures incompatible with relational DB –different time complexity in SQL/Hadoop/MapReduce –Incremental and online 15/60

Cubes inside DBMS: more involved Assumption: –data records are in the DBMS; exporting slow –row-based or column-based storage Programming alternatives: –SQL and UDFs: SQL code generation (JDBC), precompiled UDFs. Extra: SP, embedded SQL, cursors –Internal C Code (direct access to file system and mem) DBMS advantages: –Columns 10X faster: compression + efficient projection –mportant: storage, queries, security –maybe: recovery, concurrency control, integrity, transactions (i.e. some ACID ok) 16/60

Cubes outside DBMS: alternatives Hadoop: dump data to Data Lake; SQL-like later MOLAP tools: –Push hard aggregations with SQL –Memory-based lattice traversal –Interaction with spreadsheets Imperative programming languages instead of SQL: C++, Java –Arrays, functions, modules, classes –flexibility of control statements 17/60

Cube Processing Optimizations Algorithmic & Systems Algorithmic (90% research, but not in a DBMS) –accelerate/reduce cube computations –database systems focus: reduce I/O passes –approximate solutions: good for count(*), sum() looked at with suspicion –parallel Systems (SQL, Hadoop, MapReduce, Libraries) –Platform: parallel DBMS server vs cluster of computers vs multicore CPUs –Programming: SQL/C++ versus Java 18/60

Research Highlights research with my students Comprehensive –Modeling –Query processing –Visualization Biased –Motivated by DOLAP! –Influenced by Stonebraker –Mostly with my students –Hadoop ignored 19/79

A glimpse Preparing and cleaning data takes time: ETL Lots of SQL and scripts written to prepare data sets for statistical analysis Data quality was hot; worth revisiting w/big data Graph analytics Cube computation is the most researched topic; cube result analysis/interpretation 2 nd priority Is “Big data” different? 20/79

SQL and ER: can they get closer? Goal: creating a data set X with d dimensions D(K,A), K commonly a single id Lots of SQL queries, many temporary tables Users do not like to look at someone else’s SQL code Decoupled from ER model, not reused Many transformations: cubes, variable creation, even math transformation for statistical analysis 21/79

Representing Data Transformations done with SQL queries 22/79

SQL transformations in ER 23/79

Extended ER zoom in 24/79

Referential Integrity QMs 25/79

SQL Optimizations: Queries vs UDFs SQL query optimization –mathematical equations as queries –Turing-complete: SQL code generation and programming language UDFs as optimization –substitute difficult/slow math computations –push processing into RAM memory 26/60

SQL Query Processing Columns will take over rows [Stonebraker] –Vertica and MonetDB “pure” column –Hybrid: Oracle Exadata, Teradata, SQL Server indexes But a lot of work to do –OLTP: rows, not columns (slow conversion) –still a lot of data warehouses working in row form –Many external tools store by row 27/79

Horizontal aggregations Create cross-tabular tables from cube PIVOT requires knowing values Aggregations in horizontal layout 28/79

Prepare Data Set Horizontal aggregations

Horizontal Meta-optimizer 30/79

Graph Analytics Recursive queries in SQL Patterns: paths, cycles, cliques Examples: –Twitter: who follows who?, how many #? –Facebook: family, close friends, social circles, friends in common –Airline: list all flights from A to B; balance cost/distance Surprisingly: SQL is good!, but with a column DBMS 31/79

A Benchmark to compute # of paths in a graph of length k 32/79

Cube computation with UDF (table function) Data structure in RAM; maybe one pass It requires maximal cuboid or choosing k dimensions 33/79

Cube in UDF Lattice manipulated with hash table 34/79

Cube visualization: harder than 2D or 3D data! Lattice exploration Projection into 2D Comparing cuboids 35/79

Cube interpretation & visualization statistical tests on cubes

Can we do “search engines”? Keyword search, ranking

Acknowledgments Il-Yeol Song, since we met in 2010, but I started sending papers to DOLAP in 2003 Mike Stonebraker: one size does not fit all My students 38/79