Download presentation
Presentation is loading. Please wait.
Published byAusten Parks Modified over 9 years ago
1
Big Data Analytics and Data Warehousing with Data Cubes Carlos Ordonez University of Houston ATT Research Labs NY
2
Goals of talk 1.Big Data 2.Cubes 3.Highlight some of my “cubic” research 2/79
3
Big Data: what is different from large databases? VVV+V Variety: –Loosely specified or no schema –Storage: Record -> files; data types -> any digital content Volume: –Higher volume, including streaming –Multiple levels of granularity Velocity: speed of arrival/processing Veracity: Internet, multiple versions of data 3/79
4
Big Data: specific technical details Integration needed! Finer granularity than transactions; give up ACID? cannot be directly analyzed: pre-processing Diverse data sources, beyond relational/alphanumeric Volume requires parallelism Skip ETL: load files directly Web logs, user interactions, social nets, streams Still only HDD provides capacity and good $; SSD $$; future non-volatile RAM? 4/79
5
Current technologies for Big Data DBMS –Row –Column –Other: array, XML, Datalog is back Hadoop stack –Apache: + important than GNU in corp. world –MapReduce: forgotten?; HDFS: dominant –Hive; SPARQL; Cassandra, Impala, Cask –Many more: open-source 5/79
6
Data Warehouse versus Data Lake (Data Swamp?) 6/79 FeatureData WarehouseData Lake Database ModelER modelNone ETLInvolved, data transformation Copy file QueryingSQLSparql, Java program, SQL? Grow # nodesDifficult, $$$Easy, $$
7
Big Data Analytics Processing Data Profiling Data Exploration; univariate stats Data Preparation Multivariate Statistics Machine Learning Algorithms Analytic Modeling Scoring Lifecycle Maintenance Model Deployment Highly Iterative Process
8
Processing: DBMS versus Hadoop TaskSQLHadoop /noSQL Available Sequential open-sourceyy Parallel open sourceny Fault tolerant on long jobsnY LibrarieslimitedMany Arrays and matriceslimitedgood Massive parallelism (# servers, 1000s of CPUs)ny 8/60
9
Some cons about big data not using DBMS technology No SQL No model, no DDL no consistency, although transaction too stringent Web-scale data tough, but not universal Database integration and cleaning much harder Parallel processing with too much hardware Fact: SQL remains main query mechanism 9/79
10
Why analysis inside a DBMS? llllllll Teradata Your PC with Warehouse Miner ODBC Huge data volumes: potentially better results with larger amounts of data; less process. time Minimizes data redundancy; Eliminate proprietary data structures; simplifies data management; security Caveats: SQL, limited statistical functionality, complex DBMS architecture
11
DBMS Sequential vs Parallel Physical Operators Serial DBMS (one CPU, RAID): –table Scan –join: hash join, sort merge join, nested loop –external merge sort Parallel DBMS (shared-nothing): –even row distribution, hashing –parallel table scan –parallel joins: large/large (sort-merge, hash); large/short (replicate short) –distributed sort 11/60
12
Big Data Analytics Overview Simple: –Ad-hoc Queries –Cubes: OLAP, MOLAP, includes descriptive statistics: histograms, means, plots, statistical tests Complex: –Statistical and Machine Learning Models –Patterns: Graphs subsuming other problems 12/60
13
Cube Processing Input Data set F : n records, d dimensions, e measures Dimensions: discrete, measures: numeric Focus of the talk, d dimensions I/O bottleneck: Cube: lattice of d dimensions High d harder than n 13/60
14
Cube computations Explore lattice of dimensions Large n: F cannot fit in RAM, minimize I/O Multidimensional –d: tens, maybe hundreds of dimensions Internally computed with data structures 14/60
15
Cube algorithms: elevator story Behavior with respect to data set X: –Level-wise: k passes Time complexity bottleneck d: O(n2 d ) Cubes research today: –Parallel processing –Data structures incompatible with relational DB –different time complexity in SQL/Hadoop/MapReduce –Incremental and online 15/60
16
Cubes inside DBMS: more involved Assumption: –data records are in the DBMS; exporting slow –row-based or column-based storage Programming alternatives: –SQL and UDFs: SQL code generation (JDBC), precompiled UDFs. Extra: SP, embedded SQL, cursors –Internal C Code (direct access to file system and mem) DBMS advantages: –Columns 10X faster: compression + efficient projection –mportant: storage, queries, security –maybe: recovery, concurrency control, integrity, transactions (i.e. some ACID ok) 16/60
17
Cubes outside DBMS: alternatives Hadoop: dump data to Data Lake; SQL-like later MOLAP tools: –Push hard aggregations with SQL –Memory-based lattice traversal –Interaction with spreadsheets Imperative programming languages instead of SQL: C++, Java –Arrays, functions, modules, classes –flexibility of control statements 17/60
18
Cube Processing Optimizations Algorithmic & Systems Algorithmic (90% research, but not in a DBMS) –accelerate/reduce cube computations –database systems focus: reduce I/O passes –approximate solutions: good for count(*), sum() looked at with suspicion –parallel Systems (SQL, Hadoop, MapReduce, Libraries) –Platform: parallel DBMS server vs cluster of computers vs multicore CPUs –Programming: SQL/C++ versus Java 18/60
19
Research Highlights research with my students Comprehensive –Modeling –Query processing –Visualization Biased –Motivated by DOLAP! –Influenced by Stonebraker –Mostly with my students –Hadoop ignored 19/79
20
A glimpse Preparing and cleaning data takes time: ETL Lots of SQL and scripts written to prepare data sets for statistical analysis Data quality was hot; worth revisiting w/big data Graph analytics Cube computation is the most researched topic; cube result analysis/interpretation 2 nd priority Is “Big data” different? 20/79
21
SQL and ER: can they get closer? Goal: creating a data set X with d dimensions D(K,A), K commonly a single id Lots of SQL queries, many temporary tables Users do not like to look at someone else’s SQL code Decoupled from ER model, not reused Many transformations: cubes, variable creation, even math transformation for statistical analysis 21/79
22
Representing Data Transformations done with SQL queries 22/79
23
SQL transformations in ER 23/79
24
Extended ER zoom in 24/79
25
Referential Integrity QMs 25/79
26
SQL Optimizations: Queries vs UDFs SQL query optimization –mathematical equations as queries –Turing-complete: SQL code generation and programming language UDFs as optimization –substitute difficult/slow math computations –push processing into RAM memory 26/60
27
SQL Query Processing Columns will take over rows [Stonebraker] –Vertica and MonetDB “pure” column –Hybrid: Oracle Exadata, Teradata, SQL Server indexes But a lot of work to do –OLTP: rows, not columns (slow conversion) –still a lot of data warehouses working in row form –Many external tools store by row 27/79
28
Horizontal aggregations Create cross-tabular tables from cube PIVOT requires knowing values Aggregations in horizontal layout 28/79
29
Prepare Data Set Horizontal aggregations
30
Horizontal Meta-optimizer 30/79
31
Graph Analytics Recursive queries in SQL Patterns: paths, cycles, cliques Examples: –Twitter: who follows who?, how many #? –Facebook: family, close friends, social circles, friends in common –Airline: list all flights from A to B; balance cost/distance Surprisingly: SQL is good!, but with a column DBMS 31/79
32
A Benchmark to compute # of paths in a graph of length k 32/79
33
Cube computation with UDF (table function) Data structure in RAM; maybe one pass It requires maximal cuboid or choosing k dimensions 33/79
34
Cube in UDF Lattice manipulated with hash table 34/79
35
Cube visualization: harder than 2D or 3D data! Lattice exploration Projection into 2D Comparing cuboids 35/79
36
Cube interpretation & visualization statistical tests on cubes
37
Can we do “search engines”? Keyword search, ranking
38
Acknowledgments Il-Yeol Song, since we met in 2010, but I started sending papers to DOLAP in 2003 Mike Stonebraker: one size does not fit all My students 38/79
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.