Big Data Infrastructure Week 6: Analyzing Relational Data (1/3) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

Slides:



Advertisements
Similar presentations
Database and MapReduce
Advertisements

Supervisor : Prof . Abbdolahzadeh
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1.
INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014.
A Fast Growing Market. Interesting New Players Lyzasoft.
Data Warehouse Architecture Sakthi Angappamudali Data Architect, The Oregon State University, Corvallis 16 th May, 2005.
ICS 421 Spring 2010 Data Warehousing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/18/20101Lipyeow.
Introduction to Data Warehousing. From DBMS to Decision Support DBMSs widely used to maintain transactional data Attempts to use of these data for analysis,
Data Mining.
Components and Architecture CS 543 – Data Warehousing.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
13 Chapter 13 The Data Warehouse Hachim Haddouti.
Chapter 13 The Data Warehouse
Business Intelligence System September 2013 BI.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Tyson Condie.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
Cube Intro. Decision Making Effective decision making Goal: Choice that moves an organization closer to an agreed-on set of goals in a timely manner Goal:
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
1 Data Warehouses BUAD/American University Data Warehouses.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
A NoSQL Database - Hive Dania Abed Rabbou.
CISB594 – Business Intelligence
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
By N.Gopinath AP/CSE. There are 5 categories of Decision support tools, They are; 1. Reporting 2. Managed Query 3. Executive Information Systems 4. OLAP.
Ayyat IT Group Murad Faridi Roll NO#2492 Muhammad Waqas Roll NO#2803 Salman Raza Roll NO#2473 Junaid Pervaiz Roll NO#2468 Instructor :- “ Madam Sana Saeed”
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
CMPE 226 Database Systems October 21 Class Meeting Department of Computer Engineering San Jose State University Fall 2015 Instructor: Ron Mak
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
MIS2502: Data Analytics Advanced Analytics - Introduction.
CISB594 – Business Intelligence Data Warehousing Part I.
Information Systems in Organizations
Advanced Database Concepts
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
1 Copyright © Oracle Corporation, All rights reserved. Business Intelligence and Data Warehousing.
Information Systems in Organizations Managing the business: decision-making Growing the business: knowledge management, R&D, and social business.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
An Overview of Data Warehousing and OLAP Technology
This is a free Course Available on Hadoop-Skills.com.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Supervisor : Prof . Abbdolahzadeh
Big Data Infrastructure
Big Data Infrastructure
Big Data Infrastructure
Information Systems in Organizations
THE COMPELLING NEED FOR DATA WAREHOUSING
Data warehouse and OLAP
Chapter 13 The Data Warehouse
Data Platform and Analytics Foundational Training
Data Warehouse.
Presented by: Warren Sifre
Data Warehouse and OLAP
XtremeData on the Microsoft Azure Cloud Platform:
The Idea of Pig Or Pig Concepts
Data Warehouse.
Data Warehousing Concepts
Data-Intensive Distributed Computing
Big DATA.
Analytics, BI & Data Integration
Data Warehouse and OLAP
David Gilmore & Richard Blevins Senior Consultants April 17th, 2012
Data Warehouse and OLAP Technology
Data Warehousing.
Presentation transcript:

Big Data Infrastructure Week 6: Analyzing Relational Data (1/3) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details CS 489/698 Big Data Infrastructure (Winter 2016) Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo February 11, 2016 These slides are available at

Structure of the Course “Core” framework features and algorithm design Analyzing Text Analyzing Graphs Analyzing Relational Data Data Mining

An organization should retain data that result from carrying out its mission and exploit those data to generate insights that benefit the organization, for example, market analysis, strategic planning, decision making, etc. Business Intelligence Duh!?

Source: Wikipedia

Database Workloads OLTP (online transaction processing) Typical applications: e-commerce, banking, airline reservations User facing: real-time, low latency, highly-concurrent Tasks: relatively small set of “standard” transactional queries Data access pattern: random reads, updates, writes (involving relatively small amounts of data) OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing: batch workloads, less concurrency Tasks: complex analytical queries, often ad hoc Data access pattern: table scans, large amounts of data per query An organization should retain data that result from carrying out its mission and exploit those data to generate insights that benefit the organization, for example, market analysis, strategic planning, decision making, etc.

One Database or Two? Downsides of co-existing OLTP and OLAP workloads Poor memory management Conflicting data access patterns Variable latency Solution: separate databases User-facing OLTP database for high-volume transactions Data warehouse for OLAP workloads How do we connect the two?

Data Warehousing Source: Wikipedia (Warehouse)

OLTP/OLAP Integration OLTP database for user-facing transactions Extract-Transform-Load (ETL) OLAP database for data warehousing

OLTP/OLAP Architecture OLTPOLAP ETL (Extract, Transform, and Load) A simple example to illustrate…

A Simple OLTP Schema CustomerBilling OrderInventory OrderLine

A Simple OLAP Schema Dim_Custome r Dim_Date Dim_Product Fact_Sales Dim_Store Stars and snowflakes, oh my!

ETL Extract Transform Data cleaning and integrity checking Schema conversion Field transformations Load When does ETL happen?

What do you actually do? Report generation Dashboards Ad hoc analyses

OLAP Cubes store product time slice and dice Common operations roll up/drill down pivot

OLAP Cubes: Challenges Fundamentally, lots of joins, group-bys and aggregations How to take advantage of schema structure to avoid repeated work? Cube materialization Realistic to materialize the entire cube? If not, how/when/what to materialize?

Fast forward…

“On the first day of logging the Facebook clickstream, more than 400 gigabytes of data was collected. The load, index, and aggregation processes for this data set really taxed the Oracle data warehouse. Even after significant tuning, we were unable to aggregate a day of clickstream data in less than 24 hours.” Jeff Hammerbacher, Information Platforms and the Rise of the Data Scientist. In, Beautiful Data, O’Reilly, 2009.

OLTP/OLAP Architecture OLTPOLAP ETL (Extract, Transform, and Load)

Facebook Context “OLTP”“OLAP” ETL (Extract, Transform, and Load) Adding friends Updating profiles Likes, comments … Feed ranking Friend recommendation Demographic analysis …

Facebook Technology “OLTP” PHP/MySQL

Facebook’s Datawarehouse “OLTP”Hadoop ETL (Extract, Transform, and Load) PHP/MySQL ETL or ELT?

What’s changed? Dropping cost of disks Cheaper to store everything than to figure out what to throw away

What’s changed? Dropping cost of disks Cheaper to store everything than to figure out what to throw away 5 MB hard drive in 1956

What’s changed? Dropping cost of disks Cheaper to store everything than to figure out what to throw away Types of data collected From data that’s obviously valuable to data whose value is less apparent Rise of social media and user-generated content Large increase in data volume Growing maturity of data mining techniques Demonstrates value of data analytics

a useful service analyze user behavior to extract insights transform insights into action $ (hopefully) Virtuous Product Cycle Google.Facebook. Twitter. Amazon. Uber.

What do you actually do? Report generation Dashboards Ad hoc analyses “Descriptive” “Predictive” Data products

a useful service analyze user behavior to extract insights transform insights into action $ (hopefully) Virtuous Product Cycle data sciencedata products

“On the first day of logging the Facebook clickstream, more than 400 gigabytes of data was collected. The load, index, and aggregation processes for this data set really taxed the Oracle data warehouse. Even after significant tuning, we were unable to aggregate a day of clickstream data in less than 24 hours.” Jeff Hammerbacher, Information Platforms and the Rise of the Data Scientist. In, Beautiful Data, O’Reilly, 2009.

The Irony… “OLTP”Hadoop ELT PHP/MySQL Wait, so why not use a database to begin with? reduce map … … reduce map … … reduce map … … reduce map … … SQL

Why not just use a database? Scalabilit y. Cost. SQL is awesome

Databases are great… If your data has structure (and you know what the structure is) If you know what queries you’re going to run ahead of time If your data is reasonably clean Databases are not so great… If your data has little structure (or you don’t know the structure) If you don’t know what you’re looking for If your data is messy and noisy

“there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are unknown unknowns – the ones we don't know we don't know…” – Donald Rumsfeld Source: Wikipedia

Databases are great… If your data has structure (and you know what the structure is) If you know what queries you’re going to run ahead of time If your data is reasonably clean Databases are not so great… If your data has little structure (or you don’t know the structure) If you don’t know what you’re looking for If your data is messy and noisy Known unknowns! Unknown unknowns!

Advantages of Hadoop dataflow languages Don’t need to know the schema ahead of time Many analyses are better formulated imperatively Raw scans are the most common operations Also compare: data ingestion rate

What do you actually do? Dashboards Report generation Ad hoc analyses “Descriptive” “Predictive” Data products Which are known unknowns and unknown unknowns?

OLTP/OLAP Architecture OLTPOLAP ETL (Extract, Transform, and Load)

Modern Datawarehouse Ecosystem OLTP OLAP Databases HDFS SQL tools other tools

Facebook’s Datawarehouse “OLTP”Hadoop ELT PHP/MySQL How does this actually happen?

Twitter’s data warehousing architecture

circa ~2010 ~150 people total ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total Hadoop DW capacity ~100 TB ingest daily dozens of teams use Hadoop daily 10s of Ks of Hadoop jobs daily

Twitter’s data warehousing architecture

Importing Log Data Scribe Daemons (Production Hosts) Main Hadoop DW Main Datacenter Staging Hadoop Cluster HDFS Scribe Aggregators Scribe Daemons (Production Hosts) Datacenter Staging Hadoop Cluster HDFS Scribe Aggregators Scribe Daemons (Production Hosts) Datacenter Staging Hadoop Cluster HDFS Scribe Aggregators

Importing Structured Data* DB partitions HDFS Tweets, graph, users profiles LZO-compressed protobufs select * from … mappers Important: Must carefully throttle resource usage… Different periodicity (e.g., hourly, daily snapshots, etc.) * Out of date – for illustration only

Vertica Pipeline “Basically, we use Vertica as a cache for HDFS HDFS VerticaMySQL import Birdbrain aggregation Why? Vertica provides orders of magnitude faster aggregations! Interactive browsing tools * Out of date – for illustration only

Vertica Pipeline HDFS VerticaMySQL Birdbrain The catch… Performance must be balanced against integration costs Vertica integration is non-trivial Interactive browsing tools importaggregation * Out of date – for illustration only

Vertica Pipeline HDFS Vertica import MySQL Birdbrain aggregation Interactive browsing tools Let’s just run this in reverse! DB partitions HDFS LZO-compressed protobufs select * from … mappers * Out of date – for illustration only

Vertica Pig Storage Vertica partitions HDFS reducers Vertica guarantees that each of these batch inserts are atomic So what’s the challenge? Did you remember to turn off speculative execution? What happens when a task dies? * Out of date – for illustration only

What’s Next?

RDBMS

OLTPOLAP ETL (Extract, Transform, and Load)

OLTPHadoop ELT

OLTP OLAP Databases HDFS SQL tools other tools

OLTPOLAP Hybrid Transactional/Analytical Processing (HTAP) Coming back full circle?

Source: Wikipedia (Japanese rock garden) Questions?