Data-Intensive Distributed Computing

Slides:



Advertisements
Similar presentations
Supervisor : Prof . Abbdolahzadeh
Advertisements

Data Warehouse Architecture Sakthi Angappamudali Data Architect, The Oregon State University, Corvallis 16 th May, 2005.
ICS 421 Spring 2010 Data Warehousing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/18/20101Lipyeow.
Introduction to Data Warehousing. From DBMS to Decision Support DBMSs widely used to maintain transactional data Attempts to use of these data for analysis,
Database – Part 2b Dr. V.T. Raja Oregon State University External References/Sources: Data Warehousing – Sakthi Angappamudali at Standard Insurance; BI.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Chapter 13 The Data Warehouse
Business Intelligence System September 2013 BI.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
Tyson Condie.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 5 th Edition, Aug 26, 2005 Buzzword List OLTP – OnLine Transaction Processing (normalized,
Cube Intro. Decision Making Effective decision making Goal: Choice that moves an organization closer to an agreed-on set of goals in a timely manner Goal:
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
1 Data Warehouses BUAD/American University Data Warehouses.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
CISB594 – Business Intelligence
Operational vs. Informational System. Operational System Operational systems maintain records of daily business transactions whereas a Data Warehouse.
Datawarehouse A sneak preview. 2 Data Warehouse Approach An old idea with a new interest: Cheap Computing Power Special Purpose Hardware New Data Structures.
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
CISB594 – Business Intelligence Data Warehousing Part I.
By N.Gopinath AP/CSE. There are 5 categories of Decision support tools, They are; 1. Reporting 2. Managed Query 3. Executive Information Systems 4. OLAP.
Ayyat IT Group Murad Faridi Roll NO#2492 Muhammad Waqas Roll NO#2803 Salman Raza Roll NO#2473 Junaid Pervaiz Roll NO#2468 Instructor :- “ Madam Sana Saeed”
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
CMPE 226 Database Systems October 21 Class Meeting Department of Computer Engineering San Jose State University Fall 2015 Instructor: Ron Mak
Information Systems in Organizations Managing the business: decision-making Growing the business: knowledge management, R&D, and social business.
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
CISB594 – Business Intelligence Data Warehousing Part I.
Information Systems in Organizations
Advanced Database Concepts
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
Big Data Infrastructure Week 6: Analyzing Relational Data (1/3) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
1 Copyright © Oracle Corporation, All rights reserved. Business Intelligence and Data Warehousing.
Information Systems in Organizations Managing the business: decision-making Growing the business: knowledge management, R&D, and social business.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
An Overview of Data Warehousing and OLAP Technology
BUSINESS INTELLIGENCE. The new technology for understanding the past & predicting the future … BI is broad category of technologies that allows for gathering,
1 Cloud-Native Data Warehousing Bob Muglia. 2 Scenarios with affinity for cloud Gartner 2016 Predictions: By 2018, six billion connected things will be.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Business Intelligence Overview
Supervisor : Prof . Abbdolahzadeh
Big Data Infrastructure
Scaling Big Data Mining Infrastructure: The Twitter Experience
Big Data Infrastructure
Big Data Infrastructure
Information Systems in Organizations
5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.
MIS2502: Data Analytics Advanced Analytics - Introduction
Data warehouse.
THE COMPELLING NEED FOR DATA WAREHOUSING
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Chapter 13 Business Intelligence and Data Warehouses
Data warehouse and OLAP
Chapter 13 The Data Warehouse
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Business Intelligence
Data Warehouse.
CMPE 226 Database Systems April 11 Class Meeting
Data Warehouse and OLAP
XtremeData on the Microsoft Azure Cloud Platform:
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009
Data Warehousing Concepts
Business Intelligence
Big DATA.
Analytics, BI & Data Integration
Applying Data Warehouse Techniques
Data Warehouse and OLAP
Data Warehouse and OLAP Technology
Data Warehousing & DATA MINING (SE-409) Lecture-1 Introduction and Background Huma Ayub Software Engineering department University of Engineering and Technology,
Data Warehousing.
Presentation transcript:

Data-Intensive Distributed Computing CS 451/651 (Fall 2018) Part 5: Analyzing Relational Data (1/3) October 16, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are available at http://lintool.github.io/bigdata-2018f/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Structure of the Course Analyzing Text Analyzing Graphs Analyzing Relational Data Data Mining “Core” framework features and algorithm design

Evolution of Enterprise Architectures Next two sessions: techniques, algorithms, and optimizations for relational processing

users Monolithic Application

users Frontend Backend

Source: Wikipedia

users Frontend Backend database Why is this a good idea?

Business Intelligence An organization should retain data that result from carrying out its mission and exploit those data to generate insights that benefit the organization, for example, market analysis, strategic planning, decision making, etc. Duh!?

users Frontend Backend database BI tools analysts

users database analysts Why is my application so slow? Frontend Backend database Why does my analysis take so long? BI tools analysts

Database Workloads OLTP (online transaction processing) Typical applications: e-commerce, banking, airline reservations User facing: real-time, low latency, highly-concurrent Tasks: relatively small set of “standard” transactional queries Data access pattern: random reads, updates, writes (small amounts of data) OLAP (online analytical processing) Typical applications: business intelligence, data mining Back-end processing: batch workloads, less concurrency Tasks: complex analytical queries, often ad hoc Data access pattern: table scans, large amounts of data per query

OLTP and OLAP Together? Solution? Downsides of co-existing OLTP and OLAP workloads Poor memory management Conflicting data access patterns Variable latency users and analysts Solution?

Build a data warehouse! Source: Wikipedia (Warehouse)

ETL (Extract, Transform, and Load) users Frontend Backend OLTP database OLTP database for user-facing transactions ETL (Extract, Transform, and Load) OLAP database for data warehousing Data Warehouse BI tools What’s special about OLTP vs. OLAP? analysts

A Simple OLTP Schema Customer Billing Inventory Order OrderLine

A Simple OLAP Schema Stars and snowflakes, oh my! Dim_Customer Dim_Date Dim_Product Fact_Sales Stars and snowflakes, oh my! Dim_Store

ETL When does ETL happen? Extract Transform Load Data cleaning and integrity checking Schema conversion Field transformations Load When does ETL happen?

ETL (Extract, Transform, and Load) users Frontend Backend OLTP database ETL (Extract, Transform, and Load) Data Warehouse My data is a day old… Meh. BI tools analysts

ETL (Extract, Transform, and Load) external APIs users users Frontend Frontend Frontend Backend Backend Backend OLTP database OLTP database OLTP database ETL (Extract, Transform, and Load) Data Warehouse BI tools analysts

What do you actually do? Report generation Dashboards Ad hoc analyses

OLAP Cubes Common operations slice and dice roll up/drill down pivot time Common operations slice and dice roll up/drill down pivot product store

OLAP Cubes: Challenges Fundamentally, lots of joins, group-bys and aggregations How to take advantage of schema structure to avoid repeated work? Cube materialization Realistic to materialize the entire cube? If not, how/when/what to materialize?

ETL (Extract, Transform, and Load) external APIs users users Frontend Frontend Frontend Backend Backend Backend OLTP database OLTP database OLTP database ETL (Extract, Transform, and Load) Data Warehouse BI tools analysts

Fast forward…

Jeff Hammerbacher, Information Platforms and the Rise of the Data Scientist. In, Beautiful Data, O’Reilly, 2009. “On the first day of logging the Facebook clickstream, more than 400 gigabytes of data was collected. The load, index, and aggregation processes for this data set really taxed the Oracle data warehouse. Even after significant tuning, we were unable to aggregate a day of clickstream data in less than 24 hours.”

ETL (Extract, Transform, and Load) users Frontend Backend OLTP database Facebook context? ETL (Extract, Transform, and Load) Data Warehouse BI tools analysts

ETL (Extract, Transform, and Load) users Frontend Backend Adding friends Updating profiles Likes, comments … “OLTP” ETL (Extract, Transform, and Load) Feed ranking Friend recommendation Demographic analysis … Data Warehouse BI tools analysts

ETL (Extract, Transform, and Load) users Frontend Backend “OLTP” PHP/MySQL ETL (Extract, Transform, and Load) or ELT? Hadoop ✗ analysts data scientists

Cheaper to store everything than to figure out what to throw away What’s changed? Dropping cost of disks Cheaper to store everything than to figure out what to throw away 5 MB hard drive in 1956

What’s changed? Dropping cost of disks Types of data collected Cheaper to store everything than to figure out what to throw away Types of data collected From data that’s obviously valuable to data whose value is less apparent Rise of social media and user-generated content Large increase in data volume Growing maturity of data mining techniques Demonstrates value of data analytics

$ (hopefully) Virtuous Product Cycle a useful service transform insights into action analyze user behavior to extract insights Google. Facebook. Twitter. Amazon. Uber.

What do you actually do? Report generation Dashboards Ad hoc analyses “Descriptive” “Predictive” Data products

$ (hopefully) Virtuous Product Cycle a useful service transform insights into action analyze user behavior to extract insights Google. Facebook. Twitter. Amazon. Uber. data products data science

Jeff Hammerbacher, Information Platforms and the Rise of the Data Scientist. In, Beautiful Data, O’Reilly, 2009. “On the first day of logging the Facebook clickstream, more than 400 gigabytes of data was collected. The load, index, and aggregation processes for this data set really taxed the Oracle data warehouse. Even after significant tuning, we were unable to aggregate a day of clickstream data in less than 24 hours.”

ETL (Extract, Transform, and Load) users Frontend Backend “OLTP” ETL (Extract, Transform, and Load) Hadoop data scientists

The Irony… users “OLTP” ETL (Extract, Transform, and Load) Hadoop Frontend Backend “OLTP” ETL (Extract, Transform, and Load) Hadoop reduce map … data scientists Wait, so why not use a database to begin with?

Why not just use a database? SQL is awesome Scalability. Cost.

Databases are not so great… Databases are great… If your data has structure (and you know what the structure is) If your data is reasonably clean If you know what queries you’re going to run ahead of time Databases are not so great… If your data has little structure (or you don’t know the structure) If your data is messy and noisy If you don’t know what you’re looking for

“there are known knowns; there are things we know we know “there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are unknown unknowns – the ones we don't know we don't know…” – Donald Rumsfeld Source: Wikipedia

Databases are not so great… Databases are great… If your data has structure (and you know what the structure is) If your data is reasonably clean If you know what queries you’re going to run ahead of time Known unknowns! Databases are not so great… If your data has little structure (or you don’t know the structure) If your data is messy and noisy If you don’t know what you’re looking for Unknown unknowns!

Advantages of Hadoop dataflow languages Don’t need to know the schema ahead of time Raw scans are the most common operations Many analyses are better formulated imperatively Much faster data ingest rate

Which are known unknowns and unknown unknowns? What do you actually do? Report generation Dashboards Ad hoc analyses “Descriptive” “Predictive” Data products Which are known unknowns and unknown unknowns?

ETL (Extract, Transform, and Load) external APIs users users Frontend Frontend Frontend Backend Backend Backend OLTP database OLTP database OLTP database ETL (Extract, Transform, and Load) Data Warehouse BI tools analysts

ETL (Extract, Transform, and Load) external APIs users users Frontend Frontend Frontend Backend Backend Backend OLTP database OLTP database OLTP database ETL (Extract, Transform, and Load) “Data Lake” Data Warehouse Other tools SQL on Hadoop “Traditional” BI tools data scientists

Twitter’s data warehousing architecture (circa 2012)

circa ~2010 circa ~2012 ~150 people total ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total Hadoop DW capacity ~100 TB ingest daily dozens of teams use Hadoop daily 10s of Ks of Hadoop jobs daily

Twitter’s data warehousing architecture (circa 2012) How does ETL actually happen? Twitter’s data warehousing architecture (circa 2012)

Importing Log Data Main Hadoop DW HDFS HDFS HDFS Main Datacenter Staging Hadoop Cluster HDFS Scribe Aggregators Main Hadoop DW Scribe Daemons (Production Hosts) Datacenter Staging Hadoop Cluster HDFS Scribe Aggregators Scribe Daemons (Production Hosts) Datacenter Staging Hadoop Cluster HDFS Scribe Aggregators Scribe Daemons (Production Hosts)

Two developing trends… What’s Next? Two developing trends…

users Frontend Backend database BI tools analysts

ETL (Extract, Transform, and Load) external APIs users users Frontend Frontend Frontend Backend Backend Backend OLTP database OLTP database OLTP database ETL (Extract, Transform, and Load) Data Warehouse BI tools analysts

ETL (Extract, Transform, and Load) external APIs users users Frontend Frontend Frontend Backend Backend Backend OLTP database OLTP database OLTP database ETL (Extract, Transform, and Load) “Data Lake” Data Warehouse My data is a day old… Other tools SQL on Hadoop “Traditional” BI tools I refuse to accept that! data scientists

What if you didn’t have to do this? OLTP ETL OLAP What if you didn’t have to do this?

Coming back full circle? HTAP Hybrid Transactional/Analytical Processing (HTAP) Coming back full circle?

ETL (Extract, Transform, and Load) external APIs users users Frontend Frontend Frontend Backend Backend Backend OLTP database OLTP database OLTP database ETL (Extract, Transform, and Load) “Data Lake” Data Warehouse Other tools SQL on Hadoop “Traditional” BI tools data scientists

ETL (Extract, Transform, and Load) external APIs users users Frontend Frontend Frontend Backend Backend Backend HTAP database HTAP database HTAP database Analytics tools Analytics tools data scientists data scientists ETL (Extract, Transform, and Load) “Data Lake” Data Warehouse Other tools SQL on Hadoop “Traditional” BI tools data scientists

ETL (Extract, Transform, and Load) external APIs users users Frontend Frontend Frontend IaaS / Load balance aaS Backend Backend Backend DBaaS (e.g., RDS) OLTP database OLTP database OLTP database Everything In the cloud! ELT aaS ETL (Extract, Transform, and Load) DBaaS (e.g., RedShift) “Data Lake” Data Warehouse S3 Other tools SQL on Hadoop “Traditional” BI tools “Cloudified” tools data scientists

Source: Wikipedia (Japanese rock garden)