Introduction to AWS Redshift

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
1 Nassau Community CollegeProf. Vincent Costa Acknowledgements: Introduction to Database Management, All Rights ReservedIntroduction to Database Management.
Cloud Computing Lecture Column Store – alternative organization for big relational data.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
TM 7-1 Copyright © 1999 Addison Wesley Longman, Inc. Physical Database Design.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Physical Database Design Transparencies. ©Pearson Education 2009 Chapter 11 - Objectives Purpose of physical database design. How to map the logical database.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
SQL Fundamentals  SQL: Structured Query Language is a simple and powerful language used to create, access, and manipulate data and structure in the database.
SQLintersection Putting the "Squeeze" on Large Tables Improve Performance and Save Space with Data Compression Justin Randall Tuesday,
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Session 1 Module 1: Introduction to Data Integrity
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
Bigtable: A Distributed Storage System for Structured Data
SQL Basics Review Reviewing what we’ve learned so far…….
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
Session Name Pelin ATICI SQL Premier Field Engineer.
11 Copyright © 2009, Oracle. All rights reserved. Enhancing ETL Performance.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
Index Building.
Storage and File Organization
SQL Server Statistics and its relationship with Query Optimizer
Practical Database Design and Tuning
Microsoft Office Access 2010 Lab 3
Storage Access Paging Buffer Replacement Page Replacement
Module 11: File Structure
Indexing Structures for Files and Physical Database Design
CSCI5570 Large Scale Data Processing Systems
Record Storage, File Organization, and Indexes
Parallel Databases.
Chapter 11: Storage and File Structure
Physical Database Design and Performance
Encryption in SQL Server
COMP 430 Intro. to Database Systems
Methodology – Physical Database Design for Relational Databases
Database Performance Tuning and Query Optimization
Four Rules For Columnstore Query Performance
Evaluation of Relational Operations
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Chapter 15 QUERY EXECUTION.
Introducing the SQL Server 2016 Query Store
國立臺北科技大學 課程:資料庫系統 fall Chapter 18
Physical Database Design
Azure SQL DWH: Tips and Tricks for developers
Practical Database Design and Tuning
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Chapter 13: Data Storage Structures
Azure Data Factory v2: What’s new?
Four Rules For Columnstore Query Performance
SQL Database on IoT devices could you? should you? would you?
Chapter 11 Database Performance Tuning and Query Optimization
Chapter 17 Designing Databases
Understanding Core Database Concepts
Get data insights faster with Data Wrangling
Chapter 13: Data Storage Structures
Chapter 13: Data Storage Structures
Lecture 20: Representing Data Elements
SQL Like Languages in Azure IoT
Presentation transcript:

Introduction to AWS Redshift Maryna Popova BI Engineer GO EURO GmbH

Maryna Popova BI Engineer at GoEuro www.goeuro.com LinkedIn: www.linkedin.com/in/marinapopova05 Email: marinapopova05@gmail.com

SQLSat Kyiv Team Denis Reznik Eugene Polonichko Oksana Tkach Yevhen Nedashkivskyi Mykola Pobyivovk Denis Reznik Eugene Polonichko Oksana Tkach Oksana Borysenko

Sponsor Sessions Starts at 13:00 Don’t miss them, they might be providing some interesting and valuable information! Congress Hall DevArt Conference Hall Simplement Room AC DB Best Predslava1 Intapp NULL means no session in that room at that time ☺

Sponsors

Session will begin very soon :) Please complete the evaluation form from your pocket after the session. Your feedback will help us to improve future conferences and speakers will appreciate your feedback! Enjoy the conference!

Agenda What is AWS Redshift Columnar vs Row-based storage MPP Data compression Distkey and Sorkey Vacuum in Redshift Scaling Features and Bugs Q&A

What is Amazon Redshift Amazon Redshift is a fully managed, petabyte- scale data warehouse service in the cloud(с) - documentation My: Amazon Redshift is scalable DWH in cloud. It is columnar datastorage It is mpp Управляемая Колоночная Массово-параллельная архитектура

What is MPP In computing, massively parallel refers to the use of a large number of processors (or separate computers) to perform a set of coordinated computations in parallel (simultaneously).

Sequential processing Input Output Processing time

Parallel Processing with equal workload distribution Input Output Processing time

Parallel Processing with unequal workload distribution Output Input Processing time

Redshift as MPP Massively Parallel Processing (MPP): Amazon Redshift automatically distributes data and query load across all nodes. Amazon Redshift makes it easy to add nodes to your data warehouse and enables you to maintain fast query performance as your data warehouse grows. Массово-параллельная_архитектура

Columnar vs Row data storage

Row - oriented

Row - oriented In row-wise database storage, data blocks store values sequentially for each consecutive column making up the entire row. If block size is smaller than the size of a record, storage for an entire record may take more than one block. If block size is larger than the size of a record, storage for an entire record may take less than one block, resulting in an inefficient use of disk space.

Row - oriented data blocks store values sequentially If block size is smaller than the size of a record, storage for an entire record may take more than one block. If block size is larger than the size of a record, storage for an entire record may take less than one block, resulting in an inefficient use of disk space

Row - oriented Designed to return a record in as few operations as possible optimal for OLTP databases Disadvantage: inefficient use of disk space In online transaction processing (OLTP) applications, most transactions involve frequently reading and writing all of the values for entire records, typically one record or a small number of records at a time. As a result, row-wise storage is optimal for OLTP databases.

Columnar Using columnar storage, each data block stores values of a single column for multiple rows. As records enter the system, Amazon Redshift transparently converts the data to columnar storage for each of the columns.

Columnar data block stores values of a single column for multiple rows reading the same number of column field values for the same number of records requires much less of the I/O operations compared to row-wise storage block holds the same type of data ⇒ can use a compression scheme

Data compression in Redshift specifies the type of compression that is applied to a column of data values as rows are added to a table Applied during table design stage

Example create table dwh.fact_bookings_and_cancellations( reporting_operations_id bigint, booking_id varchar(255) DISTKEY, lowest_unit_value_in_euros bigint, operation_currency varchar(255) encode bytedict, operation_date_time timestamp SORTKEY, ...

Default encodings Columns that are defined as sort keys are assigned RAW compression. Columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types are assigned RAW compression. All other columns are assigned LZO compression.

Encodings Raw Encoding Byte-Dictionary Encoding Delta Encoding data is stored in raw, uncompressed form. Byte-Dictionary Encoding a separate dictionary of unique values is created for each block of column values on disk effective when a column contains a limited number (<256) of unique values Delta Encoding compresses data by recording the difference between values that follow each other in the column

LZO Encoding Mostly Encoding Runlength Encoding provides a very high compression ratio with good performance works especially well for CHAR and VARCHAR columns that store very long character strings Mostly Encoding useful when the data type for a column is larger than most of the stored values require Runlength Encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences DONT apply on SORTKEY

Text255 and Text32k Encodings useful for compressing VARCHAR columns in which the same words recur often A separate dictionary of unique words is created for each block of column values on disk Zstandard Encoding provides a high compression ratio with very good performance across diverse data sets

Here comes the question Columnar storage, MPP - WHAT is the way to influence the performance?

The answer is: Sortkey and Distkey

Sorkey and Distkey Applied during table design stage - initial DDL Can be imagined as indices Improve performance dramatically

Sorkey and Distkey Both specified at the table design stage

Sortkey Amazon Redshift stores your data on disk in sorted order according to the sort key. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans.

Best Sortkey If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. Queries will be more efficient because they can skip entire blocks that fall outside the time range. If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. Queries will be more efficient because they can skip entire blocks that fall outside the time range. If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. Amazon Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range. If you frequently join a table, specify the join column as both the sort key and the distribution key. This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.

If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. Amazon Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range.

If you frequently join a table, specify the join column as both the sort key and the distribution key. This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.

Main Rule for Sortkey For Developers: Define the column which is/(will be) used to filter and make it a SORTKEY

Main Rule for Sortkey For Data Users: Define which column is a SORTKEY and use it in your queries to filter the data

The MOST important Rule for Sortkey For Developers: Let your Data USERS know the SORTKEY for the tables

Sortkey benefits: Queries will be more efficient because they can skip entire blocks that fall outside the time range. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join. Amazon Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range.

Demo - no vertical filter XN Seq Scan on events (cost=0.00.. 33547553.28 width=512)

Demo - with timestamp filter cost=0.00.. 41934441.60 width=512)

Demo - with sortkey filter cost=0.00..4193.44 width=512

Demo Summary No Filter: XN Seq Scan on events (cost=0.00..33547553.28 rows=3354755328 width=512) Any Filter : XN Seq Scan on events (cost=0.00..41934441.60 rows=335476 width=512) Sortkey Filter: XN Seq Scan on events (cost=0.00..4193.44 rows=335476 width=512)

Killing the sortkey avoid using functions on sortkey if you need to use a function, specify a wider range to help the optimizer

Demo - Killing the sortkey XN Seq Scan on events (cost=0.00..50321329. 92 rows=1118251776 width=512)

Sortkey types Compound Interleaved

Compound Sortkey A compound sort key is more efficient when query predicates use a prefix, which is a subset of the sort key columns in order Is the default Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY and ORDER BY. For example, a merge join, which is often faster than a hash join, is feasible when the data is distributed and presorted on the joining columns. Compound sort keys also help improve compression.

Example ... ,local_session_ts timestamp encode lzo ,vendor_id varchar(80) encode text255 ,is_onsite boolean encode runlength ) SORTKEY (session_type, session_first_ts); alter table dwh.fact_traffic_united owner to etl;

Interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order An INTERLEAVED sort key can use a maximum of eight columns. ... Interleaved SORTKEY (session_type,session_first_ts);

Vacuum and analyze VACUUM ANALYZE Reclaims space and resorts rows in either a specified table or all tables in the current database. ANALYZE Updates table statistics for use by the query planner.

Columnar and Sortkey When columns are sorted appropriately, the query processor is able to rapidly filter out a large subset of data blocks. https://docs.aws.amazon.com/redshift/latest/dg/c_challenges_achieving_high_performance_queries.html

MPP and DISTKEY Amazon Redshift distributes the rows of a table to the compute nodes so that the data can be processed in parallel. By selecting an appropriate distribution key for each table, you can optimize the distribution of data to balance the workload and minimize movement of data from node to node.

MPP and DISTKEY Optimizer decides how the data needs to be located Some rows or entire table is moved Substantial data movements slow overall system performance Using DISTKEY minimizes data redistribution

Data distribution goals To distribute the workload uniformly among the nodes in the cluster. To minimize data movement during query execution.

Distribution Styles KEY EVEN ALL

Examples ... provider_api varchar(500) encode lzo, table_loaded_at timestamp default getdate() ) DISTSTYLE EVEN;

Getting data into AWS Redshift There is only one way - AWS S3 And COPY command COPY table-name [ column-list ] FROM data_source authorization [ [ FORMAT ] [ AS ] data_format ] [ parameter [ argument ] [, ... ] ]

Bugs/features to keep in mind ALTER TABLE statement For SORTKEY/DISTKEY changes ⇒ recreating a table only ⇒ keep in mind for CI/CD

Bugs/features to keep in mind Redshift types Redshift scalability Automatic backups Quick table restoring

Questions? marinapopova05@gmail.com www.linkedin.com/in/marinapopova05