Introduction to AWS Redshift

Introduction to AWS Redshift
Maryna Popova BI Engineer GO EURO GmbH

Maryna Popova BI Engineer at GoEuro www.goeuro.com
LinkedIn:

SQLSat Kyiv Team Denis Reznik Eugene Polonichko Oksana Tkach
Yevhen Nedashkivskyi Mykola Pobyivovk Denis Reznik Eugene Polonichko Oksana Tkach Oksana Borysenko

Sponsor Sessions Starts at 13:00
Don’t miss them, they might be providing some interesting and valuable information! Congress Hall DevArt Conference Hall Simplement Room AC DB Best Predslava1 Intapp NULL means no session in that room at that time ☺

Sponsors

Session will begin very soon :)
Please complete the evaluation form from your pocket after the session. Your feedback will help us to improve future conferences and speakers will appreciate your feedback! Enjoy the conference!

Agenda What is AWS Redshift Columnar vs Row-based storage MPP
Data compression Distkey and Sorkey Vacuum in Redshift Scaling Features and Bugs Q&A

What is Amazon Redshift
Amazon Redshift is a fully managed, petabyte- scale data warehouse service in the cloud(с) - documentation My: Amazon Redshift is scalable DWH in cloud. It is columnar datastorage It is mpp Управляемая Колоночная Массово-параллельная архитектура

What is MPP In computing, massively parallel refers to the use of a large number of processors (or separate computers) to perform a set of coordinated computations in parallel (simultaneously).

Sequential processing
Input Output Processing time

Parallel Processing with equal workload distribution
Input Output Processing time

Parallel Processing with unequal workload distribution
Output Input Processing time

Redshift as MPP Massively Parallel Processing (MPP): Amazon Redshift automatically distributes data and query load across all nodes. Amazon Redshift makes it easy to add nodes to your data warehouse and enables you to maintain fast query performance as your data warehouse grows. Массово-параллельная_архитектура

Columnar vs Row data storage

Row - oriented

Row - oriented In row-wise database storage, data blocks store values sequentially for each consecutive column making up the entire row. If block size is smaller than the size of a record, storage for an entire record may take more than one block. If block size is larger than the size of a record, storage for an entire record may take less than one block, resulting in an inefficient use of disk space.

Row - oriented data blocks store values sequentially
If block size is smaller than the size of a record, storage for an entire record may take more than one block. If block size is larger than the size of a record, storage for an entire record may take less than one block, resulting in an inefficient use of disk space

Row - oriented Designed to return a record in as few operations as possible optimal for OLTP databases Disadvantage: inefficient use of disk space In online transaction processing (OLTP) applications, most transactions involve frequently reading and writing all of the values for entire records, typically one record or a small number of records at a time. As a result, row-wise storage is optimal for OLTP databases.

Columnar Using columnar storage, each data block stores values of a single column for multiple rows. As records enter the system, Amazon Redshift transparently converts the data to columnar storage for each of the columns.

Columnar data block stores values of a single column for multiple rows
reading the same number of column field values for the same number of records requires much less of the I/O operations compared to row-wise storage block holds the same type of data ⇒ can use a compression scheme

Data compression in Redshift
specifies the type of compression that is applied to a column of data values as rows are added to a table Applied during table design stage

Example create table dwh.fact_bookings_and_cancellations(
reporting_operations_id bigint, booking_id varchar(255) DISTKEY, lowest_unit_value_in_euros bigint, operation_currency varchar(255) encode bytedict, operation_date_time timestamp SORTKEY, ...

Default encodings Columns that are defined as sort keys are assigned RAW compression. Columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types are assigned RAW compression. All other columns are assigned LZO compression.

Encodings Raw Encoding Byte-Dictionary Encoding Delta Encoding
data is stored in raw, uncompressed form. Byte-Dictionary Encoding a separate dictionary of unique values is created for each block of column values on disk effective when a column contains a limited number (<256) of unique values Delta Encoding compresses data by recording the difference between values that follow each other in the column

LZO Encoding Mostly Encoding Runlength Encoding
provides a very high compression ratio with good performance works especially well for CHAR and VARCHAR columns that store very long character strings Mostly Encoding useful when the data type for a column is larger than most of the stored values require Runlength Encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences DONT apply on SORTKEY

Text255 and Text32k Encodings
useful for compressing VARCHAR columns in which the same words recur often A separate dictionary of unique words is created for each block of column values on disk Zstandard Encoding provides a high compression ratio with very good performance across diverse data sets

Here comes the question
Columnar storage, MPP - WHAT is the way to influence the performance?

The answer is: Sortkey and Distkey

Sorkey and Distkey Applied during table design stage - initial DDL
Can be imagined as indices Improve performance dramatically

Sorkey and Distkey Both specified at the table design stage

Sortkey Amazon Redshift stores your data on disk in sorted order according to the sort key. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans.

Best Sortkey If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. Queries will be more efficient because they can skip entire blocks that fall outside the time range. If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. Queries will be more efficient because they can skip entire blocks that fall outside the time range. If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. Amazon Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range. If you frequently join a table, specify the join column as both the sort key and the distribution key. This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.

If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. Amazon Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range.

If you frequently join a table, specify the join column as both the sort key and the distribution key. This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.

Main Rule for Sortkey For Developers:
Define the column which is/(will be) used to filter and make it a SORTKEY

Main Rule for Sortkey For Data Users:
Define which column is a SORTKEY and use it in your queries to filter the data

The MOST important Rule for Sortkey
For Developers: Let your Data USERS know the SORTKEY for the tables

Sortkey benefits: Queries will be more efficient because they can skip entire blocks that fall outside the time range. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join. Amazon Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range.

Demo - no vertical filter
XN Seq Scan on events (cost= width=512)

Demo - with timestamp filter
cost= width=512)

Demo - with sortkey filter
cost= width=512

Demo Summary No Filter: XN Seq Scan on events (cost= rows= width=512) Any Filter : XN Seq Scan on events (cost= rows= width=512) Sortkey Filter: XN Seq Scan on events (cost= rows= width=512)

Killing the sortkey avoid using functions on sortkey
if you need to use a function, specify a wider range to help the optimizer

Demo - Killing the sortkey
XN Seq Scan on events (cost= rows= width=512)

Sortkey types Compound Interleaved

Compound Sortkey A compound sort key is more efficient when query predicates use a prefix, which is a subset of the sort key columns in order Is the default Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY and ORDER BY. For example, a merge join, which is often faster than a hash join, is feasible when the data is distributed and presorted on the joining columns. Compound sort keys also help improve compression.

Example ... ,local_session_ts timestamp encode lzo
,vendor_id varchar(80) encode text255 ,is_onsite boolean encode runlength ) SORTKEY (session_type, session_first_ts); alter table dwh.fact_traffic_united owner to etl;

Interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order An INTERLEAVED sort key can use a maximum of eight columns Interleaved SORTKEY (session_type,session_first_ts);

Vacuum and analyze VACUUM ANALYZE
Reclaims space and resorts rows in either a specified table or all tables in the current database. ANALYZE Updates table statistics for use by the query planner.

Columnar and Sortkey When columns are sorted appropriately, the query processor is able to rapidly filter out a large subset of data blocks.

MPP and DISTKEY Amazon Redshift distributes the rows of a table to the compute nodes so that the data can be processed in parallel. By selecting an appropriate distribution key for each table, you can optimize the distribution of data to balance the workload and minimize movement of data from node to node.

MPP and DISTKEY Optimizer decides how the data needs to be located
Some rows or entire table is moved Substantial data movements slow overall system performance Using DISTKEY minimizes data redistribution

Data distribution goals
To distribute the workload uniformly among the nodes in the cluster. To minimize data movement during query execution.

Distribution Styles KEY EVEN ALL

Examples ... provider_api varchar(500) encode lzo,
table_loaded_at timestamp default getdate() ) DISTSTYLE EVEN;

Getting data into AWS Redshift
There is only one way - AWS S3 And COPY command COPY table-name [ column-list ] FROM data_source authorization [ [ FORMAT ] [ AS ] data_format ] [ parameter [ argument ] [, ... ] ]

Bugs/features to keep in mind
ALTER TABLE statement For SORTKEY/DISTKEY changes ⇒ recreating a table only ⇒ keep in mind for CI/CD

Bugs/features to keep in mind
Redshift types Redshift scalability Automatic backups Quick table restoring

Questions?

Introduction to AWS Redshift

Similar presentations

Presentation on theme: "Introduction to AWS Redshift"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to AWS Redshift

Similar presentations

Presentation on theme: "Introduction to AWS Redshift"— Presentation transcript:

Similar presentations

About project

Feedback