Introduction to AWS Redshift Maryna Popova BI Engineer GO EURO GmbH
Maryna Popova BI Engineer at GoEuro www.goeuro.com LinkedIn: www.linkedin.com/in/marinapopova05 Email: marinapopova05@gmail.com
SQLSat Kyiv Team Denis Reznik Eugene Polonichko Oksana Tkach Yevhen Nedashkivskyi Mykola Pobyivovk Denis Reznik Eugene Polonichko Oksana Tkach Oksana Borysenko
Sponsor Sessions Starts at 13:00 Don’t miss them, they might be providing some interesting and valuable information! Congress Hall DevArt Conference Hall Simplement Room AC DB Best Predslava1 Intapp NULL means no session in that room at that time ☺
Sponsors
Session will begin very soon :) Please complete the evaluation form from your pocket after the session. Your feedback will help us to improve future conferences and speakers will appreciate your feedback! Enjoy the conference!
Agenda What is AWS Redshift Columnar vs Row-based storage MPP Data compression Distkey and Sorkey Vacuum in Redshift Scaling Features and Bugs Q&A
What is Amazon Redshift Amazon Redshift is a fully managed, petabyte- scale data warehouse service in the cloud(с) - documentation My: Amazon Redshift is scalable DWH in cloud. It is columnar datastorage It is mpp Управляемая Колоночная Массово-параллельная архитектура
What is MPP In computing, massively parallel refers to the use of a large number of processors (or separate computers) to perform a set of coordinated computations in parallel (simultaneously).
Sequential processing Input Output Processing time
Parallel Processing with equal workload distribution Input Output Processing time
Parallel Processing with unequal workload distribution Output Input Processing time
Redshift as MPP Massively Parallel Processing (MPP): Amazon Redshift automatically distributes data and query load across all nodes. Amazon Redshift makes it easy to add nodes to your data warehouse and enables you to maintain fast query performance as your data warehouse grows. Массово-параллельная_архитектура
Columnar vs Row data storage
Row - oriented
Row - oriented In row-wise database storage, data blocks store values sequentially for each consecutive column making up the entire row. If block size is smaller than the size of a record, storage for an entire record may take more than one block. If block size is larger than the size of a record, storage for an entire record may take less than one block, resulting in an inefficient use of disk space.
Row - oriented data blocks store values sequentially If block size is smaller than the size of a record, storage for an entire record may take more than one block. If block size is larger than the size of a record, storage for an entire record may take less than one block, resulting in an inefficient use of disk space
Row - oriented Designed to return a record in as few operations as possible optimal for OLTP databases Disadvantage: inefficient use of disk space In online transaction processing (OLTP) applications, most transactions involve frequently reading and writing all of the values for entire records, typically one record or a small number of records at a time. As a result, row-wise storage is optimal for OLTP databases.
Columnar Using columnar storage, each data block stores values of a single column for multiple rows. As records enter the system, Amazon Redshift transparently converts the data to columnar storage for each of the columns.
Columnar data block stores values of a single column for multiple rows reading the same number of column field values for the same number of records requires much less of the I/O operations compared to row-wise storage block holds the same type of data ⇒ can use a compression scheme
Data compression in Redshift specifies the type of compression that is applied to a column of data values as rows are added to a table Applied during table design stage
Example create table dwh.fact_bookings_and_cancellations( reporting_operations_id bigint, booking_id varchar(255) DISTKEY, lowest_unit_value_in_euros bigint, operation_currency varchar(255) encode bytedict, operation_date_time timestamp SORTKEY, ...
Default encodings Columns that are defined as sort keys are assigned RAW compression. Columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types are assigned RAW compression. All other columns are assigned LZO compression.
Encodings Raw Encoding Byte-Dictionary Encoding Delta Encoding data is stored in raw, uncompressed form. Byte-Dictionary Encoding a separate dictionary of unique values is created for each block of column values on disk effective when a column contains a limited number (<256) of unique values Delta Encoding compresses data by recording the difference between values that follow each other in the column
LZO Encoding Mostly Encoding Runlength Encoding provides a very high compression ratio with good performance works especially well for CHAR and VARCHAR columns that store very long character strings Mostly Encoding useful when the data type for a column is larger than most of the stored values require Runlength Encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences DONT apply on SORTKEY
Text255 and Text32k Encodings useful for compressing VARCHAR columns in which the same words recur often A separate dictionary of unique words is created for each block of column values on disk Zstandard Encoding provides a high compression ratio with very good performance across diverse data sets
Here comes the question Columnar storage, MPP - WHAT is the way to influence the performance?
The answer is: Sortkey and Distkey
Sorkey and Distkey Applied during table design stage - initial DDL Can be imagined as indices Improve performance dramatically
Sorkey and Distkey Both specified at the table design stage
Sortkey Amazon Redshift stores your data on disk in sorted order according to the sort key. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans.
Best Sortkey If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. Queries will be more efficient because they can skip entire blocks that fall outside the time range. If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. Queries will be more efficient because they can skip entire blocks that fall outside the time range. If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. Amazon Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range. If you frequently join a table, specify the join column as both the sort key and the distribution key. This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. Amazon Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range.
If you frequently join a table, specify the join column as both the sort key and the distribution key. This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
Main Rule for Sortkey For Developers: Define the column which is/(will be) used to filter and make it a SORTKEY
Main Rule for Sortkey For Data Users: Define which column is a SORTKEY and use it in your queries to filter the data
The MOST important Rule for Sortkey For Developers: Let your Data USERS know the SORTKEY for the tables
Sortkey benefits: Queries will be more efficient because they can skip entire blocks that fall outside the time range. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join. Amazon Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range.
Demo - no vertical filter XN Seq Scan on events (cost=0.00.. 33547553.28 width=512)
Demo - with timestamp filter cost=0.00.. 41934441.60 width=512)
Demo - with sortkey filter cost=0.00..4193.44 width=512
Demo Summary No Filter: XN Seq Scan on events (cost=0.00..33547553.28 rows=3354755328 width=512) Any Filter : XN Seq Scan on events (cost=0.00..41934441.60 rows=335476 width=512) Sortkey Filter: XN Seq Scan on events (cost=0.00..4193.44 rows=335476 width=512)
Killing the sortkey avoid using functions on sortkey if you need to use a function, specify a wider range to help the optimizer
Demo - Killing the sortkey XN Seq Scan on events (cost=0.00..50321329. 92 rows=1118251776 width=512)
Sortkey types Compound Interleaved
Compound Sortkey A compound sort key is more efficient when query predicates use a prefix, which is a subset of the sort key columns in order Is the default Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions that use PARTITION BY and ORDER BY. For example, a merge join, which is often faster than a hash join, is feasible when the data is distributed and presorted on the joining columns. Compound sort keys also help improve compression.
Example ... ,local_session_ts timestamp encode lzo ,vendor_id varchar(80) encode text255 ,is_onsite boolean encode runlength ) SORTKEY (session_type, session_first_ts); alter table dwh.fact_traffic_united owner to etl;
Interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order An INTERLEAVED sort key can use a maximum of eight columns. ... Interleaved SORTKEY (session_type,session_first_ts);
Vacuum and analyze VACUUM ANALYZE Reclaims space and resorts rows in either a specified table or all tables in the current database. ANALYZE Updates table statistics for use by the query planner.
Columnar and Sortkey When columns are sorted appropriately, the query processor is able to rapidly filter out a large subset of data blocks. https://docs.aws.amazon.com/redshift/latest/dg/c_challenges_achieving_high_performance_queries.html
MPP and DISTKEY Amazon Redshift distributes the rows of a table to the compute nodes so that the data can be processed in parallel. By selecting an appropriate distribution key for each table, you can optimize the distribution of data to balance the workload and minimize movement of data from node to node.
MPP and DISTKEY Optimizer decides how the data needs to be located Some rows or entire table is moved Substantial data movements slow overall system performance Using DISTKEY minimizes data redistribution
Data distribution goals To distribute the workload uniformly among the nodes in the cluster. To minimize data movement during query execution.
Distribution Styles KEY EVEN ALL
Examples ... provider_api varchar(500) encode lzo, table_loaded_at timestamp default getdate() ) DISTSTYLE EVEN;
Getting data into AWS Redshift There is only one way - AWS S3 And COPY command COPY table-name [ column-list ] FROM data_source authorization [ [ FORMAT ] [ AS ] data_format ] [ parameter [ argument ] [, ... ] ]
Bugs/features to keep in mind ALTER TABLE statement For SORTKEY/DISTKEY changes ⇒ recreating a table only ⇒ keep in mind for CI/CD
Bugs/features to keep in mind Redshift types Redshift scalability Automatic backups Quick table restoring
Questions? marinapopova05@gmail.com www.linkedin.com/in/marinapopova05