Scaling SQL with different approaches

Scaling SQL with different approaches
Credits: Luca Canali

2nd attempt with Oracle: it scales-up!
Initial query is bound by CPU – big table join 39M x 10M ===> 374M By default it is performed on a single core This can be significantly speeded up by using more cores – parallel query IMPORTANT: by default parallel queries are disabled on Oracle production clusters

Parallel query on 60-core machine
Creating view with timestamps of interests with fundamentals as (select /*+ materialize parallel(120) */ distinct service_id, valid_from dates from csdb.csdb) select /*+ parallel(120) */ a.service_id, max(a.devs_per_serv_and_fdate) from ( select f.service_id, count(1) devs_per_serv_and_fdate from csdb.csdb c, fundamentals f where f.service_id=c.service_id and f.dates between c.valid_from and c.valid_to group by f.service_id, f.dates ) a group by a.service_id order by a.service_id; Elapsed: 00:15:21.13 Efficiency: 374M rows on 120 cores = 3.3k rows /s /core Joining view with fact table

Execution plan Creating temp table with timestamps
PLAN_TABLE_OUTPUT | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| TQ |IN-OUT| PQ Distrib | | 0 | SELECT STATEMENT | | | | (13)| | | | | 1 | TEMP TABLE TRANSFORMATION | | | | | | | | | 2 | PX COORDINATOR | | | | | | | | | 3 | PX SEND QC (RANDOM) | :TQ | | | (20)| Q1,01 | P->S | QC (RAND) | | 4 | LOAD AS SELECT (TEMP SEGMENT MERGE) | SYS_TEMP_0FD9D6F7C_D780440F | | | | Q1,01 | PCWP | | | 5 | HASH UNIQUE | | | | (20)| Q1,01 | PCWP | | | 6 | PX RECEIVE | | | | (20)| Q1,01 | PCWP | | | 7 | PX SEND HASH | :TQ | | | (20)| Q1,00 | P->P | HASH | | 8 | HASH UNIQUE | | | | (20)| Q1,00 | PCWP | | | 9 | PX BLOCK ITERATOR | | | | (0)| Q1,00 | PCWC | | | 10 | TABLE ACCESS FULL | F2P_IP_INFO_EXT | | | (0)| Q1,00 | PCWP | | | 11 | PX COORDINATOR | | | | | | | | | 12 | PX SEND QC (ORDER) | :TQ | | | (0)| Q2,02 | P->S | QC (ORDER) | | 13 | SORT GROUP BY | | | | (0)| Q2,02 | PCWP | | | 14 | PX RECEIVE | | | | (0)| Q2,02 | PCWP | | | 15 | PX SEND RANGE | :TQ | | | (0)| Q2,01 | P->P | RANGE | | 16 | HASH GROUP BY | | | | (0)| Q2,01 | PCWP | | | 17 | VIEW | | | | (0)| Q2,01 | PCWP | | | 18 | HASH GROUP BY | | | | (0)| Q2,01 | PCWP | | | 19 | PX RECEIVE | | | | (0)| Q2,01 | PCWP | | | 20 | PX SEND HASH | :TQ | | | (0)| Q2,00 | P->P | HASH | | 21 | HASH GROUP BY | | | | (0)| Q2,00 | PCWP | | | 22 | NESTED LOOPS | | | | (0)| Q2,00 | PCWP | | | 23 | NESTED LOOPS | | | | (0)| Q2,00 | PCWP | | | 24 | VIEW | | | | (0)| Q2,00 | PCWP | | | 25 | PX BLOCK ITERATOR | | | | (0)| Q2,00 | PCWC | | | 26 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6F7C_D780440F | | | (0)| Q2,00 | PCWP | | |* 27 | INDEX RANGE SCAN | F2P_IP_INFO_EXT_SERVICE_ID_IDX | | | (0)| Q2,00 | PCWP | | |* 28 | TABLE ACCESS BY INDEX ROWID| F2P_IP_INFO_EXT | | | (0)| Q2,00 | PCWP | | Creating temp table with timestamps For each sub query take part of temp table and join with fact table (using index)

Need more resources? Use Hadoop!
It is not only because of horizontal scalability…

Typical options available for data ingestion to Hadoop
Apache Sqoop Using JDBC – many dbs supported Can write in avro or parquet formats directly Kite From JSON or CSV Custom scripts

Options for querying the data on Hadoop?
Querying the data with SQL (declarative) Apache Hive Apache Impala Apache SparkSQL

Approach with Impala Create a Hive table
create external table csdb like parquetfile 'hdfspath' stored as parquet location 'hdfsdir' Query it (SQL statement is unchanged!) with fundamentals as (select distinct valid_from dates,service_id from csdb) select a.service_id,max(c) from ( select c.service_id,f.dates,count(1) c from csdb c,fundamentals f where f.service_id=c.service_id and f.dates between c.valid_from and c.valid_to group by service_id,f.dates ) a group by service_id order by 1

Impala’s runtime execution plan
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail :MERGING-EXCHANGE ns 0.000ns B UNPARTITIONED 06:SORT ms 8.682ms MB 0 13:AGGREGATE ms ms MB MB FINALIZE 12:EXCHANGE ns 0.000ns HASH(a.service_id) 05:AGGREGATE ms ms MB MB STREAMING 11:AGGREGATE ms 2.293ms MB MB FINALIZE 10:EXCHANGE ns 0.000ns HASH(c.service_id,f.dates) 04:AGGREGATE ns 0.000ns MB MB STREAMING 03:HASH JOIN 4 9m30s 9m32s 35.22M MB 2.00 GB INNER JOIN, BROADCAST |--09:EXCHANGE ms ms 10.21M BROADCAST | 08:AGGREGATE ms ms 10.21M MB MB FINALIZE | 07:EXCHANGE ms ms 10.21M HASH(valid_from,service_id) | 02:AGGREGATE 4 1s203ms 1s792ms 10.21M MB MB STREAMING | 01:SCAN HDFS ms ms 39.22M MB MB default.csdb 00:SCAN HDFS ms ms 5.75M MB MB default.csdb c Took: 103m41s Only 4 machines used in processing – because input file is small -> only 4 blocks

2nd approach - making Impala running on all nodes
The number of workers involved is data driven Table has to be repartitioned Number of block >= the number of machines Data should be evenly distributed between blocks/partitions (hash partitioning) Scaled-out on 12 machines took 39m Only one core per machnie was used by Impala Efficiency: 374M rows produced on 12 cores = 13k rows /s /core Compute table stats to scale-up it up (use more resource per host) compute stats csdb Estimated Per-Host Requirements: Memory=2.03GB VCores=3 With table statistics it took 17m

How about SparkSQL? The same query can run on SparkSQL (1.6)
like Impala it requires to have nr of blocks >= number of machines pyspark --driver-memory 2g --num-executors 12 import time starttime=time.time() sqlContext.sql("with fundamentals as (select distinct valid_from dates,service_id from luca_test.csdb2)select a.service_id,max(a.c) from (select c.service_id,f.dates,count(1) c from luca_test.csdb2 c,fundamentals f where f.service_id=c.service_id and f.dates between c.valid_from and c.valid_to group by c.service_id,f.dates) a group by a.service_id order by 1").show() print("Delta time = %f" % (time.time()-starttime)) Delta time = Took 22 minutes on 12 machines 1 core used per machine Efficiency: 23k rows /s /core Scaling-up --executor-memory 2g --executor-cores 3 Took 13 minutes

SparkSQL on Spark 2.0 Spark 2.0 introduced at end of July 2016
The same query took just 2 minutes! Efficiency: 245k row /s /core Why Spark2.0 is so fast? Spark2.0 introduces a lot of optimizations including Code Generation (already available in Impala) "Vectorization" of operations on rows (for details see

About CodeGeneration Query execution to bytecode compilation during runtime Source: databricks.com

About "Vectorization" in Spark 2.0
Batching multiple rows together and apply operators vertically (on columns) Example: parquet reader Source: databricks.com

Sparks benchmarks by databricks
Source: databricks.com

Execution plan: Spark 1.6 vs Spark 2.0

Profiling a Spark 1.6 worker

Profiling a Spark 2.0 worker

Conclusions Even small data problems (200MB) can struggle big data platforms ;) right data distribution within blocks and machines is a key to obtain scalability Oracle can run quite fast, but is less efficient than Impala or Spark due to rows processing model it lacks CodeGen CodeGen has speeds up CPU bound workloads significantly Especially when operating on simple scalar types Seems that Spark 2.0 outperforms all other players Looking forward to have it on the clusters More detailed info:

Scaling SQL with different approaches

Similar presentations

Presentation on theme: "Scaling SQL with different approaches"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scaling SQL with different approaches

Similar presentations

Presentation on theme: "Scaling SQL with different approaches"— Presentation transcript:

Similar presentations

About project

Feedback