Scaling SQL with different approaches Credits: Luca Canali
2nd attempt with Oracle: it scales-up! Initial query is bound by CPU – big table join 39M x 10M ===> 374M By default it is performed on a single core This can be significantly speeded up by using more cores – parallel query IMPORTANT: by default parallel queries are disabled on Oracle production clusters
Parallel query on 60-core machine Creating view with timestamps of interests with fundamentals as (select /*+ materialize parallel(120) */ distinct service_id, valid_from dates from csdb.csdb) select /*+ parallel(120) */ a.service_id, max(a.devs_per_serv_and_fdate) from ( select f.service_id, count(1) devs_per_serv_and_fdate from csdb.csdb c, fundamentals f where f.service_id=c.service_id and f.dates between c.valid_from and c.valid_to group by f.service_id, f.dates ) a group by a.service_id order by a.service_id; Elapsed: 00:15:21.13 Efficiency: 374M rows on 120 cores = 3.3k rows /s /core Joining view with fact table
Execution plan Creating temp table with timestamps PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| TQ |IN-OUT| PQ Distrib | | 0 | SELECT STATEMENT | | 1 | 26 | 8 (13)| | | | | 1 | TEMP TABLE TRANSFORMATION | | | | | | | | | 2 | PX COORDINATOR | | | | | | | | | 3 | PX SEND QC (RANDOM) | :TQ10001 | 1 | 22 | 5 (20)| Q1,01 | P->S | QC (RAND) | | 4 | LOAD AS SELECT (TEMP SEGMENT MERGE) | SYS_TEMP_0FD9D6F7C_D780440F | | | | Q1,01 | PCWP | | | 5 | HASH UNIQUE | | 1 | 22 | 5 (20)| Q1,01 | PCWP | | | 6 | PX RECEIVE | | 1 | 22 | 5 (20)| Q1,01 | PCWP | | | 7 | PX SEND HASH | :TQ10000 | 1 | 22 | 5 (20)| Q1,00 | P->P | HASH | | 8 | HASH UNIQUE | | 1 | 22 | 5 (20)| Q1,00 | PCWP | | | 9 | PX BLOCK ITERATOR | | 1 | 22 | 4 (0)| Q1,00 | PCWC | | | 10 | TABLE ACCESS FULL | F2P_IP_INFO_EXT | 1 | 22 | 4 (0)| Q1,00 | PCWP | | | 11 | PX COORDINATOR | | | | | | | | | 12 | PX SEND QC (ORDER) | :TQ20002 | 1 | 26 | 4 (0)| Q2,02 | P->S | QC (ORDER) | | 13 | SORT GROUP BY | | 1 | 26 | 4 (0)| Q2,02 | PCWP | | | 14 | PX RECEIVE | | 1 | 26 | 4 (0)| Q2,02 | PCWP | | | 15 | PX SEND RANGE | :TQ20001 | 1 | 26 | 4 (0)| Q2,01 | P->P | RANGE | | 16 | HASH GROUP BY | | 1 | 26 | 4 (0)| Q2,01 | PCWP | | | 17 | VIEW | | 1 | 26 | 4 (0)| Q2,01 | PCWP | | | 18 | HASH GROUP BY | | 1 | 53 | 4 (0)| Q2,01 | PCWP | | | 19 | PX RECEIVE | | 1 | 53 | 4 (0)| Q2,01 | PCWP | | | 20 | PX SEND HASH | :TQ20000 | 1 | 53 | 4 (0)| Q2,00 | P->P | HASH | | 21 | HASH GROUP BY | | 1 | 53 | 4 (0)| Q2,00 | PCWP | | | 22 | NESTED LOOPS | | 1 | 53 | 4 (0)| Q2,00 | PCWP | | | 23 | NESTED LOOPS | | 1 | 53 | 4 (0)| Q2,00 | PCWP | | | 24 | VIEW | | 1 | 22 | 4 (0)| Q2,00 | PCWP | | | 25 | PX BLOCK ITERATOR | | 1 | 22 | 4 (0)| Q2,00 | PCWC | | | 26 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6F7C_D780440F | 1 | 22 | 4 (0)| Q2,00 | PCWP | | |* 27 | INDEX RANGE SCAN | F2P_IP_INFO_EXT_SERVICE_ID_IDX | 1 | | 0 (0)| Q2,00 | PCWP | | |* 28 | TABLE ACCESS BY INDEX ROWID| F2P_IP_INFO_EXT | 1 | 31 | 0 (0)| Q2,00 | PCWP | | Creating temp table with timestamps For each sub query take part of temp table and join with fact table (using index)
Need more resources? Use Hadoop! It is not only because of horizontal scalability…
Typical options available for data ingestion to Hadoop Apache Sqoop Using JDBC – many dbs supported Can write in avro or parquet formats directly Kite From JSON or CSV Custom scripts
Options for querying the data on Hadoop? Querying the data with SQL (declarative) Apache Hive Apache Impala Apache SparkSQL
Approach with Impala Create a Hive table create external table csdb like parquetfile 'hdfspath' stored as parquet location 'hdfsdir' Query it (SQL statement is unchanged!) with fundamentals as (select distinct valid_from dates,service_id from csdb) select a.service_id,max(c) from ( select c.service_id,f.dates,count(1) c from csdb c,fundamentals f where f.service_id=c.service_id and f.dates between c.valid_from and c.valid_to group by service_id,f.dates ) a group by service_id order by 1
Impala’s runtime execution plan Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ------------------------------------------------------------------------------------------------------------------------------ 14:MERGING-EXCHANGE 0 0.000ns 0.000ns 0 -1 0 -1.00 B UNPARTITIONED 06:SORT 4 6.346ms 8.682ms 0 -1 24.00 MB 0 13:AGGREGATE 4 151.256ms 158.305ms 0 -1 2.25 MB 128.00 MB FINALIZE 12:EXCHANGE 4 0.000ns 0.000ns 0 -1 0 0 HASH(a.service_id) 05:AGGREGATE 4 160.179ms 166.378ms 0 -1 1.25 MB 128.00 MB STREAMING 11:AGGREGATE 4 2.100ms 2.293ms 0 -1 2.25 MB 128.00 MB FINALIZE 10:EXCHANGE 4 0.000ns 0.000ns 0 -1 0 0 HASH(c.service_id,f.dates) 04:AGGREGATE 4 0.000ns 0.000ns 0 -1 155.05 MB 128.00 MB STREAMING 03:HASH JOIN 4 9m30s 9m32s 35.22M -1 932.16 MB 2.00 GB INNER JOIN, BROADCAST |--09:EXCHANGE 4 198.163ms 203.434ms 10.21M -1 0 0 BROADCAST | 08:AGGREGATE 4 584.498ms 636.290ms 10.21M -1 204.04 MB 128.00 MB FINALIZE | 07:EXCHANGE 4 59.790ms 72.221ms 10.21M -1 0 0 HASH(valid_from,service_id) | 02:AGGREGATE 4 1s203ms 1s792ms 10.21M -1 395.15 MB 128.00 MB STREAMING | 01:SCAN HDFS 4 251.675ms 344.008ms 39.22M -1 58.03 MB 96.00 MB default.csdb 00:SCAN HDFS 4 93.435ms 110.983ms 5.75M -1 86.07 MB 144.00 MB default.csdb c Took: 103m41s Only 4 machines used in processing – because input file is small -> only 4 blocks
2nd approach - making Impala running on all nodes The number of workers involved is data driven Table has to be repartitioned Number of block >= the number of machines Data should be evenly distributed between blocks/partitions (hash partitioning) Scaled-out on 12 machines took 39m Only one core per machnie was used by Impala Efficiency: 374M rows produced on 12 cores = 13k rows /s /core Compute table stats to scale-up it up (use more resource per host) compute stats csdb Estimated Per-Host Requirements: Memory=2.03GB VCores=3 With table statistics it took 17m
How about SparkSQL? The same query can run on SparkSQL (1.6) like Impala it requires to have nr of blocks >= number of machines pyspark --driver-memory 2g --num-executors 12 import time starttime=time.time() sqlContext.sql("with fundamentals as (select distinct valid_from dates,service_id from luca_test.csdb2)select a.service_id,max(a.c) from (select c.service_id,f.dates,count(1) c from luca_test.csdb2 c,fundamentals f where f.service_id=c.service_id and f.dates between c.valid_from and c.valid_to group by c.service_id,f.dates) a group by a.service_id order by 1").show() print("Delta time = %f" % (time.time()-starttime)) Delta time = 1342.786748 Took 22 minutes on 12 machines 1 core used per machine Efficiency: 23k rows /s /core Scaling-up --executor-memory 2g --executor-cores 3 Took 13 minutes
SparkSQL on Spark 2.0 Spark 2.0 introduced at end of July 2016 The same query took just 2 minutes! Efficiency: 245k row /s /core Why Spark2.0 is so fast? Spark2.0 introduces a lot of optimizations including Code Generation (already available in Impala) "Vectorization" of operations on rows (for details see https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html)
About CodeGeneration Query execution to bytecode compilation during runtime Source: databricks.com
About "Vectorization" in Spark 2.0 Batching multiple rows together and apply operators vertically (on columns) Example: parquet reader Source: databricks.com
Sparks benchmarks by databricks Source: databricks.com
Execution plan: Spark 1.6 vs Spark 2.0
Profiling a Spark 1.6 worker
Profiling a Spark 2.0 worker
Conclusions Even small data problems (200MB) can struggle big data platforms ;) right data distribution within blocks and machines is a key to obtain scalability Oracle can run quite fast, but is less efficient than Impala or Spark due to rows processing model it lacks CodeGen CodeGen has speeds up CPU bound workloads significantly Especially when operating on simple scalar types Seems that Spark 2.0 outperforms all other players Looking forward to have it on the clusters More detailed info: http://db-blog.web.cern.ch/blog/luca-canali/2016-09-spark-20-performance-improvements-investigated-flame-graphs