From classic OLAP to real-time data warehouse

From classic OLAP to real-time data warehouse
Apache Kylin 2.0 From classic OLAP to real-time data warehouse Thank you all for coming. Thanks for your time. My name is Yang, or 李扬 in Chinese. Co-founder and CTO of Kyligence. Also a PMC member of Apache Kylin. The topic today is about Apache Kylin 2.0. It is in beta at the moment and is planned to release in April. I will first brief what is Aapche Kylin and then talk about the important and new features it has, including snowflake schema support, spark cubing, and streaming. 李扬 | Li, Yang Co-founder & CTO

What is Apache Kylin Apache Kylin HDFS Hive HBase Reporting
Interactive Reporting Dashboard 3,000 billion rows, < 1 sec query top 1 news feed app in China 60+ dimensions top 3 insurance group in China JDBC / ODBC / RestAPI BI integration BI Visualization So what is Apache Kylin. It is an OLAP engine on Hadoop. Perhaps the most popular one at the moment. If you google “OLAP on Hadoop”, Kylin is the first result as I tried this morning. It sits on top of Hadoop infrastructure and exposes your relational data to upper application via the standard SQL interface. Kylin can handle very big data set and is very fast in terms of query latency. For example, the biggest Kylin instance we know in production is at toutiao.com, the top 1 news feed app in China. It has a table of 3,000 billion rows, and with Kylin, the average query response time is below 1 second. Kylin can handle very complex data models, and we will talk more about snowflake support soon. The widest cube we know is at CPIC (中国太平洋保险), the top 3 insurance group in China. It contains more than 60 dimensions. And Kylin provides standard JDBC / ODBC / RestAPI interfaces. Can integrate with existing BI tools very well, like Tableau and PowerBI, you name it. Apache Kylin OLAP Engine HDFS Hive HBase Hadoop All rights reserved ©Kyligence Inc.

Why Kylin is Fast Sort Aggr. Filter Join Tables O(N)
A sample query: Report revenue by “returnflag” and “orderstatus” over a time period. Sort Why is Kylin so fast. We can show it with an example. Image a retail scenario, I want to report revenue by “returnflag” and “orderstatus” within a date range, to see the total amount of successful transactions, canceled transaction, and returned transactions etc. A typical SQL will look like this. And to execute it, we will compile it into a relational expression like the diagram on the left. It is so called the execution plan. On the execution plan, we can see, executing the query involves scanning all the rows in table, join them together, go through the date range filter, sum revenue by “returnflag” and “orderstatus”, and finally produced sorted result. It is easy to see that the time complexity of such execution is at least O(N), where N represents the total number of rows in the tables. Because at least each table row is visited once. And we assume the joining tables are perfectly partitioned and indexed, such that the expensive join operator can also finish in linear time complexity, which is actually not very possible in real cases. Anyway, O(N) is the best you could have doing ad-hoc SQL processing. Aggr. select l_returnflag, o_orderstatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price … from v_lineitem inner join v_orders on l_orderkey = o_orderkey where l_shipdate <= ' ' group by o_orderstatus order by o_orderstatus; Filter Join Tables O(N) All rights reserved ©Kyligence Inc.

Why Kylin is Fast Sort Sort Aggr. Filter Filter Join Cuboid Tables
How can Kylin go beyond O(N)? That is by precalculation. So if I know the query pattern in advance, I could precalculate the Aggregate, Join, and Table Scan operators, to create a cuboid. If cuboid sounds unfamiliar, you can think it as a materialized summary table. The summary table is transaction amount grouped by “returnflag”, “orderstatus”, and “date”. And because there are fixed number of return flags, order status, and let’s say the date range is limited too, for 3 years, there are about 1000 days. That means the number of rows in the summary table is at most “flag x status x days”, which is a constant in the big O notion. That means, if execute the same SQL on the precalculated cuboid, the maximum rows to process is a constant. And that is why Kylin can be faster. Aggr. Filter Filter Cuboid Precalculate the Kylin Cube Join Tables O(N) O(flag x status x days) = O(1) All rights reserved ©Kyligence Inc.

Kylin is about Precalculation
Based on the Cube Theory Model and Cube define the space of precalculation Build Engine carries out the precalculation Query Engine runs SQL on the precalculated result So Kylin is all about precalculation. The core idea is based on the classic cube theory and is developed from there into the SQL on big data domain. Kylin provides Model and Cube to help you define the space of precalculation. It has Build Engine that execute the precalculation in a distribute system using MR or Spark. And a Query Engine allows to run SQL on top of the precalculated result. The key here is modeling. If you have good understanding of your data and business analysis requirement, you can capture the necessary precalculation with the right cube. time, item time, item, location time, item, location, supplier time item location supplier time, location Time, supplier item, location item, supplier location, supplier time, item, supplier time, location, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid All rights reserved ©Kyligence Inc.

O(1) regardless of Data Size
O(N) Response Time Once you got the right precalculation, then most of your queries (if not all) will be able to transform into the cube query like we have just seen. And the execution time complexity can be reduced to O(1) and achieve very fast query speed regardless of the original data size. That is the brief introduction of Apache Kylin. Online Calculation O(1) Apache Kylin Data Size All rights reserved ©Kyligence Inc.

Snowflake Schema Support
Runs TPC-H benchmark. Next we will talk about the Snowflake Schema support, which is the most important new feature in Kylin 2.0. All rights reserved ©Kyligence Inc.

Kylin 1.0 Star Schema Limitation
Precalculate the join of 1 level of lookups Support star schema only Don’t allow same name columns from different tables Don’t allow table joining itself Difficult to support real world business cases Kylin 1.0 has a limitation that it only supports star schema data model. That means it only allows precalculation of joining 1 level of lookup tables. Like the diagram shows. It is difficult to support real world business cases, which are often more complicated than star schema. Join LINEORDER CUSTOMER SUPPLIER CUSTOMER SUPPLIER PART LINEORDER PART DATES DATES All rights reserved ©Kyligence Inc.

Kylin 2.0 Snowflake Schema
Precalculate unlimited levels of lookups Snowflake schema support (KYLIN-1875) Allow table be joined multiple times Big metadata change at Model level Many bug fixes regarding joins and sub-queries Support complex models of any kind, support flexible queries on the models Join Kylin 2.0 introduced big changes to model metadata, and can support snowflake data model out-of-box. Allows precalculate unlimited levels of lookup tables. Like the diagram shows. Also there were many bug fixes and improvements regarding joins and sub-queries. As a result, Kylin 2.0 is able to support very complex data models and queries. LINEITEM Join ORDERS CUSTOMER Join PARTSUPP PART Join SUPPLIER NATION Join REGION All rights reserved ©Kyligence Inc.

TPC-H on Kylin 2.0 TPC-H is a benchmark for decision support system.
Popular among commercial RDBMS & DW solutions Queries and data have broad industry-wide relevance Examine large volumes of data Execute queries with a high degree of complexity Give answers to critical business questions Kylin 2.0 runs all the 22 TPC-H queries. (KYLIN-2467) Precalculation can answer very complex queries Goal is functionality at this stage Try it: To demonstrate the greatly improved modeling and SQL capability, we made effort to run TPC-H queries on Kylin 2.0. For those who are not familiar with TPC-H, following is quote from TPC-H website: It is a popular decision support system for commercial RDBMS & DW solutions. It includes both queries and data that have broad industry-wide relevance. The queries are designed to examine large volumes of data, to have a high degree of complexity, and to give answers to critical business questions. Kylin 2.0 can run all the 22 TPC-H queries. We have put up a page with all the steps and resources that is needed to reproduce the work. So anyone who want to verify can give it a try in your own environment. This shows that with precalculated cube, Kylin is also very flexible and can answer very complex queries. Another note is, the goal here is not comparing performance with other TPC-H result. On one hand, according to TPC-H spec, precalculation is not permitted across tables, so in that sense, Kylin does not qualify a valid system to compare with other TPC-H results. On the other hand, we haven’t done performance tuning for the TPC-H queries yet. Just had enough time to let all the queries pass. The room for performance improvement is still very big. ORDERS CUSTOMER SUPPLIER PART LINEITEM PARTSUPP NATION REGION All rights reserved ©Kyligence Inc.

Complex Query 1 TPC-H query 07 0.17 sec (Hive+Tez 35.23 sec)
2 sub-queries select supp_nation, cust_nation, l_year, sum(volume) as revenue from ( n1.n_name as supp_nation, n2.n_name as cust_nation, l_shipyear as l_year, l_saleprice as volume v_lineitem inner join supplier on s_suppkey = l_suppkey inner join v_orders on l_orderkey = o_orderkey inner join customer on o_custkey = c_custkey inner join nation n1 on s_nationkey = n1.n_nationkey inner join nation n2 on c_nationkey = n2.n_nationkey where (n1.n_name = 'KENYA' and n2.n_name = 'PERU') or (n1.n_name = 'PERU' and n2.n_name = 'KENYA') ) and l_shipdate between ' ' and ' ' ) as shipping group by l_year order by Sort Aggr. To quickly show you some interesting TPC-H queries. On the left side is the SQL. Its font size is very small and is not intended to be read. I highlighted the sub-queries in different colors, so we have a feeling of the complexity. On the right side is the relational expression of the query in the tree form. In the tree structure, I marked precalculated nodes with white background and red text, so they look lightweight. The other solid nodes are submit to online calculation and will be the cause if a query runs slow. This is TPC-H query 07, as you can see, almost all of the relational operators are precalculated. The remaining Sort / Proj / Filter work on very few records, thus the query is very fast. Takes Kylin only 0.17 second to run, and on the same hardware and same data set, Hive+Tez run second for this query. That shows the difference between precalculation and online calculation. Proj. Filter Join NATION Join NATION Join CUSTOMER Join ORDER Join SUPPLIER LINEITEM All rights reserved ©Kyligence Inc.

Complex Query 2 Join TPC-H query 11 3.42 sec (Hive+Tez 15.87 sec)
4 sub-queries, 1 online join with q11_part_tmp_cached as ( select ps_partkey, sum(ps_partvalue) as part_value from v_partsupp inner join supplier on ps_suppkey = s_suppkey inner join nation on s_nationkey = n_nationkey where n_name = 'GERMANY' group by ps_partkey ), q11_sum_tmp_cached as ( sum(part_value) as total_value q11_part_tmp_cached ) part_value from ( part_value, total_value q11_part_tmp_cached, q11_sum_tmp_cached ) a part_value > total_value * order by part_value desc; Sort Proj. This is TPC-H query 11. It has 4 sub-queries in terms of SQL and is more complex than the previous query. Its relational expression tree is more complex too, include two branches, each load from a precalculated cuboid. And finally the result of the branches are joined again, which is a very heavy online computation. With the percentage of online calculation increased, the query takes longer for Kylin, runs 3.42 second. While a full online calculation is still much slower, takes second to run. Filter Join Aggr. Aggr. Proj. Proj. Aggr. Filter Proj. Join NATION Filter Join SUPPLIER Join NATION PARTSUPP Join SUPPLIER PARTSUPP All rights reserved ©Kyligence Inc.

Complex Query 3 Join Join TPC-H query 12 7.66 sec (Hive+Tez 12.64 sec)
5 sub-queries, 2 online joins with in_scope_data as( select l_shipmode, o_orderpriority from v_lineitem inner join v_orders on l_orderkey = o_orderkey where l_shipmode in ('REG AIR', 'MAIL') and l_receiptdelayed = 1 and l_shipdelayed = 0 and l_receiptdate >= ' ' and l_receiptdate < ' ' ), all_l_shipmode as( distinct l_shipmode in_scope_data ), high_line as( count(*) as high_line_count o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH' group by l_shipmode ), low_line as( count(*) as low_line_count o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH' ) al.l_shipmode, hl.high_line_count, ll.low_line_count all_l_shipmode al left join high_line hl on al.l_shipmode = hl.l_shipmode left join low_line ll on al.l_shipmode = ll.l_shipmode order by al.l_shipmode Sort Filter This is an even more complex query, TPC-H query 12. It contains 5 sub-queries, the SQL font size is even smaller than the previous pages, to fit the SQL in screen. And the relational form is more ugly too. Reading 3 cuboids in 3 branches, and online join them together. There are 2 heavy online join operators. It is expected that the more online calculation, the slower the query gets. It takes 7.66 second to run in Kylin and second to run in Hive+Tez. As more heavy operators stay on online, the advantage of precalculation decreases. Join Aggr. Filter Join Proj. ORDERS LINEITEM Join Aggr. Filter Join Proj. ORDERS LINEITEM Aggr. Filter Join Proj. ORDERS LINEITEM All rights reserved ©Kyligence Inc.

More than MOLAP Kylin 2.0 Kylin 1.0 DW on Hadoop
Supports complex data models and sub-queries; Runs TPC-H Percentile / Window / Time functions So sum it up, Kylin 2.0 is much more than multidimensional analysis. With snowflake schema enabled, it can support very complex relational models and answer very complex queries. It can run TPC-H benchmark, and has enhanced/added many other SQL features like Percentile functions, Window functions, and Time functions. Comparing to Kylin 1.0, Kylin 2.0 is a big step forward in terms of SQL maturity. Given the right model design, it can answer all your queries of any kind of complexity. Comparing to other data warehouse on Hadoop, Kylin 2.0 may still a little behind in terms of SQL, e.g. it does have UDF yet, but the gap is very small now. And in terms of query latency, because of precalculation, Kylin would have big advantage. MOLAP Analytics DW Speed Kylin 1.0 Kylin 2.0 DW on Hadoop SQL Maturity All rights reserved ©Kyligence Inc.

Spark Cubing Halves the build time.
Next we will talk about Spark Cubing, which cuts the build time by half. All rights reserved ©Kyligence Inc.

A Bit of History Kylin 1.5 attempted Spark cubing, but the feature was never released. It was a port of MR in-mem cubing algorithm. Exploit memory and build the whole cube in one round. No improvement observed. Spark did nothing differently than MR. Since 1.5, we have been trying to do cubing on Spark. However at that time, the attempt was not successful. The first attempt was a port of MR in-mem cubing algorithm on to Spark. It is the easiest to do at that time, due to ease of implementation. The algorithm uses as much memory as it could and builds the whole in one round, thus Spark’s memory cache is not an advantage since the MR job is doing the same. The result is no obvious improvement compare to MapReduce. All rights reserved ©Kyligence Inc.

Spark Cubing in 2.0 Kylin 2.0 did a complete rework based on Layered Cubing algorithm. Each layer of cuboids as a RDD. Parent RDD is cached for next round. RDD exports to sequence file, the same output format as MR. Translate “map” to “flatMap”; and “reduce” to “reduceByKey”; most code get reused. So the 2.0 did a complete rework based on the layered cubing algorithm. For layered cubing, the cube is calculated by layers. Each layer’s output is the next layer’s input. We use RDD to abstract the layer data, then the parent RDD can be cached and speed up the processing of the next layer. RDD can be exported to sequence file using the same format as MapReduce. This keeps the compatibility in the cube data format. Implementation wise, the original mapper can translate into “flatMap”, original reducer can translate into “reduceByKey”. Most of the code gets reused. RDD-1 RDD-2 RDD-3 RDD-4 RDD-5 All rights reserved ©Kyligence Inc.

DAG Calculating the 3rd Layer
This is a sample of a Spark job, calculating the 3rd layer of a cube. The first 2 stages are skipped, because they have been calculated and cached by previous layer’s job. The later two stages do the additional processing to calculate the 3rd of cube. All rights reserved ©Kyligence Inc.

Spark Cubing vs. MR Layered Cubing
Halves the build time. The advantage decreases as data size increases. 4-node cluster Spark on YARN 24 vcores, 30 GB memory 3 data sets of increasing size: .15 GB / 2.5 GB / 8 GB Compare Spark cubing performance with MR layered cubing. The red bar is Spark and the blue bar is MR. The test is done in a 4-node cluster. Spark on YARN, with 24 vcores and 30 GB memory allocated to Spark. We tested 3 data sets with increasing size: 0.15 GB, 2.5 GB, and 8 GB. The first two data sets are very small comparing to 30 GB memory. So spark could cache everything in memory, and the result is Spark build time is about 50% of MR layered cubing. The 8 GB data set is much bigger, and as cubing cause data expansion, it won’t be able to all fit in memory, thus the advantage of Spark decreases little, as the diagram shows. All rights reserved ©Kyligence Inc.

Spark Cubing vs. MR In-mem Cubing
Almost the same fast. And more adaptable to general data set. In-mem cubing expects sharded data, works poorly on random data sets. Spark cubing is more adaptable to different kinds of data distribution. Compare Spark cubing to MR In-mem cubing. Still the red bar is Spark, the blue bar is MR in-mem cubing. They are almost the same fast. However, remember in-mem cubing is very picky on data distribution. It is only effective when the data is sharded or nearly sharded. And if you force in-mem cubing to run on random data, it performs even slower than MR layered cubing. In the diagram, the data sets are all sharded for in-mem cubing. On the other hand, Spark cubing shows good performance consistently, regardless of the level of data distribution. So in general, Spark cubing is still a better choice than in-mem cubing. All rights reserved ©Kyligence Inc.

Kylin New in Kylin 1.6 ANSI SQL In-mem Cubing New BI Tools, Web App…
This is actually a 1.6 feature. However I feel that it is not marketed well enough, so is it again. Start from 1.6, Kylin can connect Kafka as a source just like Hive. Using in-mem cubing algorithm, we can trigger micro incremental build very frequently, e.g. every 2 minutes. The result is many small cube segments and they can be queried to give very real-time result. Kylin New In-mem Cubing All rights reserved ©Kyligence Inc.

Demo of Twitter Analysis
Incremental build triggers every 2 minutes, build finishes in 3 minutes. 8-node cluster on AWS, 3 Kafka brokers Twitter sample feed, 10+ K messages per second Cube has 9 dimensions and 3 measures 2 jobs running at the same time To show this really truly works, we have put up a demo site to analyze twitter messages in real-time. It runs on a 8-node AWS cluster, 3 Kafka brokers. The input is Twitter sample feed, which has 10+ K messages per second. The cube is average complex, 9 dimensions and 3 measures. For such setup, the incremental build is triggered every 2 minutes and finishes in 3 minutes. That’s why you will see 2 jobs running at the same time like the screenshot shows. That is perfectly OK as long as your cluster continue to complete jobs at the fixed rate. As a result, the system has about 5 minutes delay in terms of real-time-ness. All rights reserved ©Kyligence Inc.

The demo shows twitter message trends by language and by devices
The demo shows twitter message trends by language and by devices. The left diagram is message trend by language of a whole day. We can see English message volume goes up in the US day time, and meanwhile Asia message volume goes down because it’s Asia night.

There’s also the tag cloud, show the most recent hot topics
There’s also the tag cloud, show the most recent hot topics. And below it the trend of the hottest tags. All the charts real-time to the latest 5 minutes.

Summary Apache Kylin 2.0 What is next
Kylin 2.0 Beta download available. Snowflake schema support Runs TPC-H benchmark Time / Window / Percentile functions Spark cubing Near real-time streaming What is next Hadoop 3.0 support (Erasure Coding) Spark cubing enhancement Connect more source (JDBC, SparkSQL) Alternative storage (Kudu?) Real-time support, lambda architecture To summarize. Apache Kylin 2.0 is about to reach it 2.0 release. It has rich features like snowflake schema support, runs TPC-H benchmark, and many enhanced SQL functions. It has spark cubing that halves the build time, and has streaming capability. 2.0 is still in beta at the moment, there is a beta package which you can download and try. We are very eager to hear feedbacks from community, good or bad. After fixing any critical issues, the plan is to release in April. As to Kylin’s roadmap, there is a lot we want to do. Hadoop 3.0 with erasure coding could save cube storage greatly, something we will definitely catch up. Spark cubing has many room of improvement too, e.g. at the moment it does not source from Kafka yet. Connecting more sources is another frequently asked requirement. We could source from JDBC, or maybe SparkSQL. Alternative storage, like Kudu? Lastly but not least, a lambda architecture to support true real-time. That is all. Thanks for your time again. All rights reserved ©Kyligence Inc.

Thanks See you on our next talk!

From classic OLAP to real-time data warehouse

Similar presentations

Presentation on theme: "From classic OLAP to real-time data warehouse"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

From classic OLAP to real-time data warehouse

Similar presentations

Presentation on theme: "From classic OLAP to real-time data warehouse"— Presentation transcript:

Similar presentations

About project

Feedback