Download presentation
Presentation is loading. Please wait.
Published byNeal Robinson Modified over 6 years ago
1
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing
Xueyan Li (Qunar) & Chunming Li (Garena)
2
Contents Introduction to Qunar Hotel Data Services and Data processing platform Part 01 Part 02 Qunar Hotel Data Acceleration with Alluxio Qunar Hotel Data Use Alluxio to enable data sharing between Batch / Streaming Part 03
3
Part 01 Introduction to Qunar Hotel Data Services and data processing platform
4
Hotel price data Price Data 4000QPS 500G 4T Sensitive data
After compression Raw message Daily data volume
5
Use Storm to extract data and convert to protobuf
点此添加标题 Price Center Data Landing ORC compression Use Storm to extract data and convert to protobuf Use Spark Streaming run the batch
6
Application of data 1 01 2 02 03 3 04 4 Analyst/PM/Operations
Downstream application Direct queries 03 3 04 Price center Monitor 4 Real-time / off-line model training
7
System architecture Uniform use Marathon + Docker mode
8
Upgrade to Spark 2.0.x After
9
Part 02 Qunar Hotel Data Acceleration with Alluxio
10
Receiver balance problem
Conclusion: Each Executor runs only one Receiver for the highest performance.
11
Basic tuning spark Increase streaming duration
The longer the time, the more data each batch receives, the greater the storage requirements. Kafka Partition = Spark Receiver Using Spark high-level API,in order to make full use of resources, the number of partitions must be equal to the number of Receivers. Increase block size Increased block interval will generate larger blocks, and it will make the file orc less, but the higher the memory requirements. But the processing performance will be improved. Modify Mesos resource scheduling Spark has a node local problem, there must be a reasonable scheduling program to make sure the resource is not wasted.
12
There are problems Large amount of data
Day data is too large, hive SQL and Spark batch job can not run well Large amount of data Can not be real-time data analysis, hot data set, will only use the day or the day before for the results. Real-time If you do not use checkpoint, data will be lost when the task fails or restarts. Checkpoint
13
Why use Alluxio? Save the data cache Garbage Collection 02 01
When a Spark executor fails to exit, the calculated data will not be lost due to the "drifting" of the executor. Spark data on rdd can reduce GC overhead and save time. Data sharing 03 04 Tiered storage Zeppelin, Flink, Spark, MapReduce, can share data at memory-speed. Management of the local storage media, including memory, SSD and disk, constitute a hierarchical storage layer.
14
Tiered storage separates cold and hot data
Most of the data in a hotspot will only be used for the day's results. We deployed Alluxio Worker on each compute node and managed the local storage media, including memory, SSDs and disks, to form a hierarchical storage tier. Each node upstream computing related data will be stored in the local as much as possible, to avoid consumption of network resources. At the same time, Alluxio itself provides LRU, LFU and other efficient replacement strategy to ensure that the hot data is located in the faster memory layer to improve the data access rate; even the cold data is stored in the local disk, avoiding having to access remote HDFS storage cluster. MEM SSD HDD
15
System data flow
16
Average processing time
17
Average processing message
18
Other benefits of Alluxio
Web UI Web UI and CLI Simple and easy to use API Alluxio's command-line tools and web UI facilitate validation and debugging during the development process, shortening the overall system development cycle. Alluxio provides a set of easy-to-use API, its native API is a set of similar java.io file input and output interface, the use of its development does not require complex user learning curve. For example, we use Chronos early in the morning through the Alluxio loadufs command to load the day before the MapReduce calculated by the good data to Alluxio, so that subsequent operations can directly read these files.
19
Part 03 Qunar Hotel Data Use Alluxio to enable data sharing between Batch / Streaming
20
Spark/Zeppelin on Alluxio
Tool chain HMM We use Zeppelin as a tool for development, debugging, and analysis. LR 1 2 Computational framework interconnection Reduce development costs Directly write code to run on the results, the results can be directly attached to the Spark code. In addition to Spark, Flink, or other computational frameworks can also use the computed data. SVM CRT 3 4 Cross-machine room synchronization data Memory speed increase The downstream application takes the same calculated data directly from the memory for machine learning. EM Asynchronous synchronization data acceleration can be used when writing as a bottleneck.
21
Unified Namespace For the upper application and computing framework transparent unified namespace HDFS and Alluxio own storage space for unified management To avoid the complex input and output logic Alluxio mount function to manage the remote HDFS storage cluster In Qunar we use the account name as the data directory HDFS, we use swift to store Spark, Storm, Flink program jar package, for the checkpoint we use checkpoint/appcode as the path.
22
Calculation framework
Unified Namespace Calculation framework Storage framework
23
The benefits of data sharing
Spark MLLib Part of the intermediate results can be shared between different Spark MLLib pipelines, greatly improving computational efficiency. Spark SQL Spark SQL can provide partial query results directly to downstream applications, improving efficiency.
24
Summary Pricing system Alluxio Data sharing Spark checkpoint
Alluxio Data synchronization Spark block
25
Q&A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.