Chapter 10: Stream-based Data Management Title: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core Authors: Navendu Jain, Lisa Amini, et. al.
Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core Problem –Problem Statement –Why is this problem important? –Why is this problem hard? Approaches –Approach description, key concepts –Contributions (novelty, improved) –Assumptions
Problem Statement Given –Stream data, continuous queries in large-scale distributed environments –Streaming data application (Linear Road) –Stream processing middleware (Stream Processing Core, SPC) Find: –Performance bottlenecks of streaming data applications Objectives –Understand the performance characteristics of the stream data application Constraints –SPC is constantly overloaded with respect to the available resources. –Processing elements are a mix of I/O-bound as well as CPU-bound. –It is unrealistic for applications to store the full history of a stream in memory. Memory-bound.
Why is this problem important? High volume, continuous data are ubiquitous. –Text and transactional data –Digital audio, video, and image –Instant messages, network packet traces –Sensor data Stream processing applications become important in the networking and database community.
Why is this problem Hard? Stream data are –Large volume –High data rates –Generated by multiple distributed data sources –Rapidly updated Processing stream data requires –Filtering –Aggregation –Correlation A system supporting the stream data processing applications should consider –Scalability –Latency –Resource utilization
Novelty of Contribution Related Work –DataCutter, StreaMIT: Connections between applications are statically determined. –TelegraphCQ, Aurora, Borealis, STREAM: provide support for stream data manipulation from a database-centric perspective, but, process streams of tuples individually. (i.e., small-scale) –Benchmarks: Previous works on Linear Road did not report any performance number Contributions –SPC is dynamic application composition. –Evaluate the SPC using the Linear Road application employing multiple distributed configurations. Highly scalable implementation of the Linear Road application –Study the behavior of the streaming infrastructure support for large- scale continuous and historical queries. Addressing performance bottlenecks and tuning them.
SPC Architecture Publish-subscribe model –Each processing element (PE) that consumes and produces stream data specifies the characteristics of the streams. –SPC dynamically determines the stream connections by matching stream descriptors as new applications and new data sources join and leave the system. Reusing streams –Results in significant resource savings. –Discovers useful info. over an ever- changing set of data sources.
Performance Challenges and Optimizations in SPC Challenges –The PEs consist of performing Small amount of processing on large volumes of data Large amount of processing on lower volumes of data Thus, a mix of I/O-bound & CPU-bound –Impossible to store stream history in memory memory-bound Optimizations –SDO filtering: SPC can filter out unwanted objects saving resources. –Events: PEs can subscribe to system events. Can adapt its algorithm. –Dynamic copies of PEs
Linear Road Benchmark Simulates the traffic characteristics of a simple urban expressway system. Input to the Linear Road benchmark is stream data format. Requires stream-based data management system (SDMS) to process a set of continuous and historical queries.
Prototype Implementation Design principles –Modularity –Data Aggregation –Network and Data Locality –Flexible Programming Environment Linear Road in SPC –The figure shows the query network infrastructure comprising 15 PEs.
Experiments Input data is increasing over time for stress-test Scalability
Experiments Analyzing Bottleneck PEs PE Placement Policy
Summary Paper’s focus –Understanding the performance characteristics of stream processing applications in a distributed setup Ideas –Design and implementation of the Linear Road benchmark on the SPC middleware. –Identify the main performance bottlenecks to achieve scalability and low query response latency Contributions –Demonstrate a scalable distributed implementation of Linear Road –Highlight the importance of addressing performance bottlenecks Analytical Validation –Experiments –Prototyping
Assumptions, Rewrite today Assumptions –Restrict evaluation to SPC support for the Linear Road application assuming that their design decisions are performance results are applicable to other streaming applications. –The system is constantly overloaded with respect to the available resources. –PEs are I/O, CPU, and memory bound. Rewrite today –Apply the ideas to other types of streaming applications. –More extensive experiments on performance tuning.