TelegraphCQ: Continuous Dataflow Processing for an Uncertain World Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein,Wei Hong*, Sailesh Krishnamurthy, Sam Madden, Vijayshankar Raman**, Fred Reiss, and Mehul Shah University of California, Berkeley *Intel Berkeley Laboratory **IBM Almaden Research Center http://telegraph.cs.berkeley.edu/
Contents Background and Motivation Telegraph – Architecture Window Semantics in TelegraphCQ TelegraphCQ – Design Overview TelegraphCQ – Architecture Conclusion All diagrams and contents are directly adapted/taken from the paper itself! 5/7/2019
TelegraphCQ – Background and Motivation Adaptive Dataflow Architecture – systems that could adjust their processing on-the-fly in response to Changes in user needs [HACO+99] Intermittent delays in accessing data across WANs [UFA98] Shared Processing CACQ [MSHR02] PSoup [CF02] Limitations - processing restricted to in-memory data No scheduling and resource management for queries with little or no overlap No Quality of Service (QoS) for adapting to resource limitations No tradeoff between flexibility and overhead 5/7/2019
Telegraph - Architecture Extensible set of composable dataflow modules/operators Producer-Consumer design with Fjords API Push as well as Pull queues Ingress and Caching Query Processing Adaptive Routing 5/7/2019
Adaptive Processing – Eddies & SteMs EDDY – continuously route tuples according to a routing policy per tuple basis routing requiring associated state to the tuple SteMs – Temporary repository of tuples Stores homogeneous tuples Supports build (insert), probe (search) and eviction (deletion) operations 5/7/2019
Fjords – InterModule Communication Allow use of mixture of push and pull connections between modules a pull-queue is implemented using a blocking dequeue on the consumer side and a blocking enqueue on the producer side. A push-queue is implemented using non-blocking enqueue and dequeue; control is returned to the consumer when the queue is empty Execute query over any combination of streaming and static data sources Flux – Scaling Up Dataflow Processing Interposed between a producer-consumer operator pair in a pipelined, partitioned dataflow Fault-tolerant, Load-balancing eXchange Load-balancing via online repartitioning of the input stream and corresponding state of operators Fault-tolerance by leveraging these state movement mechanisms to replicate an operator’s internal state and in-flight data 5/7/2019
Initial CQ Approaches PSoup CACQ First CQ engine exploiting adaptive query processing framework Modification of Eddies- execution of multiple queries by executing a single “super”- query as disjunction of all the queries Tuple Lineage – state to determine the client Grouped Filters – index for single variable Boolean factors over the same attribute for optimizing selections in the shared execution PSoup Extends CACQ Allows queries to access historical data – treats data and queries symmetrically Adds support for disconnected operation-users can register queries 5/7/2019
Window Semantics in TelegraphCQ Rich windowing schemes over both already-arrived as well as incoming data Various window semantics are- Snapshot query: execute exactly once over one window e.g. “Select the closing prices for MSFT on the first five days of trading” Landmark query: fixed beginning point and a forward moving endpoint e.g. “Select all the days after the hundredth trading day, on which the closing price of MSFT has been greater than $50. Keep this query standing in the system for a thousand trading days” Sliding query: forward moving beginning and end e.g. “On every fifth trading day starting today, calculate the average closing price of MSFT for the five most recent trading days. Keep the query standing for fifty trading days” Temporal Band-Join: join tuples in one stream with those in another based on timestamp e.g. “For the five most recent trading days starting today, select all stocks that closed higher than MSFT on a given day. Keep the query standing for twenty trading days” 5/7/2019
TelegraphCQ – Design Overview Adapted the architecture of PostgreSQL Implemented the new system in C/C++ to leverage the open source PostgreSQL code base Reused components with different levels of changes 5/7/2019
TelegraphCQ – Architecture Three processes that comprise the TelegraphCQ server FrontEnd Wrapper Providing Abstraction of External Source Separate Process( non-blocking) Executor Execution Object Providing Execution Context for Multiple Queries Dispatch Unit Performing Actual Work 5/7/2019
TelegraphCQ: Rebuttal Query Grouping and Sharing: Degree of overlap. No prioritizing of queries. Adaptivity Schemes: “Per tuple”, “per operator” or batch. No experimental evaluation of the efficiency of the schemes. 5/7/2019
TelegraphCQ: Rebuttal Ingress module: allows input from various sources. Does that bring the efficiency down? Lack of Egress module. It does not support value based windows. 5/7/2019
TelegraphCQ: Rebuttal It does not have special arrangements for supporting ad-hoc queries as in Aurora. Does not support distributed operations (proposed later). No support for crash recovery and imprecise or missing data. 5/7/2019
Conclusion TelegraphCQ provides adaptive dataflow and shared processing architecture Eddy and SteM form building blocks for adaptive processing Features like Fjord’s inter-module communication (push and pull connections) and Flux – Fault-tolerant and Load-balancing Exchange CACQ (tuple-lineage and group-filters) PSoup (Symmetrical treatment of data and queries) Built over the PostgreSQL framework The rebuttal presented was comparing TelegraphCQ with other stream engines and with the concept of relational databases. Thank you 5/7/2019