Peter R. Pietzuch THEMIS: Fairness in Data Stream Processing under Overload Evangelia Kalyvianaki City University London, UK Model-driven Algorithms and Architectures for Self-Aware Computing Systems, Dagstuhl 2015 Marco Fiscato Imperial College London, UK Theodoros Salonidis IBM Research, USA Peter Pietzuch Imperial College London, UK
The Puzzle of Big Data Real-Time Processing Engines in Data Centres 2 Queries overload data center resources. How to efficiently allocate resources across clusters/engines?
3 A well-known technique to handle transient overload conditions is to discard data [][][] Data Shedding overloaded How to measure shedding across queries? a well-known mechanism to handle transient overload conditions is to discard data How much data should we shed from queries? How to implement shedding in this distributed setup?
4 shedding data reduced correctness degraded performance different dropped data difference degrees of degradation Source Information Content (SIC) metric measures the contribution of data from sources to results 11/6 < 3 degraded processing perfect processing How to measure shedding across queries? SIC is a data-stream-processing-aware metric. But can we have a metric that is operator- or query-aware?
5 Fair Shedding for Equalising SIC values each local shedder equalises the SIC values of its own queries global coordination is achieved with local informed shedding
6 SIC Fair Shedder to address nodes’ heterogeneity and workload variations: online cost model estimates the time to process an average tuple Could we build the system to be goal-aware?
7 A self-aware autonomic system for data processing in real-time Systems already have (some) adaption and (some) self-awareness but could we extend to (full) self-awareness? For example, can we build a self-aware system to perform fair data shedding for data stream processing and databases and filesystems in overload? Thank you! Questions?