IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden & *IST Lisbon)
2 Information Data analytics systems Raw data
3 Massive scale Low latency High throughput Big data systems
To strike a balance 4 Low latency High throughput Tension “Novel” computing paradigms “Novel” computing paradigms
Observation: Compute over a sub-set of data items instead of the entire data-set! Take less time and resources for computation How do these computing paradigms make this trade-off? 5
Two such computing paradigms 6 Inc Incremental computing Approx Approximate computing
Incremental computation 7 Application Small changed input Incrementally updated output Common workflow: Rerun the same application over evolving input Incremental updates: Reuse memoized parts of the computation that are unaffected by the changed input Incremental updates: Reuse memoized parts of the computation that are unaffected by the changed input
Approximate computation 8 Common use-case: Approximate output is good enough! Application Approximate output Input Approximate output: Compute only parts of the input selected by representative sampling Approximate output: Compute only parts of the input selected by representative sampling
Basic idea 9 Both paradigms compute over a sub-set of data items ! Incremental computation Approximate computation Affected by the changed input Selected by the input sampling Biased sampling: Select input items for which we already have memoized result from previous runs Biased sampling: Select input items for which we already have memoized result from previous runs IncApprox
Motivation Design Evaluation Outline 10
Overview of IncApprox 11 Input data stream Incremental computing Approximate computing + IncApprox Approximate output Streaming query Query budget (Latency or resource constraints) Query budget provides adaptive execution interface to systematically tune b/w latency & throughput! Query budget provides adaptive execution interface to systematically tune b/w latency & throughput!
Computation model “ Batched stream processing” 12 Input data stream M M M M M M M M M M M M M M M M M M R R R R R R R R Output Input For each sliding window Run a data-parallel job Computation window
High-level approach 13 Step #1 Stratified sampling Computation input window Approximate output Biased sampling Step #2 Run job incrementally Step #3
#1: Stratified sampling 14 Step #1 Stratified sampling Computation input window Approximate output Biased sampling Step #2 Run job incrementally Step #3
#1: Why stratified sampling? 15 Stream aggregator (Kafka) Stream aggregator (Kafka) Stream processing system Input stream Sub-streams S1 S2 Sn … Need proportional allocation of data-items for all sub-streams Need proportional allocation of data-items for all sub-streams Sub-streams: Disparate events with different distributions Different arrival rates Sub-streams: Disparate events with different distributions Different arrival rates
#1: Stratified sampling in IncApprox 16 Stream aggregator (Kafka) Stream aggregator (Kafka) Sub-streams S1 S2 Sn … Sample size IncApprox Computation window for the input stream Stratified reservoir sampling (see the paper for details) Query budget
#2: Biased sampling 17 Step #1 Stratified sampling Computation input window Approximate output Biased sampling Step #2 Run job incrementally Step #3
#2: Why biased sampling? 18 Input data stream Window at T1 Window at T2 Overlap Successive overlapping computation windows provide an opportunity to reuse result
#2: Biased sampling in IncApprox 19 IncApprox T1 T2 Overlapping windows w/ fluctuating arrival rates “Adaptive” budget / Sample size Biased sampling (see the paper for details)
#3: Run job incrementally 20 Step #1 Stratified sampling Computation input window Approximate output Biased sampling Step #2 Run job incrementally Step #3
#3: Why incremental run? 21 Computation window new old (with old and new data-items) To reuse results: Design and implement “Dynamic algorithms” To reuse results: Design and implement “Dynamic algorithms” Need for automatic and efficient mechanism to incrementally update the output
#3: Incremental run in IncApprox 22 Self-adjusting computation (see the paper for details) Window M M M M M M M M R R R R R R Dependence graph Change in a data item M M R R R R Change propagation
Motivation Design Evaluation Outline 23
Performance gains of IncApprox 1.Twitter stream analytics 2.Network monitoring Implementation Apache Spark Streaming Platform 24 nodes distributed computing cluster Evaluation 24 See the paper for more results!
Performance gains 25 Higher the better 2X over native Spark Streaming 1.4X over individual Inc & Approx modules 2X over native Spark Streaming 1.4X over individual Inc & Approx modules
A data analytics system for incremental approximate computing Transparent : Targets existing applications w/o any code changes Practical: Supports adaptive execution based on the query budget Efficient: Employs a mix of Inc & Approx computing paradigms Summary: IncApprox 26
IncApprox Transparent + Practical + Efficient 27 IncApprox also provides error estimation approximate output = output ± error-estimate IncApprox also provides error estimation approximate output = output ± error-estimate See the paper for details! Thank you!