Download presentation
Presentation is loading. Please wait.
Published byBritney Tate Modified over 9 years ago
1
Massive Scale-out of Expensive Continuous Queries Erik Zeitler and Tore Risch Uppsala Database Laboratory Uppsala University
2
31 Aug 2011Erik Zeitler and Tore Risch 2 Outline 1.Introduction 2.Stream splitting strategies for scale-out 3.Evaluating stream splitting strategies 4.Cost model and heuristic 5.Energy efficiency 6.Related work 7.Conclusions and future work
3
31 Aug 2011Erik Zeitler and Tore Risch 3 CQ: Continuous Queries (filters and transformations) 11001011 01001011 DSMS Data Stream Management System SCSQ Super Computer Stream Query processor a Data Stream Management System
4
31 Aug 2011Erik Zeitler and Tore Risch 4 CQ splitmerge Research Questions How to ensure scalable CQ execution with growing input stream rate? with high CQ execution cost? CQs are scaled out by splitting the input stream. applications require customizable input stream splitting, called splitstream both tuple route and broadcast allowed splitstream CQ By scale-out.
5
31 Aug 2011Erik Zeitler and Tore Risch 5 merge Research Questions How to ensure scalable CQ execution with growing input stream rate? with high CQ execution cost? CQs are scaled out by splitting the input stream. applications require customizable input stream splitting, called splitstream both tuple route and broadcast allowed How to split massive streams over massively parallel CQs? By parallelization of splitstream By scale-out. CQ split splitstream
6
31 Aug 2011Erik Zeitler and Tore Risch 6 Outline 1.Introduction 2.Stream splitting strategies for scale- out 3.Scale-up of stream splitting strategies 4.Cost model and heuristic 5.Energy efficiency 6.Related work 7.Conclusions and future work
7
31 Aug 2011Erik Zeitler and Tore Risch 7 splitstream(stream s, integer q, function rfn, function bfn) vector of stream sv User defines rfn and bfn rfn(object tpl, integer q) integer rfnLRB(event e, integer q) integer as select expressway(e) where eventtype(e) = 0; bfn(object tpl) boolean bfnLRB(event e) boolean as select eventtype(e) = 2; rfn and bfn for streams are analogous to fragmentation and replication conditions in distributed DBMS Unlike DDBMS, execution of rfn and bfn is parallelized Defining stream splitting splitstream q s sv
8
31 Aug 2011Erik Zeitler and Tore Risch 8 Naïve (flat) splitstream implementation: fsplit Expensive stream splitting computations Bottleneck! fsplit(stream s, integer q, function rfn, function bfn) vector of stream sv
9
31 Aug 2011Erik Zeitler and Tore Risch 9 Tree shaped splitstream implementation: maxtree Bottleneck is alleviated [Zeitler and Risch, DASFAA 2010] but still problematic maxtree(stream s, integer q, function rfn, function bfn) vector of stream sv
10
31 Aug 2011Erik Zeitler and Tore Risch 10 Scaled-out splitstream: parasplit Window router distributes entire windows Window splitter Stream merge parasplit(stream s, integer q, function rfn, function bfn) vector of stream sv
11
31 Aug 2011Erik Zeitler and Tore Risch 11 Parasplit: route – //fsplit – //(merge – CQ) Window router distributes entire windows Window splitter Stream merge parasplit(stream s, integer q, function rfn, function bfn) vector of stream sv
12
31 Aug 2011Erik Zeitler and Tore Risch 12 Tree shaped window routing: parasplit*
13
31 Aug 2011Erik Zeitler and Tore Risch 13 Outline 1.Introduction 2.Stream splitting strategies 3.Scale-up of stream splitting strategies 4.Cost model and heuristic 5.Energy efficiency 6.Related work 7.Conclusions and future work
14
31 Aug 2011Erik Zeitler and Tore Risch 14 Experimental set-up Hardware Linux cluster Up to 70 nodes Each node has 2x quad-core Intel® Xeon® E5430@2.66GHz, 6 MB L2$. TCP/IP over GbE Performance number L : Number of xways the DSMS can handle www.cs.brandeis.edu/~linearroad
15
31 Aug 2011Erik Zeitler and Tore Risch 15 LRB result nameorgyearLcorescomment Aurora Brandeis, Brown, MIT 20042.51 Commercial sys A20040.51 SPCIBM20062.51703GHz Xeon XqueryETHZ20071.51 DataCellCWI2009141.4s avg RT stream schemaETHZ201054 SCSQ maxtreeUU20106448 D disabled (later verified in mySQL) SCSQ parasplitUU2011512560D disabled Performance number L : Number of xways the DSMS can handle
16
31 Aug 2011Erik Zeitler and Tore Risch 16 Splitstream stream rate CQ parallelism, q 1 Gbps wire speed
17
31 Aug 2011Erik Zeitler and Tore Risch 17 Window router stream rate W – physical window size p – number of parallel fsplit p W
18
31 Aug 2011Erik Zeitler and Tore Risch 18 Impact of window size W in window router network bound for large enough windows
19
31 Aug 2011Erik Zeitler and Tore Risch 19 Impact of window size W in window router when scaling p
20
31 Aug 2011Erik Zeitler and Tore Risch 20 Parasplit* Tree shaped window router W = 16 kB fsplit parallelism, p
21
31 Aug 2011Erik Zeitler and Tore Risch 21 Outline 1.Introduction 2.Stream splitting strategies 3.Scale-up of stream splitting strategies 4.Cost model and heuristic 5.Energy efficiency 6.Related work 7.Conclusions and future work
22
31 Aug 2011Erik Zeitler and Tore Risch 22 Eliminate p p q parasplit(stream s, integer q, function rfn, function bfn) vector of stream sv Given Input stream rate Φ D Parallelism of continuous query q Automatically determine fsplit parallelism p
23
31 Aug 2011Erik Zeitler and Tore Risch 23 Cost model for fsplit cr – read cost per tpl (read + de-marshal) cs – split cost per tpl (execute rfn and bfn ) ce – emit cost per tpl (marshal + print) o – omit % r – routing % b – broadcast % q – number of output streams according to rfn and bfn
24
31 Aug 2011Erik Zeitler and Tore Risch 24 Cost model for merge in CQ cr – read cost per tpl (read + de-marshal) cp – poll cost per tpl cm – merge cost per tpl O – cost of executing the CQ and emit its result
25
31 Aug 2011Erik Zeitler and Tore Risch 25 Cost model for parasplit p can be eliminated using cost model, but requires extensive profiling everywhere PRfsplit CQ p q
26
31 Aug 2011Erik Zeitler and Tore Risch 26 Heuristic for estimating p Assume 1% broadcast tuples (configurable) 0% omitted tuples (configurable) Measure Φ fsplit (1) on rfn and bfn, q = 1: cs +ce = 1/Φ fsplit (1) Estimate p by fsplit
27
31 Aug 2011Erik Zeitler and Tore Risch 27 p according to heuristics vs. p using exact cost model CQ parallelism, q
28
31 Aug 2011Erik Zeitler and Tore Risch 28 Outline 1.Introduction 2.Stream splitting strategies 3.Scale-up of stream splitting strategies 4.Cost model and heuristic 5.Energy efficiency 6.Related work 7.Conclusions and future work
29
31 Aug 2011Erik Zeitler and Tore Risch 29 Estimating energy efficiency, η How much extra energy does parasplit consume in comparison to fsplit? Conservatively assume energy consumption proportional to CPU usage: Useful work p ∙ C fsplit Overhead C PR q ∙ C CQ (O=0) PRfsplit CQ
30
31 Aug 2011Erik Zeitler and Tore Risch 30 Measuring energy efficiency CQ parallelism, q
31
31 Aug 2011Erik Zeitler and Tore Risch 31 Related work Nobody else has investigated strategies for scalable customizable stream splitting IBM SPADE/System S [Andrade et al 2009] Splitstream operator with broadcast capabilities Streaming throughput degrades when scaling q Event based systems [Brenna et al 2009] Custom stream splitting shown to be a bottleneck Gigascope [Johnson et al 2008] Assumes specialized stream splitting hardware No customizable stream splitting GSDM [Ivanova, Risch 2005] Parallel execution of expensive UDFs More limited parallelization Streaming MapReduce [Condie et al 2010] Does not handle scalable stream splitting [Balkesen, Tatbul 2011] Distributing entire windows over CQs q ≤ 4
32
31 Aug 2011Erik Zeitler and Tore Risch 32 Conclusions and future work Naïve stream splitting is prohibitive for scale-out of CQs Parasplit eliminates the bottleneck of stream splitting, providing network bound stream rates Parasplit* provides network bound stream rates for highly scaled-out stream splitting Push selection predicates from CQ to rfn of splitstream Improve energy efficiency High Availability SCSQ home page http://www.it.uu.se/research/group/udbl/SCSQ.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.