Download presentation
Presentation is loading. Please wait.
1
Interactive Query Processing Vijayshankar Raman Computer Science Division University of California at Berkeley
2
July 13, 2015Interactive Query Processing2 Motivation: Nature of querying querying = extracting information from data sets different techniques in different settings intrinsically slow significant user-system interaction info. seekers work iteratively, gradually refining requests based on feedback
3
July 13, 2015Interactive Query Processing3 Problems with traditional solutions mismatch between system functionality and mode of HCI black boxes do batch processing frustrating delays in iterative process process query exact answer
4
July 13, 2015Interactive Query Processing4 Interactive processing HCI requirements users must get continual feedback on results of processing allow users to control processing based on prior feedback performance goals not to minimize time to give complete results give continually improving partial results adapt to dynamically specified performance goals process query exact answer
5
July 13, 2015Interactive Query Processing5 Context and Talk Outline query processing over structured data processing = dataflow through pipelining operators outline of rest of talk support for dynamic user control in traditional query proc. architectures (static plans) architecture for giving more aggressive partial results policy for generating partial results client user interface impact of user actions on query execution other related issues interactive data cleaning (A-B-C) adaptive query processing RS P Q
6
July 13, 2015Interactive Query Processing6 Design goals in supporting user control make minimal change to system architecture must be independent of particular query processing algorithms no delay in processing
7
July 13, 2015Interactive Query Processing7 Online Reordering ( Raman et al. ’99,’00 ) users perceive data being processed over time prioritize processing for “interesting” tuples interest based on user-specified preferences reorder dataflow so that interesting tuples go first encapsulate reordering as pipelined dataflow operator scan reorder index scan
8
July 13, 2015Interactive Query Processing8 Framework for Online Reordering want no delay in processing in general, reordering can only be best-effort typically process/consume slower than produce exploit throughput difference to reorder two aspects mechanism for best-effort reordering reordering policy acddbadb... abcdabc.. reorder produce process f(t) user interest
9
July 13, 2015Interactive Query Processing9 Juggle mechanism for reordering continually prefetch from input, spooling onto auxiliary side disk if needed juggle data between buffer and side disk, to keep buffer full of “interesting” items getNext chooses best item currently on buffer reordering policy determines what getNext returns, and enrich/spool decisions buffer spoolfetch enrich getNext juggle side disk produce process/consume
10
July 13, 2015Interactive Query Processing10 Reordering policies quality of feedback for a prefix t 1 t 2 …t k QOF( UP(t 1 ), UP(t 2 ), … UP(t k ) ), UP = user preference determined by application goodness of reordering: dQOF/dt implication for juggle mechanism process gets item from buffer that increases QOF the most juggle tries to maintain buffer with such items time QOF GOAL: “good” permutation of items t 1 …t n to t 1 …t n
11
July 13, 2015Interactive Query Processing11 One application of reordering online aggregation [Hellerstein et. al ‘97, Hellerstein and Haas’99 ] for SQL aggregate queries, give gradually improving estimates with confidence intervals allow users to speed up estimate refinement for groups of interest prioritize for processing at a per-group granularity SELECT AVG(gpa) FROM students GROUP BY college
12
July 13, 2015Interactive Query Processing12 Online Aggregation Screenshot SELECT AVG(gpa) FROM students GROUP BY college
13
July 13, 2015Interactive Query Processing13 QOF in Online Aggregation avg weighted confidence interval preference acts as weight on confidence interval QOF = UP i / n i, n i = number of tuples processed from group i process pulls items from group with max UP i / n i n i desired ratio of group i tuples on buffer = UP i 2/3 / UP j 2/3 juggle tries to maintain this by enrich/spool
14
July 13, 2015Interactive Query Processing14 Other Quality of Feedback functions rate of processing (for a group) preference QOF = (n i - nUP i ) 2 (variance from ideal proportions) process pulls items from group with max (nUP i - n i ) desired ratio of group i tuples in buffer = UP i will see more complex QOF later
15
July 13, 2015Interactive Query Processing15 Results: Reordering in Online Aggregation implemented in Informix UDO experiments with modified TPC-D queries questions: how much throughput difference is needed for reordering can we reorder handle skewed data one stress test: skew, minimal proc. cost index-only join 5 orderpriorities, zipf distribution consume process scan juggle index SELECT AVG(o_totalprice), o_orderpriority FROM order WHERE exists ( SELECT * FROM lineitem WHERE l_orderkey = o_orderkey) GROUP BY o_orderpriority
16
July 13, 2015Interactive Query Processing16 Performance results time # tuples processed 3 times faster for interesting groups 2% completion time overhead E C A confidence interval time
17
July 13, 2015Interactive Query Processing17 Findings higher processing costs (index/hash join, subquery, …) make reordering easy at very low processing costs, juggle constrained by density of interesting tuples in source outlier groups hard to speed up better to use index on reordering column [HHW ‘97] reordering becomes easier over time questions to answer: where to place juggle are groups fixed statically
18
July 13, 2015Interactive Query Processing18 Outline motivation and context support for dynamic user control in traditional query proc. architectures architecture for giving more aggressive partial results policy for generating partial results user interface for displaying partial results impact on routing related work, work-in-progress, future work
19
July 13, 2015Interactive Query Processing19 Generating partial results traditional arch. also generate continual result tuples arises from continual dataflow thru pipelining operators much work on pipelining joins [WA91, HH99, IFF + 99, UF00] this is too rigid especially in distributed envirorments Result Space
20
July 13, 2015Interactive Query Processing20 Context: Query processing in Telegraph Telegraph: adaptive dataflow system to query diverse, distributed sources much data available as services over Internet currently only accessible by browse/search/forms want to semi-automatically combine this data examples: campaign finance information (fff.cs.berkeley.edu) (fff.cs.berkeley.edu) multiple donor lists, home prices, census info., crime ratings, maps, celebrity lists, legislative records … restaurant information lists, reviews, addresses, maps, health inspection reports, nutrition information …
21
July 13, 2015Interactive Query Processing21 More aggressive partial results complete results too rigid source latencies high, diverse dynamic source variations, delays query does not capture user desires want to process queries flexibly, giving results asap adapt dynamically to user preferences and source variations allow user feedback to refine partial results
22
July 13, 2015Interactive Query Processing22 Correctness of partial results must contain some essential columns group-by columns / sort-by columns full disjunction/outer-join semantics appropriate for many Internet data sources join semantics cannot give partial results without ensuring that match exists aggregates: over-estimate, even with outer-join semantics update early, and compensate later
23
July 13, 2015Interactive Query Processing23 Dynamic query plans Eddy [Avnur and Hellerstein 2000] router for directing data thru modules dynamically chooses join and selection order all partial tuples generatable original routing policy: optimize completion time our focus: continual partial results, adaptation to user preferences Eddy RS T
24
July 13, 2015Interactive Query Processing24 Eddy Routing Eddy must decide a)what tuple to route next b)where to route it based on user preferences, and source properties need routing policy and a reordering mechanism eddy memory and module queues bounded... modules RPS... inputs Eddy
25
July 13, 2015Interactive Query Processing25 Prioritizing tuples perform online reordering within Eddy all incoming tuples placed in reorderer when space available on queues to modules, Eddy takes tuple from reorderer and routes it to modules best-effort automatically reorder before slow modules... modules RPS... inputs copy
26
July 13, 2015Interactive Query Processing26 Routing and reordering policy GOAL: at any time, route to max. dQOF/dt = benefit of sending tuple to module / cost cost: estimate data rates to/from module benefit: dependent on application and UI how partial results impact the UI user preferences QOF GOAL: time
27
July 13, 2015Interactive Query Processing27 Outline motivation and context online reordering for user prioritization of partial results architecture for generating more aggressive partial results policy for generating partial results Telegraph UI: displaying results and inferring preferences impact on routing experimental results related work/work-in-progress/future work
28
July 13, 2015Interactive Query Processing28 Telegraph UI screenshot
29
July 13, 2015Interactive Query Processing29 Telegraph UI partial results displayed on screen as they arrive values clustered into groups can roll-up or drill-down to see different granularities client has hash table mapping groups to values navigation vertical/horizontal scrolling column (un)hiding rollup/drilldown different columns visible at different drilldown levels
30
July 13, 2015Interactive Query Processing30 Getting user preferences (1) infer from navigation row/column scrolling, group drill down and rollup prioritize visible rows/columns “query evolution” subset “one-size-fits-all” queries future work: query expansion explicit at the cell level -- need for some expensive sources can also have up/down buttons on columns
31
July 13, 2015Interactive Query Processing31 Getting user preferences (2) preferences: col priorities { … } e.g. {,,,, }
32
July 13, 2015Interactive Query Processing32 Benefit of a partial result depends on user preferences benefit of updating a cell in output row priority x column weight x cell worthiness column weight = max(col priority, row-col priority) incremental cell worthiness: how much does one extra update add to the cell’s value quality of feedback enumerations -- 1 aggregations -- change in confidence interval informing user about execution progress map cell worthiness to foreground color value
33
July 13, 2015Interactive Query Processing33 Benefit of Routing a Tuple route tuple according to expected benefit and cost benefit of sending a tuple t to a module M and forming set T = cells c T benefit of updating c estimate fanout by sampling previous values -- selectivity estimation t M T
34
July 13, 2015Interactive Query Processing34 Granularity of Reordering/Routing reorder and route tuples at granularity of a group group = groups created and deleted dynamically, as user navigates in UI group predicate may be a subset of application- specified row predicate final row depends on values this tuple joins with
35
July 13, 2015Interactive Query Processing35 Partial results, no reordering Bush Contributors Income Crime Ratings 8002004006000 10000 20000 40000 30000 50000 time (s) total worthiness with partial results without partial results
36
July 13, 2015Interactive Query Processing36 Partial results, no reordering 0 100200300 10000 20000 Bush Contributors Income Crime Ratings time(s) total worthiness 100200300 0 1000 delay with partial results without partial results
37
July 13, 2015Interactive Query Processing37 Results with row-wise reordering scrolling: {AZ, AR, CA CO, CT}, LA,KY,MA,MD,MI} 0 100 4080 Aggr. Updates time 0 0.24 groups rel. error
38
July 13, 2015Interactive Query Processing38 Prioritizing particular columns previous graph: Income updates much faster than Crime Ratings can we prioritize Crime Ratings? increase number of threads to probe it if eddy sends n a tuples to A, n b to B, per second n a <= threads a /latency(A), n b <= threads b /latency(B) threads a + threads b <= Number of Threads Per Query max. n a worthiness(A) + n b worthiness(B) more generally, optimize for multiple resources good citizenship costs
39
July 13, 2015Interactive Query Processing39 Outline motivation and context online reordering for user prioritization of partial results architecture for generating more aggressive partial results policy for generating partial results conclusions and future work
40
July 13, 2015Interactive Query Processing40 Related Work IR work ranked retrieval, relevance feedback search strategies, Berry Picking making operators pipelining (Ripple Join, XJoin, Tuqwila) precomputed summaries (OLAP, materialized views, AQUA) top N / fast-first query processing parachute queries, union queries adaptivity dynamic query plans mid-query reoptimization (KD’98, IFF + 99) competition (DEC Rdb)
41
July 13, 2015Interactive Query Processing41 Summary query processor should care about how user/application uses results online reordering effective way of supporting dynamic user control give partial results as user wishes by embedding reordering within dynamically controlled dataflow system flexibility helpful hard to map these user-interaction needs into concrete algorithm performance goals wanted: benchmarks based on user/application traces For more information: http://telegraph.cs.berkeley.edu/http://telegraph.cs.berkeley.edu/, http://control.cs.berkeley.edu/ttp://control.cs.berkeley.edu/
42
July 13, 2015Interactive Query Processing42 Future Work session-granularity query processing query evolution lazy evaluation: user/client navigates through partial results closer dbms-application interaction reduced operator set querying execute query by appropriate routing through data sources and state modules
43
July 13, 2015Interactive Query Processing43 Lessons Learned query processor should care about how user/application uses results system flexibility helpful hard to map these user-interaction needs into concrete algorithm performance goals wanted: benchmarks based on user/application traces For more information: http://telegraph.cs.berkeley.edu/ http://control.cs.berkeley.edu/
44
July 13, 2015Interactive Query Processing44 relational operators: logical astractions encapsulate multiple physical effects inflexible in handling unexpected changes information hiding operator speeds resource consumption work sharing competitive access paths inflexible tuple routing suppose user asks for SQ tuples suppose there is a delay in P want to encapsulate at level of physical operators: data sources, data structures Granularity of Query Operators RS P Q R Eddy S Q RS RS-PQPQPQij
45
July 13, 2015Interactive Query Processing45 State Modules store state in separate State Modules (SteMs) SteM like an index: supports builds and probes unifies caches, join state, rendezvous buffers Eddy routes tuples through SteMs and data sources gets results from sources, or probe/build into SteMs joins and data access performed in the process of routing RSPQ Q R Eddy SteM P SteM Q SteM S SteM R S P Q R Eddy S Q RS RS-PQPQPQij
46
July 13, 2015Interactive Query Processing46 HCI Motivation: Why Interactivity? Berry picking (Bates ‘90, ‘93) user studies decision support (O’day and Jeffries ‘93) relevance feedback (Koenemann and Belkin ‘96)
47
July 13, 2015Interactive Query Processing47 BACKUP: Disk data layout during juggling performance goal: favor early results: optimize Phase 1, at expense of Phase 2 spool : sequential I/O, enrich: random I/O +reordering in Phase 2 much easier than in Phase 1 enrich needs approx. index on side-disk have hash-index on tuples, according to user interest done at “group” granularity
48
July 13, 2015Interactive Query Processing48 BACKUP: Other applications of reordering can add reorder to any dataflow will later discuss application for query systems that give aggressive partial results also useful in batch query processing sorting often used in query plans for performance gains can replace by best-effort reordering little performance hit, but plan is now pipelining will not discuss further in this talk spreadsheets
49
July 13, 2015Interactive Query Processing49 BACKUP: Estimators for rowPriority, incremental worthiness easy, unless output row depends on new values currently use average of all possible rows could estimate a distribution instead
50
July 13, 2015Interactive Query Processing50 BACKUP: Interactive query processing with non-pipelining operators basic techniques independent of pipelined operators. reordering --- can be used in general query plans, although effectiveness may be hindered by blocking operators State Modules applicable in general -- helps with adaptivity much work on making plan operators pipelining -- ripple join, xjoin, tuqwila reordering itself can be used to avoid blocking sorts in some cases
51
July 13, 2015Interactive Query Processing51 Query processing using SteMs (1) SteM operations -- works like index on a table build(tuple): add tuple to SteM probe(tuple): find all matches among build values bounce back tuple if all matches not found and probe tuple not cached elsewhere performing joins by routing through SteMs index joins regular synchronous index joins easy -- have no state over distributed sources, lookups cached by building into SteM separating cache allows Eddy to distinguish access cost asynchronous index joins [GW’00] helps for high-throughput Internet sources SteM acts as rendezvous buffer for pending probe results R Eddy SteM S SteM R S R bld. S bld. R probe S probe R
52
July 13, 2015Interactive Query Processing52 Query processing using SteMs (2) performing joins by routing through SteMs (contd) doubly pipelined hash join use two SteMs, one on each source can also do extensions (e.g. Tuqwila [IFF + 99]), and even non-pipelined joins, using appropriate SteMs thus can simulate any static query plan by appropriate routing R Eddy SteM S SteM R R bld. S bld. R probe S probe S
53
July 13, 2015Interactive Query Processing53 Query processing using SteMs (3) how to do this in general? don’t want Eddy to be a super-join want it to route as per user desires and source properties, independent of any join algorithm routing constraint: each tuple must have a non-zero probability of being sent to every appropriate module can avoid repeated looping by marking tuples Theorem: any routing that satisfies above constraint will produce all query results and terminate could get duplicates prune at application, or enforce atomicity in tuple routing
54
July 13, 2015Interactive Query Processing54 Benefits of SteMs adaptivity join algorithm and access path selection are determined by eddy routing hence Eddy can dynamically choose these based on source properties in fact, Eddy can do hybrid join algorithms adapt to dynamic delays, etc. work sharing -- for query evolution, competition better information to the Eddy speeds, memory consumption
55
July 13, 2015Interactive Query Processing55 BACKUP: Lazy Evaluation User navigates thru partial results can see only a few at a time exploit to delay processing do on demand on the region user currently navigating over things to delay expensive source lookups (e.g. map) joins (e.g. …) any other formatting etc.
56
July 13, 2015Interactive Query Processing56 Backup: giving probabilistic results view partial result as somethng that throws light on outer-join space positive result: probabilistic negative result: deterministic benefit of positive result: weighted sum of cells it is likely to show; penalty for false result repudiation of earlier false results benefit of negative result: direct benefit unlikely repudiation of earlier false results complications many tuples may contribute to single output tuple -- aggregation
57
July 13, 2015Interactive Query Processing57 Probabilistic partial results partial result => may not apply all filters still want to show result probabilistically filters ill-specified data on Internet often inconsistent show all partial results: “full disjunction” (Galindo-Legaria ‘94) positive and negative results Output Space Outer Join Space
58
July 13, 2015Interactive Query Processing58 What does user see? user navigates thru partial results with complete row partial results: can either explore these results in detail (sort/scroll along columns as in spreadsheet) compute approx. aggregates on them in online fashion errors shrinking as more results come in with probabilistic, incomplete results? easy if primary key in partial result what to show otherwise?
59
July 13, 2015Interactive Query Processing59 Quantifying benefit of partial results partial result sheds light on outer-join space positive result: probabilistic negative result: deterministic benefit of positive result: weighted sum of cells it is likely to show penalty for false result ? repudiation of earlier false results ? benefit of negative result: direct benefit unlikely repudiation of earlier false results ?
60
July 13, 2015Interactive Query Processing60 Why probabilistic results are good data on Internet is often inconsistent hence even exact processing cannot give perfect answers people are used to sloppy answers can allow negations in expert interfaces only
61
July 13, 2015Interactive Query Processing61 Data growth vs. Computer Speedup Moore’s Law -- # of transistors/chip doubles every 18 months (1965) data growth Source: J. Porter, Disk/Trend, Inc. (http://www.disktrend.com/pdf/portrpkg.pdf)
62
July 13, 2015Interactive Query Processing62 Disk Appetite, contd. Greg Papadopoulos, CTO Sun: Disk sales doubling every 9 months similar results from Winter VLDB survey time to process all your data doubles every 18 months!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.