Interactive Query Processing Vijayshankar Raman Computer Science Division University of California at Berkeley.

Interactive Query Processing Vijayshankar Raman Computer Science Division University of California at Berkeley

July 13, 2015Interactive Query Processing2 Motivation: Nature of querying  querying = extracting information from data sets  different techniques in different settings  intrinsically slow  significant user-system interaction info. seekers work iteratively, gradually refining requests based on feedback

July 13, 2015Interactive Query Processing3 Problems with traditional solutions  mismatch between system functionality and mode of HCI  black boxes do batch processing frustrating delays in iterative process process query exact answer

July 13, 2015Interactive Query Processing4 Interactive processing  HCI requirements users must get continual feedback on results of processing allow users to control processing based on prior feedback  performance goals not to minimize time to give complete results give continually improving partial results adapt to dynamically specified performance goals process query exact answer

July 13, 2015Interactive Query Processing5 Context and Talk Outline  query processing over structured data  processing = dataflow through pipelining operators  outline of rest of talk  support for dynamic user control in traditional query proc. architectures (static plans) architecture for giving more aggressive partial results policy for generating partial results  client user interface  impact of user actions on query execution  other related issues interactive data cleaning (A-B-C) adaptive query processing RS P Q

July 13, 2015Interactive Query Processing6 Design goals in supporting user control  make minimal change to system architecture  must be independent of particular query processing algorithms  no delay in processing

July 13, 2015Interactive Query Processing7 Online Reordering ( Raman et al. ’99,’00 )  users perceive data being processed over time prioritize processing for “interesting” tuples interest based on user-specified preferences  reorder dataflow so that interesting tuples go first  encapsulate reordering as pipelined dataflow operator scan reorder index scan

July 13, 2015Interactive Query Processing8 Framework for Online Reordering  want no delay in processing  in general, reordering can only be best-effort  typically process/consume slower than produce exploit throughput difference to reorder  two aspects mechanism for best-effort reordering reordering policy acddbadb... abcdabc.. reorder produce process f(t) user interest

July 13, 2015Interactive Query Processing9 Juggle mechanism for reordering  continually prefetch from input, spooling onto auxiliary side disk if needed  juggle data between buffer and side disk, to keep buffer full of “interesting” items getNext chooses best item currently on buffer  reordering policy determines what getNext returns, and enrich/spool decisions buffer spoolfetch enrich getNext juggle side disk produce process/consume

July 13, 2015Interactive Query Processing10 Reordering policies  quality of feedback for a prefix t  1 t  2 …t  k QOF( UP(t  1 ), UP(t  2 ), … UP(t  k ) ), UP = user preference determined by application  goodness of reordering: dQOF/dt  implication for juggle mechanism process gets item from buffer that increases QOF the most juggle tries to maintain buffer with such items time QOF GOAL: “good” permutation of items t 1 …t n to t  1 …t  n

July 13, 2015Interactive Query Processing11 One application of reordering  online aggregation [Hellerstein et. al ‘97, Hellerstein and Haas’99 ] for SQL aggregate queries, give gradually improving estimates with confidence intervals allow users to speed up estimate refinement for groups of interest prioritize for processing at a per-group granularity SELECT AVG(gpa) FROM students GROUP BY college

July 13, 2015Interactive Query Processing12 Online Aggregation Screenshot SELECT AVG(gpa) FROM students GROUP BY college

July 13, 2015Interactive Query Processing13 QOF in Online Aggregation  avg weighted confidence interval  preference acts as weight on confidence interval QOF =  UP i / n i, n i = number of tuples processed from group i process pulls items from group with max UP i / n i n i desired ratio of group i tuples on buffer = UP i 2/3 / UP j 2/3 juggle tries to maintain this by enrich/spool

July 13, 2015Interactive Query Processing14 Other Quality of Feedback functions  rate of processing (for a group)  preference QOF =  (n i - nUP i ) 2 (variance from ideal proportions)  process pulls items from group with max (nUP i - n i )  desired ratio of group i tuples in buffer = UP i  will see more complex QOF later

July 13, 2015Interactive Query Processing15 Results: Reordering in Online Aggregation  implemented in Informix UDO  experiments with modified TPC-D queries  questions: how much throughput difference is needed for reordering can we reorder handle skewed data  one stress test: skew, minimal proc. cost index-only join 5 orderpriorities, zipf distribution consume process scan juggle index SELECT AVG(o_totalprice), o_orderpriority FROM order WHERE exists ( SELECT * FROM lineitem WHERE l_orderkey = o_orderkey) GROUP BY o_orderpriority

July 13, 2015Interactive Query Processing16 Performance results time # tuples processed  3 times faster for interesting groups  2% completion time overhead E C A confidence interval time

July 13, 2015Interactive Query Processing17 Findings  higher processing costs (index/hash join, subquery, …) make reordering easy  at very low processing costs, juggle constrained by density of interesting tuples in source outlier groups hard to speed up better to use index on reordering column [HHW ‘97]  reordering becomes easier over time  questions to answer: where to place juggle are groups fixed statically

July 13, 2015Interactive Query Processing18 Outline  motivation and context  support for dynamic user control in traditional query proc. architectures  architecture for giving more aggressive partial results  policy for generating partial results user interface for displaying partial results impact on routing  related work, work-in-progress, future work

July 13, 2015Interactive Query Processing19 Generating partial results  traditional arch. also generate continual result tuples arises from continual dataflow thru pipelining operators much work on pipelining joins [WA91, HH99, IFF + 99, UF00]  this is too rigid especially in distributed envirorments Result Space

July 13, 2015Interactive Query Processing20 Context: Query processing in Telegraph  Telegraph: adaptive dataflow system to query diverse, distributed sources  much data available as services over Internet currently only accessible by browse/search/forms  want to semi-automatically combine this data  examples: campaign finance information (fff.cs.berkeley.edu) (fff.cs.berkeley.edu)  multiple donor lists, home prices, census info., crime ratings, maps, celebrity lists, legislative records … restaurant information  lists, reviews, addresses, maps, health inspection reports, nutrition information …

July 13, 2015Interactive Query Processing21 More aggressive partial results  complete results too rigid source latencies high, diverse dynamic source variations, delays query does not capture user desires  want to process queries flexibly, giving results asap adapt dynamically to user preferences and source variations allow user feedback to refine partial results

July 13, 2015Interactive Query Processing22 Correctness of partial results  must contain some essential columns group-by columns / sort-by columns  full disjunction/outer-join semantics appropriate for many Internet data sources  join semantics cannot give partial results without ensuring that match exists  aggregates: over-estimate, even with outer-join semantics update early, and compensate later

July 13, 2015Interactive Query Processing23 Dynamic query plans  Eddy [Avnur and Hellerstein 2000] router for directing data thru modules dynamically chooses join and selection order all partial tuples generatable  original routing policy: optimize completion time  our focus: continual partial results, adaptation to user preferences Eddy RS T

July 13, 2015Interactive Query Processing24 Eddy Routing  Eddy must decide a)what tuple to route next b)where to route it  based on user preferences, and source properties  need routing policy and a reordering mechanism eddy memory and module queues bounded... modules RPS... inputs Eddy

July 13, 2015Interactive Query Processing25 Prioritizing tuples  perform online reordering within Eddy all incoming tuples placed in reorderer when space available on queues to modules, Eddy takes tuple from reorderer and routes it to modules  best-effort  automatically reorder before slow modules... modules RPS... inputs copy

July 13, 2015Interactive Query Processing26 Routing and reordering policy  GOAL: at any time, route to max. dQOF/dt = benefit of sending tuple to module / cost  cost: estimate data rates to/from module  benefit: dependent on application and UI how partial results impact the UI user preferences QOF GOAL: time

July 13, 2015Interactive Query Processing27 Outline  motivation and context  online reordering for user prioritization of partial results  architecture for generating more aggressive partial results  policy for generating partial results Telegraph UI: displaying results and inferring preferences impact on routing experimental results  related work/work-in-progress/future work

July 13, 2015Interactive Query Processing28 Telegraph UI  screenshot

July 13, 2015Interactive Query Processing29 Telegraph UI  partial results displayed on screen as they arrive  values clustered into groups can roll-up or drill-down to see different granularities  client has hash table mapping groups to values  navigation vertical/horizontal scrolling column (un)hiding rollup/drilldown  different columns visible at different drilldown levels

July 13, 2015Interactive Query Processing30 Getting user preferences (1)  infer from navigation row/column scrolling, group drill down and rollup  prioritize visible rows/columns “query evolution”  subset “one-size-fits-all” queries  future work: query expansion  explicit at the cell level -- need for some expensive sources can also have up/down buttons on columns

July 13, 2015Interactive Query Processing31 Getting user preferences (2)  preferences: col priorities { … }  e.g. {,,,, }

July 13, 2015Interactive Query Processing32 Benefit of a partial result  depends on user preferences  benefit of updating a cell in output row priority x column weight x  cell worthiness column weight = max(col priority, row-col priority) incremental cell worthiness: how much does one extra update add to the cell’s value  quality of feedback  enumerations -- 1  aggregations -- change in confidence interval  informing user about execution progress map cell worthiness to foreground color value

July 13, 2015Interactive Query Processing33 Benefit of Routing a Tuple  route tuple according to expected benefit and cost  benefit of sending a tuple t to a module M and forming set T =  cells c T benefit of updating c  estimate fanout by sampling previous values -- selectivity estimation t M T

July 13, 2015Interactive Query Processing34 Granularity of Reordering/Routing  reorder and route tuples at granularity of a group  group =  groups created and deleted dynamically, as user navigates in UI  group predicate may be a subset of application- specified row predicate final row depends on values this tuple joins with

July 13, 2015Interactive Query Processing35 Partial results, no reordering Bush Contributors Income Crime Ratings 8002004006000 10000 20000 40000 30000 50000 time (s) total worthiness with partial results without partial results

July 13, 2015Interactive Query Processing36 Partial results, no reordering 0 100200300 10000 20000 Bush Contributors Income Crime Ratings time(s) total worthiness 100200300 0 1000 delay with partial results without partial results

July 13, 2015Interactive Query Processing37 Results with row-wise reordering  scrolling: {AZ, AR, CA CO, CT}, LA,KY,MA,MD,MI} 0 100 4080 Aggr. Updates time 0 0.24 groups rel. error

July 13, 2015Interactive Query Processing38 Prioritizing particular columns  previous graph: Income updates much faster than Crime Ratings can we prioritize Crime Ratings?  increase number of threads to probe it  if eddy sends n a tuples to A, n b to B, per second n a <= threads a /latency(A), n b <= threads b /latency(B) threads a + threads b <= Number of Threads Per Query max. n a  worthiness(A) + n b  worthiness(B)  more generally, optimize for multiple resources good citizenship costs

July 13, 2015Interactive Query Processing39 Outline  motivation and context  online reordering for user prioritization of partial results  architecture for generating more aggressive partial results  policy for generating partial results  conclusions and future work

July 13, 2015Interactive Query Processing40 Related Work  IR work ranked retrieval, relevance feedback search strategies, Berry Picking  making operators pipelining (Ripple Join, XJoin, Tuqwila)  precomputed summaries (OLAP, materialized views, AQUA)  top N / fast-first query processing  parachute queries, union queries  adaptivity dynamic query plans mid-query reoptimization (KD’98, IFF + 99) competition (DEC Rdb)

July 13, 2015Interactive Query Processing41 Summary  query processor should care about how user/application uses results  online reordering effective way of supporting dynamic user control  give partial results as user wishes by embedding reordering within dynamically controlled dataflow  system flexibility helpful  hard to map these user-interaction needs into concrete algorithm performance goals wanted: benchmarks based on user/application traces  For more information: http://telegraph.cs.berkeley.edu/http://telegraph.cs.berkeley.edu/, http://control.cs.berkeley.edu/ttp://control.cs.berkeley.edu/

July 13, 2015Interactive Query Processing42 Future Work  session-granularity query processing query evolution lazy evaluation: user/client navigates through partial results closer dbms-application interaction  reduced operator set querying execute query by appropriate routing through data sources and state modules

July 13, 2015Interactive Query Processing43 Lessons Learned  query processor should care about how user/application uses results  system flexibility helpful  hard to map these user-interaction needs into concrete algorithm performance goals wanted: benchmarks based on user/application traces  For more information: http://telegraph.cs.berkeley.edu/ http://control.cs.berkeley.edu/

July 13, 2015Interactive Query Processing44  relational operators: logical astractions encapsulate multiple physical effects  inflexible in handling unexpected changes information hiding  operator speeds  resource consumption work sharing  competitive access paths inflexible tuple routing  suppose user asks for SQ tuples  suppose there is a delay in P  want to encapsulate at level of physical operators: data sources, data structures Granularity of Query Operators RS P Q R Eddy S  Q RS RS-PQPQPQij

July 13, 2015Interactive Query Processing45 State Modules  store state in separate State Modules (SteMs) SteM like an index: supports builds and probes unifies caches, join state, rendezvous buffers  Eddy routes tuples through SteMs and data sources gets results from sources, or probe/build into SteMs  joins and data access performed in the process of routing RSPQ  Q R Eddy SteM P SteM Q SteM S SteM R S P  Q R Eddy S  Q RS RS-PQPQPQij

July 13, 2015Interactive Query Processing46 HCI Motivation: Why Interactivity?  Berry picking (Bates ‘90, ‘93)  user studies decision support (O’day and Jeffries ‘93) relevance feedback (Koenemann and Belkin ‘96)

July 13, 2015Interactive Query Processing47 BACKUP: Disk data layout during juggling  performance goal: favor early results: optimize Phase 1, at expense of Phase 2 spool : sequential I/O, enrich: random I/O +reordering in Phase 2 much easier than in Phase 1  enrich needs approx. index on side-disk have hash-index on tuples, according to user interest done at “group” granularity

July 13, 2015Interactive Query Processing48 BACKUP: Other applications of reordering  can add reorder to any dataflow  will later discuss application for query systems that give aggressive partial results  also useful in batch query processing sorting often used in query plans for performance gains can replace by best-effort reordering little performance hit, but plan is now pipelining will not discuss further in this talk  spreadsheets

July 13, 2015Interactive Query Processing49 BACKUP: Estimators for rowPriority, incremental worthiness  easy, unless output row depends on new values currently use average of all possible rows could estimate a distribution instead

July 13, 2015Interactive Query Processing50 BACKUP: Interactive query processing with non-pipelining operators  basic techniques independent of pipelined operators. reordering --- can be used in general query plans, although effectiveness may be hindered by blocking operators State Modules applicable in general -- helps with adaptivity  much work on making plan operators pipelining -- ripple join, xjoin, tuqwila reordering itself can be used to avoid blocking sorts in some cases

July 13, 2015Interactive Query Processing51 Query processing using SteMs (1)  SteM operations -- works like index on a table build(tuple): add tuple to SteM probe(tuple): find all matches among build values  bounce back tuple if all matches not found and probe tuple not cached elsewhere  performing joins by routing through SteMs index joins  regular synchronous index joins easy -- have no state  over distributed sources, lookups cached by building into SteM separating cache allows Eddy to distinguish access cost asynchronous index joins [GW’00]  helps for high-throughput Internet sources  SteM acts as rendezvous buffer for pending probe results R Eddy SteM S SteM R S R bld. S bld. R probe S probe R

July 13, 2015Interactive Query Processing52 Query processing using SteMs (2)  performing joins by routing through SteMs (contd) doubly pipelined hash join use two SteMs, one on each source  can also do extensions (e.g. Tuqwila [IFF + 99]), and even non-pipelined joins, using appropriate SteMs  thus can simulate any static query plan by appropriate routing R Eddy SteM S SteM R R bld. S bld. R probe S probe S

July 13, 2015Interactive Query Processing53 Query processing using SteMs (3)  how to do this in general? don’t want Eddy to be a super-join want it to route as per user desires and source properties, independent of any join algorithm  routing constraint: each tuple must have a non-zero probability of being sent to every appropriate module can avoid repeated looping by marking tuples  Theorem: any routing that satisfies above constraint will produce all query results and terminate  could get duplicates prune at application, or enforce atomicity in tuple routing

July 13, 2015Interactive Query Processing54 Benefits of SteMs  adaptivity join algorithm and access path selection are determined by eddy routing hence Eddy can dynamically choose these based on source properties in fact, Eddy can do hybrid join algorithms  adapt to dynamic delays, etc.  work sharing -- for query evolution, competition  better information to the Eddy speeds, memory consumption

July 13, 2015Interactive Query Processing55 BACKUP: Lazy Evaluation  User navigates thru partial results can see only a few at a time  exploit to delay processing do on demand on the region user currently navigating over  things to delay expensive source lookups (e.g. map) joins (e.g. …) any other formatting etc.

July 13, 2015Interactive Query Processing56 Backup: giving probabilistic results  view partial result as somethng that throws light on outer-join space positive result: probabilistic negative result: deterministic  benefit of positive result: weighted sum of cells it is likely to show; penalty for false result repudiation of earlier false results  benefit of negative result: direct benefit unlikely repudiation of earlier false results  complications many tuples may contribute to single output tuple -- aggregation

July 13, 2015Interactive Query Processing57 Probabilistic partial results  partial result => may not apply all filters  still want to show result probabilistically filters ill-specified data on Internet often inconsistent show all partial results: “full disjunction” (Galindo-Legaria ‘94)  positive and negative results Output Space Outer Join Space

July 13, 2015Interactive Query Processing58 What does user see?  user navigates thru partial results  with complete row partial results: can either explore these results in detail (sort/scroll along columns as in spreadsheet) compute approx. aggregates on them in online fashion  errors shrinking as more results come in  with probabilistic, incomplete results? easy if primary key in partial result what to show otherwise?

July 13, 2015Interactive Query Processing59 Quantifying benefit of partial results  partial result sheds light on outer-join space positive result: probabilistic negative result: deterministic  benefit of positive result: weighted sum of cells it is likely to show penalty for false result ? repudiation of earlier false results ?  benefit of negative result: direct benefit unlikely repudiation of earlier false results ?

July 13, 2015Interactive Query Processing60 Why probabilistic results are good  data on Internet is often inconsistent hence even exact processing cannot give perfect answers  people are used to sloppy answers  can allow negations in expert interfaces only

July 13, 2015Interactive Query Processing61 Data growth vs. Computer Speedup  Moore’s Law -- # of transistors/chip doubles every 18 months (1965)  data growth Source: J. Porter, Disk/Trend, Inc. (http://www.disktrend.com/pdf/portrpkg.pdf)

July 13, 2015Interactive Query Processing62 Disk Appetite, contd.  Greg Papadopoulos, CTO Sun: Disk sales doubling every 9 months  similar results from Winter VLDB survey  time to process all your data doubles every 18 months!

Interactive Query Processing Vijayshankar Raman Computer Science Division University of California at Berkeley.

Similar presentations

Presentation on theme: "Interactive Query Processing Vijayshankar Raman Computer Science Division University of California at Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Interactive Query Processing Vijayshankar Raman Computer Science Division University of California at Berkeley.

Similar presentations

Presentation on theme: "Interactive Query Processing Vijayshankar Raman Computer Science Division University of California at Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback