Download presentation
Presentation is loading. Please wait.
Published byHubert Horn Modified over 9 years ago
1
Interactive Data Exploration using Constraints Alexander Kalinin Ugur Cetintemel, Stan Zdonik
2
CP + DBMS for Data Intensive Exploration 2
3
Interactive Data Exploration (IDE) Searching for the “interesting” within big data Exploratory-analysis: ad-hoc & repetitive Questions are not well defined “Interesting” can be complex Human-in-the loop operation Fast, online results Query refinement Where’s Waldo? Where’s Horrible Gelatinous Blob? 3
4
Exploratory Queries: Some examples First-order “Celestial 3-5 o by 5-7 o regions with brightness > 0.8” Higher-order “Pairs of 2 o by 2 o celestial regions with similarity > 0.5” Optimized “Celestial 3 o by 7 o region with maximum brightness” Sloan Digital Sky Survey (SDSS) 4
5
“Celestial 3-5 o by 5-7 o regions with average brightness > 0.8” in SQL 1.Divide the data into cells 2.Enumerate all regions 3.Final filtering (> 0.8) 5
6
DBMSs for IDE? No native support for exploratory constructs No power set No user-defined objective functions No support for interactivity No online results No notion of a “query session” 6
7
Data Exploration as a CP problem “Celestial 3-5 o by 5-7 o regions with average brightness > 0.8” Left-most corner Lengths 7
8
CP Solvers Large variety of methods for exploring a search space Branch-and-Cut Large Neighborhood Search (LNS) Randomized search with Restarts Highly extensible – important for ad-hoc exploration! New constraints/functions New search heuristics But… comparing with DBMSs In-memory data (CP) vs. efficient disk data handling (DBMS) No I/O cost-awareness (CP) vs. cost-based query planning (DBMS) 8
9
SearchLight A fusion of CP solvers and DBMSs The DBMS stores and maintains data The CP solver explores the constrained search space SearchLight is a mediator Extends CP solvers Provides buffering, prefetching Distributes the search Makes CP solvers cost-aware CP Solver (OR-tools, Gecode) Constraints/ Functions Search Heuristics SearchLight MetadataBuffering DBMS (PostgreSQL, SciDB) Data, estimates, decisions Requests, Solutions Data, schema info Data requests, constraints Exploration Query 9
10
Research Issues A cost model for data-intensive CP Each search decision has an I/O cost Mediation of data access Meta-data for guiding and optimizing search (annotated trees, samples, etc.) Prefetching Distributed search Multi-node parallel branch processing CP/DBMS integrated query planning Propagating CP/Schema constraints 10
11
Semantic Windows (SW) First step towards constraint-based exploration Supports first-order queries Exploration via multi-dimensonal “windows of interest” Shape-based constraints (“a 3-5 o by 5-7 o region”) Content-based constraints (“avg_br() > 0.8") Custom distributed cost-aware solver 11
12
SQL/CP Extensions for Data Exploration SELECT lb(ra), rb(ra), lb(dec), rb(dec), avg(brightness) FROM sdss GRID BY ra BETWEEN 100 AND 300 STEP 1 dec BETWEEN 5 AND 40 STEP 1 HAVING avg(brightness) > 0.8 AND size(ra) = 5 AND size(dec) >= 5 AND size(dec) <= 7 12
13
Cost-aware Solver Best-first search based on the utility Utility = f(benefit, cost) Benefit – how close a window is to satisfy the constraints A distance between the constraint’s value and the estimated value Cost – how expensive it is to read a window from disk Measured in cells we have to read Adjustments are made for skewed data 13
14
Optimizations Cost and benefit are estimated by sampling Objective function values are cached in a cell cache Dynamic utility updates Avoiding same cells re-reads Constraint-based pruning during the search Distributed search Multiple nodes work in parallel 14
15
Adaptive Prefetching Dispersed reads hit total performance Prefetching: read the neighborhood with every window Progress-driven prefetching: how much? Finding new results? Prefetch a small amount No new results? Increase the prefetch exponentially 15 3 2 1 4 No prefetching With prefetching 1 2 3 4
16
Online vs. Total Performance Results 35GB data set (part of the SDSS) 4GB total memory (1GB shared buffer) First results in 10-20 seconds 16
17
Conclusions Integrate CP and DBMS technologies SearchLight: Data-Intensive CP Engine Initial implementation: Semantic Windows Cost-aware solver Mediating disk access (sampling, prefetching) Distributed search Current work: OR-Tools as the CP solver SciDB as the DBMS 17
18
Questions? Supported by: 18
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.