Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data Alexander Kalinin, Ugur Cetintemel, Stan Zdonik.

Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data Alexander Kalinin, Ugur Cetintemel, Stan Zdonik

Interactive Data Exploration (IDE) Searching for the “interesting” within big data Exploratory-analysis: ad-hoc & repetitive – Questions are not well defined – “Interesting” can be complex – Hard to find – Hard to compute – Fast, online results (human-in-the-loop) Where’s Waldo? Where’s Horrible Gelatinous Blob? 2

Exploratory Queries: Some examples First-order – “Celestial 3-5 o by 5-7 o regions with brightness > 0.8” Higher-order – “Pairs of 2 o by 2 o celestial regions with similarity > 0.5” Optimized – “Celestial 3 o by 7 o region with maximum brightness” Sloan Digital Sky Survey (SDSS) 3

Sub-sequence Matching 4

Two Sides of Data Exploration Search complexity – Search space is large Enumeration isn’t feasible – Constraints are elaborate More than just ranges Data complexity – Large data sets (“big data”) Hard to fit in memory – Expensive computations Functions over a lot of objects 5 DBMS CP

DBMSs for IDE? No native support for exploratory constructs – No power set – Limited support for user-defined logic Poor support for interactivity 6

“Celestial 3-5 o by 5-7 o regions with average brightness > 0.8” in SQL 1.Divide the data into cells 2.Enumerate all regions 3.Final filtering (> 0.8) 7

Data Exploration as a CP problem “Celestial 3-5 o by 5-7 o regions with average brightness > 0.8” Leftmost corner Lengths 8

CP Solvers Large variety of methods for exploring a search space – Branch-and-Cut – Large Neighborhood Search (LNS) – Randomized search with Restarts Highly extensible – important for ad-hoc exploration! – New constraints/functions – New search heuristics But… comparing with DBMSs – In-memory data (CP) vs. efficient disk data handling (DBMS) – No I/O cost-awareness (CP) vs. cost-based query planning (DBMS) 9

CP + DBMS for Data Intensive Exploration 10

Exploring Alternatives 11 ApproachFirst results, sSubsequent delays, s Searchlight56 CPNA SciDBNA Large search space, time-limited execution (1 hour) Small search space ApproachFirst result, sTotal time, s Searchlight4.85.13 CP91304 SciDB301.3945.3 8 node cluster 120 GB data 2GB memory

Dynamic Solve-Validate Approach SciDB Instance CP Solver Validator Synopsis Array Data Array Candidate Solutions CP Solver: runs the CP search process Synopsis-only access Produces candidate solutions Validator: validates candidates Data access Produces real solutions CP-based Router Provides uniform access to data CP Solver  Synopsis, Validator  Data Supports transparency 12

Synopsis Pyramid 1253 3432 21 433 13 4x4 2x2 1x1 Allows approximate answers for data API calls Lower and upper bounds, 100% confidence Synopsis resolution trade-off: Coarse synopses are less accurate, but fast Fine synopses are very accurate, but slow Dynamic synopsis choice based on the query region

CP Solver Validator Distributed Searchlight in a Nutshell Layer of CP Solvers Search balancing Multiple solvers per instance Depending on free CPU cores Layer of validators Data partitioning Multiple validators per instance Depending on free CPU cores Disjoint layers Different number of processes No mandatory collocation Dynamic allocation 14

Two-dimensional search space Variables: x, y Interval domains Search partitions Divide intervals Each solver gets a slice Features Works with any heuristic Covers hot-spots Static Search Partitioning

Dynamic Search Balancing x = [0, 100] x = [50, 100] Busy Solver x = [0, 50] Idle Solver 1.Go to [0, 50] 2.Help available! Push [0, 50] to the helper 3.Go to [50, 100] 1.Idle! 2.Got [0, 50] 3.Explore as its own search partition 16

Data Partitioning Validator 1 Validator 2 Validator 3 Data array No data prefetching Fetch only when needed (i.e., on validations) Data transfer transparent for validators 17

Other Optimizations Using synopses for validations Query region must be aligned with the grid Dividing data partitions into zones Avoids thrashing Validating candidates from recent zones Solver/Validator balancing Dynamically redistributed CPU between Solvers/Validators Many candidates  more Validators; and vice versa Utilize idle times for validations

SDSS Results Google’s Or-Tools + SciDB 80GB SDSS Varying selectivity: grid size, region size, magnitudes 19 QueryFirstMin/avg/max delaysTotal Q1100.001/2/54300 Q217 132 Q3240.004/6/45331 Q4290.21/13/29134

Related Work PackageBuilder (UMass Amherst & NYU Abu Dhabi) Sets of tuples with global constraints Pruning, local search, MIP Constraint Programming Solvers, parallel search, heuristics,… DBMSs & Spatial DBMSs “Simple” retrieval queries Content-Based Retrieval (CBR) 20

Ongoing Work Query planning for search queries – Higher-order queries – DBMS integration (e.g., push-down predicates) Exploring new datasets/constraints – MIMIC dataset – Sub-sequence matching 21

Thank you! Questions?

ra = [100, 200] dec = [5, 40] ra = [133, 165] dec = [5, 40] ra = [100, 132] U [166, 200] dec = [5, 40] dec[5, 16][17, 28][29, 40] ra = [133, 165] dec = [29, 40] ra = [133, 165] dec = [5, 28] ra[100, 132][133, 165][166, 200] … ra = 133 dec = [29, 40] ra = 133 dec = 29 ra = [134, 165] dec = [29, 40] … Fail! … Search process for a backtracking CP solver

“Celestial 3-5 o by 5-7 o regions with average brightness > 2” CP “UDFs”: z = avg(…) z = (2, +inf) Accesses the data Provides min/max values UDF  Searchlight API calls: aggregate(X1, X2) elem(X) 24

1253 3432 21 433 1252 3442 41 233 1225 3424 21 343 Upper Bound Lower Bound Synopsis answers API calls: elem(0, 0)  [1, 4] avg(white)  [m, M] m – lower bound M – upper bound Synopsis is lossy compression: Top-right cell: (5,3,3,2) Min=2, Max=5, Sum=13, Count=4 Cell distributions. Is it: (5, 2, 4, 2)? (2, 5, 2, 4)? (5, 3, 3, 2)?

Upper Bound Example Full cells: a = 10/4 =2.5 Partial cells; (value, count) – from stats: – (5, 1), (4, 1), (2, 2) – (4, 1), (3, 1), (2, 1) – (3, 1), (1, 1) Add in the descending order: 1.10/4 + (5, 1) = 15/ 5 = 3 2.15/5 + (4, 1) = 19/6 = 3.17 3.19/6 + (4, 1) = 23/7 = 3.29 4.Stop! 1252 3442 41 233 26

Intuition: Cell Coverage 1252 3442 41 233 “Good” coverage: cell is covered 50% More-less enough information “Bad” coverage: cell is covered 25% (< 50) Too little information 27 Cell Coverage = area of intersection / area of cell

Dynamic Synopsis Choice Search space1000x1000100x10010x101000-100-10 LargeNA4m41s2h28m3m Small21m30s15m6m9s1m10s 28 50,000x50,000 array Different synopsis resolutions Query completion times Solver times: 6s  6m

Dynamic Search Balancing Idle solvers have to report to the coordinator Coordinator dispatches helpers – Queue of busy solvers – Got help? Go to the end of the queue – Solvers may reject help (e.g., they’re finishing) Dynamic approach – Busy solvers might have several helpers – Helpers might have helpers 29

Slices Time, s Individual solver times

Candidate Zones CP Solver Zone 1Zone 2Zone 3Zone 4Zone5 Data Partition For each candidate: 1.Determine chunks 2.Put into zone with most chunks Validator: Validates from the same zone Recent zones first 31

Dynamic Candidates Forwarding Validator 10,000 candidates Validator Node 1Node 2 5,000 5,000 candidates Candidates forwarding: Might cause data replication Needed when validators are flooded Only to idle validators Forward to recent 32

< 1sec

MIMIC Contains waveforms from ICU Two-dimensional array: (patient, time) Multiple signals: ABP, ECG, etc. Queries Aggregate search (e.g., anomalies) Sub-sequence matching (e.g., find a pattern similar to query) 34

Sub-sequence Matching Sub-sequence matching Distance-based Usually, sequence of DFTs  traces  index Then, nearest-neighbor retrieval Applying Searchlight Index is a synopsis API call: distance between the current area and query sequence Expecting small overhead 35

Distributed Challenges 1.Search space partitioning 2.Data partitioning 3.Where to send candidates? – Solvers/validators might be disjoint – We don’t know the data the validation needs 36

Simulating Validations CP Solver Validator Router Data Array Candidates Access Collector Simulation 1.Candidate is submitted to the validator 2.Validator checks on real data (via router) 3.Validator “checks” on dumb data: (–inf, +inf) 4.Access collector writes down all accesses 5.Now we know the chunks! Forwarding 1.Knowing the chunks, choose a validator 2.Forward the candidate to the validator (or keep it)

Other Optimizations Solver/Validator balancing Dynamically redistributed CPU between Solvers/Validators Many candidates  more Validators; and vice versa Utilize idle times for validations Candidates relocation Will cause data movement – used rarely Relocate only to idle validators Try reusing validators

Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data Alexander Kalinin, Ugur Cetintemel, Stan Zdonik.

Similar presentations

Presentation on theme: "Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data Alexander Kalinin, Ugur Cetintemel, Stan Zdonik."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data Alexander Kalinin, Ugur Cetintemel, Stan Zdonik.

Similar presentations

Presentation on theme: "Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data Alexander Kalinin, Ugur Cetintemel, Stan Zdonik."— Presentation transcript:

Similar presentations

About project

Feedback