Interactive Data Exploration using Constraints Alexander Kalinin Ugur Cetintemel, Stan Zdonik.

Slides:



Advertisements
Similar presentations
Recommender System A Brief Survey.
Advertisements

Hybrid BDD and All-SAT Method for Model Checking Orna Grumberg Joint work with Assaf Schuster and Avi Yadgar Technion – Israel Institute of Technology.
Ranking Multimedia Databases via Relevance Feedback with History and Foresight Support / 12 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT AND EXPLORATION.
A Hierarchical Multiple Target Tracking Algorithm for Sensor Networks Songhwai Oh and Shankar Sastry EECS, Berkeley Nest Retreat, Jan
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
CSE 460 Hybrid Optimization In this section we will look at hybrid search methods That combine stochastic search with systematic search Problem Classes.
Anytime RRTs Dave Fergusson and Antony Stentz. RRT – Rapidly Exploring Random Trees Good at complex configuration spaces Efficient at providing “feasible”
Outline SQL Server Optimizer  Enumeration architecture  Search space: flexibility/extensibility  Cost and statistics Automatic Physical Tuning  Database.
Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
Database Systems on Virtual Machines: How Much Do We Lose? Kristin Travis March 2, 2011.
Interactive Dynamic Aggregate Queries Kenneth A. Ross Junyan Ding Columbia University.
ACM GIS An Interactive Framework for Raster Data Spatial Joins Wan Bae (Computer Science, University of Denver) Petr Vojtěchovský (Mathematics,
Cache Placement in Sensor Networks Under Update Cost Constraint Bin Tang, Samir Das and Himanshu Gupta Department of Computer Science Stony Brook University.
Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,
1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs.
Spatial Indexing I Point Access Methods.
CMSC724: Database Management Systems Instructor: Amol Deshpande
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Prefetching for Visual Data Exploration Punit R. Doshi, Elke A. Rundensteiner, Matthew O. Ward Computer Science Department Worcester Polytechnic Institute.
Knight’s Tour Distributed Problem Solving Knight’s Tour Yoav Kasorla Izhaq Shohat.
A Randomized Approach to Robot Path Planning Based on Lazy Evaluation Robert Bohlin, Lydia E. Kavraki (2001) Presented by: Robbie Paolini.
By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,
Real-time Query Processing Roger Blake CSC 536 May 2, 2005.
RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.
Database Replication in Tashkent CSEP 545 Transaction Processing Sameh Elnikety.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.
Client-Server Paradise ICOM 8015 Distributed Databases.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data Alexander Kalinin, Ugur Cetintemel, Stan Zdonik.
Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik.
A* optimality proof, cycle checking CPSC 322 – Search 5 Textbook § 3.6 and January 21, 2011 Taught by Mike Chiang.
Efficient Evaluation of Queries in a Mediator for WebSources Louiqa Raschid University of Maryland Joint work with Zadorozhny, Vidal, Urhan, Bright.
CS 540 Database Management Systems
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Parallel Programming in Chess Simulations Part 2 Tyler Patton.
Search Control.. Planning is really really hard –Theoretically, practically But people seem ok at it What to do…. –Abstraction –Find “easy” classes of.
Constraint Programming for the Diameter Constrained Minimum Spanning Tree Problem Thiago F. Noronha Celso C. Ribeiro Andréa C. Santos.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
CMPT 463. What will be covered A* search Local search Game tree Constraint satisfaction problems (CSP)
Dense-Region Based Compact Data Cube
Hybrid BDD and All-SAT Method for Model Checking
CPS216: Data-intensive Computing Systems
CSCI5570 Large Scale Data Processing Systems
Kyriaki Dimitriadou, Brandeis University
Database Management System
Spatial Indexing I Point Access Methods.
Buffer Insertion with Adaptive Blockage Avoidance
Spatial Online Sampling and Aggregation
On Spatial Joins in MapReduce
Announcements Homework 3 due today (grace period through Friday)
Hash-Based Indexes Chapter 10
Query Processing CSD305 Advanced Databases.
Sahand Kashani, Stuart Byma, James Larus 2019/02/16
Chapter 11 Instructor: Xin Zhang
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

Interactive Data Exploration using Constraints Alexander Kalinin Ugur Cetintemel, Stan Zdonik

CP + DBMS for Data Intensive Exploration 2

Interactive Data Exploration (IDE) Searching for the “interesting” within big data Exploratory-analysis: ad-hoc & repetitive Questions are not well defined “Interesting” can be complex Human-in-the loop operation Fast, online results Query refinement Where’s Waldo? Where’s Horrible Gelatinous Blob? 3

Exploratory Queries: Some examples First-order “Celestial 3-5 o by 5-7 o regions with brightness > 0.8” Higher-order “Pairs of 2 o by 2 o celestial regions with similarity > 0.5” Optimized “Celestial 3 o by 7 o region with maximum brightness” Sloan Digital Sky Survey (SDSS) 4

“Celestial 3-5 o by 5-7 o regions with average brightness > 0.8” in SQL 1.Divide the data into cells 2.Enumerate all regions 3.Final filtering (> 0.8) 5

DBMSs for IDE? No native support for exploratory constructs No power set No user-defined objective functions No support for interactivity No online results No notion of a “query session” 6

Data Exploration as a CP problem “Celestial 3-5 o by 5-7 o regions with average brightness > 0.8” Left-most corner Lengths 7

CP Solvers Large variety of methods for exploring a search space Branch-and-Cut Large Neighborhood Search (LNS) Randomized search with Restarts Highly extensible – important for ad-hoc exploration! New constraints/functions New search heuristics But… comparing with DBMSs In-memory data (CP) vs. efficient disk data handling (DBMS) No I/O cost-awareness (CP) vs. cost-based query planning (DBMS) 8

SearchLight A fusion of CP solvers and DBMSs The DBMS stores and maintains data The CP solver explores the constrained search space SearchLight is a mediator Extends CP solvers Provides buffering, prefetching Distributes the search Makes CP solvers cost-aware CP Solver (OR-tools, Gecode) Constraints/ Functions Search Heuristics SearchLight MetadataBuffering DBMS (PostgreSQL, SciDB) Data, estimates, decisions Requests, Solutions Data, schema info Data requests, constraints Exploration Query 9

Research Issues A cost model for data-intensive CP Each search decision has an I/O cost Mediation of data access Meta-data for guiding and optimizing search (annotated trees, samples, etc.) Prefetching Distributed search Multi-node parallel branch processing CP/DBMS integrated query planning Propagating CP/Schema constraints 10

Semantic Windows (SW) First step towards constraint-based exploration Supports first-order queries Exploration via multi-dimensonal “windows of interest” Shape-based constraints (“a 3-5 o by 5-7 o region”) Content-based constraints (“avg_br() > 0.8") Custom distributed cost-aware solver 11

SQL/CP Extensions for Data Exploration SELECT lb(ra), rb(ra), lb(dec), rb(dec), avg(brightness) FROM sdss GRID BY ra BETWEEN 100 AND 300 STEP 1 dec BETWEEN 5 AND 40 STEP 1 HAVING avg(brightness) > 0.8 AND size(ra) = 5 AND size(dec) >= 5 AND size(dec) <= 7 12

Cost-aware Solver Best-first search based on the utility Utility = f(benefit, cost) Benefit – how close a window is to satisfy the constraints A distance between the constraint’s value and the estimated value Cost – how expensive it is to read a window from disk Measured in cells we have to read Adjustments are made for skewed data 13

Optimizations Cost and benefit are estimated by sampling Objective function values are cached in a cell cache Dynamic utility updates Avoiding same cells re-reads Constraint-based pruning during the search Distributed search Multiple nodes work in parallel 14

Adaptive Prefetching Dispersed reads hit total performance Prefetching: read the neighborhood with every window Progress-driven prefetching: how much? Finding new results? Prefetch a small amount No new results? Increase the prefetch exponentially No prefetching With prefetching

Online vs. Total Performance Results 35GB data set (part of the SDSS) 4GB total memory (1GB shared buffer) First results in seconds 16

Conclusions Integrate CP and DBMS technologies SearchLight: Data-Intensive CP Engine Initial implementation: Semantic Windows Cost-aware solver Mediating disk access (sampling, prefetching) Distributed search Current work: OR-Tools as the CP solver SciDB as the DBMS 17

Questions? Supported by: 18