Joining Massive High-Dimensional Datasets


Similar presentations
High-dimensional Similarity Join

Spatial Join Queries. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
Automatically Annotating and Integrating Spatial Datasets Chieng-Chien Chen, Snehal Thakkar, Crail Knoblock, Cyrus Shahabi Department of Computer Science.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 2) Efficient Processing of Spatial Joins Using R-trees Rollo Chan Chu Chung Man Mak Wai Yip Vivian Lee Eric.
Unsupervised Feature Selection for Multi-Cluster Data Deng Cai et al, KDD 2010 Presenter: Yunchao Gong Dept. Computer Science, UNC Chapel Hill.
CS Lecture 9 Storeing and Querying Large Web Graphs.
Spatio-Temporal Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases …..
Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1
Tracking Moving Objects in Anonymized Trajectories Nikolay Vyahhi 1, Spiridon Bakiras 2, Panos Kalnis 3, and Gabriel Ghinita 3 1 St. Petersburg State University.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
San Diego, 06/12/03 San Diego, 06/12/03 Martin Pfeifle, Database Group, University of Munich Using Sets of Feature Vectors for Similarity Search on Voxelized.
Scalable Network Distance Browsing in Spatial Database Samet, H., Sankaranarayanan, J., and Alborzi H. Proceedings of the 2008 ACM SIGMOD international.
Spatio-Temporal Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases …..
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Michael Vassilakopoulos.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
Reference-Based Indexing of Sequence Databases (VLDB ’ 06) Jayendra Venkateswaran Deepak Lachwani Tamer Kahveci Christopher Jermaine Presented by Angela.
1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
MotivationFundamental ProblemsProblems on Graphs Parallel processors are becoming common place. Each core of a multi-core processor consists of a CPU and.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Active Frame Selection for Label Propagation in Videos Sudheendra Vijayanarasimhan and Kristen Grauman Department of Computer Science, University of Texas.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Spatial Range Querying for Gaussian-Based Imprecise Query Objects Yoshiharu Ishikawa, Yuichi Iijima Nagoya University Jeffrey Xu Yu The Chinese University.
23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.
1 A Methodology for automatic retrieval of similarly shaped machinable components Mark Ascher - Dept of ECE.
Spatio-Temporal Databases
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
Database Management System
Fast nearest neighbor searches in high dimensions Sami Sieranoja
A Black-Box Approach to Query Cardinality Estimation
Spatial Indexing.
Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel, University of Munich Epsilon Grid Order: An Algorithm for the Similarity.
RE-Tree: An Efficient Index Structure for Regular Expressions
Privacy Preserving Subgraph Matching on Large Graphs in Cloud
Nonparametric Semantic Segmentation
Haim Kaplan and Uri Zwick
Chapter 12: Query Processing
Selectivity Estimation of Big Spatial Data
Clustering (3) Center-based algorithms Fuzzy k-means
Spatial Online Sampling and Aggregation
On Efficient Graph Substructure Selection
Lecture#12: External Sorting (R&G, Ch13)
Spatio-Temporal Databases
Efficient Evaluation of k-NN Queries Using Spatial Mashups
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Consensus Partition Liang Zheng 5.21.
Dimension reduction : PCA and Clustering
The use of Neural Networks to schedule flow-shop with dynamic job arrival ‘A Multi-Neural Network Learning for lot Sizing and Sequencing on a Flow-Shop’
3. Brute Force Selection sort Brute-Force string matching
Chapter 12 Query Processing (1)
Efficient Cost Models for Spatial Queries Using R-Trees
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Approximation Algorithms for the Selection of Robust Tag SNPs
Topological Signatures For Fast Mobility Analysis
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
Donghui Zhang, Tian Xia Northeastern University
Efficient Aggregation over Objects with Extent
Presentation transcript:

Joining Massive High-Dimensional Datasets Tamer Kahveci Christian A. Lang Ambuj K. Singh Department of Computer Science University of California at Santa Barbara 11/19/2018 Kahveci, Lang, Singh

Motivation: Sample Queries Join is fundamental database primitive Spatial Join: Find all hotels in California that are within three miles of a recreation area. Sequence Join: Find all pairs of companies from New York Exchange and Tokyo Exchange that have similar closing prices for one month 11/19/2018 Kahveci, Lang, Singh

Motivation We assume limited buffer space. Recreation areas (n) Hotels (m) We assume limited buffer space. Joining two datasets is expensive I/O cost CPU cost O(mn) Buffer Hotels Recreation areas 11/19/2018 Kahveci, Lang, Singh

The Naive Solution: NLJ Dataset 1 (m pages) Buffer = B = 4 min{m,n}+mn/(B-1) page reads mn page comparisons Dataset 2 (n pages) We do not need to compare all page pairs! 11/19/2018 Kahveci, Lang, Singh

Outline Reducing search space: Prediction Matrix Minimizing I/O cost by clustering Square Cluster Cost Cluster Maximizing buffer reuse Experimental results 11/19/2018 Kahveci, Lang, Singh

PM-NLJ Dataset 1 Dataset 2 Predict the candidate page pairs using plane sweep method on an index structure. Dataset 1 Dataset 2 11/19/2018 Kahveci, Lang, Singh

Prediction of Join 11/19/2018 Kahveci, Lang, Singh

Prediction of Join 11/19/2018 Kahveci, Lang, Singh

PM-NLJ Dataset 1 Dataset 2 Predict the candidate page pairs using plane sweep method on an index structure. Dataset 1 The final estimate is called Prediction Matrix (PM). Restrict NLJ to marked entries of PM. We call this method PM-NLJ. Dataset 2 11/19/2018 Kahveci, Lang, Singh

PM-NLJ The number of marked entries = e. Dataset 1 Performance improvement rate = mn/e. Dataset 1 Dataset 2 Is there a better read schedule? 11/19/2018 Kahveci, Lang, Singh

Outline Reducing search space: Prediction Matrix Minimizing I/O cost by clustering Square Cluster Cost Cluster Maximizing buffer reuse Experimental results 11/19/2018 Kahveci, Lang, Singh

Minimizing Number of I/O: Square Clustering PM-NLJ reads min{m’,n’}+e’ = 9 pages. Let B=6. Dataset 1 m’+n’ = 6 page reads suffices. Savings = e’-max{m’,n’}. Maximize e’ Minimize max{m’,n’} m’+n’ = B m’=n’=B/2. Dataset 2 11/19/2018 Kahveci, Lang, Singh

Minimizing Number of I/O: Square Clustering Dataset 1 O(e) space & time complexity Can we reduce total I/O cost by reducing the amount of random seeks? Dataset 2 11/19/2018 Kahveci, Lang, Singh

Minimizing Random Seek Cost: Cost Clustering Dataset 1 The location of the pages is important as well as their number! Dataset 2 11/19/2018 Kahveci, Lang, Singh

Minimizing Random Seek Cost: Cost Clustering Dataset 1 O(e) space complexity O(e3/2) time Dataset 2 11/19/2018 Kahveci, Lang, Singh

Outline Reducing search space: Prediction Matrix Minimizing I/O cost by clustering Square Cluster Cost Cluster Maximizing buffer reuse Experimental results 11/19/2018 Kahveci, Lang, Singh

Maximizing Cache Reuse Dataset 1 B = 5 pages C1 C2 C3 C4 C5 sum 5 4 2 21 C1 C3 Scenario 1 Cluster order = (C4,C1,C3,C5,C2) 5+4+3+2+5=19 page reads. C2 Dataset 2 C4 C5 11/19/2018 Kahveci, Lang, Singh

Maximizing Cache Reuse Dataset 1 C1 C2 C3 C4 C5 sum 5 4 2 21 Scenario 1 = 19 C1 Scenario 2 Cluster order = (C4,C2,C1,C3,C5) 5+2+3+3+2=15 page reads. C3 C2 Dataset 2 C4 What is the best schedule? C5 11/19/2018 Kahveci, Lang, Singh

Sharing Graph (SG) Dataset 1 Sharing Graph 2 1 Dataset 2 3 C1 C2 C3 C4 11/19/2018 Kahveci, Lang, Singh

Finding Best Schedule Each schedule is a path on SG. Cache reuse = sum of weights of the edges of the corresponding path on SG. Equivalent to TSP. NP-Complete. Use greedy heuristic to find optimal path. C2 2 C1 1 1 1 C3 3 C5 C4 11/19/2018 Kahveci, Lang, Singh

Outline Reducing search space: Prediction Matrix Minimizing I/O cost by clustering Square Cluster Cost Cluster Maximizing buffer reuse Experimental results 11/19/2018 Kahveci, Lang, Singh

Experimental Setup: Datasets Low dimensional data: 2-D road intersections of Long Beach (LBeach) & Montgomery County (MGcounty). 53K & 39K vectors High dimensional data: 60-D feature vectors for satellite image database (landsat). 275K vectors Sequence data: Human chromosome 18 (HChr18) & mouse chromosome 18 (MChr18) 4.2 M & 2.3 M nucleotides 11/19/2018 Kahveci, Lang, Singh

Experimental Setup: Compared Techniques NLJ Epsilon Grid Order (EGO) [BBKK’01] BFRJ [HJR’97] PM-NLJ Random-SC SC CC 11/19/2018 Kahveci, Lang, Singh

Experimental Setup Three optimizations tested: OPT 1: reducing space by using the PM. OPT 2: clustering. OPT 3: cluster scheduling. 11/19/2018 Kahveci, Lang, Singh

Itemized Cost Analysis opt3 opt2 opt1 Join on MGCounty & LBeach 11/19/2018 Kahveci, Lang, Singh

Total Cost Analysis of Various Optimizations Self-join on HChr18 11/19/2018 Kahveci, Lang, Singh Buffer Size (num pages)

Comparison of SC & CC 11/19/2018 Kahveci, Lang, Singh

Buffer Size (num pages) Total Cost Analysis Join on landsat data 11/19/2018 Kahveci, Lang, Singh Buffer Size (num pages)

Database Size (num vectors per database) Scalability Analysis Join on landsat data 11/19/2018 Kahveci, Lang, Singh Database Size (num vectors per database)

Discussion We proposed three optimizations for join operator. Prediction matrix Clustering Buffer recycling SC is 2 to 86 times faster than competing techniques for spatial databases, and 13 to 133 times faster than competing techniques for sequence databases SC is very close to the optimal technique (CC). 11/19/2018 Kahveci, Lang, Singh

Future Directions The solution can be generalized to multi-way joins. Similar optimizations can be applied to NN queries. Can be applied to biological data. 11/19/2018 Kahveci, Lang, Singh

THANK YOU Related Work Join without index Join with index Arge et al 1998 Blasgen et al 1977 Bohm et al 2001 Chan et al 1997 Graefe 1994 Koudas et al 1997 Koudas et al 2000 Orenstein 1986 Patel et al 1996 Shim et al 2002 Xiao et al 2001 Join with index Bercken et al 2000 Bohm et al 2001 Brinkhoff et al 1993 Gurret et al 2000 Hjaltson et al 1998 Huang et al 1997 Lo et al 1994 Lo et al 1996 THANK YOU 11/19/2018 Kahveci, Lang, Singh

Using Sharing Graph to Determine Cache Reuse Scenario 1 Scenario 2 C2 C2 2 2 C1 C1 1 1 1 1 1 1 C3 C3 3 3 C5 C5 C4 C4 Reuse = 1+1 = 2 Reuse = 3+2+1 = 6 11/19/2018 Kahveci, Lang, Singh

Spatial Join Example Recreation areas Hotels 11/19/2018 Kahveci, Lang, Singh

Spatial Join Example Hotels Recreation areas 11/19/2018 Kahveci, Lang, Singh

The Naive Solution: NLJ Dataset 1 (m pages) Buffer = B = 4 min{m,n}+mn/(B-1) page reads mn page comparisons Dataset 2 (n pages) We do not need to compare all page pairs! 11/19/2018 Kahveci, Lang, Singh

Reading Pages in a Better Order 1 seek + 4 page transfers 3 seeks + 3 page transfers 11/19/2018 Kahveci, Lang, Singh