Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.

Slides:



Advertisements
Similar presentations
Multi-Guarded Safe Zone: An Effective Technique to Monitor Moving Circular Range Queries Presented By: Muhammad Aamir Cheema 1 Joint work with Ljiljana.
Advertisements

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Finding the Sites with Best Accessibilities to Amenities Qianlu Lin, Chuan Xiao, Muhammad Aamir Cheema and Wei Wang University of New South Wales, Australia.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
1 Spatial Join. 2 Papers to Present “Efficient Processing of Spatial Joins using R-trees”, T. Brinkhoff, H-P Kriegel and B. Seeger, Proc. SIGMOD, 1993.
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
指導教授:陳良弼 老師 報告者:鄧雅文  Introduction  Related Work  Problem Formulation  Future Work.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
1 Top-k Spatial Joins
Fast Algorithms For Hierarchical Range Histogram Constructions
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Nearest Neighbor Queries using R-trees Based on notes from G. Kollios.
Jianzhong Qi Rui Zhang Lars Kulik Dan Lin Yuan Xue The Min-dist Location Selection Query University of Melbourne 14/05/2015.
Retrieving k-Nearest Neighboring Trajectories by a Set of Point Locations Lu-An Tang, Yu Zheng, Xing Xie, Jing Yuan, Xiao Yu, Jiawei Han University of.
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
Probabilistic Similarity Search for Uncertain Time Series Presented by CAO Chen 21 st Feb, 2011.
Spatial Queries Nearest Neighbor Queries.
A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.
Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.
Presented by: Duong, Huu Kinh Luan March 14 th, 2011.
Efficient Computation of the Skyline Cube Yidong Yuan School of Computer Science & Engineering The University of New South Wales & NICTA Sydney, Australia.
Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)
R-tree Analysis. R-trees - performance analysis How many disk (=node) accesses we’ll need for range nn spatial joins why does it matter?
Spatial Indexing. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.
1 Hypersphere Dominance: An Optimal Approach Cheng Long, Raymond Chi-Wing Wong, Bin Zhang, Min Xie The Hong Kong University of Science and Technology Prepared.
Computer Science and Engineering Loyalty-based Selection: Retrieving Objects That Persistently Satisfy Criteria Presented By: Zhitao Shen Joint work with.
Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Michael Vassilakopoulos.
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
K-Hit Query: Top-k Query Processing with Probabilistic Utility Function SIGMOD2015 Peng Peng, Raymond C.-W. Wong CSE, HKUST 1.
Influence Zone: Efficiently Processing Reverse k Nearest Neighbors Queries Presented By: Muhammad Aamir Cheema Joint work with Xuemin Lin, Wenjie Zhang,
1 Top-k Dominating Queries DB seminar Speaker: Ken Yiu Date: 25/05/2006.
Efficient Processing of Top-k Spatial Preference Queries
9/2/2005VLDB 2005, Trondheim, Norway1 On Computing Top-t Most Influential Spatial Sites Tian Xia, Donghui Zhang, Evangelos Kanoulas, Yang Du Northeastern.
A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
Information Technology (Some) Research Trends in Location-based Services Muhammad Aamir Cheema Faculty of Information Technology Monash University, Australia.
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.
Computer Science and Engineering Ranking Complex Objects in a Multi-dimensional Space Wenjie Zhang, Ying Zhang, Xuemin Lin The University of New South.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.
A Unified Algorithm for Continuous Monitoring of Spatial Queries
A Unified Framework for Efficiently Processing Ranking Related Queries
Stochastic Skyline Operator
Probabilistic Data Management
TT-Join: Efficient Set Containment Join
Preference Query Evaluation Over Expensive Attributes
Chapter 4: Probabilistic Query Answering (2)
Probabilistic Data Management
Probabilistic Data Management
Distributed Probabilistic Range-Aggregate Query on Uncertain Data
Probabilistic n-of-N Skyline Computation over Uncertain Data Streams
Efficient Subgraph Similarity All-Matching
Presented by: Mahady Hasan Joint work with
Publishing in Top Venues
Uncertain Data Mobile Group 报告人:郝兴.
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia

Top-k Similarity Join From two sets of objects R and S, retrieve k object pairs from R X S which are the most similar 2 s1s1 r1r1 r4r4 r2r2 r3r3 top-2 similarity join: (r 1, s 3 ), (r 2, s 1 ) s2s2 s3s3

Top-k Similarity Join 3 What if objects have multiple values ? Player A Player B

Existing Models for Multi-instances Probabilistic Top-k Join – Probability is computed by summing up probabilities of all valid possible worlds – Join over Uncertain Data: [Cheng et al. CIKM 06] [Kriegel et al. DASFAA 06] Join over Uncertain Data Streams: [Lian, Chen. TKDE 11] 4

Probabilistic Top-k Similarity Join 5 q (q, ) and (q, ) have the same probability to be top-1 similarity join result Top-1 similarity pair from {q} X {, }

Probabilistic Top-k Similarity Join 6 q (q, ) and (q, ) have the same probability to be top-1 similarity join result

Probabilistic Top-k Similarity Join 7 ties Not very sensitive to relative distribution of objects instances Different semantics than multi-value objects !

Uncertain Objects VS Multi-valued Objects Uncertain Objects: the instances are exclusive, at most one instance can be true Multi-valued Objects: instances exist simultaneously, all are true at the same time 8

Uncertain Objects VS Multi-valued Objects Uncertain Objects: 9

Uncertain Objects VS Multi-valued Objects Uncertain Objects 10

Uncertain Objects VS Multi-valued Objects Multi-valued Objects – Ranking a school in a university: with many staff members – Comparing weather of two cities: statistics from 365 days – … … 11

Aggregate Distances Existing: mean, minimal, maximal, etc Generalization: quantile distance 12

Contributions Formally define top-k similarity join over Multi-valued objects Seeding-Pruning-Refinement Framework – Novel and efficient pruning techniques 13

Preliminaries Φ-quantile: – The first element in a sorted list with the cumulative weight not smaller than Φ, where Φ is a number in (0, 1]. 14 Sorted elements: Weights: quantile0.8-quantile Cumulative weight: 0.45

Preliminaries Multi-valued Object: – A set of weighted instances in d-dimensional space – The sum of the weights equals to 1 15

Preliminaries: Quantile Distances Φ-quantile distance 16 Instance pairs: (q 2, u 1 ) (q 3, u 1 ) (q 3, u 2 ) (q 1, u 1 ) (q 2, u 2 ) (q 1, u 2 ) Weights: 1/8 1/8 1/8 1/4 1/8 1/ quantile0.5-quantile ½ ¼ ¼ ½ ½

Preliminaries: Quantile Distances Multi-value objects U and V – Overall m=|U| * |V| instance pairs – O(m) algorithm with pruning [Zhang et al, ICDE 2010] 17

Preliminaries: Top-k Similarity Join Non-index [Cheema et al, ICDE 2011] R-tree based – Exhaustive algorithm [Corral et al, SIGMOD 2000] – Recursive algorithm [Corral et al, SIGMOD 2000] – Heap algorithm [Corral et al, SIGMOD 2000] – Priority query algorithm [Hjaltason et al, SIGMOD 1998] 18

Problem Definition 19

Framework Seeding – Select k objects and compute quantile distance --- k seeds are selected based on mean – λ k : the largest distance of the K seeded object pairs Refinement: – Determine the final solution – Challenge: efficient pruning techniques 20

Φ-quantile Top-k Join 21

Φ-quantile Top-k Join 22 global sR-tree of local aR-trees for each object

Φ-quantile Top-k Join Based on Heap Algorithm – Start from root pairs of two global sR-trees Apply pruning techniques to each entry pairs 23

Φ-quantile Top-k Join Distance based pruning rule on Global R-tree 24 EVEV EUEU from global sR-tree ≥λ k (E U, E V ) could be pruned

Φ-quantile Top-k Join Statistic based pruning rule on Global sR-tree 25

Φ-quantile Top-k Join Statistic based pruning rule on Global sR-tree 26

Φ-quantile Top-k Join Weight based pruning rule on local aR-tree 27

Φ-quantile Top-k Join Pruning on the Local R-tree of U and V – Trim the local R-tree of both U and V by λ k level by level. Discard entries in V with minimal distance to U larger than λ k, similarly for U. – Denote the total weight of remaining entries as P U and P V, respectively, if P U * P V ≤ Φ, then (U, V)can be pruned. 28

Φ-quantile KNN Overall processing: for each pair of entries – 1. Pruning on Global sR-tree based on distance – 2. Pruning on Global sR-tree based on statistics – 3. Pruning on Local R-tree based on weight – If not pruned, insert to heap, till the heap is empty 29

Experiment Real Dataset – NBA player statistics – 339, 721 records of 1,313 players – 3 dimension Synthetic Dataset – Size: 5k to 15k (5k) – Instances: 100 to 800 (200) – Dimension: 2 to 5 (3) – K: 5 to 25 (10) 30

Overall Performance 31 KNN refers to naively apply KNN processing over multi- valued objects techniques in [Zhang et al, ICDE 2010]

Different Settings 32

I/O Costs 33

Conclusions Top-k Similarity Join over multi-valued objects Novel and efficient pruning techniques 34

Future Work Fundamental spatial queries over multi-valued objects – KNN [Zhang et al, ICDE 2010] – Top-k Similarity Join [This paper] – Skyline – Range query, range aggregate query 35

Thanks 36