The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.

Slides:



Advertisements
Similar presentations
L3S Research Center University of Hanover Germany
Advertisements

Computer Science and Engineering Diversified Spatial Keyword Search On Road Networks Chengyuan Zhang 1,Ying Zhang 2,1,Wenjie Zhang 1, Xuemin Lin 3,1, Muhammad.
Output URL Bidding Panagiotis Papadimitriou, Hector Garcia-Molina, (Stanford University) Ali Dasdan, Santanu Kolay (Ebay Inc) Related papers: VLDB 2011,
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries Vagelis Hristidis University of California, San Diego Nick Koudas AT&T.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Patch to the Future: Unsupervised Visual Prediction
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
LSDS-IR’08, October 30, Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.
Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
A Semi-automatic Ontology Acquisition Method for the Semantic Web Man Li, Xiaoyong Du, Shan Wang Renmin University of China, Beijing WAIM May 2012.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Date: 2012/3/5 Source: Marcus Fontouraet. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou 1 Efficiently encoding term co-occurrences in inverted.
Clustering Moving Objects in Spatial Networks Jidong Chen, Caifeng Lai, Xiaofeng Meng, Renmin University of China Jianliang Xu, and Haibo Hu Hong Kong.
RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
Dynamic P2P Indexing and Search based on Compact Clustering Mauricio Marin Veronica Gil-Costa Cecilia Hernandez UNSL, Argentina Universidad de Chile Yahoo!
Efficient Processing of Top-k Spatial Preference Queries
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,
CoNMF: Exploiting User Comments for Clustering Web2.0 Items Presenter: He Xiangnan 28 June School of Computing National.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
Query by Image and Video Content: The QBIC System M. Flickner et al. IEEE Computer Special Issue on Content-Based Retrieval Vol. 28, No. 9, September 1995.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Rate-Based Query Optimization for Streaming Information Sources Stratis D. Viglas Jeffrey F. Naughton.
A Metric Cache for Similarity Search fabrizio falchi claudio lucchese salvatore orlando fausto rabitti raffaele perego.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.
1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.
Designing a framework For Recommender system Based on Interactive Evolutionary Computation Date : Mar 20 Sat, 2011 Project Number :
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.
Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.
Neighborhood - based Tag Prediction
Efficient Multi-User Indexing for Secure Keyword Search
Optimizing Parallel Algorithms for All Pairs Similarity Search
RE-Tree: An Efficient Index Structure for Regular Expressions
TT-Join: Efficient Set Containment Join
On Spatial Joins in MapReduce
Automatic Physical Design Tuning: Workload as a Sequence
Junqi Zhang+ Xiangdong Zhou+ Wei Wang+ Baile Shi+ Jian Pei*
Predict Failures with Developer Networks and Social Network Analysis
Efficient Processing of Top-k Spatial Preference Queries
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel   University of Kaiserslautern ★ L3S Research Center contact:

Dating Portal [1] [2] 2

Motivation Users J

Problem Overview Task: Efficiently retrieve all rankings that are close to query ranking q. 4 q

Formal Problem Statement Set of Rankings T={r 1, r 2, … r n }, D ri - domain of r i Distance function d(r j, r i ) Query ranking q Threshold θ Rankings r1,r2,... r m where distance d(q,ri)<=θ Top-k List = Ranking 5

Footrule Distance for Top-k Lists For top-k rankings r 1 (i) – rank of item i in ranking r 1 k+1 when r 1 (i) not in D 2 and vice versa r1r1 i1i1 i2i2 i3i3 i4i4 i7i7 r2r2 i3i3 i2i2 i4i4 i9i9 i1i1 F(r 1,r 2 ) = 6 F(r 1,r 2 ) = 7 F(r 1,r 2 ) = 4 4 i 7, i 9 not in D 1 overlap D 2 F(r 1,r 2 ) = 8 F(r 1,r 2 ) = 10 [Fagin et al. ’03, SIAM J. Discrete Math] i9i9 i9i9 i7i7 i7i7 2 1 The Footrule distance is a metric 6

Outline Motivation and Introduction Problem Statement and Basic Indexing Hybrid Index: Inv. Index vs. Metric Index Cost model for automated Performance Tuning Experimental Evaluation Conclusion and Outlook 7

How can we approach the problem? 8

Rankings as Sets r1r1 i1i1 i2i2 i3i3 i4i4 Rankings as plain sets r2r2 r4r4 r3r3 r1r1 r9r9 … r3r3 r4r4 … i2i2 i5i5 i9i9 i1i1 i8i8 r8r8 r Inverted Index index Efficiently find all the candidate rankings Compute distance function 9

Metric Index Structures Footrule distance is metric r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  index Metric Index Structure Reduce the search space 10

Drawbacks Inverted Index – drawbacks – We need to validate each candidate ranking Metric index structures (BK-tree) – even worse 11

Advantages Inverted Index – Efficiently filter out rankings having no overlap with q Metric index structures – Pre compute distances at construction time – Efficiently prune the search space using the triangle inequality Combine advantages of both approaches 12

Approach-Coarse Index r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  θcθc θcθc θcθc θcθc … θcθc θcθc Metric Index Structure riri riri rjrj rjrj rkrk rkrk Inverted Index rmrm rkrk rmrm rjrj i1i1 i2i2 … ilil r  r  θcθc θcθc r  r  r  r  θ c -partitioning threshold Medoids 13

Approach - Coarse Index Benefits – Use the filtering power of inverted index – Validate rankings using metric index 14

Coarse Index - Querying r1r1 r1r1 θcθc θcθc rjrj rjrj θcθc θcθc … rmrm rmrm θcθc θcθc θ + θ c θ r1r1 rjrj rmrm rmrm i1i1 i2i2 …ilil Input : q, θ Result: r 1, …r m, where F(q,r i )<=θ avoids missing results with d ≤ θ, represented by a medoid with d > θ 15

Performance Sweet Spot Trade-off between inverted index and metric index Large θ c Small θ c r1r1 r1r1 θcθc θcθc rjrj rjrj θcθc θcθc rmrm rmrm θcθc θcθc r1r1 rjrj rmrm rmrm i1i1 i2i2 …ilil r1r1 r1r1 θcθc θcθc rjrj rjrj θcθc θcθc … rmrm rmrm θcθc θcθc rjrj rjrj θcθc θcθc r1r1 rjrj rmrm rmrm rlrl r1r1 rjrj rkrk i1i1 i2i2 i3i3 … ilil i5i5 r5r5 … rkrk Which θ c results in the best performance? 16

Cost Model We estimate: 1.Cost for querying the inverted index (filtering cost) – Size of the posting list 2.Cost for validating the partitions (validation cost) – Size of partitions – Number of partitions to be queried Assumption: Distribution of pairwise distances is known Which θ c results in the best performance? 17

Cost for validating the partitions Number of partitions to be queried [Flajolet et al. ’92] Number of medoids that capture the rankings Coupon collector problem [3] 18

Cost Model We can find the sweet spot 19

Inverted Index Access & Optimizations q i1i1 i5i5 i9i9 i2i2 i8i8 r i2i2 i6i6 i7i7 i3i3 i4i4 query threshold θ rankings need to have at least an overlap of w otherwise L(r,q) > θ k – size of rankings Lowest distance 20

Pruning by Query-Ranking Overlap r1r1 r4r4 r3r3 r1r1 r9r9 … q1q1 i1i1 i5i5 i9i9 i2i2 r3r3 r4r4 … θ=0.2, k=5 i8i8 i2i2 i5i5 i9i9 i1i1 i8i8 r1r1 r3r3... the resulting rankings must have an overlap of at least 4 items [Wang et al. ’12] Resembles the idea of prefix filtering methods 21

Experiments Datasets – New York Times (NYT) 1 million rankings Generated by executing keyword queries on NYT corpus – Yago rankings The facts in Yago are used to create rankings[Ilieva et al. ’13 CIKM] E.g. Buildings located in New York, ranked by height 22

Experiments Algorithms – Baseline approaches Filter and Validate (F&V) Merge of Id-Sorted Lists (ListMerge) – Competitors AdaptSearch [Wang et al. ’12, SIGMOD] Minimal Filter and Validate (Minimal F&V) – Coarse Index (Coarse) – Coarse Index + Dropping index lists (Coarse+Drop) – Filter and Validate + Dropping index lists (F&V+ Drop) – Blocked access + Pruning (Blocked+Prune) – Blocked access + Pruning + Dropping index lists (Blocked+Prune+Drop) 23

Experiments Algorithms implemented in Java Main memory Wallclock time needed for processing 1000 queries We normalize d and θ to [0, 1] 24

Validity of the Theoretical Cost Model Query threshold θ = 0.2; k=10 25 Average difference is 14.82ms

Validity of the Theoretical Cost Model Query threshold θ = 0.2; k=10 26 Average difference is 2.02ms

Performance of Algorithms NYT 27 Coarse+Drop index outperforms the competitor by at least factor of 34

Performance of Algorithms Yago 28 Coarse+Drop index again outperforms the competitor

Conclusion New hybrid index structure for similarity search of top-k list – Tunable towards inverted index or metric index – Cost model for finding the sweet spot Optimizations over the inverted index The presented hybrid index beat the competitor AdaptSearch 29

References [Ilieva et al. ‘13] E. Ilieva, S. Michel, and A. Stupar. The essence of knowledge (bases) through entity rankings. CIKM, [Fagin et al. ’03] R. Fagin, R. Kumar, and D. Sivakumar. Comparing Top-k Lists. SIAM J. Discrete Math., 17(1), [Flajolet et al. ’92] P. Flajolet, D. Gardy, and L. Thimonier. Birthday Paradox, Coupon Collectors, Caching Algorithms and Self-Organizing Search. Discrete Applied Mathematics, 39(3), 1992 [Wang et al. ’12] J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference,

Image References [1] business-people-group-in [2] god-to-take-control.html [3]