Marina Drosou, Evaggelia Pitoura Computer Science Department

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Chapter 4 Partition I. Covering and Dominating.
Approximation Algorithms
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
22C:19 Discrete Math Graphs Fall 2014 Sukumar Ghosh.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Searching on Multi-Dimensional Data
Branch and Bound Optimization In an exhaustive search, all possible trees in a search space are generated for comparison At each node, if the tree is optimal.
CS 206 Introduction to Computer Science II 11 / 07 / 2008 Instructor: Michael Eckmann.
Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Topology Control of Multihop Wireless Networks Using Transmit Power Adjustment Paper By : Ram Ramanathan, Regina Resales-Hain Instructor : Dr Yingshu Li.
Answering Metric Skyline Queries by PM-tree Tomáš Skopal, Jakub Lokoč Department of Software Engineering, FMP, Charles University in Prague.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
CSC5160 Topics in Algorithms Tutorial 2 Introduction to NP-Complete Problems Feb Jerry Le
Department of Computer Science, University of Maryland, College Park, USA TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
International Workshop on Computer Vision - Institute for Studies in Theoretical Physics and Mathematics, April , Tehran 1 IV COMPUTING SIZE.
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
Chapter 3: Data Storage and Access Methods
Vassilios V. Dimakopoulos and Evaggelia Pitoura Distributed Data Management Lab Dept. of Computer Science, Univ. of Ioannina, Greece
Chapter 4: Straight Line Drawing Ronald Kieft. Contents Introduction Algorithm 1: Shift Method Algorithm 2: Realizer Method Other parts of chapter 4 Questions?
ReDrive: Result-Driven Database Exploration through Recommendations Marina Drosou, Evaggelia Pitoura Computer Science Department University of Ioannina.
Trip Planning Queries F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, S.-H. Teng Boston University.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
The Maximum Independent Set Problem Sarah Bleiler DIMACS REU 2005 Advisor: Dr. Vadim Lozin, RUTCOR.
Algorithm Animation for Bioinformatics Algorithms.
Marina Drosou Department of Computer Science University of Ioannina, Greece Thesis Advisor: Evaggelia Pitoura
Search Result Diversification by M. Drosou and E. Pitoura Presenter: Bilge Koroglu June 14, 2011.
Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.
TEDI: Efficient Shortest Path Query Answering on Graphs Author: Fang Wei SIGMOD 2010 Presentation: Dr. Greg Speegle.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Preference and Diversity-based Ranking in Network-Centric Information Management Systems PhD defense Marina Drosou Computer Science & Engineering Dept.
May 1, 2002Applied Discrete Mathematics Week 13: Graphs and Trees 1News CSEMS Scholarships for CS and Math students (US citizens only) $3,125 per year.
A Graph-based Friend Recommendation System Using Genetic Algorithm
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
Dynamic Diversification of Continuous Data Marina Drosou and Evaggelia Pitoura Computer Science Department University of Ioannina, Greece
Graphs. Definitions A graph is two sets. A graph is two sets. –A set of nodes or vertices V –A set of edges E Edges connect nodes. Edges connect nodes.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
Easiest-to-Reach Neighbor Search Fatimah Aldubaisi.
Graph Query Reformulation with Diversity – Davide Mottin, Francesco Bonchi, Francesco Gullo 1 Graph Query Reformulation with Diversity Davide Mottin, University.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
WK15. Vertex Cover and Approximation Algorithm By Lin, Jr-Shiun Choi, Jae Sung.
Introduction to Graph Theory
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Vertex Coloring Distributed Algorithms for Multi-Agent Networks
Two Connected Dominating Set Algorithms for Wireless Sensor Networks Overview Najla Al-Nabhan* ♦ Bowu Zhang** ♦ Mznah Al-Rodhaan* ♦ Abdullah Al-Dhelaan*
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
The geometric GMST problem with grid clustering Presented by 楊劭文, 游岳齊, 吳郁君, 林信仲, 萬高維 Department of Computer Science and Information Engineering, National.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Applied Discrete Mathematics Week 15: Trees
Graph theory Definitions Trees, cycles, directed graphs.
B+ Tree.
A* Path Finding Ref: A-star tutorial.
CSE 373: Data Structures and Algorithms
Coverage Approximation Algorithms
Problem Solving 4.
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Locality In Distributed Graph Algorithms
Presentation transcript:

DisC Diversity: Result Diversification based on Dissimilarity and Coverage Marina Drosou, Evaggelia Pitoura Computer Science Department University of Ioannina, Greece http://dmod.cs.uoi.gr

Why diversify? Car Animal Sports Team “Mr. Jaguar’’ An example of diversity in the vase of web search Animal Sports Team “Mr. Jaguar’’ DMOD lab, University of Ioannina

What it means Given a set P of query results we want to select a representative diverse subset S of P What diverse means[1]? Coverage: different aspects, perspectives, concepts as in the example of web search Dissimilarity: non-similar items e.g., a number of characteristics in recommendations Novelty: items not seen in the past cover [1] Marina Drosou, Evaggelia Pitoura: Search result diversification. SIGMOD Record 39(1): 41-47 (2010) DMOD lab, University of Ioannina

Shortcomings of previous approaches Most previous work views as a top-k problem Given a set P of items and a number k, select a subset S* of P with the k most diverse items of P. where P = {p1, …, pn} k ≤ n d: a distance metric f: a diversity function Find: DMOD lab, University of Ioannina

Our approach - DisC Diversity What is the right size for the diverse subset S? What is a good k? What if… instead of k, a radius r? Given a result set P and a radius r, we select a representative subset S ⊆ P such that: For each item in P, there is at least one similar item in S (coverage) No two items in S are similar with each other (dissimilarity) DMOD lab, University of Ioannina

r-DisC set: r-Dissimilar and Covering set Zoom-in Zoom-out Local zoom Small r: more and less dissimilar points (zoom in) Large r: less and more dissimilar points (zoom out) Local zooming at specific points by adjusting the radius around them Oct 4, 2012@Bari

Talk Overview Formal definition and algorithms Comparison Adaptive Diversification Implementation using M-trees Evaluation DMOD lab, University of Ioannina

Our approach - DisC Diversity Since a DisC set for a set P is not unique We seek a concise representation → the minimum DisC set Formal definition: Let P be a set of objects and r, r ≥ 0, a real number. A subset S ⊆ P is an r-Dissimilar-and-Covering diverse subset, or r-DisC diverse subset, of P, if the following two conditions hold: (coverage condition) ∀pi ∈ P, ∃pj ∈ N+r (pi), such that pj ∈ S and (dissimilarity condition) ∀ pi, pj ∈ S with pi ≠ pj , it holds that d(pi, pj) > r DMOD lab, University of Ioannina

Graph model We use a graph to model the problem: Each item is a vertex There exists an edge between two vertices, if their distance is less than r r DMOD lab, University of Ioannina

Graph model Solving the minimum r-DisC Diverse Subset Problem for a set P is equivalent to finding a minimum Independent Dominating set of the graph. Independent: no edge between any two vertices in the set Dominating: all vertices outside connected with at least one inside NP-hard  Dominating, not independent Dominating and independent DMOD lab, University of Ioannina

Computing DisC subsets DMOD lab, University of Ioannina

How smaller is the minimum set? The size of any r-DisC diverse subset S of P is  B times the size of any minimum r-DisC diverse subset S∗ where B the maximum number of independent neighbors of any item in P i.e., each item has at most B neighbors that are independent from each other. B depends on the distance metric and data cardinality We have proved that: for the Euclidean distance in the 2D plane: B = 5 for the Manhattan distance in the 2D plane: B = 7 for the Euclidean distance in the 3D plane: B = 24 (proofs in the paper) DMOD lab, University of Ioannina

Bounding the size of DisC subsets Raising the dissimilarity condition: Let Δ be the maximum number of neighbors of any item in P. The size of any covering (but not dissimilar) diverse subset S of P is at most lnΔ times larger than any minimum covering subset S∗ (proof in the paper) DMOD lab, University of Ioannina

Talk Overview Formal definition and algorithms Comparison Adaptive Diversification Implementation using M-trees Evaluation DMOD lab, University of Ioannina

Comparison with other models Two widespread options for f: DMOD lab, University of Ioannina

Comparison with other models DMOD lab, University of Ioannina

Comparison with other models Let S be an r-DisC set and S* be an optimal MaxMin set. Let  and * be the MaxMin distances of the two sets. Then, * ≤ 3. (proof in the paper) DMOD lab, University of Ioannina

Talk Overview Formal definition and Algorithms Comparison Adaptive Diversification Implementation using M-trees Evaluation DMOD lab, University of Ioannina

Zooming We want to change the radius r to r’ interactively and compute a new diverse set r’ < r zoom in, r’ > r, zoom out Two requirements: Support an incremental mode of operation: the new set Sr’ should be as close as possible to the already seen result Sr. Ideally, Sr’ ⊇ Sr for r’ < r and Sr’ ⊆ Sr for r’ > r The size of Sr’ should be as close as possible to the size of the minimum r’-DisC diverse subset There is no monotonic property among the r-DisC diverse and the r’-DisC diverse subsets of a set of objects P (the two sets may be completely different) DMOD lab, University of Ioannina

Size when moving from r -> r’ 𝑁 𝑟 1 ,𝑟 2 𝐼 ( 𝑝 𝑖 ) The change in size of the diverse set when moving from r to r’ depends on the number of independent neighbors (for r’) in the “ring” around an object between the two radii. DMOD lab, University of Ioannina

Zooming Again, |𝑁 𝑟 1 ,𝑟 2 𝐼 𝑝 𝑖 | depends on the distance metric and data cardinality 2D Euclidean 2D Manhattan (proofs in the paper) DMOD lab, University of Ioannina

Zooming-In For zooming-in, we keep the items of Sr and fill in the solution with items from uncovered areas. It holds that: Sr ⊆ Sr′ |Sr′| ≤ N|Sr|, where N is the maximum |𝑁 𝑟 1 ,𝑟 2 𝐼 𝑝 𝑖 | in Sr (proofs and algorithms in the paper) (proof and various algorithms for keeping the size small in the paper) DMOD lab, University of Ioannina

Zooming-Out For zooming-out, we keep the independent items of Sr and fill in the solution with items from uncovered areas. It holds that: There are at most N items in Sr\Sr’ For each item in Sr\Sr’, at most (B-1) items are added to Sr’ (proof and various algorithms for keeping the size small in the paper) DMOD lab, University of Ioannina

Talk Overview Formal definition and Algorithms Comparison Adaptive Diversification Implementation using M-trees Evaluation DMOD lab, University of Ioannina

Implementation We base our implementation on a spatial data structure (central operation: compute neighbors) We use an M-tree We link together all leaf nodes (we visit items in a single left-to- right traversal of the leaf level to exploit locality) We build trees using splitting policies that minimize overlap DMOD lab, University of Ioannina

Implementation Lazy variations for updating neigborhoods Pruning Rule: A leaf node that contains no white objects is colored grey. When all its children become grey, an internal node is colored grey and becomes inactive. We prune subtrees with only “grey nodes”. Lazy variations for updating neigborhoods Our code is available on-line: www.dbxr.org (VLDB 2013 Reproducible label) DMOD lab, University of Ioannina

Performance Many real and synthetic datasets General trade-off: Larger r → Smaller diverse set → higher cost Lazy variations of our algorithms further reduce computational cost The cost also depends on the characteristics of the M-tree (fat-factor) Smaller sizes for clustered data Solution size Cost DMOD lab, University of Ioannina

Jaccard distance among solutions Zooming performance Solution size Both requirements: incremental (much smaller cost) and small size (relative to computing it from scratch) Jaccard distance among solutions Cost Larger overlap among Sr and Sr’ DMOD lab, University of Ioannina

On-going and future work Incorporate relevance: instead of locating the smaller set, locating the “most relevant” set Use multiple radii: emphasize specific areas of the dataset emphasize specific items, e.g., most relevant Streaming (publish/subscribe) systems: also “novelty” Many other – other forms of indexing, integrating the notion of diversity with database query processing, etc . DMOD lab, University of Ioannina

Thank you! See DisC and other models in action in our demo! Poikilo @ Group D DMOD Lab, University of Ioannina

Computing DisC subsets Let us call black the objects of P that are in S, grey the objects covered by S and white the objects that are neither black nor grey. Initially, S is empty and all objects are white. until there are no more white objects. select an arbitrary white object pi color pi black and colors all objects in the neighborhood of pi grey. Greedy variation: At each step, we select the white object with the largest number of white neighbors. DMOD lab, University of Ioannina