Dynamic Diversification of Continuous Data Marina Drosou and Evaggelia Pitoura Computer Science Department University of Ioannina, Greece

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Using Trees to Depict a Forest
Lindsey Bleimes Charlie Garrod Adam Meyerson
Traveling Salesperson Problem
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Branch-and-Bound In this handout,  Summary of branch-and-bound for integer programs Updating the lower and upper bounds for OPT(IP) Summary of fathoming.
AVL Trees COL 106 Amit Kumar Shweta Agrawal Slide Courtesy : Douglas Wilhelm Harder, MMath, UWaterloo
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
1 Stochastic Event Capture Using Mobile Sensors Subject to a Quality Metric Nabhendra Bisnik, Alhussein A. Abouzeid, and Volkan Isler Rensselaer Polytechnic.
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Small-world Overlay P2P Network
Windows Scheduling Problems for Broadcast System 1 Amotz Bar-Noy, and Richard E. Ladner Presented by Qiaosheng Shi.
2-dimensional indexing structure
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.
Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.
Cache Oblivious Search Trees via Binary Trees of Small Height
ReDrive: Result-Driven Database Exploration through Recommendations Marina Drosou, Evaggelia Pitoura Computer Science Department University of Ioannina.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Marina Drosou Department of Computer Science University of Ioannina, Greece Thesis Advisor: Evaggelia Pitoura
CPSC 335 BTrees Dr. Marina Gavrilova Computer Science University of Calgary Canada.
DEXA 2005 Quality-Aware Replication of Multimedia Data Yicheng Tu, Jingfeng Yan and Sunil Prabhakar Department of Computer Sciences, Purdue University.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Review Binary Tree Binary Tree Representation Array Representation Link List Representation Operations on Binary Trees Traversing Binary Trees Pre-Order.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
A Polynomial Time Approximation Scheme For Timing Constrained Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, Charles J. Alpert** *Dept of Electrical.
Preference and Diversity-based Ranking in Network-Centric Information Management Systems PhD defense Marina Drosou Computer Science & Engineering Dept.
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
Dynamic Covering for Recommendation Systems Ioannis Antonellis Anish Das Sarma Shaddin Dughmi.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
The Application of The Improved Hybrid Ant Colony Algorithm in Vehicle Routing Optimization Problem International Conference on Future Computer and Communication,
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Heuristic Optimization Methods Greedy algorithms, Approximation algorithms, and GRASP.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
1 Combinatorial Algorithms Local Search. A local search algorithm starts with an arbitrary feasible solution to the problem, and then check if some small,
Applications of Dynamic Programming and Heuristics to the Traveling Salesman Problem ERIC SALMON & JOSEPH SEWELL.
Efficient Processing of Top-k Spatial Preference Queries
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
Marina Drosou, Evaggelia Pitoura Computer Science Department
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
 B-tree is a specialized multiway tree designed especially for use on disk  B-Tree consists of a root node, branch nodes and leaf nodes containing the.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
Lecture 9COMPSCI.220.FS.T Lower Bound for Sorting Complexity Each algorithm that sorts by comparing only pairs of elements must use at least 
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
1 Low Latency Multimedia Broadcast in Multi-Rate Wireless Meshes Chun Tung Chou, Archan Misra Proc. 1st IEEE Workshop on Wireless Mesh Networks (WIMESH),
Efficient Semantic Web Service Discovery in Centralized and P2P Environments Dimitrios Skoutas 1,2 Dimitris Sacharidis.
Intro. ANN & Fuzzy Systems Lecture 37 Genetic and Random Search Algorithms (2)
Server Allocation for Multiplayer Cloud Gaming
Analysis and design of algorithm
Structure and Content Scoring for XML
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Structure and Content Scoring for XML
Efficient Processing of Top-k Spatial Preference Queries
Donghui Zhang, Tian Xia Northeastern University
Efficient Aggregation over Objects with Extent
Presentation transcript:

Dynamic Diversification of Continuous Data Marina Drosou and Evaggelia Pitoura Computer Science Department University of Ioannina, Greece

Dynamic Diversification of Continuous Data Motivation The abundance of available information online creates the need for selecting and presenting to users representative results. Result diversification can be applied in multiple domains: ▫Exploratory or ambiguous queries ▫Recommenders 2 DMOD lab, University of Ioannina

Dynamic Diversification of Continuous Data Result Diversification Let P = {p 1, …, p n }, k ≤ n, d be a distance measure and f be a function measuring the diversity of items as indicated by d. Then: DMOD lab, University of Ioannina 3 Given a set P of items and a number k, we seek to select a subset S * of P with the k most dissimilar items of P. Problem definition:

Dynamic Diversification of Continuous Data Diversity Functions Two widespread options for f: DMOD lab, University of Ioannina 4

Dynamic Diversification of Continuous Data Contribution We study the dynamic/streaming diversification problem: ▫New items (books, movies etc.) are added to a recommender system. ▫News articles are published while old ones are not important anymore. ▫Microblogging applications (e.g., twitter) We introduce continuity requirements that need to be satisfied in streaming environments: 1.The order in which diverse items are delivered to the users must follow the order of their generation. 2.Items must not appear, disappear and then re-appear again to the users. DMOD lab, University of Ioannina 5

Dynamic Diversification of Continuous Data Contribution We propose an efficient indexing approach based on Cover Trees. We provide approximation bounds for the accuracy of the achieved solutions. We conduct an experimental study of the efficiency and effectiveness of our approach using both real and synthetic datasets. DMOD lab, University of Ioannina 6

Dynamic Diversification of Continuous Data Outline Diversification Model ▫Diversification framework ▫Continuous k-diversity problem Cover Tree ▫Structure ▫Algorithms Evaluation DMOD lab, University of Ioannina 7

Dynamic Diversification of Continuous Data Outline Diversification Model ▫Diversification framework ▫Continuous k-diversity problem Cover Tree ▫Structure ▫Algorithms Evaluation DMOD lab, University of Ioannina 8

Dynamic Diversification of Continuous Data Diversification Framework We consider a streaming scenario, where new items arrive and older items expire. We want to provide users with a continuously updated subset of the top-k most diverse recent items in the stream. We focus on the M AX M IN diversification problem. DMOD lab, University of Ioannina 9

Dynamic Diversification of Continuous Data Jumping Windows We consider a sliding-window model where the k most diverse items are computed over windows of length w in the streaming input data. We allow windows not only to slide but also to jump. ▫This corresponds to the view of the users between consequent logins to their notification service. DMOD lab, University of Ioannina 10 w jump step Window P i-1 Window P i

Dynamic Diversification of Continuous Data Constrained Continuous k-Diversity Problem Let P i-1, P i be two consequent jumping windows. For each P i, we seek to select a diverse subset S i, where the additional two constraints hold: ▫Durability:  p j  (S i-1  P i )  p j  S i  Once selected as diverse, an item remains as such until it expires. ▫Freshness: Let p l be the newest item in S i-1. Then, ∄p j  P i \S i-1 with j < l, such that, p j  S i  Items are selected in the same order they are produced. DMOD lab, University of Ioannina 11

Dynamic Diversification of Continuous Data Widespread approaches The diversification problem is NP-hard. Most current solutions to the (static) diversification problem are either: ▫Greedy Heuristics  Select one item at a time. ▫Interchange Heuristics  Start with a random solution and try to improve it. Other approaches include the definition of optimization problems, probabilistic selection etc. DMOD lab, University of Ioannina 12

Dynamic Diversification of Continuous Data Greedy Heuristic One of the most widely spread and best performing heuristics 1,2. Algorithm (applied after each window jump): ▫Add first into S the two most dissimilar items of P. ▫Continue by adding one item at a time, until k items are selected. Add the item p with the largest distance from S: The greedy heuristic provides a ½-approximation of the optimal solution. DMOD lab, University of Ioannina 13 1M. Vieira, H. Razente, M. Barioni, M. Hadjieleftheriou, D. Srivastava, C. Traina Jr., V. Tsotras: On query result diversification. ICDE, M.Drosou, E.Pitoura: Search result diversification. SIGMOD Record 39(1), 2010

Dynamic Diversification of Continuous Data Outline Diversification Model ▫Diversification framework ▫Continuous k-diversity problem Cover Tree ▫Structure ▫Algorithms Evaluation DMOD lab, University of Ioannina 14

Dynamic Diversification of Continuous Data The Cover Tree A leveled tree where each level is a “cover” for all levels beneath it. Items at higher levels are farther apart from each other than items at lower levels. DMOD lab, University of Ioannina 15 level C l level C l-1 level C l-2

Dynamic Diversification of Continuous Data Cover Tree Invariants - Nesting Nesting: C l  C l-1, i.e., once an item p appears at some level, then every lower level has a node associated with p. DMOD lab, University of Ioannina 16 level C l level C l-1 level C l-2 p1p1 p1p1 p2p2 p2p2 p2p2 p3p3 p3p3

Dynamic Diversification of Continuous Data Cover Tree Invariants - Covering Covering: For every p i  C l-1, there exists a p j  C l, such that d(p i,p j ) ≤ b l and the node associated with p j is the parent of the node associated with p i. DMOD lab, University of Ioannina 17 level C l level C l-1 level C l-2 p1p1 p1p1 p2p2 p2p2 p2p2 p3p3 p3p3  b l-1 b: the “base” of the treel: the level of p i

Dynamic Diversification of Continuous Data Cover Tree Invariants - Separation Separation: For all distinct p i, p j  C l, it holds that d(p i, p j ) > b l. DMOD lab, University of Ioannina 18 level C l level C l-1 level C l-2 p1p1 p1p1 p2p2 p2p2 p2p2 p3p3 p3p3 > b l-2  b l-1

Dynamic Diversification of Continuous Data Cover Tree Representations After an item p appears in some level l of the tree, then p is a child of itself at all levels below l. DMOD lab, University of Ioannina 19 Implicit RepresentationExplicit Representation O(n) space space depending on P p1p1 p1p1 p2p2 p2p2 p2p2 p3p3 p3p3 p1p1 p2p2 p3p3

Dynamic Diversification of Continuous Data Computing Diverse Subsets l  +  while |T.C l-1 | ≤ k do l  l-1 end while S  T.C l candidates  T.C l-1 while |S| < k do p *  argmax p  candidates d(p,S) S  S  {p * } candidates  candidates\{p * } end while return S DMOD lab, University of Ioannina 20 Let T be a cover tree for P. Select k distinct items that appear at the highest levels of the tree. Basic idea:

Dynamic Diversification of Continuous Data Example of Diverse Subsets DMOD lab, University of Ioannina 21 Diverse subsets (items in red) for different values of k from the same Cover Tree.

Dynamic Diversification of Continuous Data Continuity Requirements Items in the tree are marked as valid or invalid: ▫Freshness: non-diverse items that are older than the newest diverse item from the previous window are marked as invalid in the cover tree and are not further considered. ▫Durability: Let r be the number of diverse items from previous windows that have not yet expired. We select k-r new valid diverse items from the new window. DMOD lab, University of Ioannina 22

Dynamic Diversification of Continuous Data Dynamic Construction Items can be inserted and deleted from a Cover Tree in a dynamic fashion. Insertion: ▫Starting from the root, descend towards the candidate nodes that can cover the new item p. ▫Continue until we reach a level C l where p is separated from all other items. ▫Select as parent the candidate node of C l+1 that is closest to p. Deletion: ▫Descend the tree looking for p, keeping note of candidate nodes that can cover the children of p. ▫Remove p and reassign its children to the candidate nodes. DMOD lab, University of Ioannina 23

Dynamic Diversification of Continuous Data p1p1 p1p1 p2p2 p2p2 p2p2 p3p3 p3p3 Insertion DMOD lab, University of Ioannina 24 p1p1 p1p1 p2p2 p2p2 p2p2 p3p3 p3p3 The new item is assigned to the closest candidate parent (according to the covering invariant) p1p1 p1p1 p2p2 p2p2 p2p2 p3p3 p3p3 We traverse the tree until we find a level in which the new item is separated from the rest

Dynamic Diversification of Continuous Data p1p1 p1p1 p2p2 p2p2 p2p2 Deletion DMOD lab, University of Ioannina 25 p1p1 p1p1 p2p2 p2p2 p2p2 p3p3 p3p3 p1p1 p1p1 p2p2 p2p2 p2p2 p4p4 p4p4 This child is assigned to p 2 since it can be covered by it A new item is promoted to the upper level and covers the other children p 3 is deleted

Dynamic Diversification of Continuous Data Approximation Bound DMOD lab, University of Ioannina 26 Let P be a set of items, k  2, d OPT (P,k) the optimal minimum distance for the M AX M IN problem and d CT (P,k) be the minimum distance of the diverse set computed by the Cover Tree algorithm. It holds that: d CT (P,k)    d OPT (P,k), where  = (b-1)/(2b 2 ) Let P be a set of items, k  2, d OPT (P,k) the optimal minimum distance for the M AX M IN problem and d CT (P,k) be the minimum distance of the diverse set computed by the Cover Tree algorithm. It holds that: d CT (P,k)    d OPT (P,k), where  = (b-1)/(2b 2 ) Proved by exploiting the covering invariant of the tree to bound the level where the least common ancestor of any two items of the optimal solution appears in the tree.

Dynamic Diversification of Continuous Data Batch Construction If all items of P are available, we can perform a batch construction of the Cover Tree ▫We call the trees thus constructed Batch Cover Trees (BCTs). Algorithm: ▫The leaf level C l contains all items in P ▫We greedily select items from C l with distance larger than b l+1 and promote them to C l+1 ▫The rest of the items in C l are distributed as children among the new nodes of C l+1 ▫Continue until we reach the root level We have proved that computing diverse subsets from a BCT results in solutions identical to those produced by the Greedy Heuristic, i.e., ½-approximations of the optimal solution. DMOD lab, University of Ioannina 27

Dynamic Diversification of Continuous Data Changing k The Cover Tree can support a zooming functionality from k to k’ items. Let l be the lowest level from which the k diverse items were selected. We can exploit the nesting invariant: ▫For k’ > k (zoom in), we select items from level l or lower. ▫For k’ < k (zoom out), we select items that appear in levels higher than l. DMOD lab, University of Ioannina 28

Dynamic Diversification of Continuous Data Outline Diversification Model ▫Diversification framework ▫Continuous k-diversity problem Cover Tree ▫Structure ▫Algorithms Evaluation DMOD lab, University of Ioannina 29

Dynamic Diversification of Continuous Data Setup DMOD lab, University of Ioannina 30 DatasetCardinalityDimensions Distance metric Synthetic Uniform100002Euclidean Clustered100002Euclidean Real Cities59222Euclidean Forest500010Cosine Faces300256Cosine Flickr18245-Jaccard GR: Greedy Heuristic BCT: Batch Cover Tree ICT: Incremental Cover Tree (dynamic insertions and deletions) RA: Random selection

Dynamic Diversification of Continuous Data Building Batch Cover Trees We measure the extra cost of building a BCT as compared to executing GR for k = n. ▫This extra cost corresponds to assigning nodes to suitable parents to form the tree levels. DMOD lab, University of Ioannina 31 ClusteredFaces bnon-npnpnon-npnp %0.58%1.49%1.94% %0.56%1.47%1.92% %0.55%1.47%1.91% np – nearest parent heuristic (choose closest candidate parent). The quality of the solution is the same for BCT and GR. Extra Cost

Dynamic Diversification of Continuous Data Building Incremental Cover Trees Building ICTs requires a small fraction of the cost required for the corresponding BCTs. However, the quality of the solutions provided by ICTs is comparable to that of BCTs (and, thus, GR). DMOD lab, University of Ioannina 32 bClusteredFaces %0.79% %0.41% %0.28% For trees with 10,ooo items:  Insertion cost: ~2.6 msec  Deletion cost: ~10 msec Inserting/Removing items after a window jump depends on the size of the window and the jump step but is much faster than re-building a BCT for the new set of items. Extra Cost

Dynamic Diversification of Continuous Data Streaming Data We compare ICTs against SGR, a streaming version of GR: ▫After each window jump, we initialize the solution for the new window with the remaining diverse items from the previous window (durability) and let GR select items from the new window satisfying freshness. To simulate the order in which data arrives, we used the photo upload time for the Flickr dataset. For the rest of the datasets, we permuted the items so that they enter the stream in a random order. DMOD lab, University of Ioannina 33

Dynamic Diversification of Continuous Data Streaming Data Comparable achieved diversity, while ICTs are much faster. Retrieving the top-100 items from an ICT with 1,000-10,000 items requires ~1.5 msec. Executing SGR requires 3.2 sec for 5,000 items and more than 15 sec for 10,000 items. DMOD lab, University of Ioannina 34

Dynamic Diversification of Continuous Data Summary We studied the diversification problem in a dynamic setting, where items change over time. We defined continuity requirements that the diversified items must satisfy. We proposed an indexed-based approach based on Cover Trees for providing efficient solutions for continuous diversification. We provided theoretical and experimental results for the quality of our approach. DMOD lab, University of Ioannina 35

Dynamic Diversification of Continuous Data DMOD lab, University of Ioannina 36 Thank you!