ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs Yu Liu , Bolong Zheng, Xiaodong He, Zhewei Wei, Xiaokui Xiao, Kai.

Slides:



Advertisements
Similar presentations
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Advertisements

Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Fast Algorithms For Hierarchical Range Histogram Constructions
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
GRAIL: Scalable Reachability Index for Large Graphs VLDB2010 Vineet Chaoji Mohammed J. Zaki.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Simpath: An Efficient Algorithm for Influence Maximization under Linear Threshold Model Amit Goyal Wei Lu Laks V. S. Lakshmanan University of British Columbia.
COVERTNESS CENTRALITY IN NETWORKS Michael Ovelgönne UMIACS University of Maryland 1 Chanhyun Kang, Anshul Sawant Computer Science Dept.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.
1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Kijung Shin Jinhong Jung Lee Sael U Kang
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Glen Jeh & Jennifer Widom KDD  Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search.
SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU.
Yu Wang1, Gao Cong2, Guojie Song1, Kunqing Xie1
Cohesive Subgraph Computation over Large Graphs
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Nanyang Technological University
Clustering Data Streams
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Greedy & Heuristic algorithms in Influence Maximization
Sofus A. Macskassy Fetch Technologies
CS 326A: Motion Planning Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces (1996) L. Kavraki, P. Švestka, J.-C. Latombe,
FORA: Simple and Effective Approximate Single­-Source Personalized PageRank Sibo Wang, Renchi Yang, Xiaokui Xiao, Zhewei Wei, Yin Yang School of Information.
A paper on Join Synopses for Approximate Query Answering
Parallel Density-based Hybrid Clustering
Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University
Static Optimality and Dynamic Search Optimality in Lists and Trees
Chapter 3 Component Reliability Analysis of Structures.
Department of Computer Science
Sublinear Algorithms for Personalized PageRank, with Applications
Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
Localizing the Delaunay Triangulation and its Parallel Implementation
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
Data Integration with Dependent Sources
Effective Social Network Quarantine with Minimal Isolation Costs
Haim Kaplan and Uri Zwick
Randomized Algorithms CS648
Probably Approximately
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16
Scaling up Link Prediction with Ensembles
Cost-effective Outbreak Detection in Networks
Example: Academic Search
Asymmetric Transitivity Preserving Graph Embedding
GANG: Detecting Fraudulent Users in OSNs
Chapter 7: Eligibility Traces
A Framework for Testing Query Transformation Rules
Minwise Hashing and Efficient Search
Efficient Processing of Top-k Spatial Preference Queries
PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs.
Towards Maximum Independent Sets on Massive Graphs
PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs
Presentation transcript:

ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs Yu Liu , Bolong Zheng, Xiaodong He, Zhewei Wei, Xiaokui Xiao, Kai Zheng, Jiaheng Lu 1(+ next slide 0.5min).Hello everyone, it’s my pleasure to present our work. My name is Yu and I’m from Renmin University of China. This is a joint work with professor Zhewei, Xiaokui and Jiaheng. It deals with fast computation of SimRank queries on large and dynamic graphs.

Outline Backgrounds Our ProbeSim Algorithm Experiments 1(0min).Here’s the outline of my talk. First I’ll briefly give some backgrounds, followed by the intuitive idea of our ProbeSim algorithm; at last I’ll show some interesting experimental results. ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

SimRank [Jeh and Widom, KDD 02] Professor A Student A University (less) similar Most similar to itself! similar Professor B Student B (1 min)1.The idea of SimRank was proposed by Jeh and Widom in KDD 2002, and now has more than 1600 citations. 2.Let me use this example graph to show its basic idea. Professor A and B both come from some university, and they both have a student. 3.First every object is most similar to itself; then, the two professors are similar because they are from the same university; at last, the two students are also similar but a little less, because their advisors are similar, and the similarity decays during propagation. 4.Formally, SimRank is defined by this equation, which says the similarity of two objects is determined by the similarity of all possible pairs of their in-neighbors. Here c denotes the decay factor between 0 and 1. ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Applications of SimRank Collaborative filtering [Jeh and Widom, KDD02] Clustering via semantic links [Yin et. al., VLDB06] Recommendation system [Nguyen et. al., WWW15] Rating=3 Rating=5 Rating=2 0.7 0.9 0.2 Last.fm user (1min)There are lots of applications of SimRank, like collaborative filtering and recommendation systems. For example, consider the system needs to guess the rating of a user to a given singer, to determine whether to recommend the singer to him. First the system needs to find k most similar singers already rated by the user, by some similarity measure such as SimRank, which performs well. Then the system computes a weighted score according to the similarity. ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Random Walk Interpretation of SimRank The Random Surfer-Pairs Model [Jeh and Widom, KDD02] Refined by SRK-Join[Tao et.al, VLDB14] and SLING[Tian and Xiao, SIGMOD16] Sim(u,v) = Pr[Random walks from u and v meet] (0.5min)1.Computing by the definition of SimRank induces huge cost on large graphs, which is equivalent to iteration of matrix multiplication. The authors of SimRank proposed another random walk interpretation. 2.It says the similarity of u and v is the probability of two random walks meet, starting from u and v respectively, following the in-edges and with same steps. 3. Several works refined this definition by considering the decay factor, making it more efficient. *Decay factor Ignored. ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Random Walk Interpretation of SimRank Example of random walk meet u t s (+next slide 1min)1. In this example, both u and v start a random walk. The walk of u is u to s then to v, and the walk of v is to t and then back to v. They meet at v. Meet ! v ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

The Monte Carlo (MC) Algorithm Single pair query (given u, v) Meet ! Never Meet. u u … v v 1. (0.5min)Motivated by the meeting definition of SimRank, for single pair query, the MC algorithm takes n trials and count meet number. Since each trial is an unbiased estimation, by the concentration inequality, the MC estimation can be arbitrarily precise as long as there are sufficient trials. Trial 1 Trial 2 Trial n ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

The Monte Carlo (MC) Algorithm Single source/Top-k query (given u, query all v) # walk pair generated = |V| * # trial n O(logn/ε2) (1min)However, for the single source or top-k query, the cost of MC is too high. The number of walk pairs needs to be vertex size times number of trials. To guarantee the absolute error is no larger than epsilon for any vertex, the number of trials needs to be logn over square of epsilon. On the other hand, the answer size is relatively small. For single source query, it’s no larger than vertex size, but typically much smaller because a large fraction of nodes have negligible similarity w.r.t the query vertex. For top-k query the answer size is just a constant. Thus, there is a big gap between the performance of MC algorithm and the problem complexity. Answer size O(|V|) for single source query, and O(k) for top-k query! ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

State-of-the-arts for Single Source/Top-k Method Time Space Drawbacks TopSim [Lee et.al, ICDE12] ~O(|D|2t) - 1. Heuristics -> No accuracy guarantee; 2. Slow on large graphs. TSF [Shao et.al, VLDB15] O(RgRqt|V|) O(Rg|V|) 1. Assumption -> No accuracy guarantee; 2. Large sized index! SLING [Tian & Xiao, SIGMOD16] O(|V|/ε), O(|E|log21/ε) O(|V|/ε) 1. Large index and preprocessing time; 2. Do not support dynamic graph There are already some improvements of MC algorithm, such as the TopSim, TSF and SLING, but they either rely on heuristics Which make them have no accuracy guarantee and give poor results in practice; Or need large index which limits the scalability, and do not support queries on dynamic graphs. ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Our Approach: the ProbeSim Algorithm … x w v 1.(+next slides 2min)In order to guarantee the error and query cost, We propose the ProbeSim algorithm, combining the forward random walk and backward searching strategies. 2. The algorithm also take many trials. 3.For each trial, first we sample a random walk from the query vertex, following in-edges each step. u ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Our Approach: the ProbeSim Algorithm … x depth = 3 depth = 2 depth = 1 w v s u b c (1.5min)For each node on the walk, we start a BFS-like traversal following out-edges. The depth of traversal is equal to the distance of this node to query vertex. As an example, s is an out neighbor of w, which has 3 in-edges and one comes from w. Then s randomly select one edge and see if it connects to w. Similarly, b samples one in-edge which links to s. Then the score of b is set to 1. Since c samples some other edge, the score of c is 0. It can be shown that with probability of u and b meet, the score of b is 1, otherwise 0. Which is an unbiased estimator, same as the MC algorithm. 1 Unbiased estimator! With probability Pr[u and b meet], Score(b) = 1; Otherwise Score(b) = 0. ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Intuition: ProbeSim vs MC … … … … … Never meet Meet here .. … …… 1.We show the benefits of ProbeSim in improving query efficiency. Given a query vertex, only a few nodes are similar, and most nodes have similarity = 0. 2.For MC, every node need to sample many walks. For nodes are very similar, the walks meet at short distances, for less similar nodes the meeting distance are larger. 3.For non-similar nodes, the walks never meet in a fixed step. These walks are wasted. u Query vertex Very similar Similar Less similar Not similar ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Intuition: ProbeSim vs MC … Unreachable! .. … …… 1. The random probing procedure of ProbeSim guarantees that practically, only a small fraction of similar nodes have non-zero scores. Most nodes are not reached, thus significantly improves the practical efficiency. u Query vertex Very similar Similar Less similar Not similar ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Some Optimization Techniques… Probe deterministically when cost is low Many trials -> Batch up random walks! Pruning rules (+next slides 1min)There are also some optimization techniques to speed up query and improve accuracy. ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Some Optimization Techniques… Probe deterministically when cost is low x w <- cost is low, compute deterministically v 1.For example, if we find computing the score has low cost, we compute it directly. In this example, the score of s is 1/3 instead of 1, which has smaller variance. Since the next step has many edged involved, we still use random probe. If b1 is selected, the score of b1 is score(s). 2.We also formally prove that with all these optimizations, our algorithm has probably approximately correct guarantee. 2.For more details, please refer to our paper. s Score(s) = 1/3 <- high cost, probe randomly u b1 bn Score(b1) = 1/3 Score(bn) = 0 ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Experimental Results: Small Datasets Absolute error (+next slide 0.5min)In the following I’ll show some interesting experimental results. We first conduct queries on small graphs, and compare ProbeSim with TSF and TopSim, with all their optimizations. They are the state-of-the-arts on dynamic graphs. For absolute error, our algorithm varies the error guarantee. ProbeSim consistently outperforms other algorithms. ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Experimental Results: Small Datasets Precision@k (k = 50) For top-k queries, ProbeSim achieves higher precision using same time. Furthermore, ProbeSim is more flexible in the tradeoff between efficiency and effectiveness. ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Experiments: Large Datasets Graph U/D n m LiveJournal Directed 4,847,571 68,993,773 it-2004 41,291,594 1,150,725,436 twitter-2010 41,652,230 1,468,365,182 friendster Undirected 68,349,466 2,586,147,869 Cost too high! Result n Result 1 … Top-k answer Ground truth unavailable! 1.(0.5min)We further consider the evaluation of algorithms on 4 large graphs, where the largest contains billion edges. 2.Previous works do not consider such graphs since computing the ground truth incurs too high cost. 3.We use the idea of pooling, which comes from information retrieval. Briefly it says once we have n results, to find top-k, we give n results to some expert and let him evaluate the answer. Use the idea of pooling! ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Pooling: An Example Query vertex: 3 k=3 6 4 1 2 3 5 7 1(+ next 3 slides, 0.5min). I’ll use this example to illustrate the idea of pooling. Assume the query vertex is 3 and we want to know the top-3 similar nodes (except itself). Query vertex: 3 k=3 ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Pooling: An Example ProbeSim Query vertex: 3 k=3 6 4 1 2 3 5 7 1 5 4 1. ProbeSim gives 1,5,and 4. Query vertex: 3 k=3 ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Pooling: An Example TopSim Query vertex: 3 k=3 6 4 1 2 3 5 7 1 2 4 1.TopSim computes 1,2 and 4. Query vertex: 3 k=3 ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Pooling: An Example TSF Query vertex: 3 k=3 6 4 1 2 3 5 7 2 7 4 1.The answer of TSF is 2, 7 and 4. Query vertex: 3 TSF 2 7 4 k=3 ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Use MC algorithm for pooling ! Pooling: An Example ProbeSim Use MC algorithm for pooling ! 6 1 5 4 4 1 TopSim 2 v s(3, v) 1 2 4 5 7 1 2 4 3 5 7 1.(0.5min)We merge the answers of 3 algorithms and use MC algorithm for pooling. 2.The number of candidates is no larger than 3 times k, which can be efficiently evaluated. Query vertex: 3 TSF 2 7 4 k=3 ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Use MC algorithm for pooling ! Pooling: An Example ProbeSim Use MC algorithm for pooling ! 6 1 5 4 Precision = 2/3 4 1 TopSim 2 v s(3, v) 1 0.353 5 0.249 7 0.235 4 0.196 2 0.171 1 2 4 3 Precision = 1/3 5 Ground truth: 1, 5, 6 7 (0.5min)We take the top-3 among the candidates, assuming this is the ground truth, and do the evaluation. Note that we do not know the exact ground truth, in this example is 1,5 and 6. Because no algorithm finds 6. But it’s sufficient to evaluate the relative performance of algorithms. Query vertex: 3 TSF 2 7 4 k=3 Precision = 1/3 ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Experimental Results on Large Graphs Precision@k (k = 50) 1.(+next slide 0.5min)We use the above method to compare the precision of different algorithms. 2.For precision, on LJ, ProbeSim is as good as TopSim, on other datasets ProbeSim is consistently better for a significant margin. ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Experimental Results on Large Graphs Tau@k (k = 50) 1.We also evaluate Tau, which measures the correctness of ranking. 2.The result is similar as Precision. We’ll show later that by taking more time, ProbeSim can be better than TopSim on LJ. ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Experimental Results on Large Graphs Query Time (Sec) Dataset ProbeSim TSF TopSim-SM Trun-TopSim Prio-TopSim LiveJournal 0.056 0.48 439.82 6.2 0.6 it-2004 0.018 1.01 35.18 0.67 0.2 twitter-2010 13.6 191.28 N/A friendster 0.93 1036 65.08 (0.5min)This slide shows the query time of different algorithm, using same parameters as previous slide. ProbeSim is much faster than other algs. For example, it’s 3 to 4 orders of magnitude faster than TopSim on LJ, and give similar results. TSF and TopSim is hard to handle billion scale graphs such as twitter and friendster, because of excessive query time or index out of memory. ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Experimental Results on Large Graphs Query Time (Sec) Dataset ProbeSim TSF TopSim-SM Trun-TopSim Prio-TopSim LiveJournal 0.056 0.48 439.82 6.2 0.6 it-2004 0.018 1.01 35.18 0.67 0.2 twitter-2010 13.6 191.28 N/A friendster 0.93 1036 65.08 Memory Cost (GB) Dataset Size (GB) ProbeSim TSF TopSim-SM Trun-TopSim Prio-TopSim LiveJournal 0.88 0.05 10.8 17.2 10.1 0.09 it-2004 10.9 0.4 83.7 1.7 0.9 1.0 twitter-2010 14.3 0.6 79.1 N/A friendster 23.4 99.2 1.9 1.(0.5min)We show the memory cost of these algs. ProbeSim has memory cost much smaller than graph size, which is consistent to its property of locality. 2.TSF needs to construct and store index in memory, which costs most. 3.TopSim still cost more memory than ProbeSim due to the traversal strategy, and fails on largest graphs. ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Conclusions First single source and top-k SimRank algorithm for dynamic graphs of billion scale, and with theoretical accuracy guarantee Outperforms existing methods in query efficiency, accuracy and scalability First evaluation on large graphs by pooling 1.(1min)To conclude our work, we (read…) Our code available at https://github.com/dokirabbithole/ProbeSim_vldb_pub ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Thank you! Q & A Thank you. Any question? ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs