Fast Nearest-neighbor Search in Disk-resident Graphs 报告人：鲁轶奇.

Slides:

Advertisements

Similar presentations

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Advertisements

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Link Analysis: PageRank

Identity and search in social networks Presented by Pooja Deodhar Duncan J. Watts, Peter Sheridan Dodds and M. E. J. Newman.

More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.

Query Processing in Databases Dr. M. Gavrilova.  Introduction  I/O algorithms for large databases  Complex geometric operations in graphical querying.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

SASH Spatial Approximation Sample Hierarchy

Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou

1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.

Geographic Gossip: Efficient Aggregations for Sensor Networks Author: Alex Dimakis, Anand Sarwate, Martin Wainwright University: UC Berkeley Venue: IPSN.

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.

1 Fast Dynamic Reranking in Large Graphs Purnamrita Sarkar Andrew Moore.

Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Tal Mor  Create an automatic system that given an image of a room and a color, will color the room walls  Maintaining the original texture.

Presented By: - Chandrika B N

Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.

Page 19/17/2015 CSE 30341: Operating Systems Principles Optimal Algorithm  Replace page that will not be used for longest period of time  Used for measuring.

Keyword Search on External Memory Data Graphs Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan PVLDB 2008 Reported by: Yiqi Lu.

X-Stream: Edge-Centric Graph Processing using Streaming Partitions

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.

Chapter 2 Memory Management: Early Systems Understanding Operating Systems, Fourth Edition.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.

Scaling Personalized Web Search Authors: Glen Jeh, Jennfier Widom Stanford University Written in: 2003 Cited by: 923 articles Presented by Sugandha Agrawal.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

Heuristic Optimization Methods Greedy algorithms, Approximation algorithms, and GRASP.

PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.

Purnamrita Sarkar Committee: Andrew W. Moore (Chair) Geoffrey J. Gordon Anupam Gupta Jon Kleinberg (Cornell) 1.

Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Artur Czumaj DIMAP DIMAP (Centre for Discrete Maths and it Applications) Computer Science & Department of Computer Science University of Warwick Testing.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.

Kijung Shin Jinhong Jung Lee Sael U Kang

Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.

Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.

Clustering Data Streams A presentation by George Toderici.

Purnamrita Sarkar (Carnegie Mellon) Andrew W. Moore (Google, Inc.)

Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.

Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.

Memory Allocation The main memory must accommodate both:

Database Management System

FORA: Simple and Effective Approximate Single-Source Personalized PageRank Sibo Wang, Renchi Yang, Xiaokui Xiao, Zhewei Wei, Yin Yang School of Information.

Greedy Algorithm for Community Detection

Query Processing in Databases Dr. M. Gavrilova

CSE 454 Advanced Internet Systems University of Washington

CSE 454 Advanced Internet Systems University of Washington

CS Introduction to Operating Systems

CSE 454 Advanced Internet Systems University of Washington

CSE 454 Advanced Internet Systems University of Washington

Lecture 2- Query Processing (continued)

Memory Management (1).

Continuous Density Queries for Moving Objects

Database Systems (資料庫系統)

Presentation transcript:

Fast Nearest-neighbor Search in Disk-resident Graphs 报告人：鲁轶奇

IBM – China Research Lab Outline  Introduction  Background & related works  Proposed Work  Experiments

IBM – China Research Lab Introduction-Motivation  Graph becoming enormous  Streaming algorithm must take passes over the entire dataset  Other perform clever preprocessing which use a specific similarity measure  This paper introduces analysis and algorithms which try to address the scalability problem in a generalizable way: not specific to one kind of graph partitioning nor one specific proximity measure.

IBM – China Research Lab Introduction-Motivation(cont.)  Real world graphs contain high-degree nodes  Computing node value by combining that of its neighbors.  Whenever a high degree node is encountered, these algorithm have to examine a much large neighborhood leading to severely degraded performance.

IBM – China Research Lab Introduction-Motivation(cont.)  Algorithms can no longer assume that entire graph can be stored in memory.  Compression techniques still have at least three setting where these might not work  social networks are far less compressible than Web graphs  decompression might lead to an unacceptable increase in query response time  even if a graph could be compressed down to a gigabyte, it might be undesirable to keep it in memory on a machine which is running other applications

IBM – China Research Lab Contribution  a simple transform of the graph (turning high degree nodes into sinks)  a deterministic local algorithm guaranteed to return nearest neighbors in personalized pagerank from the disk-resident clustered graph.  we develop a fully external-memory clustering algorithm (RWDISK) that uses only sequential sweeps over data files.

IBM – China Research Lab Background-Personalized Pagerank  A random walk starting at node a, at any step the walk can be reset to the start node with probability α  PPV(a, j) : PPV entry from a to j  Large value indicates high similarity

IBM – China Research Lab Background-Clustering  Using random walk based approaches for computing good quality local graph partition near a given anchor node.  Main intuition:  A random walk started inside a low conductance cluster will mostly stay inside the cluster.  Conductance:  Ф V (A) denote conductance and μ(A)=Σ i ∈ A degree(i)

IBM – China Research Lab Proposed Work  First problem: most local algorithms for computing nearest neighbors suffer from the presence of high degree nodes.  Second issue: computing proximity measures on large disk-resident graphs.  Third issue: Finding a good clustering

IBM – China Research Lab Effect of high degree nodes  High degree nodes are performance bottleneck  Effect on personalized pagerank  Main intuition  Main intuition: a very high degree node passes on a small fraction of its value to the out-neighbors, which might not be significant enough to invest our computing resources on.  Argue  Argue: stopping a random walk at a high degree node does not change the personalized pagerank value at other nodes which have relatively smaller degree.

IBM – China Research Lab Effect of high degree nodes  error incurred in personalized pagerank is inversely proportional to the degree of the sink node.

IBM – China Research Lab Effect of high degree nodes  fa α (i, j) is simply the probability of hitting a node j for the first time from node i, in this α-discounted walk.

IBM – China Research Lab Effect of high degree nodes

IBM – China Research Lab Effect of high degree nodes  the error for introducing a set of sink nodes

IBM – China Research Lab Nearest-neighbors on clustered graphs  how to use the clusters for deterministic computation of nodes "close" to an arbitrary query  how to use the clusters for deterministic computation of nodes "close" to an arbitrary query.  Use degree-normalized personalized pagerank  For a given node i, the PPV from j to it, i.e. PPV (j, i) can be written as

IBM – China Research Lab assume that j and i are in the same cluster S.  Don’t have access to PPV -1 (k),, replace it with upper and lower bound  lower bound: 0, we pretend that S is completely disconnected to the rest of the graph  Upper bound ： A random walk from outside S has to cross the boundary of S to hit node i.

IBM – China Research Lab  S is small in size, the power method suffice  At each iteration, maintain the upper and lower bounds for nodes within S  To expand S: bring in the clusters for x of the external neighbors of  this global upper boundfalls below a pre-specified small threshold γ  In reality, using an additive slack ε, (ub k+1 - ε)

IBM – China Research Lab Ranking Step  return all nodes which have lower bound greater than the (k+1)th largest upper bound  Why: All nodes outside the cluster are guaranteed to have personalized pagerank smaller than the global upperbound, which is smaller than γ

IBM – China Research Lab Clustered Representation on Disk  Intuition: use a set of anchor nodes and assign each remaining node to its “closest” anchor.  Using personalized page-rank as the measure of “closeness”  Algorithm:  Start with a random set of anchors  Iteratively add new anchors from the set of unreachable nodes, and the recompute the cluster assignments  Two properties:  new anchors are far away from the existing anchors  when the algorithm terminates, each node i is guaranteed to be assigned to its closest anchor.

IBM – China Research Lab RWDISK  4 kinds of files  Edge file: Each line represents an edge by a triplet {src,dst,p}, p = P(X t = dst| X t-1 =src)  Last file: each line in Last is {src,anchor,value}, value= P(X t-1 =src| X 0 =anchor)  Newt file: Newt contains xt, each line is {src,anchor,value}, where value equals P(X t =src|X 0 =anchor)  Ans file: represents the values for vt. Thus each line in Ans is {src,anchor,value}, where value =  Algorithm to compute vt by power iterations

IBM – China Research Lab RWDISK(cont.)  Newt is simply a matrix-vector product between the transition matrix stored in Edges and Last.  File are stored lexicographically, this can be obtained by a file-join like algorithm.  First step: simply joins the two files, and accumulates the probability values at a node from its in-neighbors.  Next step: the Newt file is sorted and compressed, in order to add up the values from different in-neighbors  multiply the probabilities by α(1-α) t-1  Fix the number of iterations at maxiter.

IBM – China Research Lab  One major problem is that intermediate files can become much larger than the number of edges  in most real-world networks within 4-5 steps it is possible to reach a huge fraction of the whole graph  Intermediate file getting too large  Using rounding for reducing file sizes

IBM – China Research Lab Experiments  Dataset

IBM – China Research Lab Experiments(cont.)  System Detail  On a off-the-shelf PC  Least recently used replacement scheme  Page size 4KB

IBM – China Research Lab Experiments(cont.)-Effect of high degree nodes  Three-fold advantages: - Speed up external memory clustering - Reduce number of page-faults in random-walk simulation  Effect on RWDISK

IBM – China Research Lab Experiments(cont.)-Deterministic vs. Simulations  Computing top-10 neighbors with approximation slack for 500 randomly picked nodes  Citeseer original graph  DBLP turned nodes with degree above 1000 into sinks  LiveJournal turn nodes with degree above 100 into sinks

IBM – China Research Lab Experiments(cont.)-RWDISK vs. METIS  maxiter = 30, α = 0.1 and ε = for PPV  METIS for baseline algorithm  break DBLP into parts, which used 20GB of RAM  Break LiveJournal into parts, which used 50GB of RAM  In comparison, RWDISK can be excuted on a 2-4 GB standard PC

IBM – China Research Lab Experiments(cont.)-RWDISK vs. METIS  Measure of cluster quality  A good disk-based clustering must satisfy ： - Low conductance - Fit in disk-sized pages

IBM – China Research Lab Experiments(cont.)-RWDISK vs. METIS

IBM – China Research Lab Experiments(cont.)-RWDISK vs. METIS