Download presentation
Presentation is loading. Please wait.
Published byDeborah Daring Modified over 9 years ago
1
Fast Nearest-neighbor Search in Disk-resident Graphs 报告人:鲁轶奇
2
IBM – China Research Lab Outline Introduction Background & related works Proposed Work Experiments
3
IBM – China Research Lab Introduction-Motivation Graph becoming enormous Streaming algorithm must take passes over the entire dataset Other perform clever preprocessing which use a specific similarity measure This paper introduces analysis and algorithms which try to address the scalability problem in a generalizable way: not specific to one kind of graph partitioning nor one specific proximity measure.
4
IBM – China Research Lab Introduction-Motivation(cont.) Real world graphs contain high-degree nodes Computing node value by combining that of its neighbors. Whenever a high degree node is encountered, these algorithm have to examine a much large neighborhood leading to severely degraded performance.
5
IBM – China Research Lab Introduction-Motivation(cont.) Algorithms can no longer assume that entire graph can be stored in memory. Compression techniques still have at least three setting where these might not work social networks are far less compressible than Web graphs decompression might lead to an unacceptable increase in query response time even if a graph could be compressed down to a gigabyte, it might be undesirable to keep it in memory on a machine which is running other applications
6
IBM – China Research Lab Contribution a simple transform of the graph (turning high degree nodes into sinks) a deterministic local algorithm guaranteed to return nearest neighbors in personalized pagerank from the disk-resident clustered graph. we develop a fully external-memory clustering algorithm (RWDISK) that uses only sequential sweeps over data files.
7
IBM – China Research Lab Background-Personalized Pagerank A random walk starting at node a, at any step the walk can be reset to the start node with probability α PPV(a, j) : PPV entry from a to j Large value indicates high similarity
8
IBM – China Research Lab Background-Clustering Using random walk based approaches for computing good quality local graph partition near a given anchor node. Main intuition: A random walk started inside a low conductance cluster will mostly stay inside the cluster. Conductance: Ф V (A) denote conductance and μ(A)=Σ i ∈ A degree(i)
9
IBM – China Research Lab Proposed Work First problem: most local algorithms for computing nearest neighbors suffer from the presence of high degree nodes. Second issue: computing proximity measures on large disk-resident graphs. Third issue: Finding a good clustering
10
IBM – China Research Lab Effect of high degree nodes High degree nodes are performance bottleneck Effect on personalized pagerank Main intuition Main intuition: a very high degree node passes on a small fraction of its value to the out-neighbors, which might not be significant enough to invest our computing resources on. Argue Argue: stopping a random walk at a high degree node does not change the personalized pagerank value at other nodes which have relatively smaller degree.
11
IBM – China Research Lab Effect of high degree nodes error incurred in personalized pagerank is inversely proportional to the degree of the sink node.
12
IBM – China Research Lab Effect of high degree nodes fa α (i, j) is simply the probability of hitting a node j for the first time from node i, in this α-discounted walk.
13
IBM – China Research Lab Effect of high degree nodes
14
IBM – China Research Lab Effect of high degree nodes the error for introducing a set of sink nodes
15
IBM – China Research Lab Nearest-neighbors on clustered graphs how to use the clusters for deterministic computation of nodes "close" to an arbitrary query how to use the clusters for deterministic computation of nodes "close" to an arbitrary query. Use degree-normalized personalized pagerank For a given node i, the PPV from j to it, i.e. PPV (j, i) can be written as
16
IBM – China Research Lab assume that j and i are in the same cluster S. Don’t have access to PPV -1 (k),, replace it with upper and lower bound lower bound: 0, we pretend that S is completely disconnected to the rest of the graph Upper bound : A random walk from outside S has to cross the boundary of S to hit node i.
17
IBM – China Research Lab S is small in size, the power method suffice At each iteration, maintain the upper and lower bounds for nodes within S To expand S: bring in the clusters for x of the external neighbors of this global upper boundfalls below a pre-specified small threshold γ In reality, using an additive slack ε, (ub k+1 - ε)
18
IBM – China Research Lab Ranking Step return all nodes which have lower bound greater than the (k+1)th largest upper bound Why: All nodes outside the cluster are guaranteed to have personalized pagerank smaller than the global upperbound, which is smaller than γ
19
IBM – China Research Lab Clustered Representation on Disk Intuition: use a set of anchor nodes and assign each remaining node to its “closest” anchor. Using personalized page-rank as the measure of “closeness” Algorithm: Start with a random set of anchors Iteratively add new anchors from the set of unreachable nodes, and the recompute the cluster assignments Two properties: new anchors are far away from the existing anchors when the algorithm terminates, each node i is guaranteed to be assigned to its closest anchor.
20
IBM – China Research Lab RWDISK 4 kinds of files Edge file: Each line represents an edge by a triplet {src,dst,p}, p = P(X t = dst| X t-1 =src) Last file: each line in Last is {src,anchor,value}, value= P(X t-1 =src| X 0 =anchor) Newt file: Newt contains xt, each line is {src,anchor,value}, where value equals P(X t =src|X 0 =anchor) Ans file: represents the values for vt. Thus each line in Ans is {src,anchor,value}, where value = Algorithm to compute vt by power iterations
21
IBM – China Research Lab RWDISK(cont.) Newt is simply a matrix-vector product between the transition matrix stored in Edges and Last. File are stored lexicographically, this can be obtained by a file-join like algorithm. First step: simply joins the two files, and accumulates the probability values at a node from its in-neighbors. Next step: the Newt file is sorted and compressed, in order to add up the values from different in-neighbors multiply the probabilities by α(1-α) t-1 Fix the number of iterations at maxiter.
22
IBM – China Research Lab One major problem is that intermediate files can become much larger than the number of edges in most real-world networks within 4-5 steps it is possible to reach a huge fraction of the whole graph Intermediate file getting too large Using rounding for reducing file sizes
23
IBM – China Research Lab Experiments Dataset
24
IBM – China Research Lab Experiments(cont.) System Detail On a off-the-shelf PC Least recently used replacement scheme Page size 4KB
25
IBM – China Research Lab Experiments(cont.)-Effect of high degree nodes Three-fold advantages: - Speed up external memory clustering - Reduce number of page-faults in random-walk simulation Effect on RWDISK
26
IBM – China Research Lab Experiments(cont.)-Deterministic vs. Simulations Computing top-10 neighbors with approximation slack 0.005 for 500 randomly picked nodes Citeseer original graph DBLP turned nodes with degree above 1000 into sinks LiveJournal turn nodes with degree above 100 into sinks
27
IBM – China Research Lab Experiments(cont.)-RWDISK vs. METIS maxiter = 30, α = 0.1 and ε = 0.001 for PPV METIS for baseline algorithm break DBLP into 50000 parts, which used 20GB of RAM Break LiveJournal into 75000 parts, which used 50GB of RAM In comparison, RWDISK can be excuted on a 2-4 GB standard PC
28
IBM – China Research Lab Experiments(cont.)-RWDISK vs. METIS Measure of cluster quality A good disk-based clustering must satisfy : - Low conductance - Fit in disk-sized pages
29
IBM – China Research Lab Experiments(cont.)-RWDISK vs. METIS
30
IBM – China Research Lab Experiments(cont.)-RWDISK vs. METIS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.