Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.

Similar presentations


Presentation on theme: "1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts."— Presentation transcript:

1 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

2 2 The Web as a Graph Page A Page B Page C Page D A BCD

3 3 Motivation The Web graph itself is interesting and useful. –PageRank / Kleinberg’s algorithm. –Finding cyber-communities. –Archival history of Web growth and development. –Connectivity server. Storing Web linkage information is expensive. –Web growth rate vs. storage growth rate? Can we compress it?

4 4 Varieties of Compression 1. Compress an isomorphism of the Web graph. Good for storage/transmission of graph features. 2. Compress the Web graph with nodes in a given order (e.g. sorted by URL). 3. Compress for use of compressed graph in a product (e.g. connectivity server).

5 5 Baseline: Huffman coding Significant work has shown in/outdegrees of vertices of Web graph have power-law distribution. Basic scheme: for each vertex, list all outedges. Assign Huffman codeword based on indegree.

6 6 Huffman Example Indegrees 1 3 2 3 1 1 1 Codewords 100 01 001 11 0000 0001 101

7 7 Web Graph Structure Intuition: Huffman uses degree distribution, but not Web graph structure. More structure to take advantage of: Web communities. Many pages share links. A C D E B F

8 8 Reference Algorithm Each vertex is allowed to choose a reference vertex. Compress by representing edges copied from reference vertex as a bit vector. No cycles allowed among references. X Y abcdef X uses Y X outedges = a + ref Y [11100]

9 9 Simple Reference Algorithm Maximize the number of edges compressed. Build a related affinity graph, recording number of shared pointers. Find a maximum spanning tree (or forest) to find best references. X Y abcdef X Y 3

10 10 Improved Reference Algorithm Let cost(A,B) be the cost of compressing A using B as a reference. Form an improved affinity graph: directed graph with costs. Also add a root node R, with cost(A,R) being the cost of A with no reference. Compute the rooted directed maximum spanning tree on directed affinity graph.

11 11 Example A B abcdef A B n = 1024 vertices 25 34 4050 Part of the directed affinity graph. R

12 12 Complexity Finding directed maximum spanning is fast: for x vertices and y edges, running time is O(x log x + y) or O(y log x). Compressing is fast given references. Slow part is building affinity graph. –Equivalent to sparse matrix multiplication. –If M is adjacency matrix, number of shared neighbors found by computing MM T. –Sparseness helps, but still potentially very slow.

13 13 Building the Affinity Graph Approach 1: For each pair of vertices a,b, check edge list to find common neighbors. –Slow, but good with memory. Approach 2: For each vertex a, increase count for each pair b,c of vertices with edges to a. –Quicker, but a potential memory hog. –Parallelizable. –Complexity:

14 14 Variations Huffman code non-referenced edges. –Using non-Huffman weights to find references is no longer optimal. –But do not know Huffman weights until references found. Huffman/run length/otherwise encode bit vectors. Bound the depth of tree. Find multiple references.

15 15 Bounded Tree Depth For computing on compressed form of graph, do not want a long path of references. Potential solution: bound tree depth from root. Problem: finding optimal tree of bounded depth is NP-hard. –Depth 2 = Facility location problem. In practice: use heuristic/approximation algorithms; split full optimal tree to keep depth bound.

16 16 Multiple References If one reference is good, finding two could be better. We show finding optimal pair of references, even just to maximize number of compressed edges, is NP-hard. In practice: run single Reference algorithm multiple times.

17 17 Prototype Finds references by constructing directed affinity graph, computing directed maximum spanning tree. Does not output compressed form; only size of compressed form. Also computes Huffman and Reference + Huffman size. –Size of Huffman table not counted. Future work: dealing with bottleneck of computing affinity graph.

18 18 Web Graph Models Copy models –New pages generated dynamically –Some links are “random”-- uniform over all vertices –Some links are copies: choose a page you like at random, and copy some of its links. –Richer models include deletions, changing links, inedges at creation. –Results in power-law distribution.

19 19 Copy Model Random Link X Copies of X links X

20 20 Data for Testing Graphs chosen using random copy graphs. TREC8 WT2g data set. Graph Pages Copied Copy Prob Random Links G1 G2 G3 G4 TREC Nodes131,072 247,428 0.5 1 0.70.5 NA 11[1,2][0,4]NA 1[1,2][0,4]NA

21 21 Testing Details Single pass: at most one reference. 10 trials for each random graph type. –Little variance found. Random graphs seeded with 1024 vertices of degree 3. Small graphs: edge between vertices in affinity graph if at least 2 shared edges in original. Large graphs (G3,G4,TREC): 3 shared edges.

22 22 Results Graph No comp. Bits, mill. Huffman Reference Ref+Huff G2 G3 G4 TREC Avg. Deg.2.093.255.1010.224.72 87.75 88.68 81.58 83.93 67.49 63.63 85.15 69.96 65.35 79.47 61.65 54.13 83.31 49.15 46.36 4.667.2511.3622.7821.00 G1

23 23 Analysis of Results Huffman fails to capture significant structure. More copying leads to more compression. Good compression possible even with only one reference. Performs well on “real” Web data. –TREC database may not be representative. –Significant locality.

24 24 Contributions We introduce the Reference algorithm, an algorithm designed to compress Web graphs based on structural properties. Initial results: Reference algorithm appears very promising, better than Huffman. Bounded depth variations may be suitable for on-line computing (connectivity server). Hardness results for natural extensions of Reference algorithm.

25 25 Future Work Beating the bottleneck: determining the affinity graph. –Can we approximate the affinity graph and still compress well? More extensive testing. –Variations: multiple passes, bounded depth. –Graphs: larger artificial and real Web graphs. Determining value of locality and combining locality with a reference-based scheme.


Download ppt "1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts."

Similar presentations


Ads by Google