Download presentation
Presentation is loading. Please wait.
1
CS728-2008 Lecture 9 Storeing and Querying Large Web Graphs
2
Last Time Algorithms for link-based clustering finding “tightly knit communities” TKC on the web graphs Today’s lecture Dealing with large graphs Building Indexes for adjacency and connectivity testing Distance and Transitive Closure New Data Structure: 2-hop covers
3
Connectivity Server Support for fast queries on the web graph –Which URLs point to a given URL? –Which URLs does a given URL point to? Stores mappings in main memory from URL to outlinks, URL to inlinks Applications –Crawl control, Web graph analysis –Connectivity, crawl optimization –TKCs, other Link analysis
4
Problem of Adjacency lists ALs store set of neighbors of each node Assume each URL represented by an integer –Use some natural ordering or hashing –E.g., for a 60K page web, need 16 bits –For 4 billion page web, need 32 bits per node Naively, this demands 32-64 bits to represent each hyperlink Can we compress this?
5
Adjacency list compression Properties exploited in compression: –Similarity (between lists) –Locality (many links from a page go to “nearby” pages) –Can use gap encodings in sorted lists –Look at distribution of gap values
6
Gap encoding (Elias) Given a list of integers in increasing order. –E.g., 33,47,154,159,202 … It suffices to store gaps. –33,14,107,5,43 … We Hope: most gaps encoded with far fewer bits. Represent a gap G as the pair length is in unary and uses log 2 G +1 bits to specify the length of the binary encoding of offset = G - 2 log 2 G in binary. Recall that the unary encoding of x is a sequence of x 1’s followed by a 0.
7
Elias codes for gap encoding e.g., 9 represented as. 2 is represented as. Exercise: does zero have a code? Encoding G takes 2 log 2 G +1 bits. –codes are always of odd length. –1 = 2 0 + 0 = 1 –2 = 2 1 + 0 = 110 –3 = 2 1 + 1 = 101 –4 = 2 2 + 0 = 11000 –5 = 2 2 + 1 = 11001 –6 = 2 2 + 2 = 11010 –7 = 2 2 + 3 = 11011 –8 = 2 3 + 0 = 1110000 –9 = 2 3 + 1 = 1110001
8
Exercise Given the following sequence of coded gaps, reconstruct the gap sequence: 1110001110101011111101101111011
9
Storage Requirements Recently a paper by Boldi/Vigna report that we can get down to an average of ~3 bits/link –(URL to URL edge) –For a 118M node web graph How can this be possible? Why is this remarkable?
10
Main ideas of Boldi/Vigna First consider lexicographically ordered list of all URLs, e.g., –www.stanford.edu/alchemy –www.stanford.edu/biology –www.stanford.edu/biology/plant –www.stanford.edu/biology/plant/copyright –www.stanford.edu/biology/plant/people –www.stanford.edu/chemistry
11
Boldi/Vigna Each of these URLs has an adjacency list Main thesis: because of use of webpage templates, the adjacency list of a node is usually similar to one of the 7 preceding URLs in the lexicographic ordering Express adjacency list in terms of one of these E.g., consider these adjacency lists –1, 2, 4, 8, 16, 32, 64 –1, 4, 9, 16, 25, 36, 49, 64 –1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144 –1, 4, 8, 16, 25, 36, 49, 64
12
Connectivity Queries Beyond basic adjacency we’d like to answer other queries… –Transitive closure: is there a path from x to y? –Distance: what is the length of shortest path from x to y? Applications –Link analysis –XML path queries with wildcards
13
Naïve Solutions Given a web graph, we can compute and store All Pairs Shortest Paths (APSPs) off-line – Then answer any query in constant time – What are Space requirements for an n-node graph ? Alternatively, given a node, we can compute online –Answer query Single Source Shortest Path Algorithm –Minimal additional space required. –What is the time complexity to answer query?
14
Transitive Closure Encoding Problem We want to find a compact representation for the transitive closure whose size is comparable to the data‘s size that supports connection tests (almost) as fast as the naive transitive closure lookup that can be built efficiently for large data sets
15
Main Idea: 2-Hop Covers and 2-Hop Labeling 2-Hop cover is set of hops (x,y) so that every connected pair is covered by 2 hops For each node a, we maintain two sets of labels (which are simply lists of nodes): Lin(a) and Lout(a) For each connection (a,b), –choose a node c on the path from a to b (center node) –add c to Lout(a) and to Lin(b) Then (a,b) Transitive Closure T Lout(a) Lin(b)≠ acb Reachability and distance queries via 2-hop Labels (Cohen et al., SODA 2002)
16
2-hop Covers Conjecture: For any graph with n nodes and m edges, a 2-hop cover always exists and has size bounded by O(n √m ) Optimization Problem: Find a cover which minimizes the sum of the label sizes Problem is NP-hard –=> approximation required Theorem: There exists a polytime algorithm that approx optimal size 2-hop cover within factor of log n. Based on a greedy (set cover) algorithm
17
124 3 5 6 (We can cover 8 connections with 6 cover entries) Approximation Algorithm What are good center nodes? Nodes that can cover many uncovered connections. Consider the center graph of candidates initial density: 2 1 2 I 4 5 6 O 2 Initial step: All connections are uncovered
18
Approximation Algorithm 124 3 5 6 Consider the center graph of candidates 4 IO 1 2 3 4 4 5 6 Initial step: All connections are uncovered Cover connections in subgraph with greatest density with corresponding center node
19
124 3 5 6 Approximation Algorithm Consider the center graph of candidates 2 1 2 IO 2 Next step: Some connections already covered Repeat this algorithm until all connections are covered Theorem: Generated Cover is optimal up to a logarithmic factor
20
Experimental Results Small example from real world: subset of DBLP 6,210 documents (publications) 168,991 elements 25,368 links (citations) 14Megabytes (uncompressed XML) Element-level graph has 168,991 nodes and 188,149 edges Its transitive closure: 344,992,370 connections 2,632.1 MB
21
Experimental Results For example above: Transitive Closure: 344,992,370 connections Two-Hop Cover: 1,289,930 entries compression factor of ~267 queries are still fast (~7.6 entries/node) But: Computation took 45 hours and 80 GB RAM! Need: Smart partitioning of problem to fit memory
22
Final Results for Index Creation Transitive Closure: 344,992,370 connections Two-Hop Cover: 9,999,052 entries compression factor of ~34.5 queries are still ok (~59.2 entries/node) build time is good (~23 minutes with 1 CPU and 1GB RAM) Cover size 8 times larger than best, but ~118 times faster with ~1% memory
23
Why Distances are much more Difficult than TC Should be simple to add distance information: vuw L out (v)={u, …} L in (w)= {u, …} L out (v)={(u,2), …} L in (w)= {(u,4), …} 24 Is this correct... dist(v,w)=dist(v,u)+dist(u,w)=2+4=6
24
Why Distances are Difficult vuw 24 dist(v,w)=1 Center node u does not reflect the correct distance of v and w
25
Solution: Distance-aware Centergraph Add edges to the center graph only if the corresponding connection is a shortest path Correct, problems: –Expensive to build the center graph (2 additional lookups per connection) - Approx bound is no longer tight 124 3 5 6 1 2 3 4 I 4 5 6 O
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.