Dense subgraphs of random graphs Uriel Feige Weizmann Institute
Talk Outline Discuss problems related to dense subgraphs of random graphs: Planted k-clique. Dense k-subgraph (if time permits).
Random Clique Random graph G on n vertices and edge probability ½. Maximum clique size almost surely 2log n. Upper bound: expectation. Lower bound: + variance. Not constructive.
How to actually find the clique? Greedy(degree) algorithm finds clique of size log n (plus low order terms). No better polytime algorithm known. Exhaustive search in time n O(log n).
Cryptographic applications [Juels and Peinado] Assuming state of the art is not improved: Oneway functions. Hierarchical keys. (Idea: distribution does not change if a small number of cliques of size 1.5 log n are planted in the graph.)
Planted/hidden clique Random graph G on n vertices and edge probability ½. A random set H of k vertices turned into a clique. If k > 2log n, H will almost surely be the unique maximum clique in G. Find H. Becomes easier the larger k is.
Degree concentration Degrees of vertices in G strongly concentrated around n/2. Distribution of degrees of H-vertices statistically different than other vertices if k larger than standard deviation. Kucera: if k > c(n log n) 1/2, H is simply all vertices of largest degree. (Greedy(degree) algorithm outputs H)
Use of eigenvectors [Alon, Krivilevich and Sudakov] Normalize adjacency matrix of G to sum up to 0. Eigenvalues of G strongly concentrated around 0. No value larger than n 1/2. If k > cn 1/2, H contributes a larger eigenvalue. H can be recovered from the eigenvector that corresponds to largest eigenvalue (takes some work).
Constant improvements Guess a vertex from H, and restrict problem to its neighborhood. Clique relative size increases, and graph remains random. Can find planted cliques of size n 1/2 /2 t in time n O(t). Polynomial (but very slow) for fixed t.
Use of SDP [Feige and Krauthgamer] Lovasz theta function provides upper bound of clique size. On random graphs, its value is known to be O(n 1/2 ). Can be used to both find and certify optimality of H when k > n 1/2.
Going below n 1/2 A certain Markov chain approach fails [Jerrum]. Use of t levels of Lovasz-Schrijver SDP relaxations no better than simply guessing t vertices of clique [Feige and Krauthgamer]. For k > n 1/3, a global maximum of a certain cubic form [Frieze and Kannan].
Why care about planted clique? Seems to require the development of new algorithmic techniques. A concrete challenge for understanding observable properties of random graphs (does planting a large clique make a noticeable difference?). Related to some other problems.
Interesting connection In a 2-person game, an approximate Nash equilibrium with nearly best payoffs (compared to true Nash) can be found in time n O(log n) [Lipton, Markakis and Metha]. A poly-time algorithm for approximate best Nash will solve the hidden clique problem in polynomial time [Hazan and Krauthgamer].
The experimental approach to the design and analysis of algorithms For hidden clique, the input distribution is well defined and can be sampled from efficiently. To evaluate a candidate algorithm, run it on a random sample and observe performance. If not good, modify the algorithm. If good, analyze the algorithm. In practice, graphs for experiments are generated using pseudorandom generators.
Experimental results (with Dorit Ron) n = 40,000. m = 400,000,000. n 1/2 = 200. For success rate roughly ½: k = 158 (Alg1 - LDR), 137 (Alg2 - TPMR). Is this good or bad? 2 log n = 30 n 1/4 = 14.
Understanding large sets of results To estimate the success probability within 1% error requires roughly 10,000 experiments. To see patterns, helps if results are displayed graphically. Do our algorithms work when k = n 0.49 ? Need experiments with large n.
Jumping to conclusions Care is needed. Is the PRG the issue? Is n sufficiently large to draw asymptotic conclusions? Might the choice of scaling of the x-axis be biasing our interpretation?
Jump to the analysis? The TPMR algorithm (Truncated Power Method Removal) looks promising. Difficult to analyze, but worth it, because the algorithm is so special. Or is it? (there was also Alg1 …)
Information on the algorithms General idea: Sort vertices by likelihood of being in H. Remove (one or more) least likely vertices. Repeat. Our algorithms take linear time (in m).
Low Degree Removal (LDR) Iterative removal phase: If current graph is a clique, move to expansion phase. Remove vertex of lowest degree (breaking ties arbitrarily). Iterative expansion phase: Add vertices that are connected to all the clique.
Theorem For every < 1 there is a constant c such that if k > cn 1/2 then LDR finds the hidden k-clique H for at least a fraction of the input instances.
Sketch of proof of theorem Lemma 1. In every subgraph with t > 11k/10 vertices, some vertex not in H has degree at most t/2 + c 1 n 1/2. Proof. Straightforward. Large deviation bounds on average degree + union bound.
Corollary As long as t > 11k/10 vertices remain, LDR removes a vertex of degree “not much larger” than t/2 (at most t/2 +c 1 n 1/2 ).
Lemma 2 For any vertex v, with high probability (say 99/100), up to the point v was removed (if at all), v’s average degree to removed vertices not in H is at most 1/2, with a total deviation no larger than c 2 n 1/2.
Sketch of proof of Lemma 2 Reveal the edges of v only when needed. Given a candidate vertex u for removal, if no edge (u,v) then remove u. Otherwise perhaps delay removal. Average rate of removal at most 1/2. Probability of excursion larger than c 2 n 1/2 is small.
Most vertices of H survive LDR. Almost all vertices of H start with “very high” degree (assuming that c > 4(c 1 + c 2 )). There are always vertices of not high degree available for removal. (Lemma 1.) The first k/10 high degree vertices of H to be removed must have lost degree at a high rate. This is a low probability event, by Lemma 2 and Markov’s inequality.
Finishing the proof 9k/10 vertices of H among the last 11k/10 survivors. Hence no vertex not in H can survive the removal phase. Expansion phase will pick up remaining vertices from H.
Conjectures The leading constant c is small: when =1/2, then c < 1 suffices. Order of quantifiers can be switched: for some c, the fraction tends to 1 as n grows. Lower bounds: LDR fails when k = o(n 1/2 ).
Open question Does the size of the planted clique exhibit threshold behavior with respect to the success probability of the LDR algorithm?
Truncated Power Method Removal TPMR algorithm Initially x is the vector of degrees. Compute x’ = Ax. Normalize x’ to sum up to 0. Average x and x’ to get a new x. Repeat 6 times. Sort vertices by their x value. Remove the lower 10%. Etc.
Some observations on TPMR Linear time in m, though slower than LDR. Finds smaller planted cliques than LDR. Why not let x converge? Faster. Performs better in our experiments. Any hope of analysing TPMR?
Summary Experimental approach suggests interesting observations. Commit in small steps. (Related to “decimation” in message passing algs.) Truncated power method is better than power method. Challenge: support observations by analysis.
Running times Lenovo 2.53 Ghz and 3GB RAM. 20 samples with around 50% success rates. N GEN | LDR | TPMR | 17 (3) | 48 (34) | | 80 (8) | 199 (127) | | 365 (31) | 832 (498) |