Joel Oren, University of Toronto

Influence at Scale: Distributed Computation of Complex Contagion in Networks
Joel Oren, University of Toronto Joint work with Yaron Singer (Harvard) & Brendan Lucier (MSR)

Efficient Influence Estimation in Massive Social Networks
Given a model for influence spread, how do we compute the influence of a group of “initial adopters”? Common approaches: Polytime methods such as Monte-Carlo (sampling) based methods, (Kempe et al.2003), Elementary matrix-product methods work. (Even-Dar & Shapira 2011). This talk: scaling to Massive social networks: are these satisfying approaches? Processes of influence spread have become increasingly evident. Wealth of data allows for research more then ever before: viral ad campaigns, rumor spread, political movements, etc.

Influence Estimation for Massive Social Networks
General question: what is the information cost of theoretical methods for estimating properties of large graphs? How much of the graph do we need to examine? (query complexity). How much information do we need to store throughout the execution? Can we get provably efficient estimation using distributed methods? How applicable is the MapReduce paradigm? 𝐺

Influence Estimation: Context and Related Work
Analysis of massive graphs: Single machine systems: Cassovary (Twitter), Ligra, SNAP. Distributed: Hadoop/MapReduce, Spark/GraphX, Giraph, GraphLab. Ongoing: efficient algorithms that adhere to a computational paradigm inspired by MapReduce (Karloff et al. 2010): Connectivity, MST’s, counting triangle, edge covers, matching … Influence diffusion: models have been proposed since the 1960’s. Kempe et al. (2003,2005) studied the algorithmic problem of selecting the 𝑘 most influential. Influence can be estimated with 𝑝𝑜𝑙𝑦(𝑛) samples.

The Independent Cascade Model
Introduced by Kempe et al. (2003). The model: Input: An edge-weighted directed graph 𝐺= (𝑉,𝐸,p), set of initial infected seeds 𝑆 0 ⊆𝑉. Write 𝑆 𝑡 for set of infected nodes at step 𝑡. At every synchronous step 𝑡>0, every newly infected node 𝑢∈ 𝑆 𝑡−1 infects neighbor 𝑣 independently w.p. 𝑝 𝑢𝑣 . 𝑝 𝑑𝑎 𝑎 𝑐 𝑑 𝑝 𝑥𝑎 𝑝 𝑏𝑐 𝑝 𝑏𝑑 𝑆 0 𝑏 𝑝 𝑥𝑏 𝑥

Estimating Influence under the Independent Cascade Model
For seed set 𝑆 0 ⊆𝑉, 𝑓 𝑆 0 =𝐸 𝑆 𝑛−1 𝑆 0 ]≡ expected number of infected nodes at the end. Monte Carlo sampling (Kempe et al. 2003): Sample 𝑝𝑜𝑙𝑦(𝑛) instances of the process. Take mean # of infected nodes. Q1: How much local information (e.g., neighborhoods) do we need to consider? Q2: How scalable is this? Samples may require lots of space (high influence). Run in parallel on multiple machines. 𝑝 𝑑𝑎 𝑎 𝑑 𝑐 𝑝 𝑥𝑎 𝑝 𝑏𝑑 𝑝 𝑏𝑐 𝑏 𝑥 𝑐 𝑑 𝑎 𝑝 𝑥𝑏 𝑏 𝑥 𝑐 𝑑 𝑎 𝑐 𝑑 𝑎 𝑏 𝑏 𝑥 𝑥

Q1: What is the complexity of Contagion?
Link-server model: we have access to a server. For 𝑢∈𝑉 , query 𝑄(𝑢), returns out- neighborhood of 𝑢 with respective edge-probabilities. The information need of estimation: How many queries do we need to provide a constant factor approximation of 𝑓 𝑆 0 ? Implication: in worst case, need knowledge about a large portion of the graph. Theorem: (informal) Obtaining an (1± 1 3 )-approximation (or better) of 𝑓( 𝑆 0 ) with probability > 1 2 , requires 𝛀( 𝐧 ) queries.

The Algorithmic Framework: the MRC Model
We know we need 𝑛 queries total, which is a lot for massive networks – maybe we can achieve scalability via parallelization. For this, we’ll turn to the MapReduce model. The parallel computation paradigm: MRC Model (Karloff et al.2010). Synchronous round computation on 𝑁 𝑘;𝑣 on tuples. On every round: Map: apply local transformation on tuples in a streaming fashion. Reduce: do polytime computation on aggregates of tuples with the same key. MRC model constraints: 𝑁 – number of input tuples; 𝜖>0 – const fraction. # of machines: 𝑁 1−𝜖 . 𝑁 1−𝜖 space per machine. Up to log 𝑐 𝑁 rounds, for some 𝑐>0.

Q2: Scalability of Influence Estimation for Massive Graphs
𝑆 0 Approach: repeatedly sample the process, but now run many samples in parallel (recall: at most 𝑝𝑜𝑙𝑦(𝑛) samples are needed in total) Recall: we can estimate 𝑓( 𝑆 0 ) by sampling poly(n) instances of the process. Challenge 1: A single sample: may require Ω 𝑛 memory too big for MRC. Challenge 2: Many samples may be required  too many for MRC Synchronous round approach: multiple samples per machine, over multiple rounds. How can we assign samples to machines, and avoid consuming too much memory when samples are too large? Θ( 1 𝑛−1 ) 1 1 1 3 1 3 1 1 3 1 1 1 1 3 1 3 1

An MRC Algorithm for Influence Estimation
Goal: design an efficient, MRC algorithm for approximating 𝑓( 𝑆 0 ) with high confidence. Theorem: We can give a 1+8𝜖 -approximations with 𝑝𝑜𝑙𝑦𝑙𝑜𝑔(𝑛) rounds. Approach: take a modular approach: Layer 1: sampling bounded instances. Take 𝐿 samples, and cap infection at 𝑡 nodes. Layer 2: approximate integral of infection distribution CDF: for a guess 𝜏, use capped samples to determine whether the true influence is greater than 𝜏. Layer 3: in descending order, iterate over logarithmically many guesses  stop when we verify the true influence is greater than the guess. 1−𝐹(𝐼) 𝐿, 𝑡 Sample Oracle 𝜋 𝑡 (L) fraction of 𝐿 samples reached ≥𝑡 nodes. 𝜏′ 𝜏 𝐼

Layer 1: Bounded Parallel Sampling
In parallel: for (𝐿,𝑡) take 𝐿 samples of influence process. Terminate a sample if 𝑡 nodes reached. Perform multiple (bounded) BFS, 1 layer at a time. # of rounds – linear in diameter of the graph. Map: node-level infections. Reduce: aggregate results. Challenge: handling the case where the influence jumps from less than t to much larger than t in a single round. Can plug-in alternative reachability algorithm. 𝑺 𝒕 (𝒋) 𝑝 𝑺 𝒕 (𝒊)

Layer 2: Verify a Guess for 𝑓( 𝑆 0 )
For a guess 𝜏, verify if 𝜏 is close enough to 𝑓( 𝑆 0 ). If 𝜏 𝑛 Pr 𝐼≥𝑡 𝑑𝑡 ≥𝜏  return 𝜏. Approximate integral using Riemann sum of log 𝑛 rectangles. Use previous procedure to approximate height. Useful: can show that higher influence values require fewer samples  savings in space complexity. Get heights using previous procedure 𝜏 𝐼

Top Layer: InfEst – Top-Down Iteration over Guesses
Iterate over log 𝑛 guesses. At each iteration: Scale down 𝜏. VerifyGuess(𝜏) – if returned True, return 𝜏. 𝜏 value close enough to 𝑓( 𝑆 0 ). Total: 𝑝𝑜𝑙𝑦𝑙𝑜𝑔(𝑛) many rounds; Sublinear # of tuples per machine throughout. Approximation ratio: 1+8𝜖 w.h.p. 𝑓( 𝑆 0 ) VerifyGuess(𝜏) True False 𝜏 𝜏

Empirical Testing: Running Time
Benchmark: the standard Monte Carlo algorithm: sample influence 𝐼 Ω(𝑛⋅ log 𝑛) times. Approximate Monte Carlo: take linear number of samples. Results: InfEst scales well. Larger seed-sets  Monte Carlo takes longer. MC Approx-MC InfEst

Empirical Testing: Approximation
We measured the approximation ratio of InfEst for different scaling factors (determine the 𝜏 decrease, and how many rectangles in the Riemann sum). Recall: 𝜖 is controlling the maximum error, and not the average error.

Summary Influence estimation: polynomial time computation in theory; in practice, sampling is very time and space consuming. Recent parallel computing frameworks offer a way to alleviate such issues. We designed an algorithmic approach for estimating influence in MapReduce fashion that scales to massive networks. Next steps: Other influence models: may require different approaches. Influence maximization: how to pick 𝑆 0 so as to maximize 𝑓( 𝑆 0 ). The hope: use our approach as a stepping stone.

THANK YOU

Joel Oren, University of Toronto

Similar presentations

Presentation on theme: "Joel Oren, University of Toronto"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Joel Oren, University of Toronto

Similar presentations

Presentation on theme: "Joel Oren, University of Toronto"— Presentation transcript:

Similar presentations

About project

Feedback