Joel Oren, University of Toronto

Slides:



Advertisements
Similar presentations
Divide-and-Conquer and Statistical Inference for Big Data
Advertisements

Learning Influence Probabilities in Social Networks 1 2 Amit Goyal 1 Francesco Bonchi 2 Laks V. S. Lakshmanan 1 U. of British Columbia Yahoo! Research.
LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
“FENDER” AUTOMATIC MEMORY FENCE INFERENCE Presented by Michael Kuperstein, Technion Joint work with Martin Vechev and Eran Yahav, IBM Research 1.
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.
Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.
Learning to Combine Bottom-Up and Top-Down Segmentation Anat Levin and Yair Weiss School of CS&Eng, The Hebrew University of Jerusalem, Israel.
Online (Budgeted) Social Choice Joel Oren, University of Toronto Joint work with Brendan Lucier, Microsoft Research.
Maximizing the Spread of Influence through a Social Network
Guest lecture II: Amos Fiat’s Social Networks class Edith Cohen TAU, December 2014.
Robust Winners and Winner Determination Policies under Candidate Uncertainty JOEL OREN, UNIVERSITY OF TORONTO JOINT WORK WITH CRAIG BOUTILIER, JÉRÔME LANG.
All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.
Research Topics of Potential Interest to Geography COMPUTER SCIENCE Research Away Day, 29 th April 2010 Thomas Erlebach.
An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell
Particle filters (continued…). Recall Particle filters –Track state sequence x i given the measurements ( y 0, y 1, …., y i ) –Non-linear dynamics –Non-linear.
Sorting, Searching, and Simulation in the MapReduce Framework Michael T. Goodrich Dept. of Computer Science.
Scalable Training of Mixture Models via Coresets Daniel Feldman Matthew Faulkner Andreas Krause MIT.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Active Learning for Networked Data Based on Non-progressive Diffusion Model Zhilin Yang, Jie Tang, Bin Xu, Chunxiao Xing Dept. of Computer Science and.
Online Detection of Change in Data Streams Shai Ben-David School of Computer Science U. Waterloo.
Mean Field Inference in Dependency Networks: An Empirical Study Daniel Lowd and Arash Shamaei University of Oregon.
Limits of Local Algorithms in Random Graphs
Fan Guo 1, Chao Liu 2 and Yi-Min Wang 2 1 Carnegie Mellon University 2 Microsoft Research Feb 11, 2009.
A Model and Algorithms for Pricing Queries Tang Ruiming, Wu Huayu, Bao Zhifeng, Stephane Bressan, Patrick Valduriez.
Scalable and Fully Distributed Localization With Mere Connectivity.
Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Stochastic DAG Scheduling using Monte Carlo Approach Heterogeneous Computing Workshop (at IPDPS) 2012 Extended version: Elsevier JPDC (accepted July 2013,
Bahman Bahmani Stanford University
On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Parallel and Distributed Simulation Time Parallel Simulation.
Scalable and Coordinated Scheduling for Cloud-Scale computing
Estimating PageRank on Graph Streams Atish Das Sarma (Georgia Tech) Sreenivas Gollapudi, Rina Panigrahy (Microsoft Research)
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Research Overview Gagan Agrawal Associate Professor.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion Shiyan Hu*, Zhuo Li**, Charles Alpert** *Dept of Electrical.
Bayesian Algorithmic Mechanism Design Jason Hartline Northwestern University Brendan Lucier University of Toronto.
Of 17 Limits of Local Algorithms in Random Graphs Madhu Sudan MSR Joint work with David Gamarnik (MIT) 7/11/2013Local Algorithms on Random Graphs1.
FAST-PPR: Personalized PageRank Estimation for Large Graphs Peter Lofgren (Stanford) Joint work with Siddhartha Banerjee (Stanford), Ashish Goel (Stanford),
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden.
Yu Wang1, Gao Cong2, Guojie Song1, Kunqing Xie1
Nanyang Technological University
Wireless Sensor Network Localization with Neural Networks
Independent Cascade Model and Linear Threshold Model
Statistical Cost Sharing: Learning Fair Cost Allocations from Samples
Spark Presentation.
FORA: Simple and Effective Approximate Single­-Source Personalized PageRank Sibo Wang, Renchi Yang, Xiaokui Xiao, Zhewei Wei, Yin Yang School of Information.
Joseph E. Gonzalez Postdoc, UC Berkeley AMPLab
Center for Complexity in Business, R. Smith School of Business
Kijung Shin1 Mohammad Hammoud1
Distributed Submodular Maximization in Massive Datasets
Independent Cascade Model and Linear Threshold Model
Data Integration with Dependent Sources
CS110: Discussion about Spark
Department of Computer Science University of York
Overview of big data tools
Cost-effective Outbreak Detection in Networks
Fast and Exact K-Means Clustering
Section 11.7 Probability.
By: Ran Ben Basat, Technion, Israel
Independent Cascade Model and Linear Threshold Model
Lecture 2-6 Complexity for Computing Influence Spread
Presentation transcript:

Influence at Scale: Distributed Computation of Complex Contagion in Networks Joel Oren, University of Toronto Joint work with Yaron Singer (Harvard) & Brendan Lucier (MSR)

Efficient Influence Estimation in Massive Social Networks Given a model for influence spread, how do we compute the influence of a group of “initial adopters”? Common approaches: Polytime methods such as Monte-Carlo (sampling) based methods, (Kempe et al.2003), Elementary matrix-product methods work. (Even-Dar & Shapira 2011). This talk: scaling to Massive social networks: are these satisfying approaches? Processes of influence spread have become increasingly evident. Wealth of data allows for research more then ever before: viral ad campaigns, rumor spread, political movements, etc.

Influence Estimation for Massive Social Networks General question: what is the information cost of theoretical methods for estimating properties of large graphs? How much of the graph do we need to examine? (query complexity). How much information do we need to store throughout the execution? Can we get provably efficient estimation using distributed methods? How applicable is the MapReduce paradigm? 𝐺

Influence Estimation: Context and Related Work Analysis of massive graphs: Single machine systems: Cassovary (Twitter), Ligra, SNAP. Distributed: Hadoop/MapReduce, Spark/GraphX, Giraph, GraphLab. Ongoing: efficient algorithms that adhere to a computational paradigm inspired by MapReduce (Karloff et al. 2010): Connectivity, MST’s, counting triangle, edge covers, matching … Influence diffusion: models have been proposed since the 1960’s. Kempe et al. (2003,2005) studied the algorithmic problem of selecting the 𝑘 most influential. Influence can be estimated with 𝑝𝑜𝑙𝑦(𝑛) samples.

The Independent Cascade Model Introduced by Kempe et al. (2003). The model: Input: An edge-weighted directed graph 𝐺= (𝑉,𝐸,p), set of initial infected seeds 𝑆 0 ⊆𝑉. Write 𝑆 𝑡 for set of infected nodes at step 𝑡. At every synchronous step 𝑡>0, every newly infected node 𝑢∈ 𝑆 𝑡−1 infects neighbor 𝑣 independently w.p. 𝑝 𝑢𝑣 . 𝑝 𝑑𝑎 𝑎 𝑐 𝑑 𝑝 𝑥𝑎 𝑝 𝑏𝑐 𝑝 𝑏𝑑 𝑆 0 𝑏 𝑝 𝑥𝑏 𝑥

Estimating Influence under the Independent Cascade Model For seed set 𝑆 0 ⊆𝑉, 𝑓 𝑆 0 =𝐸 𝑆 𝑛−1 𝑆 0 ]≡ expected number of infected nodes at the end. Monte Carlo sampling (Kempe et al. 2003): Sample 𝑝𝑜𝑙𝑦(𝑛) instances of the process. Take mean # of infected nodes. Q1: How much local information (e.g., neighborhoods) do we need to consider? Q2: How scalable is this? Samples may require lots of space (high influence). Run in parallel on multiple machines. 𝑝 𝑑𝑎 𝑎 𝑑 𝑐 𝑝 𝑥𝑎 𝑝 𝑏𝑑 𝑝 𝑏𝑐 𝑏 𝑥 𝑐 𝑑 𝑎 𝑝 𝑥𝑏 𝑏 𝑥 𝑐 𝑑 𝑎 𝑐 𝑑 𝑎 𝑏 𝑏 𝑥 𝑥

Q1: What is the complexity of Contagion? Link-server model: we have access to a server. For 𝑢∈𝑉 , query 𝑄(𝑢), returns out- neighborhood of 𝑢 with respective edge-probabilities. The information need of estimation: How many queries do we need to provide a constant factor approximation of 𝑓 𝑆 0 ? Implication: in worst case, need knowledge about a large portion of the graph. Theorem: (informal) Obtaining an (1± 1 3 )-approximation (or better) of 𝑓( 𝑆 0 ) with probability > 1 2 , requires 𝛀( 𝐧 ) queries.

The Algorithmic Framework: the MRC Model We know we need 𝑛 queries total, which is a lot for massive networks – maybe we can achieve scalability via parallelization.  For this, we’ll turn to the MapReduce model. The parallel computation paradigm: MRC Model (Karloff et al.2010). Synchronous round computation on 𝑁 𝑘;𝑣 on tuples. On every round: Map: apply local transformation on tuples in a streaming fashion. Reduce: do polytime computation on aggregates of tuples with the same key. MRC model constraints: 𝑁 – number of input tuples; 𝜖>0 – const fraction. # of machines: 𝑁 1−𝜖 . 𝑁 1−𝜖 space per machine. Up to log 𝑐 𝑁 rounds, for some 𝑐>0.

Q2: Scalability of Influence Estimation for Massive Graphs 𝑆 0 Approach: repeatedly sample the process, but now run many samples in parallel (recall: at most 𝑝𝑜𝑙𝑦(𝑛) samples are needed in total) Recall: we can estimate 𝑓( 𝑆 0 ) by sampling poly(n) instances of the process. Challenge 1: A single sample: may require Ω 𝑛 memory too big for MRC. Challenge 2: Many samples may be required  too many for MRC Synchronous round approach: multiple samples per machine, over multiple rounds. How can we assign samples to machines, and avoid consuming too much memory when samples are too large? Θ( 1 𝑛−1 ) 1 1 1 3 1 3 1 1 3 1 1 1 1 3 1 3 1

An MRC Algorithm for Influence Estimation Goal: design an efficient, MRC algorithm for approximating 𝑓( 𝑆 0 ) with high confidence. Theorem: We can give a 1+8𝜖 -approximations with 𝑝𝑜𝑙𝑦𝑙𝑜𝑔(𝑛) rounds. Approach: take a modular approach: Layer 1: sampling bounded instances. Take 𝐿 samples, and cap infection at 𝑡 nodes. Layer 2: approximate integral of infection distribution CDF: for a guess 𝜏, use capped samples to determine whether the true influence is greater than 𝜏. Layer 3: in descending order, iterate over logarithmically many guesses  stop when we verify the true influence is greater than the guess. 1−𝐹(𝐼) 𝐿, 𝑡 Sample Oracle 𝜋 𝑡 (L) fraction of 𝐿 samples reached ≥𝑡 nodes. 𝜏′ 𝜏 𝐼

Layer 1: Bounded Parallel Sampling In parallel: for (𝐿,𝑡) take 𝐿 samples of influence process. Terminate a sample if 𝑡 nodes reached. Perform multiple (bounded) BFS, 1 layer at a time. # of rounds – linear in diameter of the graph. Map: node-level infections. Reduce: aggregate results. Challenge: handling the case where the influence jumps from less than t to much larger than t in a single round. Can plug-in alternative reachability algorithm. 𝑺 𝒕 (𝒋) 𝑝 𝑺 𝒕 (𝒊)

Layer 2: Verify a Guess for 𝑓( 𝑆 0 ) For a guess 𝜏, verify if 𝜏 is close enough to 𝑓( 𝑆 0 ). If 𝜏 𝑛 Pr 𝐼≥𝑡 𝑑𝑡 ≥𝜏  return 𝜏. Approximate integral using Riemann sum of log 𝑛 rectangles. Use previous procedure to approximate height. Useful: can show that higher influence values require fewer samples  savings in space complexity. Get heights using previous procedure 𝜏 𝐼

Top Layer: InfEst – Top-Down Iteration over Guesses Iterate over log 𝑛 guesses. At each iteration: Scale down 𝜏. VerifyGuess(𝜏) – if returned True, return 𝜏. 𝜏 value close enough to 𝑓( 𝑆 0 ). Total: 𝑝𝑜𝑙𝑦𝑙𝑜𝑔(𝑛) many rounds; Sublinear # of tuples per machine throughout. Approximation ratio: 1+8𝜖 w.h.p. 𝑓( 𝑆 0 ) VerifyGuess(𝜏) True False 𝜏 𝜏

Empirical Testing: Running Time Benchmark: the standard Monte Carlo algorithm: sample influence 𝐼 Ω(𝑛⋅ log 𝑛) times. Approximate Monte Carlo: take linear number of samples. Results: InfEst scales well. Larger seed-sets  Monte Carlo takes longer. MC Approx-MC InfEst

Empirical Testing: Approximation We measured the approximation ratio of InfEst for different scaling factors (determine the 𝜏 decrease, and how many rectangles in the Riemann sum). Recall: 𝜖 is controlling the maximum error, and not the average error.

Summary Influence estimation: polynomial time computation in theory; in practice, sampling is very time and space consuming. Recent parallel computing frameworks offer a way to alleviate such issues. We designed an algorithmic approach for estimating influence in MapReduce fashion that scales to massive networks. Next steps: Other influence models: may require different approaches. Influence maximization: how to pick 𝑆 0 so as to maximize 𝑓( 𝑆 0 ). The hope: use our approach as a stepping stone.

THANK YOU