B. Aditya Prakash Computer Science Virginia Tech.

Slides:

Advertisements

Similar presentations

Yasuko Matsubara (Kyoto University),

Advertisements

Cyber-Security: Some Thoughts

FUNNEL: Automatic Mining of Spatially Coevolving Epidemics Yasuko Matsubara, Yasushi Sakurai (Kumamoto University) Willem G. van Panhuis (University of.

Influence propagation in large graphs - theorems and algorithms B. Aditya Prakash Christos Faloutsos

Dynamical Processes on Large Networks B. Aditya Prakash Carnegie Mellon University MMS, SIAM AN, Minneapolis, July 10,

ICDM, Shenzhen, 2014 Flu Gone Viral: Syndromic Surveillance of Flu on Twitter using Temporal Topic Models Liangzhe Chen, K. S. M. Tozammel Hossain, Patrick.

Interacting Viruses: Can Both Survive? Alex Beutel, B. Aditya Prakash, Roni Rosenfeld, Christos Faloutsos Carnegie Mellon University, USA KDD 2012, Beijing.

DAVA: Distributing Vaccines over Networks under Prior Information

© 2012 IBM Corporation IBM Research Gelling, and Melting, Large Graphs by Edge Manipulation Joint Work by Hanghang Tong (IBM) B. Aditya Prakash (Virginia.

Scalable Vaccine Distribution in Large Graphs given Uncertain Data Yao Zhang, B. Aditya Prakash Department of Computer Science Virginia Tech CIKM, Shanghai,

CMU SCS Large Graph Mining - Patterns, Tools and Cascade Analysis Christos Faloutsos CMU.

Making Diffusion Work for You B. Aditya Prakash Computer Science Virginia Tech. GraphEx Symposium, MIT Endicott House, Aug 21, 2014.

CMU SCS C. Faloutsos (CMU)#1 Large Graph Algorithms Christos Faloutsos CMU McGlohon, Mary Prakash, Aditya Tong, Hanghang Tsourakakis, Babis Akoglu, Leman.

CMU SCS Mining Billion-node Graphs Christos Faloutsos CMU.

1 Epidemic Spreading in Real Networks: an Eigenvalue Viewpoint Yang Wang Deepayan Chakrabarti Chenxi Wang Christos Faloutsos.

x – independent variable (input)

U. Michigan participation in EDIN Lada Adamic, PI E 2.1 fractional immunization of networks E 2.1 time series analysis approach to correlating structure.

Models of Influence in Online Social Networks

CMU SCS Big (graph) data analytics Christos Faloutsos CMU.

Influence propagation in large graphs - theorems and algorithms B. Aditya Prakash Christos Faloutsos

CMU SCS Large Graph Mining Christos Faloutsos CMU.

Understanding and Managing Cascades on Large Graphs B. Aditya Prakash Computer Science Virginia Tech. CS Seminar 11/30/2012.

Fan Guo 1, Chao Liu 2 and Yi-Min Wang 2 1 Carnegie Mellon University 2 Microsoft Research Feb 11, 2009.

Making Diffusion Work for You: From Social Media to Epidemiology B. Aditya Prakash Computer Science Virginia Tech. BSEC Conference, ORNL, Aug 26, 2015.

Fast Mining and Forecasting of Complex Time-Stamped Events Yasuko Matsubara (Kyoto University), Yasushi Sakurai (NTT), Christos Faloutsos (CMU), Tomoharu.

Interacting Viruses in Networks: Can Both Survive? Authors: Alex Beutel, B. Aditya Prakash, Roni Rosenfeld, and Christos Faloutsos Presented by: Zachary.

AutoPlait: Automatic Mining of Co-evolving Time Sequences Yasuko Matsubara (Kumamoto University) Yasushi Sakurai (Kumamoto University) Christos Faloutsos.

Winner-takes-all: Competing Viruses or Ideas on fair-play Networks B. Aditya Prakash, Alex Beutel, Roni Rosenfeld, Christos Faloutsos Carnegie Mellon University,

Spotting Culprits in Epidemics: How many and Which ones? B. Aditya Prakash Virginia Tech Jilles Vreeken University of Antwerp Christos Faloutsos Carnegie.

Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics Shashidhar Sundareisan Virginia Tech Jilles Vreeken Max Planck Institute B. Aditya Prakash.

Department of Electrical Engineering and Computer Science Kunpeng Zhang, Yu Cheng, Yusheng Xie, Doug Downey, Ankit Agrawal, Alok Choudhary {kzh980,ych133,

Influence propagation in large graphs - theorems and algorithms B. Aditya Prakash Christos Faloutsos

ECML-PKDD 2010, Barcelona, Spain B. Aditya Prakash*, Hanghang Tong* ^, Nicholas Valler+, Michalis Faloutsos+, Christos Faloutsos* * Carnegie Mellon University,

Understanding and Predicting Human Behavior using Propagation: From Flu-trends to Cyber-Security B. Aditya Prakash Computer Science Virginia Tech. Keynote.

CS 1944: Sophomore Seminar Big Data and Machine Learning B. Aditya Prakash Assistant Professor Nov 3, 2015.

Propagation on Large Networks B. Aditya Prakash Christos Faloutsos Carnegie Mellon University.

Algorithms For Solving History Sensitive Cascade in Diffusion Networks Research Proposal Georgi Smilyanov, Maksim Tsikhanovich Advisor Dr Yu Zhang Trinity.

Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.

Controlling Propagation at Group Scale on Networks Yao Zhang*, Abhijin Adiga +, Anil Vullikanti + *, and B. Aditya Prakash* *Department of Computer Science.

1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.

Leveraging Information Theory for Mining Graphs and Sequences: From Propagation to Segmentation B. Aditya Prakash Computer Science Virginia Tech. ITA Workshop,

 DM-Group Meeting Liangzhe Chen, Oct Papers to be present  RSC: Mining and Modeling Temporal Activity in Social Media  KDD’15  A. F. Costa,

Inferring Networks of Diffusion and Influence

Cohesive Subgraph Computation over Large Graphs

Wenyu Zhang From Social Network Group

Forecasting with Cyber-physical Interactions in Data Centers (part 3)

Chapter 7. Classification and Prediction

B. Aditya Prakash Department of Computer Science

B. Aditya Prakash Computer Science Virginia Tech.

Greedy & Heuristic algorithms in Influence Maximization

MEIKE: Influence-based Communities in Networks

Linear Regression.

What Stops Social Epidemics?

DM-Group Meeting Liangzhe Chen, Nov

Non-linear Mining of Competing Local Activities

Intelligent Information System Lab

Epidemic Alerts EECS E6898: TOPICS – INFORMATION PROCESSING: From Data to Solutions Alexander Loh May 5, 2016.

Distributed Representations of Subgraphs

Effective Social Network Quarantine with Minimal Isolation Costs

Q4 : How does Netflix recommend movies?

Mixture of Mutually Exciting Processes for Viral Diffusion

Discovering Functional Communities in Social Media

A Network Science Approach to Fake News Detection on Social Media

Cost-effective Outbreak Detection in Networks

Binghui Wang, Le Zhang, Neil Zhenqiang Gong

Automatic Segmentation of Data Sequences

Large Graph Mining: Power Tools and a Practitioner’s guide

GANG: Detecting Fraudulent Users in OSNs

Susceptible, Infected, Recovered: the SIR Model of an Epidemic

Yingze Wang and Shi-Kuo Chang University of Pittsburgh

Presentation transcript:

B. Aditya Prakash Computer Science Virginia Tech. Understanding, Predicting and Managing Behaviors using Propagation: From Flu-trends to Cyber-Security B. Aditya Prakash Computer Science Virginia Tech. Fidelis Cybersecurity, Sept 26, 2016

Thanks! Abhishek Sharma Prakash 2016

Networks are everywhere! Facebook Network [2010] Gene Regulatory Network [Decourty 2008] Human Disease Network [Barabasi 2007] The Internet [2005] Prakash 2016

Dynamical Processes over networks are also everywhere! Prakash 2016

Why do we care? Social collaboration Information Diffusion Viral Marketing Epidemiology and Public Health Cyber Security Human mobility Games and Virtual Worlds Ecology ........ Prakash 2016

Why do we care? (1: Epidemiology) Dynamical Processes over networks [AJPH 2007] SI Model CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts Diseases over contact networks Prakash 2016

Why do we care? (1: Epidemiology) Dynamical Processes over networks Each circle is a hospital ~3000 hospitals More than 30,000 patients transferred Mention number of hospitals Patients transferred [US-MEDICARE NETWORK 2005] Problem: Given k units of disinfectant, whom to immunize? Prakash 2016

Why do we care? (1: Epidemiology) ~6x fewer! [US-MEDICARE NETWORK 2005] CURRENT PRACTICE OUR METHOD Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year) Prakash 2016

Why do we care? (2: Online Diffusion) > 800m users, ~$1B revenue [WSJ 2010] ~100m active users > 50m users Prakash 2016

Why do we care? (2: Online Diffusion) Dynamical Processes over networks Buy Versace™! Celebrity Followers Social Media Marketing Prakash 2016

Why do we care? (3: To change the world?) Dynamical Processes over networks Social networks and Collaborative Action Prakash 2016

High Impact – Multiple Settings epidemic out-breaks Q. How to squash rumors faster? Q. How do opinions spread? Q. How to market better? products/viruses transmit s/w patches Prakash 2016

Large real-world networks & processes Research Theme ANALYSIS Understanding POLICY/ ACTION Managing/Utilizing DATA Large real-world networks & processes Prakash 2016

Research Theme – Public Health ANALYSIS Will an epidemic happen? POLICY/ ACTION How to control out-breaks? DATA Modeling # patient transfers Prakash 2016

Research Theme – Social Media ANALYSIS # cascades in future? POLICY/ ACTION How to market better? DATA Modeling Tweets spreading Prakash 2016

Large real-world networks & processes In this talk DATA Large real-world networks & processes Q1: How to predict Flu- trends better? Q2: How does information evolve over time? Prakash 2016

Large real-world networks & processes In this talk DATA Large real-world networks & processes Q3: How do malware attacks evolve over time? Prakash 2016

Outline Motivation Part 1: Learning Models (Empirical Studies) Part 2: Policy and Action (Algorithms) Conclusion single virus VS multiple viruses Prakash 2016

Part 1 Part 1: Learning Models (Empirical Studies) Q1: How to predict Flu-trends better? Q2: How does information evolve over time? Q3: How does malware attacks evolve over time? single virus VS multiple viruses Prakash 2016

Surveillance How to estimate and predict flu trends? [Chen et. al. ICDM 2014] Surveillance How to estimate and predict flu trends? Population survey Hospital record Lab survey Surveillance Report Prakash 2016

GFT & Twitter Estimate flu trends using online electronic sources So cold today, I’m catching cold. I have headache, sore throat, I can’t go to school today. My nose is totally congested, I have a hard time understanding what I’m saying. Prakash 2016

Observation 1: States There are different states in an infection cycle. SEIR model: 1. Susceptible 2. Exposed 3. Infected 4. Recovered Prakash 2016

Observation 2: Ep. & So. Gap Infection cases drop exponentially in epidemiology (Hethcote 2000) Keyword mentions drop in a power-law pattern in social media (Matsubara 2012) Prakash 2016

HFSTM Model Details Hidden Flu-State from Tweet Model (HFSTM) Each word (w) in a tweet (Oi) can be generated by: A background topic Non-flu related topics State related topics Latent state Initial prob. Transit. switch Binary non-flu related switch Transit. prob. Binary background switch Word distribution Prakash 2016

HFSTM Model Generating tweets Generate the state for a tweet Generate the topic for a word State: [S,E,I] Topic: [Background, Non-flu, State] S: good This restaurant is really E: The movie was but it freezing I: I think have flu Prakash 2016

Inference Details EM-based algorithm: HFSTM-FIT E-step: M-step: At(i)=P(O1,O2,…,Ot,St=i) Bt(i)=P(Ot+1,…,OTu|St=i) γt(i)=P(St=i|Ou) M-step: Other parameters such as state transition probabilities, topic distributions, etc. Parameters learned: Prakash 2016

A possible issue with HFSTM Suffers from large, noisy vocabulary. Semi-supervision for improvement Introduce weak supervision into HFSTM. Prakash 2016

HFSTM-A HFSTM-A(spect) Introduce an aspect variable y, expressing our belief on whether a word is flu-related or not. The value of y biases the switch variables s.t. flu-related words are more likely to be explained by state topics. When the aspect value (y) is introduced, the switching probability are updated accordingly. Prakash 2016

Vocabulary & Dataset Vocabulary (230 words): Dataset (34,000 tweets): Flu-related keyword list by Chakraborty SDM 2014 Extra state-related keyword list Dataset (34,000 tweets): Identify infected users and collect their tweets Train on data from Jun 20, 2013-Aug 06, 2013 Test on two time period: Dec 01, 2012- July 08, 2013 Nov 10, 2013-Jan 26, 2014 Prakash 2016

Learned word distributions The most probable words learned in each state Probably healthy: S Having symptons: E Definitely sick: I Prakash 2016

Learned state transition Transition probabilities Transition in real tweets Learned by HFSTM: Not directly flu-related, yet correctly identified Prakash 2016

Flu trend fitting Ground-truth: Algorithms: The Pan American Health Organization (PAHO) Algorithms: Baseline: Count the number of keywords weekly as features, and regress to the ground-truth curve. Google flu trend: Take the google flu trend data as input, regress to the PAHO curve. HFSTM: Distinguish different states of keyword, and only use the number of keywords in I state. Again regress to PAHO. Prakash 2016

Flu trend fitting Linear regression to the case count reported by PAHO (the ground-truth) Prakash 2016

HFSTM-A Results are qualitatively similar with HFSTM, when the vocabulary is 10 times larger. Prakash 2016

Part 1 Part 1: Learning Models (Empirical Studies) Q1: How to predict Flu-trends better? Q2: How does information evolve over time? Q3: How does malware attacks evolve over time? single virus VS multiple viruses Prakash 2016

Google Search Volume ? ? e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date (1) First spike (2) Release date (3) Two weeks before release ? ? Prakash 2016

Patterns Y X Prakash 2016

Patterns Y More Data X Prakash 2016

Patterns Y Anomaly ? X Prakash 2016

Patterns Y Anomaly ? Extrapolation X Prakash 2016

Patterns Y Anomaly Imputation Extrapolation X Prakash 2016

Patterns Anomaly Imputation Compression Extrapolation Prakash 2016

Rise and fall patterns in social media Meme (# of mentions in blogs) short phrases Sourced from U.S. politics in 2008 “you can put lipstick on a pig” “yes we can” Prakash 2016

Rise and fall patterns in social media Can we find a unifying model, which includes these patterns? four classes on YouTube [Crane et al. ’08] six classes on Meme [Yang et al. ’11] Prakash 2016

Rise and fall patterns in social media Answer: YES! We can represent all patterns by single model In Matsubara, Sakurai, Prakash+ SIGKDD 2012 Prakash 2016

Main idea - SpikeM β 1. Un-informed bloggers (uninformed about rumor) 2. External shock at time nb (e.g, breaking news) 3. Infection (word-of-mouth) β Time n=0 Time n=nb Time n=nb+1 Infectiveness of a blog-post at age n: Strength of infection (quality of news) Decay function (how infective a blog posting is) Power Law Prakash 2016

-1.5 slope J. G. Oliveira et. al. Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature 437, 1251 (2005) . [PDF] (also in Leskovec, McGlohon+, SDM 2007) Prakash 2016

SpikeM - with periodicity Details SpikeM - with periodicity Full equation of SpikeM Periodicity 12pm Peak activity 3am Low activity Time n Bloggers change their activity over time (e.g., daily, weekly, yearly) activity Prakash 2016

Tail-part forecasts SpikeM can capture tail part Prakash 2016

“What-if” forecasting e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date (1) First spike (2) Release date (3) Two weeks before release ? ? Prakash 2016

“What-if” forecasting SpikeM can forecast not only tail-part, but also rise-part! SpikeM can forecast upcoming spikes (1) First spike (2) Release date (3) Two weeks before release Prakash 2016

Bonus: Protest Predictions Violent Protest (VP) [Sundereisan et al. ASONAM 2014] [Jin et al. SIGKDD 2014] Can Twitter provide a lead time? South American twitter dataset Language: Spanish/Portuguese Idea Look for trending keywords. Predict event type for protest using SpikeM parameters! VP A political tweet Non Violent Protest (P) P Prakash 2016

Part 1 Part 1: Learning Models (Empirical Studies) Q1: How to predict Flu-trends better? Q2: How does information evolve over time? Q3: How do malware attacks evolve over time? single virus VS multiple viruses Prakash 2016

Modeling Malware Penetration Worldwide Intelligence Network Which machine got which malware (or legitimate files) 1 Billion nodes 37 Billion edges Q: Temporal patterns? Prakash 2016

Pointers: Book chapter Graph Mining for Cyber Security Prakash. Cyber Warfare: Building the Scientific Foundation Springer 2015. Latest results on using big-data graph mining for cyber security http://link.springer.com/chapter/10.1007%2F978-3-319-14039-1_14 Prakash 2016

Book plug The Global Cyber-Vulnerability Report Subrahmanian, Ovelgonne, Dumitras, Prakash. Springer 2016 The result of analyzing two years of data from Symantec comprising over 20 billion malware and telemetry reports from over 4 million machines per year over a 2 year period. In addition, the report looks at the cybersecurity policies of all 44 countries and tries to identify important next steps that must be taken to mitigate cyber-threats. http://www.springer.com/us/book/9783319257587 Prakash 2016

WINE dataset Prakash 2016

Cybersecurity Popularity of files follows a power-law Prakash 2016

Q: Temporal Patterns Looks familiar?  Exponential rise and power law fall Prakash 2016

SpikeM again (or SharkFin) [Papalexakakis et. al. ASONAM 2013] 7 parameters only! ~ 400 points ~ 400 points Prakash 2016

Latent Propagation Patterns Prakash 2016

BUT Does not take into account differences between detections vs actual infections. Prakash 2016

Domain-based approach: Data [Chan et. al. WSDM 2016] Looked at the entire 2 years of WINE data. Augmented with vulnerability and patch data from NIST’s National Vulnerability Database (NVD) Considered all machines from 40 countries – study still ongoing. Considered the 50 most commonly occurring malware. Prakash 2016

Study Approach: Main Steps Prakash 2016

Study Approach: Patch & Detection Incompetence Incompetence : 4 base variables to measure hosts' incompetence in detecting malware and incompetence in patching (absolute and relative) w.r.t. various time period. “How much time each host took in detecting or patching for each malware” For each time tick, we built a directed bipartite graph capturing normalized detection/patching incompetence between malware and hosts Prakash 2016

FBP Model Dependent variable: For each (c,m) pair, the % of hosts in the country c attacked by malware m. Independent variables for each (c,m) pair: ADI, API, RDI, RPI, AADI, ARDI,AAPI, ARPI, ADA, RDA, APA and RPA of hosts in country c, APH and RPH of malware m Six similarity measures for hosts in two different countries Per Capita GDP and HDI of countries Found k-nearest neighbors of each (c,m) pair according to different similarity measures and used features of those countries as well. Prakash 2016

DIPS and DIPS-EXP Model Infection rate 𝛽 𝑡 . Patching rates: Susceptible hosts: 𝜃(𝑡) Detected hosts: 𝛿(𝑡) Developed algorithm to learn best parameters for DIPS and DIPS-Exp model by minimizing error terms. Prakash 2016

Learning DIPS parameters Given the #infections and detection at each time t in learning period, find parameters Φ to minimize the sum of squared errors: Learning algorithm (Two phases) Train parameters with sum of all infections and detections Train the subset of parameters for each target pair ML technique: Levenberg Marquardt (LM) algorithm Prakash 2016

Ensemble Models Developed two kinds of models: Ensemble Models Feature-based prediction (FBP) Propagation-based prediction (DIPS) Ensemble Models Based on FBP Add output from DIPS models as additional features Prakash 2016

Experiments : Overall We predict infection ratios of hosts in each country for each malware Test all country-malware pairs for top 50 malware and top 40 GDP countries w.r.t. # of infections NRMSE is important because infections ratios over countries are very different FBP shows better performance than FUNNEL w.r.t. all performance measures DIPS shows better performance than FBP w.r.t. all performance measures ESM0 is the best w.r.t. NRMSE FUNNEL*: disease infection prediction model FBP + FUNNEL does not work The MAE* values were computed with |# of ground true infected hosts – the expected # of infected hosts| Prakash 2016

Experiments Prakash 2016

Summary of Forecasting Experiments FBP, DIPS and ESM showed better performance when there were lots of infection attempts. FBP showed reliable performance across the board DIPS was very accurate when infectiousness level is high ESM takes both advantages of FBP and DIPS and shows very accurate and reliable performance Prakash 2016

Extensions: Human Vulnerability (to malware attacks) Study Identify behaviors of users that are correlated with the number of attacks on those users Approach: Find statistical proxies for human behavior Correlate them with malware attacks [Ovelgonne et al. TIST 2016] Some slides from: V S Subrahmanian Prakash 2016

Host behavior and Vulnerability Prakash 2016

Results 1 Prakash 2016

Number of Binaries vs number of infections per host Prakash 2016

Results 2: Software developers are the most vulnerable (8.1 vs. 3.3)), even after discounting for the fact that many binaries may have been produced by them. All results are statistically significant with p < 0.001 (i.e. with > 99.9% confidence) Prakash 2016

Also: Cyber-vulnerability Characterize CV of customers in different countries? help governments/companies better ensure safer user behavior Prakash 2016

Data Preparation Prakash 2016

Question Which 5 countries have the highest rate of attacks per machine? Prakash 2016

Similarly Prakash 2016

World Cyber-Vulnerability Map Prakash 2016

Europe Prakash 2016

Average Attacks per Host Prakash 2016

But in India… Prakash 2016

Per capita GDP and attacks Prakash 2016

Downloaded Binaries and Risk Prakash 2016

50 most-common malware Order? US UK S. Korea India China S. Korea, India, and China may be doing a better job patching against the most common 50 types of malware. Or maybe in the US, people distribute patching effort across lots of malware components. Our model took patching behavior into account. Prakash 2016

More in… The Global Cyber-Vulnerability Report Subrahmanian, Ovelgonne, Dumitras, Prakash. Springer 2016 The result of analyzing two years of data from Symantec comprising over 20 billion malware and telemetry reports from over 4 million machines per year over a 2 year period. In addition, the report looks at the cybersecurity policies of all 44 countries and tries to identify important next steps that must be taken to mitigate cyber-threats. http://www.springer.com/us/book/9783319257587 Prakash 2016

Outline Motivation Part 1: Learning Models (Empirical Studies) Part 2: Policy and Action (Algorithms) Conclusion single virus VS multiple viruses Prakash 2016

Alg 1: Immunization (= Interventions) Different Flavors: Pre-emptive Data-aware Prakash 2016

Pre-emptive: Vulnerability First eigenvalue λ1 (of adjacency matrix) is sufficient for most diffusion models. [Prakash et al. ICDM’12 selected for best papers] λ1 is the epidemic threshold “Safe” “Vulnerable” “Deadly” Increasing λ1 , Increasing vulnerability Prakash 2016

Goal Decrease λ1 as much as possible Node based [Tong, P., + ICDM 2010] Edge-based [Tong, P., Eliassi-Rad+ CIKM 2012, Best Paper Award] Edge-Manipulation [P., Adamic+ SDM 2013] Prakash 2016

Latest results First (provable) approximation algorithms for edge-based problem [Saha, Adiga, P., Vullikanti SDM 2015]) O(log^2 n)--factor (can be improved to O(log n)) Based on the idea of removing closed walks Semi-Definite Programming Rounding-based O(1) factor Prakash 2016

Data-aware Immunization [Zhang and Prakash, SDM 2014 Zhang and Prakash, TKDD 2015] Given: Graph and Infected nodes Find: ‘best’ nodes for immunization Complexity NP-hard Hard to approximate within an absolute error DAVA-tree Optimal solution on the tree DAVA and DAVA-fast Merging infected nodes Build a “dominator tree”, and run DAVA-tree Running time: subquadratic DAVA: O(k(|E|+ |V|log|V|)) DAVA-fast: O(|E|+|V|log|V|) Graph with infected nodes Dominator tree Prakash 2016

Extensions Can be extended to Uncertain and noisy initial data as well! [Zhang and Prakash, CIKM 2014] Twitter Firehose API 1% sample Prakash 2016

Alg 2: “Zoom-out” of the network “Zoom-out” of the cascade graph to get a quick picture (= summarization) A D D A Zoom-out C C B B F E F E Smaller representation of the network Big graph Coarsening [Purohit, Prakash, et, al. SIGKDD 2014] Prakash 2016

Application: Diffusion observation Stats: 1891 groups mean group size: 16.6 the largest group: 22061 nodes (roughly 40% of nodes) (See more results in the paper) Observation 1: a very large fraction of movies propagate in a small number of groups Observation 2: a multi-modal distribution Prakash 2016

And many others…. Finding Culprits [Prakash et al. 2012] Correcting for missing data in cascades [Sunderaisen et al. 2015] … Prakash 2016

Outline Motivation Part 1: Learning Models (Empirical Studies) Part 2: Policy and Action (Algorithms) Conclusion single virus VS multiple viruses Prakash 2016

Large real-world networks & processes Future Plans ANALYSIS Understanding POLICY/ ACTION Managing DATA Large real-world networks & processes Prakash 2016

Scalability – Big Data Need scalable algorithms for Datasets of unprecedented scale High dimensionality and sample size! Need scalable algorithms for Learning Models Developing Policy Leverage parallel systems Map-Reduce clusters (like Hadoop) for data-intensive jobs (more than 6000 machines) Parallelized compute-intensive simulations (like Condor) Prakash 2016

Uncertain Data in Cascade analysis (more implementable policies) Correcting for missing data Designing More Robust Immunization Policies Original, Nodes sampled off Culprits, and missing nodes filled in Zhang and Prakash. CIKM 2014 Sundereisan, Vreeken, Prakash. 2014 Prakash 2016

Coarsening How is it related to community structure? More applications, like Visualization… Parallelization A D D A Zoom-out C C B B F E F E Prakash 2016 Big graph

Summarization and Segmentation Automatic segmentation? Segment flu cascades? ……. Prakash 2016

References Scalable Vaccine Distribution in Large Graphs given Uncertain Data (Yao Zhang and B. Aditya Prakash) -- In CIKM 2014. Fast Influence-based Coarsening for Large Networks (Manish Purohit, B. Aditya Prakash, Chahhyun Kang, Yao Zhang and V. S. Subrahmanian) – In SIGKDD 2014 DAVA: Distributing Vaccines over Large Networks under Prior Information (Yao Zhang and B. Aditya Prakash) -- In SDM 2014 Fractional Immunization on Networks (B. Aditya Prakash, Lada Adamic, Jack Iwashnya, Hanghang Tong, Christos Faloutsos) – In SDM 2013 Spotting Culprits in Epidemics: Who and How many? (B. Aditya Prakash, Jilles Vreeken, Christos Faloutsos) – In ICDM 2012, Brussels Vancouver (Invited to KAIS Journal Best Papers of ICDM.) Gelling, and Melting, Large Graphs through Edge Manipulation (Hanghang Tong, B. Aditya Prakash, Tina Eliassi-Rad, Michalis Faloutsos, Christos Faloutsos) – In ACM CIKM 2012, Hawaii (Best Paper Award) Rise and Fall Patterns of Information Diffusion: Model and Implications (Yasuko Matsubara, Yasushi Sakurai, B. Aditya Prakash, Lei Li, Christos Faloutsos) – In SIGKDD 2012, Beijing Interacting Viruses on a Network: Can both survive? (Alex Beutel, B. Aditya Prakash, Roni Rosenfeld, Christos Faloutsos) – In SIGKDD 2012, Beijing Winner-takes-all: Competing Viruses or Ideas on fair-play networks (B. Aditya Prakash, Alex Beutel, Roni Rosenfeld, Christos Faloutsos) – In WWW 2012, Lyon Threshold Conditions for Arbitrary Cascade Models on Arbitrary Networks (B. Aditya Prakash, Deepayan Chakrabarti, Michalis Faloutsos, Nicholas Valler, Christos Faloutsos) - In IEEE ICDM 2011, Vancouver (Invited to KAIS Journal Best Papers of ICDM.) Times Series Clustering: Complex is Simpler! (Lei Li, B. Aditya Prakash) - In ICML 2011, Bellevue Epidemic Spreading on Mobile Ad Hoc Networks: Determining the Tipping Point (Nicholas Valler, B. Aditya Prakash, Hanghang Tong, Michalis Faloutsos and Christos Faloutsos) – In IEEE NETWORKING 2011, Valencia, Spain Formalizing the BGP stability problem: patterns and a chaotic model (B. Aditya Prakash, Michalis Faloutsos and Christos Faloutsos) – In IEEE INFOCOM NetSciCom Workshop, 2011. On the Vulnerability of Large Graphs (Hanghang Tong, B. Aditya Prakash, Tina Eliassi-Rad and Christos Faloutsos) – In IEEE ICDM 2010, Sydney, Australia Virus Propagation on Time-Varying Networks: Theory and Immunization Algorithms (B. Aditya Prakash, Hanghang Tong, Nicholas Valler, Michalis Faloutsos and Christos Faloutsos) – In ECML-PKDD 2010, Barcelona, Spain MetricForensics: A Multi-Level Approach for Mining Volatile Graphs (Keith Henderson, Tina Eliassi-Rad, Christos Faloutsos, Leman Akoglu, Lei Li, Koji Maruhashi, B. Aditya Prakash and Hanghang Tong) - In SIGKDD 2010, Washington D.C. Prakash 2016

Acknowledgements Collaborators Christos Faloutsos Roni Rosenfeld, Michalis Faloutsos, Lada Adamic, Theodore Iwashyna (M.D.), Dave Andersen, Tina Eliassi-Rad, Iulian Neamtiu, Varun Gupta, Jilles Vreeken, V. S. Subrahmanian John Brownstein (M.D.) Deepayan Chakrabarti, Hanghang Tong, Kunal Punera, Ashwin Sridharan, Sridhar Machiraju, Mukund Seshadri, Alice Zheng, Lei Li, Polo Chau, Nicholas Valler, Alex Beutel, Xuetao Wei Prakash 2016

Acknowledgements Students Liangzhe Chen Shashidhar Sundereisan Benjamin Wang Yao Zhang Sorour Amiri Bijaya Adhikari Prakash 2016

Acknowledgements Funding Prakash 2016

Making Diffusion Work for You B. Aditya Prakash http://www.cs.vt.edu/~badityap Analysis Policy/Action Data Prakash 2016