Structural Analysis in Large Networks Observations and Applications Mary McGlohon Committee Christos Faloutsos, co-chair Alan Montgomery, co-chair Geoffrey.

Slides:



Advertisements
Similar presentations
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
Advertisements

1 Dynamics of Real-world Networks Jure Leskovec Machine Learning Department Carnegie Mellon University
CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.
LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Modeling Blog Dynamics Speaker: Michaela Götz Joint work with: Jure Leskovec, Mary McGlohon, Christos Faloutsos Cornell University Carnegie Mellon University.
Analysis and Modeling of Social Networks Foudalis Ilias.
Jure Leskovec, CMU Lars Backstrom, Cornell Ravi Kumar, Yahoo! Research Andrew Tomkins, Yahoo! Research.
Modeling Malware Spreading Dynamics Michele Garetto (Politecnico di Torino – Italy) Weibo Gong (University of Massachusetts – Amherst – MA) Don Towsley.
Lecture 21 Network evolution Slides are modified from Jurij Leskovec, Jon Kleinberg and Christos Faloutsos.
Advanced Topics in Data Mining Special focus: Social Networks.
Complex Networks Third Lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA TexPoint fonts used in EMF. Read the.
CS728 Lecture 5 Generative Graph Models and the Web.
Masters Thesis Defense Amit Karandikar Advisor: Dr. Anupam Joshi Committee: Dr. Finin, Dr. Yesha, Dr. Oates Date: 1 st May 2007 Time: 9:30 am Place: ITE.
Flickr Information propagation in the Flickr social network Meeyoung Cha Max Planck Institute for Software Systems With Alan Mislove.
CMU SCS Mining Billion-node Graphs Christos Faloutsos CMU.
Weighted Graphs and Disconnected Components Patterns and a Generator Mary McGlohon, Leman Akoglu, Christos Faloutsos Carnegie Mellon University School.
Social Networks and Graph Mining Christos Faloutsos CMU - MLD.
1 Epidemic Spreading in Real Networks: an Eigenvalue Viewpoint Yang Wang Deepayan Chakrabarti Chenxi Wang Christos Faloutsos.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.
Cascading Behavior in Large Blog Graphs Patterns and a Model Leskovec et al. (SDM 2007)
Peer-to-Peer and Grid Computing Exercise Session 3 (TUD Student Use Only) ‏
CS Lecture 6 Generative Graph Models Part II.
Blogosphere  What is blogosphere?  Why do we need to study Blog-space or Blogosphere?
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
RTG: A Recursive Realistic Graph Generator using Random Typing Leman Akoglu and Christos Faloutsos Carnegie Mellon University.
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
Influence and Correlation in Social Networks Aris Anagnostopoulos Ravi Kumar Mohammad Mahdian.
1 Exploring Blog Networks Patterns and a Model for Information Propagation Mary McGlohon In collaboration with Jure Leskovec, Christos Faloutsos Natalie.
A Measurement-driven Analysis of Information Propagation in the Flickr Social Network WWW09 报告人: 徐波.
Measurement and Evolution of Online Social Networks Review of paper by Ophir Gaathon Analysis of Social Information Networks COMS , Spring 2011,
Models of Influence in Online Social Networks
CMU SCS Big (graph) data analytics Christos Faloutsos CMU.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
Patterns And A Generative Model Jan 24, 2014 Authors: Jianwei Niu, Wanjiun Liao, Jing Peng, Chao Tong Presenter: Guoming Wang Published: Performance Computing.
Modeling Information Diffusion in Networks with Unobserved Links Quang Duong Michael P. Wellman Satinder Singh Computer Science and Engineering University.
Analysis of Topological Characteristics of Huge Online Social Networking Services Friday 10am Telefonica Barcelona Yong-Yeol Ahn Seungyeop Han.
Weighted Graphs and Disconnected Components Patterns and a Generator IDB Lab 현근수 In KDD 08. Mary McGlohon, Leman Akoglu, Christos Faloutsos.
Information Diffusion Mary McGlohon CMU /23/10.
Jure Leskovec PhD: Machine Learning Department, CMU Now: Computer Science Department, Stanford University.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
Generating and Tracking Communities Based on Implicit Affinities Matthew Smith – BYU Data Mining Lab April 2007.
CMU SCS Large Graph Mining Christos Faloutsos CMU.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
Emergence of Scaling and Assortative Mixing by Altruism Li Ping The Hong Kong PolyU
1 Graph mining techniques applied to blogs Mary McGlohon Seminar on Social Media Analysis- Oct
Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft.
Butterfly model slides. Topological Model: “Butterfly” Objective: Develop model to help explain behavioral mechanisms that cause observed properties,
With each device or application that expands the bandwidth of available information, the computer ’ s understanding of us remains unchanged.
Application 2: Misstatement detection Problem: Given network and noisy domain knowledge about suspicious nodes (flags), which nodes are most risky? Cash.
Du, Faloutsos, Wang, Akoglu Large Human Communication Networks Patterns and a Utility-Driven Generator Nan Du 1,2, Christos Faloutsos 2, Bai Wang 1, Leman.
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
1 Patterns of Cascading Behavior in Large Blog Graphs Jure Leskoves, Mary McGlohon, Christos Faloutsos, Natalie Glance, Matthew Hurst SDM 2007 Date:2008/8/21.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P9-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
Dynamics of Real-world Networks
Inferring Networks of Diffusion and Influence
A Viewpoint-based Approach for Interaction Graph Analysis
User Joining Behavior in Online Forums
Dynamics of Conversations
Part 1: Graph Mining – patterns
Lecture 13 Network evolution
Dynamics of Real-world Networks
Graph and Tensor Mining for fun and profit
Large Graph Mining: Power Tools and a Practitioner’s guide
GANG: Detecting Fraudulent Users in OSNs
Lecture 21 Network evolution
Human-centered Machine Learning
GhostLink: Latent Network Inference for Influence-aware Recommendation
Presentation transcript:

Structural Analysis in Large Networks Observations and Applications Mary McGlohon Committee Christos Faloutsos, co-chair Alan Montgomery, co-chair Geoffrey Gordon David Jensen, University of Massachusetts, Amherst

Motivation Network (a.k.a. graph, relational, social network) data has become ubiquitous. We want to know: ▫ How do networks form and structure themselves? ▫ How does information propagate through networks? ▫ How do sub-communities form? 2 Facebook Computer networks IMDB actor-movie

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS 3 “Outline” for thesis

Motivation: Topology How do these network strucures form? ▫ Example: identify topological properties common to many different types of graphs (citations, friendships, etc.) ▫ Developing models of these properties allows for forecasting. 4 vs 1 1

Graph topology Motivation: Cascades Once the networks form, how does information propagate through the graph? ▫ Example: Extract, analyze, and model cascades. 5 Cascade 2 2

Motivation: Community How do we compare communities, or sub- networks? ▫ Example: For a set of online groups (Usenet), which ones continue to thrive over time? ? 3 3

Thesis statement We propose to ▫ investigate how interactions in graphs occur, how these interactions lead to diffusion and community behavior, and ▫ to model these behaviors and apply these findings to real-world problems

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS 8 investigate how interactions in graphs occur, how these interactions lead to diffusion and community behavior, and to model these behaviors and apply these findings to real-world problems. We propose to…

Impact Understanding the relations found in networks has many applications, such as: Fraud/anomaly detection ▫ Given typical behavior and information about nodes/edges, how “suspicious” is a node or group of nodes? Ad personalization/recommendation systems ▫ Given some information about an individual and their friends, which ads to display? Resource allocation ▫ Given typical patterns of network growth, how can we allocate resources (hardware, advertising budget, etc.)? 9

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS KDD08 ICDM08 ICWSM07 ICWSM09-2* ICWSM09-3* KDD09* ICWSM09-1* 10 Completed Work *- to appear SDM07

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS 11 Proposed Work P1a: How do cascades compare across network structures? P2: Can we predict success/failure of groups? P1b: Can we use cascades to model product adoption?

The rest of the talk Motivation and thesis statement Completed work Proposed work Conclusions and impact Audience participation! 12

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS What patterns are common to networks? 13 Completed Work

Topological Observations Diameter over time 14 Connected components Edge weights (Kevin Bacon)

Topological Observations: Data Analyze unipartite and bipartite networks Networks are evolving over time Networks may be weighted 15 n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 n7n n1n1 n2n2 n3n3 n4n4 m1m1 m2m2 m3m3 -Repeated edges -Edge weights Unipartite Citations, Blogs, Router traffic Bipartite IMDB Actor- Movie, Campaign contributions…

Topological Observations: Gelling Point When does a graph begin displaying expected patterns, such as the giant connected component? How can we tell when this happens? 16

Topological Observations: Gelling Point Observation: Most real graphs display a gelling point, where the graph begins to come together and the giant connected component forms. After that point, they exhibit typical behavior. 17 Time Diameter IMDB t=1914

Topological Observations: NLCCs In graphs a giant connected component emerges. We look at sizes of the next- largest connected components (NLCCs) After gelling point, do they continue to grow? Do they shrink? 18

Topological Observations: NLCCs Observation: After the gelling point, the giant connected component takes off, but next- largest connected components remain constant or oscillate. 19 Time IMDB Size of next- largest connected components t=1914) ia 2 nd connected component 3 rd connected component

Topological Observations: Weights How are edges in a graph repeated, or otherwise weighted? As the number of edges increases, does the total edge weight grow linearly? 20

Topological Observations: Weights Observation: Weight additions follow a power law with respect to the number of edges: W(t) ∝ E(t) w ▫ W(t): total weight of graph at t ▫ E(t): total edges of graph at t ▫ w is PL exponent (w>1) Many other weighted laws: see [KDD08, ICDM08] 21 log(Edges) log(Weights) Orgs-Candidates slope=1.3

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS What patterns are common to networks? 22 Completed Work

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS Gelling point, CC’s Weighted laws 23 Completed Work

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS Gelling point, CC’s Weighted laws Can we develop generative models? 24 Completed Work

Topological Models: “Butterfly” Goals are to generate: ▫ Constant/oscillating NLCC’s ▫ Densification power law [Leskovec+05] ▫ Shrinking diameter (after “gelling point”) ▫ Power-law degree distribution ▫ Emergent, local, intuitive behavior 25

Topological Models: “Butterfly” Main idea: Uses 3 parameters ▫ “Curiosity”: how much to explore local network (~U(0,1), creates power-law degree distribution) ▫ “Flyout”: how many local networks to explore (global, joins components) ▫ “Friendliness”: how often to connect (global, allows new components) Details: see [KDD08] 26

Topological Models: “Butterfly” 27 Log(degree) Log(count) slope=-2 Power-law degree distribution Nodes Diam- eter Shrinking diameter log(nodes) log(edge s) slope=1.17 Densification Nodes NLCC size Oscillating NLCCs

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS Gelling point, CC’s Weighted laws Can we develop generative models? 28 Completed Work

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS Gelling point, CC’s Weighted laws Butterfly RTM Oddball 29 Completed Work

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS Gelling point, CC’s Weighted laws What are patterns of cascades in networks? Butterfly RTM Oddball 30 Completed Work

Cascade Observations: Data Gathered from August-September 2005* Used set of 44,362 blogs, traced cascades 2.4 million posts 245,404 blog-to-blog links 31 Time [1 day] Number of posts Jul 4 Aug 1 Sep 29

Cascade Observations: Prelims 32 Blogosphere B1B1 B2B2 B4B4 B3B3 Cascades d e b c e a a b c d e “Star” “Chain” How quickly does a link to a post occur? What size do cascades typically reach? What are typical shapes– how often are “stars” and “chains” occurring?

33 Temporal Observations How quickly does a link to a post occur? Does popularity decay at a constant rate? With an exponential (“half life”)? Linear-linear scaleLog-linear scaleLog-log scale

Cascade Observations: Link Popularity Observation: The probability that a post written at time t p acquires a link at time t p + Δ is: p(t p + Δ ) ∝ Δ -1.5 Similar to [Vazquez+06] 34 log(days after post) log( # in-links) slope=-1.5 (Linear-linear scale)

Cascade Observations: Cascade Size Q: What size distribution do cascades follow? Are large cascades frequent? Observation: The probability of observing a cascade of n blog posts follows a Zipf distribution: p(n) ∝ n log(Cascade size) (# of nodes) log(Count) slope=-2 d e b c e a

log(Size) of chain (# nodes) log(Count) a=-8.5 log(Size) of star (# nodes) log(Count) a=-3.1 Cascade Observations: Cascade Size Q: What is the distribution of particular cascade shapes? Observation: Stars and chains in blog cascades also follow a power law, with different exponents (star -3.1, chain -8.5). 36

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS Gelling point, CC’s Weighted laws What are patterns of cascades in networks? Butterfly RTM Oddball 37 Completed Work

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS Gelling point, CC’s Weighted laws Cascades laws Cascades as features Butterfly RTM Oddball 38 Completed Work

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS Gelling point, CC’s Weighted laws Cascades laws Cascades as features Can we develop predictive models for cascades? Butterfly RTM Oddball 39 Completed Work

Cascade Models: CGM Cascade Generation Model Overview: Produce realistic cascades through an emergent “viral” model Details: See [SDM07] 40

Cascade Models: CGM 41 Most frequent cascades model data log(Cascade size) (# nodes) log(Count) log(Star size) log(Count) log(Chain size) Data Model

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS Gelling point, CC’s Weighted laws Cascades laws Cascades as features Can we develop predictive models for cascades? Butterfly RTM Oddball 42 Completed Work

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS Gelling point, CC’s Weighted laws Cascades laws Cascades as features Cascade generation model ZC model Butterfly RTM Oddball 43 Completed Work

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS Gelling point, CC’s Weighted laws Cascades laws Cascades as features How can we compare communities? Cascade generation model ZC model Butterfly RTM Oddball 44 Completed Work

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS Gelling point, CC’s Weighted laws Cascades laws Cascades as features Political Usenet study Cascade generation model ZC model Butterfly RTM Oddball 45 Completed Work

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS Gelling point, CC’s Weighted laws Cascades laws Cascades as features Political Usenet study Can we detect anomalies? Cascade generation model ZC model Butterfly RTM Oddball 46 Completed Work

Community Tools: SNARE Problem: Given a network and some domain knowledge about suspicious nodes (flags), determine which nodes are most risky. Data: Accounting transaction data. Nodes are accounts, edges are transactions between accounts. 47 Accounts Payable Accounts Receivable Revenue Accts

Community Tools: SNARE Example: “Channel stuffing” ▫ Some accounts overstated ▫ But other accounts also involved. ▫ Since many accounts are slightly affected, it is easy to cover up activity. 48 Accounts Payable Accounts Receivable Revenue Accts Very risky Not risky

Community Tools: SNARE 49 Social Network Analytic Risk Evaluation ▫ Use domain knowledge to flag certain nodes. ▫ Assume homophily between nodes (“guilt by association”) ▫ Then, using initial risk as initial node potentials, use belief propagation (message passing between nodes) to determine end risk scores.

Community Tools: SNARE 50 Belief Propagation ▫ Flags are node potentials, or “intial risk scores” ▫ All nodes send messages back and forth with beliefs ▫ Upon convergence, end result will reflect “riskiest” nodes. After Revenue Accts Before Accounts Payable Accounts Receivable

Community Tools: SNARE 51 Produces improvement over simply using flags ▫ Up to 6.5 lift ▫ Improvement especially for low false positive rate False positive rate True positive rate Results for accounts data (ROC Curve) Ideal SNARE Baseline (flags only)

Community Tools: SNARE 52 Accurate- Produces large improvement over simply using flags Flexible- Can be applied to other domains Scalable- One iteration BP runs in linear time (# edges) Robust- Works on large range of parameters

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS Gelling point, CC’s Weighted laws Cascades laws Cascades as features Political Usenet study Can we detect anomalies? Cascade generation model ZC model Butterfly RTM Oddball 53 Completed Work

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS Gelling point, CC’s Weighted laws Cascades laws Cascades as features Political Usenet study SNARE Cascade generation model ZC model Butterfly RTM Oddball 54 Completed Work

The rest of the talk Motivation and thesis statement Completed work Proposed work Conclusions and impact Audience participation! 55

Proposed Work 2 main problems: ▫ P1: Cascades and product adoption  How do cascades vary according to network structure?  Can we use cascades to model product adoption? ▫ P2: Predicting success/failure of online groups 56

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS 57 P1a: How do cascades compare across network structures? P2: Can we predict success/failure of groups? P1b: Can we use cascades to model product adoption?

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS 58 P1a: How do cascades compare across network structures? P2: Can we predict success/failure of groups? P1b: Can we use cascades to model product adoption?

In different networks, how does starting point of an epidemic affect the epidemic size? What modifications on current model changes the cascades (weights, self-infection)? Can we reverse-engineer network properties based on observed cascades? P1a: Cascades & Network Structure 59 Many hubs? Large diameter?

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS 60 P1a: How do cascades compare across network structures? P2: Can we predict success/failure of groups? P1b: Can we use cascades to model product adoption?

P1b: Cascades & Product Adoption Examine adoption of Caller Ringback Tones (CRBT) ▫ User buys ringtone ▫ Friend calls user, hears CRBT Phone call data ▫ Nodes: User ID, DOB, salutation (Mr/Ms), date of joining, data plan ▫ Call Edges: src/dest ID, call time, duration ▫ SMS Edges: src/dest ID, time ▫ CRBT purchases: purchase date, song name, cost 61

P1b: Cascades & Product Adoption Can we fit the Bass Model for different CRBT’s? 62 # adopters today # adopters yesterday # potential adopters “mass marketing” “word of mouth”

P1b: Cascades & Product Adoption Are some CRBT’s more “viral” than others? Does the footprint follow a skewed distribution? How long after purchase is a CRBT infective? 63 Number of downloads (per song) Survival Function P(X>x)

P1b: Cascades & Product Adoption How does the weight of a link, homophily, or other factors affect the likelihood of transmission? Can we explicitly test whether a purchase is a result of basic similarity of neighbors or a result of “viral” propagation? How can we build and verify a model for this propagation? 64

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS 65 P1a: How do cascades compare across network structures? P2: Can we predict success/failure of groups? P1b: Can we use cascades to model product adoption?

P2: Success & Failure of Online Groups Use data over 4 years from nearly 200 newsgroups. (Political Usenet) Many discussion groups stop posting by the third year. Why? 66

P2: Success & Failure of Online Groups P2 Questions: ▫ If structural network characteristics can be traced to success or failure, which features are most predictive? ▫ Can we test causality in the predictive characteristics? 67

Timeline 68 May 09 Jun ‘09 Sep ‘09 Nov ‘09 Mar ‘10 Aug ‘10 Jul ‘10 P1 preliminaries Internship at Google P1a: Cascades and network structure P1b: Cascades and product adoption P2: Success/failure of online groups Complete document Defend

Related work ▫ Topology:  Heavy-tailed degree distributions [Faloutsos+99] [Albert+02] [Kleinberg+99]  Shrinking diameter, densification [Leskovec+05]  Random graphs model [Erdos+60]  “Forest Fire” model [Leskovec+05]  “Winners do not take all” model [Pennock+02] ▫ Cascades  Recommendations: [Leskovec+06]  Diffusion in blogs: [Adar+03] [Gruhl+04] [Kempe+03] [Kumar+03]  Marketing: Product adoption [Bass69], Word-of-mouth [Godes+04]  Virus propagation: Populations [Hethcote], Networks [Boguna, Pastor-Satorras] [Charkabarti] ▫ Communities and other applications  Securities fraud detection [Neville+05] [Fast+07]  Author identification [Hill+04]  Online group behavior [Backstrom+08] 69

Conclusions: Completed Demonstrated several properties common to networks in a wide range of domains. ▫ Oscillating sizes of next-largest connected components ▫ Power laws for weighted graphs ▫ Butterfly model: generates properties 70

Conclusions: Completed Studied and modeled cascades in blogs ▫ Several power laws for cascade shapes and size ▫ Cascade Generation Model Devised SNARE for anomaly detection for accounting data (lift factor up to 6.5) 71

Conclusions: Proposed P1a: Continue cascade studies across network structures P1b: Use cascades to model purchases in phone-call graph P2: Build predictive models for success and failure in online groups 72

References Topology [KDD08] M. McGlohon, L. Akoglu, and C. Faloutsos. Weighted Graphs and Disconnected Components: Patterns and a Generator. SIG-KDD. Las Vegas, Nev., August [ICDM08] L. Akoglu. M. McGlohon, and C. Faloutsos. RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs. ICDM. Pisa, Italy, Dec Cascades [SDM07] J. Leskovec, J, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst. Patterns of Cascading Behavior in Large Blog Graphs. SDM. Minneapolis, Minn., April [ICWSM07] M. McGlohon, J. Leskovec, C. Faloutsos, N. Glance, and M. Hurst. Finding patterns in blog shapes and blog evolution. ICWSM. Boulder, Colo., March [ICWSM09-1] M. Goetz, J. Leskovec, M. McGlohon, and C. Faloutsos. Modeling Blog Dynamics. ICWSM. San Jose, Cali. May

References Community [KDD09] M. McGlohon, S. Bay, M. Anderle, D. Steier, and C. Faloutsos. SNARE: A Link Analytic System for Evaluating Fraud Risk. ACM Special Interest Group on Knowledge Discovery and Data Mining (SIG-KDD). Paris, France. June [ICWSM09-2] M. McGlohon and M. Hurst. Community Structure and Information Flow in Usenet: Improving analysis with a thread ownership model. International Conference on Weblogs and Social Media (ICWSM). San Jose, CA. May [ICWSM09-3] M. McGlohon and M. Hurst. Considering the Sources: Comparing linking patterns in Usenet and blogs. International Conference on Weblogs and Social Media (ICWSM09). San Jose, CA. May

Acknowledgments: ▫ Leman Akoglu ▫ Markus Anderle ▫ Stephen Bay ▫ Polo Chau ▫ Christos Faloutsos ▫ Natalie Glance ▫ Mila Goetz ▫ Geoff Gordon ▫ Matthew Hurst ▫ i-Lab ▫ David Jensen ▫ Ramayya Krishnan ▫ Jure Leskovec ▫ Austin McDonald ▫ Alan Montgomery ▫ Chris Neff ▫ Nachi Sahoo ▫ Purna Sarkar ▫ David Steier 75 ▫ Support: ▫ PricewaterhouseCoopers ▫ Microsoft Live Labs ▫ NSF Graduate Research Fellowship ▫ Yahoo! Key Technical Challenges Grant, Pennsylvania Infrastrucutre Technology Alliance (PITA) ▫ Hewlett-Packard ▫ NSF Grants No. IIS , IIS , and CNS , , SENSOR , EF , IIS ▫ U.S. Department of Energy Lawrence Livermore National Laboratory contract No.W-7405-ENG-48.

Audience participation! 76

77

Talk expansion pack 78

P1b: Other Cascade Data Post data from corporate blogs ▫ Demographic data on bloggers (employee ID, location, job description) ▫ Read data (timestamped) ▫ Write data (timestamped) CRBT adoption in general ▫ Perhaps people do not adopt particular songs, but the CRBT mechanism More public blog data (spinn3r) ▫ Also use edge information from blogrolls/comments 79

P2: Potential features to examine Posting behavior ▫ Which users are posting, how often are they posting, and how skewed is the distribution? Linking behavior ▫ How long are cascades (threads), in terms of post and time? Content ▫ Topics, keywords, sentence length, other textual features, sentiment analysis 80

Unipartite Networks Postnet: Posts in blogs, hyperlinks between Blognet: Aggregated Postnet, repeated edges Patent: Patent citations NIPS: Academic citations Arxiv: Academic citations NetTraffic: Packets, repeated edges Autonomous Systems (AS): Packets, repeated edges n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 n7n million nodes 8 million edges 17 years 4 million nodes 8 million edges 17 years

Bipartite Networks IMDB: Actor-movie network Netflix: User-movie ratings DBLP: conference- repeated edges ▫ Author-Keyword ▫ Keyword-Conference ▫ Author-Conference US Election Donations: $ weights, repeated edges ▫ Orgs-Candidates ▫ Individuals-Orgs n1n1 n2n2 n3n3 n4n4 m1m1 m2m2 m3m million nodes 10 million edges 22 years 6 million nodes 10 million edges 22 years

Topological Models: “Butterfly” 83 New node joins, picks host and iteratively random walks around neighbors, with ~ U(0,1) ▫ Some nodes “friendlier” than others Nodes may have multiple hosts ( ). ▫ Joins components Nodes link with probability ▫ May choose host, but not link (start new component) new node host

Topological Models: “Butterfly” Nodes may have multiple hosts ( ). ▫ Joins components 84 Node picks “host” and iteratively perform random walk around neighbors, with ~ U(0,1) ▫ Some nodes “friendlier” than others Nodes link with probability ▫ May choose host, but not link (start new component)

Topological Models: RTM Recursive Tensor Model Goal: to introduce time and burstiness Main idea: Begin with a core tensor (multidimensional array), and use self- similarity to reproduce observed power laws. 85

Topological Models: RTM Self similarity arises from Kronecker product 2D: 86 [Leskovec+06]

Topological Models: RTM 3D: Use Kronecker product on a core tensor Reproduced power laws as found in ICDM08 87 Adjacency matrix

Topological Models: RTM 3D: Use Kronecker product on a core tensor Reproduced power laws as found in ICDM rd dim: time

Topological Applications: Oddball Main ideas: ▫ Use local neighborhood of node ▫ Find common patterns ▫ Score how much a node deviates from common patterns Results ▫ Identified anomalous nodes such as Ken Lay in Enron, particularly different blog posts 89

Cascade Models: CGM 90 B1B1 B2B2 B4B4 B3B3 ii) Infect each in-linked neighbor with probability  p 1,1 B1B1 B2B2 B4B4 B3B3 iii) Add infected neighbors’ posts to cascade. p 1,1 p 4,,1 i) Randomly pick blog to infect, add post to cascade. p 1,1 B1B1 B2B2 B4B4 B3B3 B1B1 B2B2 B4B4 B3B3 iv) Set node infected in (i) to uninfected. p 1,1 p 4,1

Cascade Models: Zero-crossing Main ideas: ▫ Models blogs in both network growth and network diffusion ▫ Choose to post based on random walk (produces burstiness) ▫ Link based on recency an popularity (reproduces “-1.5 law” and skewed degree) ▫ Improvement over CGM because network is generated 91

Community Observations: Newsgroups Observation: Threads introduced to a group later in the thread tended to have more activity from that group. Observation: Discussions tended to flow from “main” groups (can.politics) into subgroups (ab.politics, bc.politics) 92

Community Observations: Newsgroups 189 newsgroups (‘polit’ in name), January 2004-June million posts Includes many countries, provinces, states, topical groups (alt.politics.guns) 93 Major issue: over half are cross- posted to multiple groups. Where is conversation truly occurring? {alt.politics, us.politics} {alt.politics, us.politics, pa.politics}

Community Observations: Newsgroups Solution: Introduce “Thread ownership”, by assigning threads according to where authors exclusively post. 94

Community Observations: Newsgroups Observation: Discussions tended to flow from “main” groups (can.politics) into subgroups (ab.politics, bc.politics) 95

TOPOLOGY COMMUNITY CASCADES OBSERVATIONSAPPLICATIONS/TOOLS What patterns are common to networks? What are patterns of cascades in networks? How can we compare communities? Can we detect anomalies, and predict group behavior? Can we develop predictive models for cascades? Can we develop generative models and detect anomalies? 96 Completed Work