Download presentation
Presentation is loading. Please wait.
Published byAbel Dean Modified over 9 years ago
1
Statistical Inference for Large Directed Graphs with Communities of Interest Deepak Agarwal
2
Outline Communities of Interest : overview Why a probabilistic model? Bayesian Stochastic Blockmodels Example Ongoing work
3
Communites of interest Goal: understand calling behavior of every TN on ATT LD network: massive graph Corinna, Daryl and Chris invented COI’s to scale computation using Hancock (Anne Rogers and Kathleen Fisher) Definition: COI of TN X is a subgraph centered around X –Top k called by X + other –Top k calling X + other
4
COI signature X Other outbound Other inbound
5
Entire graph union of COI’s Extend a COI by recursively growing the spider –Captures calling behavior more accurately Definition for this work: –Grow the spider till depth 3. Only retain depth 3 edges that are between depth 2 nodes.
6
Extended COI me other X x
7
Enhancing a COI !! Missed calls: –Local calls where TN’s not ATT local –Outbound OCC calls –Calls to/from the bin “other” Big outbound and inbound TNs –Dominate the COI, lot of clutter. –Need to down weight their calls. Other issues Want to quantify things like tendency to call, tendency of being called, tendency of returning calls for every TN.
8
Our approach so far COI -> social network Want a statistical model that estimates missing edges, add desired ones and remove (or down weight) undesired ones.
9
me COI from top probability edges of a statistical model. The model adds new edges. (brown arrows) Removes undesired ones.
10
Getting a sense of data Some descriptive statistics based on a random sample of 500 residential COI’s.
11
density = 100*ne/(g(g-1)) ne = number of edges g = number of nodes
13
Under random Average conditional on out -degrees
14
Under random: Conditional on outdegrees
15
Under random: Conditional on indegrees
17
Distribution of “Other"
18
Representing the Data Collection of all edges with activity Matrix with no diagonal entries Collection of several 2x2 contingency tables
19
COI: gxg matrix without diagonal entries
20
COI: collection of 2x2 tables. Data matrix a collection of g(g-1)/2 2x2 tables (called dyads). m ij a ij a ji n ij p ij p ji 1 i->j j-> i present absent present absent 1-p ij 1-p ji Row total Column total
21
More probabilities than edges. Need to express them in terms of fewer parameters which could be learned from data.
22
All Greek letters to be estimated from data Computation: 2 minutes for a typical COI on fry Likelihood, gradient and Hessian computed using C, optimizer in R. Optimizer goes crazy due to presence of so many zero degrees Do regularization, known as “shrinkage estimation” in statistics. Incur bias for small degree nodes but get reduction in variance.
23
Meaning of parameters Node i: – α i : expansiveness (tendency to call) – β i : attractiveness (tendency of being called) Global parameters: – θ: density of COI (reduces with increasing sparseness) – ρ: reciprocity of COI (tendency to return calls) – λ s : “caller” specific effect – λ r : “cal lee” specific effect – γ: “call” specific effect
24
Differential reciprocity Different reciprocity for each node: –Add another parameter η i to node i – Replace ρM by ρM + Σ i η i M i in the likelihood – Called “differential reciprocity” model – Computationally challenging, have implemented it.
25
Missing edges? Can estimate all parameters as long as we have some observed edges in data matrix –for each row (to estimate expansiveness) –for each column (to estimate attractiveness) Missing local calls -> o.k. OCC -> problem, entire row missing. –Impute data using reasonable assumptions m times (typically m=3 o.k.) and combine results. Working on it.
26
Incorporating edge weights Edge weights binned into k bins using a random sample of 500 COI’s. Weights in ith bin assigned a score i. tij unknown, w’s weights on dyad (i,j). tij imputed using Hyper geometric w ij w ji k t ij Row total Column total k - w ji k - w ij
27
Example COI with 117 nodes, 172 edges. 14 missing edges, local calls from14 non ATT local customers to seed node (local list provided by Gus). One edge attribute: number of common “buddies” between TN i and TN j Tried Bizocity, “Localness to seed” for caller and cal lee effects, eventually settled with one caller effect viz localness to seed, no cal lee effect.
28
Parameter estimates. θ = -6.28; ρ=2.76 (higher side) λ s =.29 (TN’s local to seed have a higher tendency to call) γ=.41 (common acquaintances between two TN’s increase their tendency to call each other)
30
Pruning the big (red) nodes Down weight expansiveness/attractiveness based on proportion of volume going to “other”, higher value get down weighted more by adding “offset” –Renormalize the new probability matrix to have the same mass as the original one. Offset function used:
32
Matrix obtained by taking union of top 50 data edges, top 50 edges from original model, top 50 edges from pruned model.
34
Where to from here? Estimate missing OCC calls :multiple imputation. Scale the algorithm to get parameter estimates for every TN, maybe on a weekly basis, enrich customer signature. Can compute Hellinger distance between two COIs in closed form. Could be useful in supervised learning tasks like tracking Repetitive debtors.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.