School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems and Fast Algorithms Danai Koutra U Kang Hsing-Kuo Kenneth Pao Tai-You Ke Duen Horng (Polo) Chau Christos Faloutsos ECML PKDD, 5-9 September 2011, Athens, Greece
Problem Definition: G B A techniques Danai Koutra (CMU) - PKDD Given: graph with N nodes & M edges; few labeled nodes Find: class (red/green) for rest nodes Assuming: network effects ( homophily/ heterophily )
Homophily and Heterophily Danai Koutra (CMU) - PKDD Step 1 Step 2 All methods handle homophily NOT all methods handle heterophily BUT proposed method does! NOT all methods handle heterophily BUT proposed method does!
Why do we study these methods? Danai Koutra (CMU) - PKDD 20113
Motivation (1): Law Enforcement Danai Koutra (CMU) - PKDD [Tong+ ’06][Lin+ ‘04][Chen+ ’11]…
Motivation (2): Cyber Security Danai Koutra (CMU) - PKDD victims? [ Kephart+ ’95 ] [Kolter+ ’06 ][Song+ ’08-’11][Chau+ ‘11]… botnet members? bot
Motivation (3): Fraud Detection Danai Koutra (CMU) - PKDD Lax controls? [Neville+ ‘05][Chau+ ’07][McGlohon+ ’09]… fraudsters? fraudster
Motivation (4): Ranking Danai Koutra (CMU) - PKDD [Brin+ ‘98][Tong+ ’06][Ji+ ‘11]…
Our Contributions Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for linearized BP Practice F A BP algorithm fast accurate and scalable Experiments on DBLP, Web, and Kronecker graphs Danai Koutra (CMU) - PKDD 20118
Roadmap Danai Koutra (CMU) - PKDD Background Belief Propagation Random Walk with Restarts Semi-supervised Learning Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions
Background Danai Koutra (CMU) - PKDD Apologies for diversion…
Background 1: Belief Propagation (BP) Iterative message-based method Danai Koutra (CMU) - PKDD st round 2 nd round... until stop criterion fulfilled “Propagation matrix”: Homophily Heterophily class of “sender” class of “receiver” Usually same diagonal = homophily factor h Usually same diagonal = homophily factor h “about-half” homophily factor h h = h-0.5 “about-half” homophily factor h h = h
Danai Koutra (CMU) - PKDD Background 1: Belief Propagation Equations [Pearl ‘82][Yedidia+ ‘02] …[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10]
Background 2: Semi-Supervised Learning graph-based SSL use few labeled data & exploit neighborhood information Danai Koutra (CMU) - PKDD STEP1STEP1 STEP1STEP1 STEP2STEP2 STEP2STEP ? ? [Zhou ‘06][Ji, Han ’10]…
Background 3: Personalized Random Walk with Restarts (RWR) Danai Koutra (CMU) - PKDD [Brin+ ’98][Haveliwala ’03][Tong+ ‘06][Minkov, Cohen ‘07]…
Danai Koutra (CMU) - PKDD Background
Qualitative Comparison of G B A Methods Danai Koutra (CMU) - PKDD GBA Method HeterophilyScalabilityConvergence RWR ✗✓✓ SSL ✗✓✓ BP ✓✓ ? F A BP ✓✓✓
Qualitative Comparison of G B A Methods Danai Koutra (CMU) - PKDD GBA Method HeterophilyScalabilityConvergence RWR ✗✓✓ SSL ✗✓✓ BP ✓✓ ? F A BP ✓✓✓
Roadmap Danai Koutra (CMU) - PKDD Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions New work Previous work
Linearized BP Odds ratio Maclaurin expansions Odds ratio Maclaurin expansions Danai Koutra (CMU) - PKDD BP is approximated by Theorem [Koutra+] Sketch of proof ? ? d1 d2 d3 d1 d2 d3 final beliefs prior beliefs scalar constants 0.5 pipi 0 “ ” 1 DETAILS!
Linearized BP vs BP Danai Koutra (CMU) - PKDD BP is approximated by Linearized BP ? ? d1 d2 d3 d1 d2 d3 linearnon-linear Belief Propagation Our proposal:Original [Yedidia+]:
Our Contributions Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for linearized BP Practice F A BP algorithm fast accurate and scalable Experiments on DBLP, Web, and Kronecker graphs Danai Koutra (CMU) - PKDD ✓
DETAILS! Linearized BP converges if Linearized BP: convergence Danai Koutra (CMU) - PKDD Theorem degree of node n 1-norm < 1 OR Frobenius norm < 1 1-norm < 1 OR Frobenius norm < 1 Sketch of proof
Our Contributions Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for linearized BP Practice F A BP algorithm fast accurate and scalable Experiments on DBLP, Web, and Kronecker graphs Danai Koutra (CMU) - PKDD ✓ ✓
Roadmap Danai Koutra (CMU) - PKDD Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions
Correspondence of Methods Danai Koutra (CMU) - PKDD MethodMatrixUnknownknown RWR [I – c AD -1 ]×x=(1-c)y SSL [I + a (D - A)] ×x=y F A BP [I + a D - c ’ A] ×bhbh =φhφh ? ? d1 d2 d3 d1 d2 d3 final labels/ beliefs prior labels/ beliefs adjacency matrix
RWR ≈ SSL Danai Koutra (CMU) - PKDD RWR and SSL identical if THEOREM individual homophily strength of node i (SSL) fly-out probability (RWR) Simplification global homophily strength of nodes (SSL) DETAILS!
RWR ≈ SSL: example Danai Koutra (CMU) - PKDD similar scores and identical rankings y = x RWR scores SSL scores individual hom. strength global hom. strength
Our Contributions Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for linearized BP Practice F A BP algorithm fast accurate and scalable Experiments on DBLP, Web, and Kronecker graphs Danai Koutra (CMU) - PKDD ✓ ✓ ✓
Roadmap Danai Koutra (CMU) - PKDD Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions
Proposed algorithm: F A BP ①Pick the homophily factor ②Solve the linear system ①(opt) If accuracy is low, run BP with prior beliefs. Danai Koutra (CMU) - PKDD ? ? d1 d2 d3 d1 d2 d3 0.5 pipi 0 “ ” 1
Roadmap Danai Koutra (CMU) - PKDD Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions
Datasets Danai Koutra (CMU) - PKDD p% labeled nodes initially YahooWeb:.edu/others | DBLP: AI/not AI accuracy computed on hold-out set Dataset# nodes# edges YahooWeb 1,413,511,3906,636,600,779 Kronecker 1 177,1471,977,149,596 Kronecker 2 120,5521,145,744,786 Kronecker 3 59,049282,416,924 Kronecker 4 19,68340,333,924 DBLP 37,791170,794 6 billion!
Specs hadoop version M45 hadoop cluster (Yahoo!) 500 machines 4000 cores 1.5PB total storage 3.5TB of memory 100 machines used for the experiments Danai Koutra (CMU) - PKDD
Roadmap Danai Koutra (CMU) - PKDD Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments 1. Accuracy 2. Convergence 3. Sensitivity 4. Scalability 5. Parallelism Conclusions
Results (1): Accuracy Danai Koutra (CMU) - PKDD All points on the diagonal scores near-identical beliefs in BP beliefs in F A BP 0.3% labels Scatter plot of beliefs for (h, priors) = ( 0.5±0.002, 0.5±0.001 ) AI non-AI
Results (2): Convergence Danai Koutra (CMU) - PKDD F A BP achieves maximum accuracy within the convergence bounds. Accuracy wrt h h (priors = ±0.001) 0.3% labels h % accuracy frobenius norm |e_val| = 1 1-norm convergence bounds h
Danai Koutra (CMU) - PKDD Accuracy wrt h h (priors = ±0.001) 0.3% labels h % accuracy frobenius norm |e_val| = 1 1-norm F A BP is robust to the homophily factor h h within the convergence bounds. Results (3): Sensitivity to the homophily factor convergence bounds
( For all plots ) Danai Koutra (CMU) - PKDD Average over 10 runs Error bars tiny h % accuracy h prior beliefs’ magnitude note
Results (3): Sensitivity to the prior beliefs Danai Koutra (CMU) - PKDD F A BP is robust to the prior beliefs φ h. % accuracy prior beliefs’ magnitude Accuracy wrt priors (h h = ±0.002) p=5% p=0.1% p=0.3% p=0.5%
Results (4): Scalability Danai Koutra (CMU) - PKDD F A BP is linear on the number of edges. # of edges (Kronecker graphs) runtime (min)
Results (5): Parallelism Danai Koutra (CMU) - PKDD F A BP ~2x faster & wins/ties on accuracy. # of steps runtime (min) % accuracy runtime (min)
Roadmap Danai Koutra (CMU) - PKDD Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions
Our Contributions Theory correspondence: BP ≈ RWR ≈ SSL linearization for BP convergence criteria for linearized BP Practice F A BP algorithm fast accurate and scalable Experiments on DBLP, Web, and Kronecker graphs Danai Koutra (CMU) - PKDD ~2x faster 6 billion edges! same/better ✓ ✓ ✓ ✓ ✓
Thanks Data Funding Danai Koutra (CMU) - PKDD NSC ILLINOIS Ming Ji, Jiawei Han
Thank you! Danai Koutra (CMU) - PKDD % accuracy runtime (min)
Danai Koutra (CMU) - PKDD Q: Can we have multiple classes? AI ML DB Propagation matrix A: yes!
Q: Which of the methods do you recommend? A: (Fast) Belief Propagation Reasons: solid bayesian foundation heterophily and multiple classes Danai Koutra (CMU) - PKDD Propagation matrix
Q: Why is F A BP faster than BP? A: BP 2|E| messages per iteration F A BP |V| records per “power method” iteration Danai Koutra (CMU) - PKDD |V| < 2 |E|