School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems.

Slides:

Advertisements

Similar presentations

CMU SCS Identifying on-line Fraudsters: Anomaly Detection Using Network Effects Christos Faloutsos CMU.

Advertisements

BiG-Align: Fast Bipartite Graph Alignment

On the Vulnerability of Large Graphs

CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.

Overview of this week Debugging tips for ML algorithms

School of Computer Science Carnegie Mellon University Duke University DeltaCon: A Principled Massive- Graph Similarity Function Danai Koutra Joshua T.

+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.

School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Dept. of Computer Science Rutgers Node and Graph Similarity : Theory and Applications Danai Koutra (CMU) Tina Eliassi-Rad (Rutgers) Christos Faloutsos.

Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.

The Connectivity and Fault-Tolerance of the Internet Topology

Dept. of Computer Science Rutgers Node Similarity, Graph Similarity and Matching: Theory and Applications Danai Koutra (CMU) Tina Eliassi-Rad (Rutgers)

Node labels as random variables prior belief observed neighbor potentials compatibility potentials Opinion Fraud Detection in Online Reviews using Network.

Endend endend Carnegie Mellon University Korea Advanced Institute of Science and Technology VoG: Summarizing and Understanding Large Graphs Danai Koutra.

CMU SCS C. Faloutsos (CMU)#1 Large Graph Algorithms Christos Faloutsos CMU McGlohon, Mary Prakash, Aditya Tong, Hanghang Tsourakakis, Babis Akoglu, Leman.

CMU SCS Mining Billion-node Graphs Christos Faloutsos CMU.

Graph Based Semi- Supervised Learning Fei Wang Department of Statistical Science Cornell University.

Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.

© 2011 IBM Corporation IBM Research SIAM-DM 2011, Mesa AZ, USA, Non-Negative Residual Matrix Factorization w/ Application to Graph Anomaly Detection Hanghang.

Detecting Fraudulent Personalities in Networks of Online Auctioneers Duen Horng (“Polo”) Chau Shashank Pandit Christos Faloutsos School of Computer Science.

Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.

Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.

Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.

Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.

Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.

Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research

Graph-Based Semi-Supervised Learning with a Generative Model Speaker: Jingrui He Advisor: Jaime Carbonell Machine Learning Department

Fast Random Walk with Restart and Its Applications

Singular Value Decomposition and Data Management

The Role of Specialization in LDPC Codes Jeremy Thorpe Pizza Meeting Talk 2/12/03.

Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.

CMU SCS Big (graph) data analytics Christos Faloutsos CMU.

Random Walk with Restart (RWR) for Image Segmentation

Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.

Shifted Codes Sachin Agarwal Deutsch Telekom A.G., Laboratories Ernst-Reuter-Platz Berlin Germany Joint work with Andrew Hagedorn and Ari Trachtenberg.

Statistical Sampling-Based Parametric Analysis of Power Grids Dr. Peng Li Presented by Xueqian Zhao EE5970 Seminar.

Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.

CMU SCS Mining Billion Node Graphs Christos Faloutsos CMU.

1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science

On Node Classification in Dynamic Content-based Networks.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.

CMU SCS Mining Large Graphs: Fraud Detection, and Algorithms Christos Faloutsos CMU.

Fast Random Walk with Restart and Its Applications Hanghang Tong, Christos Faloutsos and Jia-Yu (Tim) Pan ICDM 2006 Dec , HongKong.

Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.

Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.

1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.

RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.

Single-Pass Belief Propagation

Kijung Shin Jinhong Jung Lee Sael U Kang

CMU SCS Patterns, Anomalies, and Fraud Detection in Large Graphs Christos Faloutsos CMU.

Center-Piece Subgraphs: Problem definition and Fast Solutions Hanghang Tong Christos Faloutsos Carnegie Mellon University.

Privacy Preserving in Social Network Based System PRENTER: YI LIANG.

CMU SCS Panel: Social Networks Christos Faloutsos CMU.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

1 / 24 Distributed Methods for High-dimensional and Large-scale Tensor Factorization Kijung Shin (Seoul National University) and U Kang (KAIST)

A Peta-Scale Graph Mining System

Introduction of BP & TRW-S

PEGASUS: A PETA-SCALE GRAPH MINING SYSTEM

Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16

Binghui Wang, Le Zhang, Neil Zhenqiang Gong

Asymmetric Transitivity Preserving Graph Embedding

Large Graph Mining: Power Tools and a Practitioner’s guide

Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.

GANG: Detecting Fraudulent Users in OSNs

Learning to Rank Typed Graph Walks: Local and Global Approaches

Presentation transcript:

School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems and Fast Algorithms Danai Koutra U Kang Hsing-Kuo Kenneth Pao Tai-You Ke Duen Horng (Polo) Chau Christos Faloutsos ECML PKDD, 5-9 September 2011, Athens, Greece

Problem Definition: G B A techniques Danai Koutra (CMU) - PKDD Given: graph with N nodes & M edges; few labeled nodes Find: class (red/green) for rest nodes Assuming: network effects ( homophily/ heterophily )

Homophily and Heterophily Danai Koutra (CMU) - PKDD Step 1 Step 2 All methods handle homophily NOT all methods handle heterophily BUT proposed method does! NOT all methods handle heterophily BUT proposed method does!

Why do we study these methods? Danai Koutra (CMU) - PKDD 20113

Motivation (1): Law Enforcement Danai Koutra (CMU) - PKDD [Tong+ ’06][Lin+ ‘04][Chen+ ’11]…

Motivation (2): Cyber Security Danai Koutra (CMU) - PKDD victims? [ Kephart+ ’95 ] [Kolter+ ’06 ][Song+ ’08-’11][Chau+ ‘11]… botnet members? bot

Motivation (3): Fraud Detection Danai Koutra (CMU) - PKDD Lax controls? [Neville+ ‘05][Chau+ ’07][McGlohon+ ’09]… fraudsters? fraudster

Motivation (4): Ranking Danai Koutra (CMU) - PKDD [Brin+ ‘98][Tong+ ’06][Ji+ ‘11]…

Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs Danai Koutra (CMU) - PKDD 20118

Roadmap Danai Koutra (CMU) - PKDD Background Belief Propagation Random Walk with Restarts Semi-supervised Learning Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions

Background Danai Koutra (CMU) - PKDD Apologies for diversion…

Background 1: Belief Propagation (BP) Iterative message-based method Danai Koutra (CMU) - PKDD st round 2 nd round... until stop criterion fulfilled “Propagation matrix”:  Homophily  Heterophily class of “sender” class of “receiver” Usually same diagonal = homophily factor h Usually same diagonal = homophily factor h “about-half” homophily factor h h = h-0.5 “about-half” homophily factor h h = h

Danai Koutra (CMU) - PKDD Background 1: Belief Propagation Equations [Pearl ‘82][Yedidia+ ‘02] …[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10]

Background 2: Semi-Supervised Learning graph-based SSL use few labeled data & exploit neighborhood information Danai Koutra (CMU) - PKDD STEP1STEP1 STEP1STEP1 STEP2STEP2 STEP2STEP ? ? [Zhou ‘06][Ji, Han ’10]…

Background 3: Personalized Random Walk with Restarts (RWR) Danai Koutra (CMU) - PKDD [Brin+ ’98][Haveliwala ’03][Tong+ ‘06][Minkov, Cohen ‘07]…

Danai Koutra (CMU) - PKDD Background

Qualitative Comparison of G B A Methods Danai Koutra (CMU) - PKDD GBA Method HeterophilyScalabilityConvergence RWR ✗✓✓ SSL ✗✓✓ BP ✓✓ ? F A BP ✓✓✓

Qualitative Comparison of G B A Methods Danai Koutra (CMU) - PKDD GBA Method HeterophilyScalabilityConvergence RWR ✗✓✓ SSL ✗✓✓ BP ✓✓ ? F A BP ✓✓✓

Roadmap Danai Koutra (CMU) - PKDD Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions New work Previous work

Linearized BP Odds ratio Maclaurin expansions Odds ratio Maclaurin expansions Danai Koutra (CMU) - PKDD BP is approximated by Theorem [Koutra+] Sketch of proof ? ? d1 d2 d3 d1 d2 d3 final beliefs prior beliefs scalar constants 0.5 pipi 0 “ ” 1 DETAILS!

Linearized BP vs BP Danai Koutra (CMU) - PKDD BP is approximated by Linearized BP ? ? d1 d2 d3 d1 d2 d3 linearnon-linear Belief Propagation Our proposal:Original [Yedidia+]:

Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs Danai Koutra (CMU) - PKDD ✓

DETAILS! Linearized BP converges if Linearized BP: convergence Danai Koutra (CMU) - PKDD Theorem degree of node n 1-norm < 1 OR Frobenius norm < 1 1-norm < 1 OR Frobenius norm < 1 Sketch of proof

Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs Danai Koutra (CMU) - PKDD ✓ ✓

Roadmap Danai Koutra (CMU) - PKDD Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions

Correspondence of Methods Danai Koutra (CMU) - PKDD MethodMatrixUnknownknown RWR [I – c AD -1 ]×x=(1-c)y SSL [I + a (D - A)] ×x=y F A BP [I + a D - c ’ A] ×bhbh =φhφh ? ? d1 d2 d3 d1 d2 d3 final labels/ beliefs prior labels/ beliefs adjacency matrix

RWR ≈ SSL Danai Koutra (CMU) - PKDD RWR and SSL identical if THEOREM individual homophily strength of node i (SSL) fly-out probability (RWR) Simplification global homophily strength of nodes (SSL) DETAILS!

RWR ≈ SSL: example Danai Koutra (CMU) - PKDD similar scores and identical rankings y = x RWR scores SSL scores individual hom. strength global hom. strength

Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs Danai Koutra (CMU) - PKDD ✓ ✓ ✓

Roadmap Danai Koutra (CMU) - PKDD Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions

Proposed algorithm: F A BP ①Pick the homophily factor ②Solve the linear system ①(opt) If accuracy is low, run BP with prior beliefs. Danai Koutra (CMU) - PKDD ? ? d1 d2 d3 d1 d2 d3 0.5 pipi 0 “ ” 1

Roadmap Danai Koutra (CMU) - PKDD Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions

Datasets Danai Koutra (CMU) - PKDD p% labeled nodes initially YahooWeb:.edu/others | DBLP: AI/not AI accuracy computed on hold-out set Dataset# nodes# edges YahooWeb 1,413,511,3906,636,600,779 Kronecker 1 177,1471,977,149,596 Kronecker 2 120,5521,145,744,786 Kronecker 3 59,049282,416,924 Kronecker 4 19,68340,333,924 DBLP 37,791170,794 6 billion!

Specs hadoop version M45 hadoop cluster (Yahoo!)  500 machines  4000 cores  1.5PB total storage  3.5TB of memory 100 machines used for the experiments Danai Koutra (CMU) - PKDD

Roadmap Danai Koutra (CMU) - PKDD Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments 1. Accuracy 2. Convergence 3. Sensitivity 4. Scalability 5. Parallelism Conclusions

Results (1): Accuracy Danai Koutra (CMU) - PKDD All points on the diagonal  scores near-identical beliefs in BP beliefs in F A BP 0.3% labels Scatter plot of beliefs for (h, priors) = ( 0.5±0.002, 0.5±0.001 ) AI non-AI

Results (2): Convergence Danai Koutra (CMU) - PKDD F A BP achieves maximum accuracy within the convergence bounds. Accuracy wrt h h (priors = ±0.001) 0.3% labels h % accuracy frobenius norm |e_val| = 1 1-norm convergence bounds h

Danai Koutra (CMU) - PKDD Accuracy wrt h h (priors = ±0.001) 0.3% labels h % accuracy frobenius norm |e_val| = 1 1-norm F A BP is robust to the homophily factor h h within the convergence bounds. Results (3): Sensitivity to the homophily factor convergence bounds

( For all plots ) Danai Koutra (CMU) - PKDD Average over 10 runs Error bars   tiny h % accuracy h prior beliefs’ magnitude note

Results (3): Sensitivity to the prior beliefs Danai Koutra (CMU) - PKDD F A BP is robust to the prior beliefs φ h. % accuracy prior beliefs’ magnitude Accuracy wrt priors (h h = ±0.002) p=5% p=0.1% p=0.3% p=0.5%

Results (4): Scalability Danai Koutra (CMU) - PKDD F A BP is linear on the number of edges. # of edges (Kronecker graphs) runtime (min)

Results (5): Parallelism Danai Koutra (CMU) - PKDD F A BP ~2x faster & wins/ties on accuracy. # of steps runtime (min) % accuracy runtime (min)

Roadmap Danai Koutra (CMU) - PKDD Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions

Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs Danai Koutra (CMU) - PKDD ~2x faster 6 billion edges! same/better ✓ ✓ ✓ ✓ ✓

Thanks Data Funding Danai Koutra (CMU) - PKDD NSC ILLINOIS Ming Ji, Jiawei Han

Thank you! Danai Koutra (CMU) - PKDD % accuracy runtime (min)

Danai Koutra (CMU) - PKDD Q: Can we have multiple classes? AI ML DB Propagation matrix A: yes!

Q: Which of the methods do you recommend? A: (Fast) Belief Propagation Reasons: solid bayesian foundation heterophily and multiple classes Danai Koutra (CMU) - PKDD Propagation matrix

Q: Why is F A BP faster than BP? A: BP 2|E| messages per iteration F A BP |V| records per “power method” iteration Danai Koutra (CMU) - PKDD |V| < 2 |E|