Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.

Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun

2 Bipartite Graphs G={V 1 +V 2, E} such that edges are between V 1 and V 2 Many applications can be modeled using bipartite graphs The key is to utilize these links across two natural groups for data mining

3 Problem Definition Neighborhood formation (NF) Given a query node a in V 1, what are the relevance scores of all the nodes in V 1 to a ? Anomaly detection (AD) Given a query node a in V 1, what are the normality scores for nodes in V 2 that link to a ? V1V2 a.3.2.05.01.002.01.25.05

4 Application I: Publication network Authors vs. papers in research communities Interesting queries: Which authors are most related to Dr. Carman? Which is the most unusual paper written by Dr. Carman?

5 Application II: P2P network Users vs. files in P2P systems Interesting queries: Find the users with similar preferences to me Locate files that are downloaded by users with very different preferences users files

6 Application III: Financial Trading Traders vs. stocks in stock markets Interesting queries: Which are the most similar stocks to company A? Find most unusual traders (i.e., cross sectors)

7 Application IV: Collaborative filtering collaborative filtering recommendation system CustomersProducts

8 Outline Problem Definition Motivation Neighborhood formation Anomaly detection Experiments Related work Conclusion and future work

10 Neighborhood formation – intuition Input: a graph G and a query node q Output: relevance scores to q random-walk with restart from q in V 1 record the probability visiting each node in V 1 the nodes with higher probability are the neighbors V1V2 q.3.2.05.01.002.01

11 Exact neighborhood formation Input: a graph G and a query node q Output: relevance scores to q Construct the transition matrix P where every node in the graph becomes a state every state has a restart probability c to jump back to the query node q. transition probability Find the steady-state probability u which is the relevance score of all the nodes to q q c cc c (1-c) c

12 Approximate neighborhood formation Scalability problem with exact neighborhood formation: too expensive to do for every single node in V 1 Observation: Nodes that are far away from q have almost 0 relevance scores. Idea: Partition the graphs and apply neighborhood formation for the partition containing q.

14 Anomaly detection - intuition t in V 2 is normal if all a in V 1 that link to t belong to the same neighborhood e.g. low normalityhigh normality t t

15 S Anomaly detection - method Input: a query node q from V 2 Output: the normality score of q Find the set of nodes connected to q, say S Compute relevance scores of elements in S, denoted as rs Apply score function f(rs) to obtain normality scores: e.g. f(rs) = mean(rs) q

17 Datasets datasets|V 1 ||V 2 ||E| Avgdeg(V 1 )Avgdeg(V 2 ) Conference- Author (CA) 2687288K662K5105 Author- Paper (AP) 316K472K1M32 IMDB553K204k2.2M411

18 Goals [Q1]: Do the neighborhoods make sense? (NF) [Q2]: How accurate is the approximate NF? [Q3]: Do the anomalies make sense? (AD) [Q4]: What about the computational cost?

19 [Q1] Exact NF The nodes (x-axis) with the highest relevance scores (y-axis) are indeed very relevant to the query node. The relevance scores can quantify how close/related the node is to the query node. relevance score most relevant neighbors relevance score most relevant neighbors ICDM (CA) Robert DeNiro (IMDB)

20 [Q2] Approximate NF Precision = fraction of overlaps between ApprNF and NF among top k neighbors The precision drops slowly while increasing the number of partition The precision remain high for a wide range of neighborhood size neighborhood size = 20 num of partitions = 10 # of partitions Precision neighborhood size

21 [Q3] Anomaly detection Randomly inject some nodes and edges (biased towards high-degree nodes) The genuine ones on average have high normality score than the injected ones normality score

22 [Q4] Computational cost Even with a small number of partitions, the computational cost can be reduced dramatically. Approximate NF Time(sec) # of Partitions

23 Related Work Random walk [Brin & Page98] [Haveliwala WWW02] Graph partitioning [Karypis and Kumar98] [Kannan et al. FOCS00] Collaborative filtering [Shardanand&Maes95] … Anomaly detection [Aggarwal&Yu. SIMOD01] [Noble&Cook KDD03] [Newman03]

24 Conclusion Two important queries on bipartite graphs: NF and AD An efficient method for NF using random- walk with restart and graph partitioning techniques Based the result of NF, we can also spot anomalies (AD) Effectiveness is confirmed on real datasets

25 Future work and Q & A Future work What about time-evolving graphs? Contact: Jimeng Sun jimeng@cs.cmu.edu http://www.cs.cmu.edu/~jimengwww.cs.cmu.edu/~jimeng

Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.

Similar presentations

Presentation on theme: "Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.

Similar presentations

Presentation on theme: "Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun."— Presentation transcript:

Similar presentations

About project

Feedback