Jie Tang Computer Science, Tsinghua University @WWW’2017 Computational Models for Social Network Analysis —mining big social networks (Part IV: Representation) Jie Tang Computer Science, Tsinghua University @WWW’2017
Fundamental question How to represent users, relationships, and groups in a big network? Fast representation Embedding representation
Example: Who are Similar to Barabási? [1] Jing Zhang, Jie Tang, Cong Ma, Hanghang Tong, Yu Jing, and Juanzi Li. Panther: Fast Top-k Similarity Search on Large Networks. KDD'15. pp. 1445-1454.
Similar Authors in AMiner.org
Let us recall Similarity Search…
Does not really consider the topology Common Neighbors Two users in a social network are similar if they share many of the same friends (Jaccard, 1901). Does not really consider the topology social
SimRank SimRank is a general similarity measure, based on a simple and intuitive graph-theoretic model (Jeh and Widom, KDD’02).
Cannot support real-time search! Path Similarity Intuition: two vertices are similar if they frequently appear on the same paths. v2 v1 v3 v5 v4 (T=2) A path is a T-length sequence of vertices p = (v1,··· ,vT+1). Π is all the T-paths in G. Path weight: Sps(v1,v2)=0.37, Sps(v1,v3)=0.42, Sps(v1,v4)=0.39, Sps(v1,v5)=0.09. Low efficiency! Cannot support real-time search!
Related Work and Challenges Method Time Complexity Space Complexity SimRank [kdd’02] O(IN2d2) O(N2) TopSim [ICDE’12] O(NTdT) O(N+M) RWR [KDD’04] O(IN2d) RoleSim [KDD’11] ReFex [KDD’11] O(N+I(fM+Nf2)) O(N+Mf) 1 Share many direct/indirect common neighbors. 2 Disconnected, but share similar structure. Find top-K similar vertices for any vertex in a network d: average degree, f: feature number, T: path length Challenges C1 : How to design a similarity method that applies to both similarities? C2: Computational efficiency challenge.
Panther: Fast Top-k Similarity Search on Big Networks [1] Jing Zhang, Jie Tang, Cong Ma, Hanghang Tong, Yu Jing, and Juanzi Li. Panther: Fast Top-k Similarity Search on Large Networks. KDD'15. pp. 1445-1454.
Path Similarity 1 Intuition: two vertices are similar if they frequently appear on the same paths. v2 v1 v3 v5 v4 (T=2) A path is a T-length sequence of vertices p = (v1,··· ,vT+1). Π is all the T-paths in G. Path weight: Sps(v1,v2)=0.37, Sps(v1,v3)=0.42, Sps(v1,v4)=0.39, Sps(v1,v5)=0.09.
Pantherps Basic idea: random path sampling Simplified path similarity: O(RT) O(dT)
Upper bound of range set’s VC dimension Theoretical Analysis How many random paths shall we sample? Domain and range set Upper bound of range set’s VC dimension Distribution 1 2 3 Required sample size
Theoretical Analysis Domain: Π Range set: VC bound: Distribution: Details Domain: Π Range set: VC bound: Distribution: Path similarity is Conclusion R random paths can guarantee ε and 1−δ.
Proof of Details A set Q of size l can be shattered by RG Assume and A set Q of size l can be shattered by RG A 1-1 corresponding between each subset in Q and each range Pi in RG A path belongs only to the ranges w.r.t a pair of vertices in the path Contradiction
Vector Similarity and Panthervs 2 Limitation of path similarity: bias to close neighbors. Vector Similarity: the probability distributions of a vertex linking to all other vertices are similar if their topology structures are similar. Panthervs : Use top-D path similarities calculated by Pantherps to construct a vector: v 0.13 0.04 u 0.12 0.39 (T=2) 0.13 0.04 w 0.11 0.02 0.12 0.25 0.39 0.12 Svs(u,w)=0.27 > Svs (u,v)=0.16 0.25 0.12 0.11 0.02
Time Complexity Random path Vertex-to-path index Kd-tree Method Time Complexity Space Complexity SimRank O(IN2d2) O(N2) TopSim O(NTdT) O(N+M) RWR O(IN2d) RoleSim ReFex O(N+I(fM+Nf2)) O(N+Mf) Pantherps O(RTc+NdT) O(RT+Nd) Panthervs O(RTc+NdT+Nc) O(RT+Nd+ND) Random path Vertex-to-path index Kd-tree Random path sampling Top-k similarity search for any vertex Build and query kd-tree
Experiments
Efficiency Performance Tencent network Preprocessing time + top-k similarity search time |V| |E| RWR [(KDD’04] TopSim [ICDE’12] RoleSim [KDD’11] ReFex Pantherps Panthervs 6,523 10,000 +7.79hr +38.58m +37.26s 3.85s+0.07s 0.07s+0.26s 0.99s+0.21s 25,844 50,000 +>150hr +11.20hr +12.98m 26.09s+0.40s 0.28s+1.53s 2.45s+4.21s 48,837 100,000 +30.94hr +1.06hr 2.02m+0.57s 0.58s+3.48s 5.30s+5.96s 169,209 500,000 +>120hr +>72hr 17.18m+2.51s 8.19s+16.08s 27.94s+24.17s 230,103 1,000,000 31.50m+3.29s 15.31s+30.63s 49.83s+22.86s 443,070 5,000,000 24.15hr+8.55s 50.91s+2.82m 4.01m+1.29m 702,049 10,000,000 >48hr 2.21m+6.24m 8.60m+6.58m 2,767,344 50,000,000 15.787m+1.36hr 1.60hr+2.17hr 5,355,507 100,000,000 44.09m+4.50hr 5.61hr+6.47hr 26,033,969 500,000,000 4.82hr+25.01hr 32.90hr+47.34hr 51,640,620 1,000,000,000 13.32hr+80.38hr 98.15hr+120.01hr 390X speed up 270X speed up Can scale up to handle 1 billion edges T=5, c=0.5, ε=√1/|E| and δ=0.1, R=16,609,640
Accuracy Performance of Pantherps Evaluate how Pantherps can approximate common neighbors. The score represents the improvement over a random method. KDD Twitter Mobile Co-author networks: |V|=3K, |E| = 7K. Twitter network: |V| = 100K, |E| = 500K. Mobile network: |V| = 200K, |E| = 200K.
Accuracy Performance of Panthervs Identity Resolution Assume the same authors in different networks of the same domain are similar to each other. Settings Given any two co-author networks, e.g., KDD and ICDM, if the top-k similar vertices from ICDM consists of the query author from KDD, we say that the method hits a correct instance. KDD-ICDM SIGIR-CIKM SIGMOD-ICDE
Parameter Analysis: Path Length T The performance gets better when T increases. The performance almost becomes stable When T ≥ 5. Effect of path length T on the accuracy performance of Panthervs.
Parameter Analysis: Error Bound ε Tencent networks When |E|/(1/ε)2 ranges from 5 to 20, Pantherps are almost convergent; The value (1/ε)2 is almost linearly positively correlated with the number of edges in a network; Therefore, we empirically set ε=√1/|E| in our experiments
Deploy in AMiner.org
Big Network Analysis BIG Networks User Tie Topology Heterogeneous data User Tie Topology Heterogeneous Micro Macro tie Influence Dynamic - User Modeling - Demographics - Social Role - Social Tie/Link - Homophily - Social Influence - Triad Formation - Community - Group Behavior Big&Big social Social Theories Graph Theories BIG Networks
Future Work BIG Networks User Tie Topology Heterogeneous data User Tie Topology Heterogeneous Micro Macro Question 1: How to incorporate the huge-volume & streaming individual behavior data into network analysis? Question 2: What is the Interplay between (Micro) Individual Behavior and (Macro) Network Distribution? tie Influence Dynamic - User modeling - Demographics - Social Role - Social Tie/Link - Homophily - Social Influence - Triad Formation - Community - Group Behavior Big&Big social Social Theories Graph Theories BIG Networks
Related Publications Jie Tang, Jimeng Sun, Chi Wang, and Zi Yang. Social Influence Analysis in Large-scale Networks. In KDD’09, pages 807-816, 2009. Chenhao Tan, Jie Tang, Jimeng Sun, Quan Lin, and Fengjiao Wang. Social action tracking via noise tolerant time-varying factor graphs. In KDD’10, pages 807–816, 2010. Chi Wang, Jiawei Han, Yuntao Jia, Jie Tang, Duo Zhang, Yintao Yu, and Jingyi Guo. Mining Advisor-Advisee Relationships from Research Publication Networks. In KDD'10, pages 203-212. Jie Tang, Sen Wu, and Jimeng Sun. Confluence: Conformity Influence in Large Social Networks. In KDD’13, pages 347-355, 2013. Yuxiao Dong, Yang Yang, Jie Tang, Yang Yang, Nitesh V. Chawla. Inferring User Demographics and Social Strategies in Mobile Social Networks. In KDD’14, 2014. Yutao Zhang, Jie Tang, Zhilin Yang, Jian Pei, and Philip Yu. COSNET: Connecting Heterogeneous Social Networks with Local and Global Consistency. In KDD'15, pages 1485-1494. Jing Zhang, Biao Liu, Jie Tang, Ting Chen, and Juanzi Li. Social Influence Locality for Modeling Retweeting Behaviors. In IJCAI'13, pages 2761-2767, 2013. Jing Zhang, Jie Tang, Honglei Zhuang, Cane Wing-Ki Leung, and Juanzi Li. Role-aware Conformity Influence Modeling and Analysis in Social Networks. In AAAI'14, 2014. Jing Zhang, Jie Tang, Yuanyi Zhong, Yuchen Mo, Juanzi Li, Guojie Song, Wendy Hall, and Jimeng Sun. StructInf: Mining Structural Influence from Social Streams. In AAAI'17. Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In KDD’08, pages 990-998, 2008. Tiancheng Lou and Jie Tang. Mining Structural Hole Spanners Through Information Diffusion in Social Networks. In WWW'13, pages 837-848, 2013. Jiezhong Qiu, Yixuan Li, Jie Tang, Zheng Lu, Hao Ye, Bo Chen, Qiang Yang, and John Hopcroft. The Lifecycle and Cascade of WeChat Social Messaging Groups. In WWW'16, pages 311-320. Jie Tang, Tiancheng Lou, Jon Kleinberg, and Sen Wu. Transfer Learning to Infer Social Ties across Heterogeneous Networks. In TOIS, Vol 34 (2), No. 7, 2016 Jimeng Sun and Jie Tang. A Survey of Models and Algorithms for Social Influence Analysis. Social Network Data Analytics, Aggarwal, C. C. (Ed.), Kluwer Academic Publishers, pages 177–214, 2011.
References S. Milgram. The Small World Problem. Psychology Today, 1967, Vol. 2, 60–67 J.H. Fowler and N.A. Christakis. The Dynamic Spread of Happiness in a Large Social Network: Longitudinal Analysis Over 20 Years in the Framingham Heart Study. British Medical Journal 2008; 337: a2338 R. Dunbar. Neocortex size as a constraint on group size in primates. Human Evolution, 1992, 20: 469–493. R. M. Bond, C. J. Fariss, J. J. Jones, A. D. I. Kramer, C. Marlow, J. E. Settle and J. H. Fowler. A 61-million-person experiment in social influence and political mobilization. Nature, 489:295-298, 2012. http://klout.com Why I Deleted My Klout Profile, by Pam Moore, at Social Media Today, originally published November 19, 2011; retrieved November 26 2011 S. Aral and D Walker. Identifying Influential and Susceptible Members of Social Networks. Science, 337:337-341, 2012. J. Ugandera, L. Backstromb, C. Marlowb, and J. Kleinberg. Structural diversity in social contagion. PNAS, 109 (20):7591-7592, 2012. S. Aral, L. Muchnik, and A. Sundararajan. Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. PNAS, 106 (51):21544-21549, 2009. J. Scripps, P.-N. Tan, and A.-H. Esfahanian. Measuring the effects of preprocessing decisions and network forces in dynamic network analysis. In KDD’09, pages 747–756, 2009. Rubin, D. B. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66, 5, 688–701. http://en.wikipedia.org/wiki/Randomized_experiment
References(cont.) A. Anagnostopoulos, R. Kumar, M. Mahdian. Influence and correlation in social networks. In KDD’08, pages 7-15, 2008. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report SIDL-WP-1999-0120, Stanford University, 1999. G. Jeh and J. Widom. Scaling personalized web search. In WWW '03, pages 271-279, 2003. G. Jeh and J. Widom, SimRank: a measure of structural-context similarity. In KDD’02, pages 538-543, 2002. A. Goyal, F. Bonchi, and L. V. Lakshmanan. Learning influence probabilities in social networks. In WSDM’10, pages 207–217, 2010. P. Domingos and M. Richardson. Mining the network value of customers. In KDD’01, pages 57–66, 2001. D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In KDD’03, pages 137–146, 2003. J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak detection in networks. In KDD’07, pages 420–429, 2007. W. Chen, Y. Wang, and S. Yang. Efficient influence maximization in social networks. In KDD'09, pages 199-207, 2009. E. Bakshy, D. Eckles, R. Yan, and I. Rosenn. Social influence in social advertising: evidence from field experiments. In EC'12, pages 146-161, 2012. A. Goyal, F. Bonchi, and L. V. Lakshmanan. Discovering leaders from community actions. In CIKM’08, pages 499–508, 2008. N. Agarwal, H. Liu, L. Tang, and P. S. Yu. Identifying the influential bloggers in a community. In WSDM’08, pages 207–217, 2008.
References(cont.) E. Bakshy, B. Karrer, and L. A. Adamic. Social influence and the diffusion of user-created content. In EC ’09, pages 325–334, New York, NY, USA, 2009. ACM. P. Bonacich. Power and centrality: a family of measures. American Journal of Sociology, 92:1170–1182, 1987. R. B. Cialdini and N. J. Goldstein. Social influence: compliance and conformity. Annu Rev Psychol, 55:591–621, 2004. D. Crandall, D. Cosley, D. Huttenlocher, J. Kleinberg, and S. Suri. Feedback effects between similarity and social influence in online communities. In KDD’08, pages 160–168, 2008. P. W. Eastwick and W. L. Gardner. Is it a game? evidence for social influence in the virtual world. Social Influence, 4(1):18–32, 2009. S. M. Elias and A. R. Pratkanis. Teaching social influence: Demonstrations and exercises from the discipline of social psychology. Social Influence, 1(2):147–162, 2006. T. L. Fond and J. Neville. Randomization tests for distinguishing social influence and homophily effects. In WWW’10, 2010. M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferring Networks of Diffusion and Influence. In KDD’10, pages 1019–1028, 2010. M. E. J. Newman. A measure of betweenness centrality based on random walks. Social Networks, 2005. D. J. Watts and S. H. Strogatz. Collective dynamics of ’small-world’ networks. Nature, pages 440–442, Jun 1998. J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos. Neighborhood formation and anomaly detection in bipartite graphs. In ICDM’05, pages 418–425, 2005.
Thank you! Collaborators: John Hopcroft, Jon Kleinberg, Chenhao Tan (Cornell) Jiawei Han (UIUC), Philip Yu (UIC) Jian Pei (SFU), Hanghang Tong (ASU) Tiancheng Lou (Google&Baidu), Jimeng Sun (GIT) Wei Chen, Ming Zhou, Long Jiang, Chi Wang, Yuxiao Dong (Microsoft) Yutao Zhang, Jing Zhang, Zhanpeng Fang, Zi Yang, Sen Wu, etc. (THU) Jie Tang, KEG, Tsinghua U, http://keg.cs.tsinghua.edu.cn/jietang Download all data & Codes, http://arnetminer.org/data http://arnetminer.org/data-sna