1 Jie Tang, Chenhui Zhang Tsinghua University Keke Cai, Li Zhang, Zhong Su IBM, China Research Lab Sampling Representative Users from Large Social Networks.

Slides:



Advertisements
Similar presentations
1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.
Advertisements

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Confluence: Conformity Influence in Large Social Networks
Maximizing the Spread of Influence through a Social Network
Maximizing the Spread of Influence through a Social Network By David Kempe, Jon Kleinberg, Eva Tardos Report by Joe Abrams.
1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.
CIKM’2008 Presentation Oct. 27, 2008 Napa, California
Graph Data Management Lab School of Computer Science , Bristol, UK.
Yang Xiang, Ruoming Jin, David Fuhry, Feodor F. Dragan
1 1 Chenhao Tan, 1 Jie Tang, 2 Jimeng Sun, 3 Quan Lin, 4 Fengjiao Wang 1 Department of Computer Science and Technology, Tsinghua University, China 2 IBM.
Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.
On Community Outliers and their Efficient Detection in Information Networks Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1.
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
Influence Maximization
Simpath: An Efficient Algorithm for Influence Maximization under Linear Threshold Model Amit Goyal Wei Lu Laks V. S. Lakshmanan University of British Columbia.
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
Maximizing Product Adoption in Social Networks
Models of Influence in Online Social Networks
1 1 Chenhao Tan, 1 Jie Tang, 2 Jimeng Sun, 3 Quan Lin, 4 Fengjiao Wang 1 Department of Computer Science and Technology, Tsinghua University, China 2 IBM.
Active Learning for Networked Data Based on Non-progressive Diffusion Model Zhilin Yang, Jie Tang, Bin Xu, Chunxiao Xing Dept. of Computer Science and.
Personalized Influence Maximization on Social Networks
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
December 7-10, 2013, Dallas, Texas
On Node Classification in Dynamic Content-based Networks.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Index Interactions in Physical Design Tuning Modeling, Analysis, and Applications Karl Schnaitter, UC Santa Cruz Neoklis Polyzotis, UC Santa Cruz Lise.
Maximizing the Spread of Influence through a Social Network David Kempe, Jon Kleinberg, Eva Tardos Cornell University KDD 2003.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
On Reducing Broadcast Redundancy in Wireless Ad Hoc Network Author: Wei Lou, Student Member, IEEE, and Jie Wu, Senior Member, IEEE From IEEE transactions.
Maximizing the Spread of Influence through a Social Network Authors: David Kempe, Jon Kleinberg, É va Tardos KDD 2003.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
Mohamed Hefeeda 1 School of Computing Science Simon Fraser University, Canada Efficient k-Coverage Algorithms for Wireless Sensor Networks Mohamed Hefeeda.
1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.
Guided Learning for Role Discovery (GLRD) Presented by Rui Liu Gilpin, Sean, Tina Eliassi-Rad, and Ian Davidson. "Guided learning for role discovery (glrd):
Lecture 3-1 Independent Cascade Weili Wu Ding-Zhu Du University of Texas at Dallas.
1 Distinguishing Objects with Identical Names in Relational Databases.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
1 CoupledLP: Link Prediction in Coupled Networks Yuxiao Dong #, Jing Zhang +, Jie Tang +, Nitesh V. Chawla #, Bai Wang* # University of Notre Dame + Tsinghua.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
Shaoxu Song 1, Aoqian Zhang 1, Lei Chen 2, Jianmin Wang 1 1 Tsinghua University, China 2Hong Kong University of Science & Technology, China 1/19 VLDB 2015.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Biao Wang 1, Ge Chen 1, Luoyi Fu 1, Li Song 1, Xinbing Wang 1, Xue Liu 2 1 Shanghai Jiao Tong University 2 McGill University
Yu Wang1, Gao Cong2, Guojie Song1, Kunqing Xie1
Nanyang Technological University
Independent Cascade Model and Linear Threshold Model
Greedy & Heuristic algorithms in Influence Maximization
MEIKE: Influence-based Communities in Networks
Independent Cascade Model and Linear Threshold Model
The Importance of Communities for Learning to Influence
Coverage Approximation Algorithms
Structural influence:
Cost-effective Outbreak Detection in Networks
Example: Academic Search
Asymmetric Transitivity Preserving Graph Embedding
Discovering Influential Nodes From Social Trust Network
Independent Cascade Model and Linear Threshold Model
PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs.
Modeling Topic Diffusion in Scientific Collaboration Networks
Data-Dependent Approximation
Presentation transcript:

1 Jie Tang, Chenhui Zhang Tsinghua University Keke Cai, Li Zhang, Zhong Su IBM, China Research Lab Sampling Representative Users from Large Social Networks Download Code&Data here: Download Code&Data here:

2 Sampling Representative Users data mining A B C HCI 0.6 visualization data mining HCI social network data mining HPC 0.2 supercompting big data Goal: Finding k users who can best represent the other users.

3 Problem Formulation Sampling Representative Users Goal: how to find the subset T (|T|=k) in order to maximize the utility function Q? G=(V, E): the input social network X=[x ik ] nxm : attribute matrix G={(V 1, a j1 ), …, (V t, a jt )} jt (V 1, a j1 ): a subset of users V1 with the attribute value a j1 T (|T|=k): Selected subset of representative users

4 Related Work Graph sampling –Sampling from large graphs [Leskovec-Faloutsos 2006] –Graph cluster randomization [Ugander-Karrer-Backstrom-Kleinberg 2013] –Sampling community structure [Maiya-Berger 2010] –Network sampling with bias [Maiya-Berger 2011] Social influence test, quantification, and diffusion –Influence and correlation [Anagnostopoulos-et-al 2008] –istinguish influence and homophily [Aral-et-al 2009, La Fond-Nevill 2010] –Topic-based influence measure [Tang-Sun-Wang-Yang 2009, Liu-et-al 2012] –Learning influence probability [Goyal-Bonchi-Lakshmanan 2010] –Linear threshold and cascaded model [Kempe-Kleinberg-Tardos 2003] –Efficient algorithm [Chen-Wang-Yang 2009]

5 The Proposed Models: S3 and SSD

6 Proposed Models Theorem 1. For any fixed Q, the Q-evaluated Representative Users Sampling problem is NP-hard, even there is only one attribute Two instantiation models –Basic principles: synecdoche and metonymy –Statistical Stratified Sampling (S3): Treat one user as being a synecdochic representative of all users –Strategic Sampling for Diversity (SSD): Treat one measurement on that user as being a metonymic indicator of all of the relevant attributes of that user and all users Prove by connecting to the Dominating Set Problem

7 Statistical Stratified Sampling (S3) Maximize the representative degree of all attribute groups –Trade-offs: choose some less representative users on some attributes in order to increase the global representative Quality Function Importance of the attribute group Avoid bias from large groups G={(V 1, a j1 ), …, (V t, a jt )} jt (V 1, a j1 ): a subset of users V 1 with the attribute value a j1 The degree that T represent the l -th attribute group G l

8 Approximate Algorithm R(T, v i, a jl ): representative degree of T for v i on attribute a j R(T, v, a) can be simply defined as if some user from T has the same value of attribute a as v, then R(T, v, a) =1; otherwise R(T, v, a) =0.

9 The greedy algorithm for the S3 model can guarantee a (1-1/e) approximation Theoretical Analysis Submodular property Let where T* is the optimal solution; T k is the solution obtained in the k-th iteration. Thus Recall Finally

10 Strategic Sampling for Diversity (SSD) Maximize the diversification of the selected representative users Approximation Algorithm –each time we choose the poorest group (with the smallest λ l  P(T, l) ), –and then select the user who maximally increases the representative degree of this group Diversity parameter We set it as λ l =1/|V l |

11 Theoretical Analysis Theorem 2. Suppose the representative set generated by S3 is T, the optimal set is T*. Then the greedy algorithm for SSD can guarantee an approximation ratio of C/d  Q*, where d is the number of attributes, C is a constant, and Q* is the optimal solution. Proof. The proof is based on the proof for S3.

12 Experiments

13 Datasets Co-author network: ArnetMiner ( –Network: Authors and coauthor relationships from major conferences –Attributes: Keywords from the papers in each dataset as attributes –Ground truth: Take program committee (PC) members of the conferences during as representative users Microblog network: Sina Weibo ( –Network: users and following relationships –Attributes: location, gender, registration date, verified type, status, description and the number of friends or followers –No ground truth

14 Datasets: Co-author and Weibo DatasetConf.#nodes#edges#PC Database SIGMOD3,4479, VLDB3,6068, ICE4,1509, SUMALL8,02723, Data Mining SIGKDD2,4944, ICDM2,1213, CIKM2,9424, SUMALL6,39412, Dataset#nodes#edges Program19,15219,225 Food189,176204,863 Student79,05276,559 Public welfare324,594383,702 Weibo Coauthor

15 Accuracy Performance Results: S3 outperforms all the other methods In terms of and achieves the best F1 score Comparison methods: InDegree: select representative users by the number of indegree HITS_h and HITS_a: first apply HITS algorithm to obtain authority and hub scores of each node. Then the two methods respectively select representative users according to the two scores PageRank: select representative users according to the pagerank score

16 Accuracy Performance The precision-recall curve of the different methods

17 Efficiency Performance S3 and SSD respectively only need 50 and 10 seconds to perform the sampling over a network of ~200,000 nodes

18 Case Study DatabaseData Mining S3PageRankS3PageRank Jiawei Han Jeffrey F. Naughton Beng Chin Ooi Samuel Madden Johannes Gehrke Kian-Lee Tan Surajit Chaudhuri Elke A. Rundensteiner Divyakant Agrawal Wei Wang Serge Abiteboul Rakesh Agrawal Michael J. Carey Jiawei Han Michael Stonebraker Manish Bhide Ajay Gupta H. V. Jagadish Surajit Chaudhuri Warren Shen Philip S. Yu Jiawei Han Christos Faloutsos ChengXiang Zhai Bing Liu Vipin Kumar Jieping Ye Ming-Syan Chen Padhraic Smyth C. Lee Giles Philip S. Yu Jiawei Han Jian Pei Christos Faloutsos Ke Wang Wei-Ying Ma Jianyong Wang Jeffrey Xu Yu Haixun Wang Hongjun Lu Advantages of S3: (1)S3 tends to select users with more diverse attributes; (2)S3 can tell on which attribute the selected users can represent.

19 Conclusions

20 Jie Tang, Chenhui Zhang Tsinghua University Keke Cai, Li Zhang, Zhong Su IBM, China Research Lab Sampling Representative Users from Large Social Networks Download Code&Data here: Download Code&Data here:

21 Theoretical Analysis If there must be (1)

22 Theoretical Analysis (4) By inequalities (3) and (4), the summation As we get a contradiction with inequality (1), this theorem is thus proved. (2) Therefore by inequality (2) (3)