Page 1 Inferring Relevant Social Networks from Interpersonal Communication Munmun De Choudhury, Winter Mason, Jake Hofman and Duncan Watts WWW ’10 Summarized and presented by Kim Chungrim
Page 2 Contents Introduction Motivation Inferring Social Networks –Dataset –Constructing Thresholded Networks Network Descriptive Statistics –Network level features –Node level features Network-based Prediction –Node Status / Gender –Future Communication / Community Detection Discussion/Conclusion
Page 3 The rapidly growing volume of electronic communication data has been a great benefit to social network analysis. However, social network analysts have found out that there are two problems: –Inference problem : the “real” social ties are not directly observable and hence must be inferred from observation of events –Relevance problem : there is no one “true” social network, but rather many such networks, each corresponding to a different definition of a tie, and each relevant to different social processes According on the definition of an ‘edge’, a network can have different meanings –1) An edge exists between I and j if either has communicated with the other at least once in the past year –2) An edge exists if each has communicated with the other at least once in the past week –3) An edge exists if each has communicated with the other at least once per week for the past year Which of these networks is the “relevant” one depends on the research question of interest I NTRODUCTION
Page 4 Motivation Define a minimum threshold on a network threshold To infer networks for various definitions of “threshold” over a tie Study the impact of different thresholded networks on: –Descriptive statistics –Ability of the network in predicting node characteristics
Page 5 Inferring Social Networks - Datasets University –A compiled registry of all associated with individuals at a large university –Duration : 2 years (6 Trimester) –Number of users : 19,817 –Number of s : 1.09M –Disregard s involving non-university domain –A node contains information about a person : id, gender, position, etc Enron –A repository of the s exchanged internally among the employees at Enron –Duration : 4 years –Number of users : 4,736 –Number of s : 1.06M –A node contains information about a person : id, position, etc
Page 6 Inferring Social Networks - Constructing Thresholded Networks Edge definition –Geometric mean of the annualize rate of messages exchanged Edge threshold –Minimum of s between each pair of individuals, over a period of time T –A social graph G(V,E; ) s.t. –A Family of networks: {G( ), G( ), …, G( )}
Page 7 Network Descriptive Statistics – Network Level Features Network density: –Number of edges –Number of connected nodes –Number of components –Relative Sizes of Components
Page 8 Network Descriptive Statistics – Network Level Features
Page 9 Network Descriptive Statistics – Node Level Features Reach of a node: –Node degree : –Average Neighbor Degree : The average degree over all of a nodes neighbor –Size of Two-hop Neighborhood : count of all of the node’s neighbors plus all of the node’s neighbors’ neighbors
Page 10 Network Descriptive Statistics – Node Level Features Closure of the ego-network: –Embeddedness –Normalized clustering coefficient
Page 11 Network Descriptive Statistics – Node Level Features To what extent does a node “bridge” communities: –Network constraint [Burt ‘04] –Number of ego components : count of the number of connected components that rema in when the focal node and its incident edges are removed
Page 12 Network-based Prediction The characteristics of a network depends on the threshold Which network to choose for an experiment? Experiment to find out the right threshold for various research interest – Predictions on Node Status/Gender – Predictions on Future communication activity – Predictions on Community detection
Page 13 Prediction Tasks: Node Status/Gender Given feature vector, A Feature matrix is built using the feature vectors for each node i, and a vector of status/gender attribute of each node i is constructed. The and are split into training set and test se –Training set : 90% of the and –Test set : 10 % of the and Using SVM with Gaussian RBF kernel, learn parameters & kernel width with 10-fold cross-validation
Page 14 Prediction Tasks: Future Communication / Community Detection Given a feature vector Where is the activity of node j from time t0 to tm and is the activity of node I at the time tl. The model of communication activity can be expressed as a function –The best-fit regression coefficient is used to predict the future node activity Fit a stochastic block model to G( ) using variational Bayes inference [Hofman et al. 2008]
Page 15 Experimental Result – University Dataset
Page 16 Experimental Results – Enron Dataset
Page 17 Conclusion It is hard to find the optimal threshold –Accuracy maximized at non-obvious point –Still, accuracy is improved 30% than the unthresholded network –Deleting edges removes noise Optimal threshold at consistent value –For different prediction tasks –For different data sets
Page 18 Summary / Discussion / Future work Network inference procedure assumes ad-hoc edge filtering Introduced a threshold on edges and a family of Networks to find a optimal threshold for a certain prediction task –The prediction accuracies peak in a non-obvious yet relatively narrow threshold range Tested on too few datasets Not enough to give a solid conclusion Apply method to variety of networks Test various thresholds for more interests