Mining Social Networks for Personalized Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter: Che-Wei, Liang Date: 2009/08/25
Outline Introduction Social Clustering Measuring Social Importance Semi-supervised Importance Propagation Experiments Conclusions and Future work 2
Introduction – One of the most prevalent personal and business communication tools – Asynchronous Process a large volume of messages of differing importance is BURDEN! 3
Introduction Information overload problem – Need to develop systems that automatically learn personal priorities for each user Identify personally interesting Identify important messages for user’s attention 4
Introduction Many statistical learning techniques have been studied in support of -based prediction tasks Spam identification, folder recommendation, recipient reminding, action-item identification, social group analysis BUT, Personalized prioritization – Remains an under-explored problem – Mainly due to privacy issues in collecting personal data 5
Introduction This paper – Create a new collection of anonymized personal data with importance levels – Proposed a fully personalized methodology for technical development and evaluation – Developed a supervised classification framework For model personal priorities over messages, and predicting importance levels for new messages 6
Outline Introduction Social Clustering Measuring Social Importance Simi-supervised Importance Propagation Experiments Conclusions and Future work 7
Motivation Sender information – One of most indicative features – Messages sent by the members of the same group tend to share similar priority level – Capturing sender groups would be informative for predicting the importance of messages If a sender who does not have any labeled instances – Based on unsupervised clustering, infer that user’s importance from other group members 8
Personalized Social Network For each user, a personalized social network is – constructed by using the data of that user Practicality Personalization contact network – Represent by graph G=(V, E) V: contacts (users) E: message sending among users, un-weighted (E ij =1 if there is a message from user i to user j, E ij =0 otherwise.) 9
Clustering Newman Clustering – Be used to successfully find social structures – Defines edge-betweenness A link has a high score means that the link is crucial between two boundary nodes of two clusters – Delete links with high edge-betweenness scores, results in disconnect components as clusters 10 A B E D C F G H I J L R
Outline Introduction Social Clustering Measuring Social Importance Semi-supervised Importance Propagation Experiments Conclusions and Future work 11
Measuring Social Importance Link relations provides useful information about the centrality of each contact 12
Measuring Social Importance In-degree centrality Out-degree centrality Total-degree centrality 13 B C D A E
Measuring Social Importance Clustering Coefficient – Measure connectivity among the neighborhood of the node Clique Count – Clique: fully connected sub-graph – A large clique count of node v means It connects to large and well-connected sub-graphs It is located in the center of the sub-graphs 14 B C D A E F
Measuring Social Importance Betweenness centrality – Percentage of existing shortest paths out of all possible paths that goes through the node v σ jk : number of shortest path between j and k σ jk (i) : number of shortest path between j and k that goes through i 15
Measuring Social Importance HITS Authority – Hyperlink-Induced Topic Search, also known as Hubs and authorities – measures the global importance of node – Definition: Adjacency matrix X N-by-N, can be calculated by Finding the principle eigenvector r of matrix, where r satisfies, λ is the largest eigenvalue 16
Measuring Social Importance PCC Analysis – Pearson Correlation Coefficient – Compute PCC of each social metric with human-labeled importance levels of messages – Indicative about “How useful each metric for predicting the importance of messages” 17
Outline Introduction Social Clustering Measuring Social Importance Semi-supervised Importance Propagation Experiments Conclusions and Future work 18
Semi-supervised Importance Propagation Semi-supervised Importance Propagation (SIP) – Propagate the importance values of labeled messages (the training examples) to other messages and corresponding contact persons 19
SIP Algorithm Use a bipartite graph – to represent the interactions between contacts and messages Let N = number of contacts, M = number of messages Using matrix to represent two types of edge, matrix A (N by M) and matrix B (N by M) – A i,j =1 if person i sends message j, and A i,j =0 otherwise – B i,j =1 if person i received message j, and B i,j =0 otherwise 20
SIP Algorithm Treat each importance label (1~5) as a category Use vector (M by 1) to indicate the labels of message, – x k,i =1 if message i belongs to category k, x k,i =0 otherwise Importance propagation from messages to persons (receivers) is calculated as Importance propagation from persons (senders) to messages is calculated as 21
Propagation Example 22 ? ? ? ? ? ? ? Messages to persons (receivers) Persons (senders) to messages
SIP Algorithm Updating of the importance values for contact persons at each time step (t) is calculated by: 23 ? ? ? ? ? ? ?
SIP Algorithm is a linear transformation of If is irreducible, and t is large stabilizes at the principal eigenvector of C – Irreducible property is not always guaranteed – If so, its principal eigenvector is insensitive to the starting vector 24
SIP Algorithm A linear interpolation – Define, and normalize by sum of vector – Define importance-sensitive matrix columns are identical, each column is equivalent to – Normalize matrix C to C’ α = [0,1] E k is irreducible and importance-sensitive 25
SIP Algorithm Finally, – SIP method is define iteratively as: ( ) – E k is irreducible, y k stabilizes when t is large – y k consists of the expected importance score of each person after iterative SIP 26
Outline Introduction Social Clustering Measuring Social Importance Semi-supervised Importance Propagation Experiments Conclusions and Future work 27
Experiments Data – Recruited 25 experimental subjects – Each subjects was requested to label non-spam messages Preprocessing – address canonicalization – Word tokenization and stemming didn’t remove stop words from title and body text 28
Experiments Features – Basic features are tokens in from, to, cc, title, and body text, use a v-dimensional vector to represent – Social-network based features Use a m-dimensional sub-vector to represent NC features Sub-vector (7-dims) to represent the social importance (SI) – 5-dimensional sub-vector to represent five SIP scores per user 29
Experiments Classifiers – Use five linear SVM classifiers for prediction of importance level per message – Use the standard SVM light software package Metric N = number of messages yi = the true importance level of message i = the predicted importance level for that message 30
Experiments 31
Conclusions and Future Work Future work – Collection of more data from a larger number of users in a longer time period – Comparative study on different clustering algorithms, and graph-mining techniques with respect to effectiveness 32