Presentation is loading. Please wait.

Presentation is loading. Please wait.

Subject : Discovering Overlapping Groups in Social Media Professor : Dr. sh.Esmaili The Student’s Identifiers : Mr. Hossien Sadrizadeh(Slides 3 to 55)

Similar presentations


Presentation on theme: "Subject : Discovering Overlapping Groups in Social Media Professor : Dr. sh.Esmaili The Student’s Identifiers : Mr. Hossien Sadrizadeh(Slides 3 to 55)"— Presentation transcript:

1 Subject : Discovering Overlapping Groups in Social Media Professor : Dr. sh.Esmaili The Student’s Identifiers : Mr. Hossien Sadrizadeh(Slides 3 to 55) Mr. Houshyar Mohammadi Talvar(Slides 57 to 78) The Date : June 21 th 2012, (On Thursday, Tir 1 th 1391 ) 1/79Discovering Overlapping Groups in social Media

2 Mr. Hossien Sadrizadeh slides from 3 to 53 2/79Discovering Overlapping Groups in social Media

3 Introduction The following sites are attractive Social media sites, they have more user than ever: Facebook Twitter Wikipedia Blogger Myspace In 2009, the global time spent on social media sites increased by 82% than the year before. Facebook, one of the most popular social media site, has more than 500 million active users and the number is still increasing. 3/79Discovering Overlapping Groups in social Media

4 Introduction(Continue) What kind of activities do the people in the social media? In social media websites, users are allowed to partrticipate in social activities, for example: Connect to the other likeminded people. Updating their status. Posting blogs. Uploading photos. Bookmark and tags. People can join to groups at different websites,for instance: Fans of sports teams can join dedicated groups. They can share their opinions on team performance. Put comment on the newest information about player. 4/79Discovering Overlapping Groups in social Media

5 Group - Community A group (community) can be considered as a set of users where each user interacts more ferquenly with users within the group than users outside the groups. Some social media websites(Flicker,Youtube) provide explicit groups which allow users to join them. Some dynamic sites(Twitter,Delicious)have no clear group structure in it, then we need to discover community detection between them. 5/79Discovering Overlapping Groups in social Media

6 Group - Community In social media, a community is: A group of people who are more similar with people within the group than people outside this group. Homophily is one of the important reasons that people connected with others.for example: People from the same city talk more frequently. People have similar political viewpoints are more likely to vote for the same candidates. The people who watch the same movies because of the commonly liked movie stars. 6/79Discovering Overlapping Groups in social Media

7 why group? Group-level investigation can provide usesful information. Studying individual behaviour is usually difficult for large population. Studying statistics at website level often fail to catch sufficient detail. 7/79Discovering Overlapping Groups in social Media

8 An example to make groups 1 We have a set of 50 people.we want to make two sets, with the following properties: Make a set whose the first letter’s name’s is “J”. Make the second set whose the first letter’s name’s is “W”. 1. An example who I make it(The Presenter – hossien sadrizadeh) http://en.wikipedia.org/wiki/Partition_of_a_set 8/79Discovering Overlapping Groups in social Media

9 Another example to make groups We have a set of 50 people.we want to make two sets, with the following properties: Make a set whose the first letter’s name’s is “J”. Make the second set whose the last letter’s name’s is “W”. Adaptation And Enhancement Of Evaluation Measure To Overlapping Graph Clustering(Tatiana Gossen, Michael Kotzyba,) 9/79Discovering Overlapping Groups in social Media

10 Overlapping - Introduce The multiple interactions in social activities imply that the community structures are often overlapping. Example: one person is in several communications. We have a new idea to take advantage network information between users and tags in social media and discover these overlapping communications with co-clustering. Co-clustering is a way to obtain this kind of community structure. 10/79Discovering Overlapping Groups in social Media

11 Overlapping When a website have an explicit group, and allowed to the users than join to more than one group base on their personal pereferences then overlapping is take place. When there are no explicit groups available, community detection algorithm can be used to obtain such groups. 11/79Discovering Overlapping Groups in social Media

12 Community detection Community detection are usually base on structureal features(links). A sketch of a small network displaying community structure, with three groups of nodes with dense internal connections and sparser connections between groups. 12/79Discovering Overlapping Groups in social Media

13 Co-Clustering The graph that is on the right has two type of nodes: Vertices u1-u5 on the left for users. T1-t4 on the right for tags. Edges for tag subscription relation between users and tags. If we use a method to make two cluster,then we’ll see that u3 is associated with two cluster. 13/79Discovering Overlapping Groups in social Media

14 Co-Clustering There are two method of clustering: Vertices clustering. Edges clustering. Instead of clustering vertices, use of clustering edges is better. Clustering edges usually achieves overlapping communities. 14/79Discovering Overlapping Groups in social Media

15 Our contribution 1 We propose to discover overlapping communities in social network. We use user-tag subscription information instead of user-user links. We obtain clusters containing users and tags simultaneousely. 1 research team 15/79Discovering Overlapping Groups in social Media

16 Co-Clustering In this graph, edges connecting to nodes t 1,t 2 and t 3,t 4 are clusterd into two separate groups both containing user u 3. 16/79Discovering Overlapping Groups in social Media

17 Community – Mathematical Defination Supose: A community C i ( 1  i  k ) is a subset of users and tags, where k is the number of community. Communities are usually overlap, C i  C j  . We use an adjacncy matrix to represent the relation between user their subscribed tag.(sparce matrix) 17/79Discovering Overlapping Groups in social Media

18 Adjacency Matrix via Incidence Matrix 18/79Discovering Overlapping Groups in social Media

19 User-Tag Network In a user-tag network, each edge is associated with a uservertix u i and a tag vertix t p. We can use of incidenc matrix.each vector in this matrix have N u + N t.(N u for users and N t for tags). For example : the edge between u 1 and t 1 in the followin graph is: 19/79Discovering Overlapping Groups in social Media

20 User-Tag Network The incidence matrix 20/79Discovering Overlapping Groups in social Media

21 Why is the incidence matrix useful? The incidence matrix It is a sparse matrix. We can impliment it with a linked list,( or double linked list ). 21/79Discovering Overlapping Groups in social Media

22 Overlapping co-clustering problem The overlapping co-clustering problen can be stated formally as follows: Input: A user-tag subscription matrix N N u  N t. when N u and N t are the numbers of users and tags,respectively. K is the number of communities. Output: K overlapping communities which consist of both users and tags. 22/79Discovering Overlapping Groups in social Media

23 The Co-Clustering Framework A user is usually has several friendship but, only a link is usually related to one community,then we understand to use of cluster edges instead of nodes. After obtaining edge clusters, communities can be recovered by replacing each edge with its two vecrtices, i.e., a node is in a community, if any of its connection is in the community. Then the obtained communities are often highly overlapping. 23/79Discovering Overlapping Groups in social Media

24 Make Categories - Find Clusters Communities that aggregate similar users and tags together can be detected by maximizing intra-cluster similarity, which is shown in the following equation: (this formulation can be solved by k-mean Algorithm). 24/79Discovering Overlapping Groups in social Media

25 Disadvantage of k-means cluster K-means isn’t efficent for large scale data set. Then, What should we do ? Our propose 1 is use of another type of k-means. That is EdgeCluster and it is efficent, which is a scalable algorithm to extract communities for sparse network. Why is the Edgecluster efficent ? Because : each centroid only compare to a small set of edges that are correlated to the centroid. It is reported to be able to cluster a sparse network with more than one million nodes into thousands of clusters in tens of minues. 1 Writers 25/79Discovering Overlapping Groups in social Media

26 Density The expected density of the user-tag network is shown in the following equation : 26/79Discovering Overlapping Groups in social Media

27 Key Step in Clustering edge Define edges similarity. Given two edges : e(u i, t p ) and e’(u j, t q ) in a user tag graph, the similarity between them can be define with the following equation : 27/79Discovering Overlapping Groups in social Media

28 Similarity Schemes for clustering There are 3 similarity schemes: Independent Learning. Normalized Learning. Correalational Learning. Our framework 1 can cover different similarity shemes. 1.Writer 28/79Discovering Overlapping Groups in social Media

29 The Kronecker delta function Independent Learning A public way is use of the similarity. The similarity can be represent by the following function : (the user Similarity can be define at the same way ) 29/79Discovering Overlapping Groups in social Media

30 Independent Learning – Cosine Similarity( Continue ) The cosine similarity is widely used in measuring the similarity between two vectors.It’s define with the following form. Given two edges e(u i,t p ) and e’(u j,t q ), the cosine similarity can be define with the following equation: 30/79Discovering Overlapping Groups in social Media

31 Independent Learning – Cosine Similarity( Continue ) The cosine of two vectors can be easily derived by using the Euclidean dot product formula: Given two vectors of attributes, A and B, the cosine similarity, , is represented using a dot product : 31/79Discovering Overlapping Groups in social Media

32 An example of cosine similarity If we have two following vectors, the similarity is :  =(1,2,3) and  =(2,5,-3) What do you think about the range of the similarity? The resulting similarity ranges is in [-1,1]. −1 meaning exactly opposite. 1 meaning exactly the same. 0 usually indicating independence. 32/79Discovering Overlapping Groups in social Media

33 Text matching 1 The attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison. In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°. 1. Co-clustering documents and words using Bipartite Spectral Graph Partitioning (S. Dhillon) 33/79Discovering Overlapping Groups in social Media

34 Normalized Learning Let d u i, denoted the degree of the user u i, and d t p represent the degree of tag t p in a user-tag network. After do the normalization, edge e(u i,t p ), can be represented by the following form : 1 The Research team 34/79Discovering Overlapping Groups in social Media

35 Normalized Learning( continue ) If given two edges e(u i,t p ), and e’(u j,t q ), the cosine similarity between them after normalization can be written the following equation : 35/79Discovering Overlapping Groups in social Media

36 Normalized Learning( continue ) if we set  by 0.5, then we can derive the following equation that tell us normalized edge similarity. These formula say that the similarity between two users is not only related to users, but also the tags. 36/79Discovering Overlapping Groups in social Media

37 Correlational Learning Users often use more than one tag to describe the main topic of a bookmark. A grouped tags indicates their Correlation. In a user-tag network : At the first side, a user can be viewed as a vector by treating tags as features. At the other side, a tag can also be viewed as a vector by treating by users as features. we use a latent space to represent the users and display correlation between their tags. 37/79Discovering Overlapping Groups in social Media

38 Correlational Learning (Continue) Let’s take the following basis vector in the orthogonal latent’s axsis : Users vectors in the original space can be mapped to new vectors in the latent space, which is shown like this : M is a linear mapping from the original space to the latent space 38/79Discovering Overlapping Groups in social Media

39 Correlational Learning (Continue) We mapped the real vectors from a real space to the Latent space.like this : (we use a Mapped function) 39/79Discovering Overlapping Groups in social Media

40 Correlational Learning (Continue) Another method to select a set of orthogonal basis is Singular Value Decomposition(SVD). The singular value decomposition for a user-tag network M is given by the following formula: 40/79Discovering Overlapping Groups in social Media

41 Correlational Learning (Continue) User latents can be formulated with the following form : We need only a small set of vector to comput them.it is here: 41/79Discovering Overlapping Groups in social Media

42 Correlational Learning (Continue) User similarity and tag similarity are defined by the following formula in the latent space : Z solved and derived the generalized eigenvectors. 42/79Discovering Overlapping Groups in social Media

43 Correlational Learning (Continue) The Adjacency matrix and Laplacian matrix are : 43/79Discovering Overlapping Groups in social Media

44 Correlational Learning (Continue) The generalized eigenvector can be rewritten by : After simple manipulation, we obtain : 44/79Discovering Overlapping Groups in social Media

45 Singular Value Decomposition – SVD 1,2 SVD is base on theorem from linear algebra which says that, a rectangular matrix A can be broken down into the product of three matrices: An orthogonal matrix U. A diagonal matrix S. Transpose of an orthogonal materix V. Gram-Schmidt orthogonalization process. Is a method for converting a set of vectors into a set of orthonormal vectors. It uses of normalization method. 1.Linear Algebra, Haffman Kenneth. (chapter 8 : vector spaces) 2.Numerical Analysis, Samuel D.Conte.(Chapter 4 : Matrixes and eigen values, eigen vectors ) 45/79Discovering Overlapping Groups in social Media

46 Other view point of Gram-Schmidt 46/79Discovering Overlapping Groups in social Media

47 Singular Value Decomposition – SVD The theorem is usually presented with a formula like this : 47/79Discovering Overlapping Groups in social Media

48 Example - SVD Start with the matrix: To find the U we have to find AA T. 48/79Discovering Overlapping Groups in social Media

49 Example – SVD(Continue) Next, we have to find the eigenvalues and corresponding eigenvectors of AA T. If we find the eigenvectors and store in a matrix order by size of the corresponding of eigenvalue. 49/79Discovering Overlapping Groups in social Media

50 Example – SVD(Continue) Finally,we have to convert this matrix into an orthogonal matrix. 50/79Discovering Overlapping Groups in social Media

51 Example – SVD(Continue) We use a similar method to find V, base on A T A. Find the eigenvalues for A T A. 51/79Discovering Overlapping Groups in social Media

52 Example – SVD(Continue) For all of data we have the following vectors: According to the size of eigenvalue, we have: 52/79Discovering Overlapping Groups in social Media

53 Example – SVD(Continue) After orthonormalization process, and the convert that to an orthogonal matrix. 53/79Discovering Overlapping Groups in social Media

54 Example – SVD(Continue) For S we take the Square roots of the non-zero eigenvalues and populate with them,putting the largest in S 11, the next largest in S 22 and so on.the smallest value in S mm. The non-zero eigenvalues of U and V are the same. The diagonal entries in Sare the singular values of A. The columns in U are called left singular vectors. The columns in V are called right singular vectors. 54/79Discovering Overlapping Groups in social Media

55 Example – SVD(Continue) Now we have the following matrixes: 55/79Discovering Overlapping Groups in social Media

56 Mr.Houshyar Mohammadi Talvar slides from 57 to 78 56/79Discovering Overlapping Groups in social Media

57 S YNTHETIC D ATA AND F INDINGS Clustering evaluation is difficult when there is no ground truth. We first introduce the synthetic data and how they are generated, then the clustering quality measurement Normalized Mutual Information (NMI). Finally, the NMI of different clustering methods are reported. We develop a synthetic data generator that allows input of the numbers of clusters, users and tags. First users and tags are split evenly into each cluster. Then, in each cluster users and tags are randomly connected with a specified density (e.g., 0.8). Synthetic Data Generation 57/79Discovering Overlapping Groups in social Media

58 Figure 2, shows a toy example of the synthetic user-tag graph in which users are labeled as u1−u7 and tags t1−t8. Three overlapping clusters are highlighted with different colors. 58/79Discovering Overlapping Groups in social Media

59 NMI Evaluation in Synthetic Data The Normalized Mutual Information (NMI) is commonly used to measure the clustering quality. Given two clusterings X and Y, the NMI is defined below. The NMI is computed in two steps First, find the pairs of clusters that are most close to each other in two clusterings Second, average the mutual information between those pairs of clusters The higher the NMI value is, the more similar between two clusterings. If two clusterings X and Y are exactly the same, the NMI value is 1. 59/79Discovering Overlapping Groups in social Media

60 NMI and Number of Clusters We generate another data set with 1,000 users and 1,000 tags and with different number of clusters which range from 5 to 50 and cluster density is set to 1 such that all users connect to all tags within each cluster. Figure 3. NMI Performance w.r.t Number of Clusters 60/79Discovering Overlapping Groups in social Media

61 NMI and Link Density We also study how intra-cluster link density affects clustering in synthetic data sets. We created synthetic data sets (50 clusters, 1,000 users and 1,000 tags) with different intracluster densities that range from 0.1 to 1. Figure 4. NMI Performance w.r.t Intra-cluster Link Density 61/79Discovering Overlapping Groups in social Media

62 Figure 3. NMI Performance w.r.t Number of Clusters Figure 4. NMI Performance w.r.t Intra- cluster Link Density View Correlational Learning in Figure 3 & figure 4 62/79Discovering Overlapping Groups in social Media

63 S OCIAL M EDIA D ATA AND F INDINGS BlogCatalog is a social blog directory where the bloggers can register their blogs under predefined categories.We crawled user names, user ids, their friends, blogs, the associated tags and blog categories. Delicious is a social bookmarking website, which allows users to tag, manage, and share online resources (e.g.,articles). For each resource, users are asked to provide several tags to summarize its main topic. 63/79Discovering Overlapping Groups in social Media

64 Interplay between Link Connection and Tag Sharing There exist explicit and implicit relations between users.Examples of explicit relations are friends or fans people choose to be. Examples of implicit relations are tag sharing,i.e., people who use the same tags. Are there any correlation between the two different relations? What drives people connect to others? Is it a random operation? We conducted statistical analysis between user-user links and tag sharing. In the first study, we fix users who have or have no connection with others, then show the tag sharing probabilities. 64/79Discovering Overlapping Groups in social Media

65 Figure 5 shows the tag sharing probabilities in BlogCatalog and Delicious data sets. For Delicious data, the friends network and fans network are evaluated separately. Interplay between Link Connection and Tag Sharing(countinu) Figure 5. X-axis represents the number of tags that two users share. 65/79Discovering Overlapping Groups in social Media

66 Figures 6 and 7 are the probability that two users being connected if they share tags in BlogCatalog and Delicious,respectively. In Figure 6, the probability of a link between two users increases with respect to the number of tags they share Figure 6. Link probability w.r.t tag sharing in BlogCatalog Interplay between Link Connection and Tag Sharing(countinu) 66/79Discovering Overlapping Groups in social Media

67 Figure 7. Link probability w.r.t tag sharing in Delicious Interplay between Link Connection and Tag Sharing(countinu) 67/79Discovering Overlapping Groups in social Media

68 Clustering Evaluation The clustering evaluation consists of three studies: 1.First,cross-validation is performed to demonstrate the effectiveness of different clustering algorithms in BlogCatalog data set. 2.we study the correlation between user connectivity and co-occurrence in extracted communities 3.concrete examples illustrate what clusters are about. 68/79Discovering Overlapping Groups in social Media

69 1)Comparative Study: In BlogCatalog, categories for each blog are selected by the blog owner from a predefined list.With category information, certain procedures such as cross validation (e.g., treating categories as class labels,cluster memberships as features) can be used to show the clustering quality. Linear SVM is adopted in our experiments since it scales well to large data sets. As recommended by Tang et al, 1,000 communities are used in our experiments. We vary the fraction of training data from 10% to 90% and use the rest as test data. This experiment is repeated for 10 times and the average Micro-F1 and Macro-F1 measures are reported. 69/79Discovering Overlapping Groups in social Media

70 Table II shows five different clustering methods and their prediction performance. In this table, the fourth algorithm EdgeCluster uses user-user network rather than the usertag network. Dhillon’s co-clustering algorithm is based on Singular Value Decomposition (SVD) of the normalized user-tag matrix. As shown in Table II, Correlational Learning consistently performs better, especially when the training set is small. According to Table II, normalization does not improve performance. This suggests normalization should be taken cautiously. Dhillon’s co-clustering method which can only deal with non-overlapping clustering does not perform well compared to other methods. 70/79Discovering Overlapping Groups in social Media

71 2) Connectivity Study: We study the correlation between user co-occurrence in extracted communities and the actual social connections between them We also study the connectivity between users who are in the top similar list. 1,000 overlapping communities are extracted by Correlational Learning. 71/79Discovering Overlapping Groups in social Media

72 We study the dis-connectivity between users who are most similar. Figure 8 shows that the probability of being disconnected is higher than 96% and 99% in BlogCatalog and Delicious, respectively, which means that the majority of homogeneous users are not connected in actual social networks. For example, users marama6 and ameer1577 both are interested in the online game “World of Warcraft”. Figure 8. Probability being Dis-connected between Top Similar Users 72/79Discovering Overlapping Groups in social Media

73 3) Illustrative Examples: Health is the second largest category (the largest is personal) in BlogCatalog, a hot topic that attracts lots of cares. 73/79Discovering Overlapping Groups in social Media

74 The largest cluster about Health obtained by Correlational Learning is cluster-health with 127 users and 102 tags. The cluster that has the maximum user overlapping with clusterhealth is cluster-nutrition with 83 users and 25 tags. Their tag clouds are shown in Figures 10 and 11. Between the two clusters, there are 18 users and 3 tags health, nutrition and weight loss in common. Both clusters are related to health but the first has an emphasis on physical health, highlighted by tags arthritis, drugs, food, dentist, and the second is more about nutrition. 74/79Discovering Overlapping Groups in social Media

75 The top 102 tags of categoryhealth are compared to the tags of cluster-health and the top 25 tags of category-health to those of cluster-nutrition. The numbers of shared tags are 16 for cluster-health and 9 for cluster-nutrition. 75/79Discovering Overlapping Groups in social Media

76 In addition, we aggregate tags of the users in cluster health and present the most frequent 102 tags in Figure 12. Comparing these tags with those of cluster-health, 40 tags are in common. Many tags such as environment, humor, jokes are not present in the tag cloud of cluster-health, which suggests that these users actually have other interests besides health. A similar pattern is observed for cluster nutrition. 76/79Discovering Overlapping Groups in social Media

77 77/79Discovering Overlapping Groups in social Media

78 C ONCLUSIONS AND F UTURE WORK This study suggests more interesting problems that are worth further exploring. Formulating the co-clustering problem into an objective function and maximizing it is one direction to work on. We proposed a framework to study the overlapping clustering of users and tags in online social media which helps to understand the major concerns within the groups. Experimental results in synthetic data reveal that Correlational Learning is very effective in recovering the overlapping cluster structures even when the inner cluster density is low. 78/79Discovering Overlapping Groups in social Media

79 ? 79/79Discovering Overlapping Groups in social Media


Download ppt "Subject : Discovering Overlapping Groups in Social Media Professor : Dr. sh.Esmaili The Student’s Identifiers : Mr. Hossien Sadrizadeh(Slides 3 to 55)"

Similar presentations


Ads by Google