Download presentation
Presentation is loading. Please wait.
Published byMilton Morgan Modified over 8 years ago
1
2016/2/131 Structural and Temporal Analysis of the Blogosphere Through Community Factorization Y. Chi, S. Zhu, X. Song, J. Tatemura, B.L. Tseng Proceedings of the 13 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007 Advisor : Dr. Jia-Ling Koh Speaker : Chien-Liang Wu
2
2 Outline Introduction Definition Basic Idea Community Factorization Experimental Studies
3
Introduction Blog Self-publishing media on the web Grow quickly and become important Key difference between blogosphere and Web Lifetime of their contents (i.e., pages and links) Blog consists of a temporal sequence of entries An event publish entries obsolete Short lifetime Web Longer lifetime A new page refer to a very old page (such as authoritative page) 3
4
Introduction (cont.) Traditional web analysis A dense subgraph of web pages a community Temporal dynamics observe how the subgraph grows over time Blogosphere analysis A dense subgraph grows only within a short time span Traditional analysis can only capture dynamics within a short-term activity (such as a single thread of discussion) 4
5
Introduction (cont.) Alternative method Accumulate such links over a very long period Generate a graph of blog The graph is static and changes incrementally Miss detailed temporal behavior Example: Two communities - Politics (P) and Economics (E) One becomes inactive while the other becomes active Some blogs are interested both in P and E, moving from one community to another Two communities have overlap a single community in the aggregated graph 5
6
Definition Community A set of blogs that communicates with each other in a synchronized manner Communication is triggered by some events Result in a number of dense subgraphs, each of which is a short-term thread of discussion Community graph Structure of community which represents how much one blog communicates with another Community intensity Represent the activity level of a community at a particular time 6
7
Observation Communication in a community Dense subgraphs Structure of subgraphs community graph structure A single dense subgraph does not necessarily reflect the entire structure of a community Since a member does not always participate in a thread of discussion, a community may appear as smaller pieces of disconnected subgraphs at a particular time Members of different communities participate in a single subgraph 7 reflect
8
Basic Idea Represent a community structure as a combination of the observed subgraphs Transfer the problem How to find coefficients for such combination as well as the values of the community intensity over time Community factorization Find such parameters that give the best explanation of the observed data Use intensities as weighting factor approximate the observed data 8
9
Overview of method 9 Identify dense subgraphs by applying a graph partitioning algorithm Community graph is defined as a linear combination of the dense subgraphs
10
Problem Formulation n blogs b i (i = 1,…, n) in the blogosphere Linking activity a graph structure for each time window s (s = 1,…, t) These graphs A s (s = 1,…, t) a tensor A Extract dense subgraphs from the observed graph A s. These dense subgraphs another tensor B Given A and B, find k communities such that their community graphs and intensities best explain the observed data A Community graphs {C l } (l = 1,…, k) Community intensities {v sl } (s = 1,…, t) 10 aggregate as stack as
11
Data Tensor A A =[A 1,…, A t ] : represent the blogosphere over time Snapshot graph A s Adjacency matrix that presents a snapshot of activity in the blogosphere at time s (A s ) ij : the count of links from b i to b j in the s-th time window A link from b i to b j at time s (i ≠ j) b i publishes an entry at time s that has a hyperlink pointing to any content of b j These links are counted for each time window (say a day) 11
12
Basis Tensor B For each snapshot graph A s, identify dense subgraphs Apply a graph partitioning algorithm Shi's normalized cut or Newman's optimal modularity Remove insignificant subgraphs e.g., a subgraph with only a couple of nodes Have m s graphs, and call them basis subgraphs For the t time windows Have total basis subgraphs Stack them together and get basis tensor B 12
13
Community Graphs and Intensities 13 B2B2 B1B1 B3B3 BmBm B m-1 B m-2 B m-3 B4B4 B5B5 B6B6 B7B7 Community graph C l = u 1l B 1 + u 2l B 2 +u 3l B 3 ++ u ml B m Linear combination u pl is a weight that indicates how important the p-th basis subgraph is to the l-th community The coefficients {u pl } are parameters and need to estimate. B =[ ] …….(1)
14
Community Graphs and Intensities (cont.) Communities behave concurrently At time s, multiple community graphs can affect the structure of A s Community intensity v sl How much the l-th community contributes at time s Use to represent the observed data A s Problem is formulated as minimization of the error 14
15
Community Graphs and Intensities (cont.) For t time windows, minimization of the error 15 …….(2) Plug equation (1) into equation (2) Minimize the objective function, where × 3 presents the 3-mode multiplication of a tensor by a matrix …….(3)
16
Solution by Non-negative Matrix Factorization For the objective function (3) Equivalently written in a matrix form as Each column A(s) of A Stack the columns of the snapshot graph A s into an n 2 ×1 vector Similar way to obtain B In equation (4), given A and B Task: search for non-negative matrices U and V that minimize J 1 16, where …….(4)
17
Smoothing by Regularization Incorporate prior knowledge into the objective function Tikhonov regularization terms In this paper, γ 1 set to be 1 and R 1 to be the identity matrix For V, apply intuitive prior knowledge – temporal trends The value difference between two consecutive (in temporal order) elements in the same column of V should be small Set R 2 to be a difference matrix Turn γ 2 to demonstrate the effect of smoothness in experiment 17 …….(5), where γ 1 and γ 2 are user defined parameters
18
Iterative Updating Rules Solve for U and V in Equation (5) Start by setting U and V to some random non-negative matrices and then iteratively update U and V Theorem 1: The following multiplicative updating rules will converge to non-negative solutions to the optimization problem whose objective function is given by Equation (5) 18
19
Some Practical Issues Size of Time Windows and Basis Subgraphs This algorithm is not very sensitive to these two sizes, as long as they are not overly large For size of time windows Days or Weeks Not necessarily uniform For size of basis subgraphs Choose different numbers in each time window Number of Communities Try different k's to compare the reconstruction error Choose one reasonably small and explains data reasonably well 19
20
Experimental Studies Synthetic data set 150 blogs and 2 communities NEC Laboratories America 407 English blogs that have 274,679 entries in 441 days (63 weeks) between 2005.07.10 ~ 2006. 09 23 These entries are connected with 148,681 links Roughly two groups of blogs: technology focus and politics focus Benchmark data set WWW 2006 Workshop on the Weblogging Ecosystem 8.37 million entries from 1.43 million different blog sites during a 3-week Constrain the subset of blogs that contain at least one link 141K blogs, and 1.62 million links among them 20
21
Synthetic Data Set Separate two overlapped communities that have different temporal trends 21
22
Synthetic Data Set (cont.) 22 more smoother
23
Synthetic Data Set (cont.) 23 Generate a random number p between 1~3 and aggregate next p time windows into a single one
24
NEC Data Set 24 Blog graph : Technology focus : Politics focus
25
NEC Data Set (cont.) 25 Use normalized cut algorithm 50 communities Report the number of links among each community every day
26
NEC Data Set (cont.) 26 Size of a node: the corresponding row sum in the community graph C l in Equation (1) Width of a link: the corresponding entry in C l This community is formed around an authoritative blog by David Sifry David Sifry posted a comprehensive study on the current status of the blogosphere
27
Benchmark Data Set 27 Scalability A weak point of this algorithm: #blog ↑, t ↓ difficult to extract meaningful communities t/n is large this method t/n is mall traditional approach
28
Running Time Implement in Matlab Run on a PC of Pentium IV processor with 2G Hz CPU and 2GB memory Running time second per iteration Criterion for convergence Algorithm can converge within 1000 iterations 28
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.