1 Graph mining techniques applied to blogs Mary McGlohon Seminar on Social Media Analysis- Oct 2 2007.

Slides:



Advertisements
Similar presentations
Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.
Advertisements

1 Dynamics of Real-world Networks Jure Leskovec Machine Learning Department Carnegie Mellon University
Modeling Blog Dynamics Speaker: Michaela Götz Joint work with: Jure Leskovec, Mary McGlohon, Christos Faloutsos Cornell University Carnegie Mellon University.
Analysis and Modeling of Social Networks Foudalis Ilias.
Lecture 21 Network evolution Slides are modified from Jurij Leskovec, Jon Kleinberg and Christos Faloutsos.
Link Analysis: PageRank
Patterns of Influence in a Recommendation Network Jure Leskovec, CMU Ajit Singh, CMU Jon Kleinberg, Cornell School of Computer Science Carnegie Mellon.
Power Laws: Rich-Get-Richer Phenomena
Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research.
Masters Thesis Defense Amit Karandikar Advisor: Dr. Anupam Joshi Committee: Dr. Finin, Dr. Yesha, Dr. Oates Date: 1 st May 2007 Time: 9:30 am Place: ITE.
NetMine: Mining Tools for Large Graphs Deepayan Chakrabarti Yiping Zhan Daniel Blandford Christos Faloutsos Guy Blelloch.
Hasan T Karaoglu. Introduction Blogs are different! Methods are different! Contents are different! Some methods on Some Content of Some Blogs Discussion.
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Cascading Behavior in Large Blog Graphs Patterns and a Model Leskovec et al. (SDM 2007)
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Statistical Analysis of the Social Network and Discussion Threads in Slashdot Vicenç Gómez, Andreas Kaltenbrunner, Vicente López Defended by: Alok Rakkhit.
CS Lecture 6 Generative Graph Models Part II.
Blogosphere  What is blogosphere?  Why do we need to study Blog-space or Blogosphere?
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
Cascading Behavior in Large Blog Graphs: Patterns and a model offence.
Analysis of the Internet Topology Michalis Faloutsos, U.C. Riverside (PI) Christos Faloutsos, CMU (sub- contract, co-PI) DARPA NMS, no
Advanced Topics in Data Mining Special focus: Social Networks.
Implicit Structure and Dynamics of BlogSpace Eytan Adar, Li Zhang, Lada Adamic, & Rajan Lukose HP Labs, Palo Alto, CA.
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
Graphs and Topology Yao Zhao. Background of Graph A graph is a pair G =(V,E) –Undirected graph and directed graph –Weighted graph and unweighted graph.
U. Michigan participation in EDIN Lada Adamic, PI E 2.1 fractional immunization of networks E 2.1 time series analysis approach to correlating structure.
1 Exploring Blog Networks Patterns and a Model for Information Propagation Mary McGlohon In collaboration with Jure Leskovec, Christos Faloutsos Natalie.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
A Measurement-driven Analysis of Information Propagation in the Flickr Social Network WWW09 报告人: 徐波.
Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Models of Influence in Online Social Networks
On Power-Law Relationships of the Internet Topology.
Social Network Analysis via Factor Graph Model
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
Authors: Xu Cheng, Haitao Li, Jiangchuan Liu School of Computing Science, Simon Fraser University, British Columbia, Canada. Speaker : 童耀民 MA1G0222.
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
Jure Leskovec PhD: Machine Learning Department, CMU Now: Computer Science Department, Stanford University.
To Blog or Not to Blog: Characterizing and Predicting Retention in Community Blogs Imrul Kayes 1, Xiang Zuo 1, Da Wang 2, Jacob Chakareski 3 1 University.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Microblogs: Information and Social Network Huang Yuxin.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
Mining Social Networks for Personalized Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter:
ACM International Conference on Information and Knowledge Management (CIKM) Analysis of Physical Activity Propagation in a Health Social Network.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
CS 590 Term Project Epidemic model on Facebook
A Latent Social Approach to YouTube Popularity Prediction Amandianeze Nwana Prof. Salman Avestimehr Prof. Tsuhan Chen.
1 Finding Spread Blockers in Dynamic Networks (SNAKDD08)Habiba, Yintao Yu, Tanya Y., Berger-Wolf, Jared Saia Speaker: Hsu, Yu-wen Advisor: Dr. Koh, Jia-Ling.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Implicit Structure and Dynamics of.
1 Patterns of Cascading Behavior in Large Blog Graphs Jure Leskoves, Mary McGlohon, Christos Faloutsos, Natalie Glance, Matthew Hurst SDM 2007 Date:2008/8/21.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
The Spread of Media Content through the Blogosphere
NetMine: Mining Tools for Large Graphs
Generative Model To Construct Blog and Post Networks In Blogosphere
Ahnert, S. E., & Fink, T. M. A. (2016). Form and function in gene regulatory networks: the structure of network motifs determines fundamental properties.
Building and Analyzing Genome-Wide Gene Disruption Networks
The likelihood of linking to a popular website is higher
Graph and Tensor Mining for fun and profit
Discovery of Blog Communities based on Mutual Awareness
GhostLink: Latent Network Inference for Influence-aware Recommendation
Presentation transcript:

1 Graph mining techniques applied to blogs Mary McGlohon Seminar on Social Media Analysis- Oct

2 Last week… Lots of methods for graph mining and link analysis.

3 Last week… Lots of methods for graph mining and link analysis. This week… A few examples of these methods applied to blogs.

4 Paper #1 ● Jure Leskovec, Mary McGlohon, Christos Faloutsos, Natalie Glance, and Matthew Hurst. Patterns of Cascading Behavior in Large Blog Graphs, SDM – What temporal and topological features do we observe in a large network of blogs?

5 Blogosphere network Representing blogs as graphs slashdot boingboing Dlisted MichelleMalki n

6 Blogosphere network Representing blogs as graphs 1 slashdot boingboing Dlisted MichelleMalki n slashdot boingboing Dlisted MichelleMalki n Blog network

7 Blogosphere network Representing blogs as graphs 1 Blog networkPost network slashdot boingboing Dlisted MichelleMalki n slashdot boingboing Dlisted MichelleMalki n

8 Extracting subgraphs: Cascades We gather cascades using the following procedure: – Find all initiators (out-degree 0). a b c d e

9 Extracting subgraphs: Cascades We gather cascades using the following procedure: – Find all initiators (out-degree 0). – Follow in-links. a b c d e a b c d e

10 Extracting subgraphs: Cascades We gather cascades using the following procedure: – Find all initiators (out-degree 0). – Follow in-links. – Produces directed acyclic graph. a b c d e a b c d e d e b c e a

11 Paper #1,2 Dataset (Nielsen Buzzmetrics) ● Gathered from August-September 2005* ● Used set of 44,362 blogs, traced cascades ● 2.4 million posts, ~5 million out-links, 245,404 blog- to-blog links Time [1 day] Number of posts

12 Temporal Observations Does blog traffic behave periodically? Posts have “weekend effect”, less traffic on Saturday/Sunday.

13 Temporal Observations How does post popularity change over time? Monday post dropoff- days after post Number in-links (log) Popularity on day 1 Popularity on day 40

14 Temporal Observations How does post popularity change over time? Days after post Number of in-links Monday post dropoff- days after post Number in-links (log) How does post popularity change over time? Post popularity dropoff follows a power law identical to that found in communication response times in [Vazquez2006].

15 Temporal Observations How does post popularity change over time? Days after post Number of in-links How does post popularity change over time? Post popularity dropoff follows a power law identical to that found in communication response times in [Vazquez2006]. The probability that a post written at time t p acquires a link at time t p +  is: p(t p +  )   1.5

16 Topological Observations What graph properties does the blog network exhibit?

17 Topological Observations What graph properties does the blog network exhibit? ● 44,356 nodes, 122,153 edges ● Half of blogs belong to largest connected component.

18 Topological Observations What power laws does the blog network exhibit? Both in- and out-degree follows a power law distribution, in-link PL exponent -1.7, out-degree PL exponent near -3. This suggests strong rich-get-richer phenomena. Number of blog in-links (log scale)Number of blog out-links (log scale) Count (log scale)

19 Topological Observations What graph properties does the post network exhibit?

20 Topological Observations What graph properties does the post network exhibit? Very sparsely connected: 98% of posts are isolated. Inlinks/outlinks also follow power laws.

21 Topological Observations How do we measure how information flows through the network? Common cascade shapes are extracted using algorithms in [Leskovec2006].

22 Topological Observations How do we measure how information flows through the network? Number of edges increases linearally with cascade size, while effective diameter increases logarithmically, suggesting tree-like structures. Cascade size (# nodes) Number of edges Cascade size Effective diameter

More on cascades ● Cascade sizes, including sizes of particular shapes (stars, chains) also follow power laws. ● This paper also presents a model for influence propagation that generates cascades based on SIS model of epidemiology. The topic of influence propagation has been reserved for a later date.

24 Paper #2 Mary McGlohon, Jure Leskovec, Christos Faloutsos, Matthew Hurst, and Natalie Glance. Finding patterns in blog shapes and blog evolution, SDM ● Do different kinds of blogs exhibit different properties? ● What tools can we use to describe the behavior of a blog over time?

● Suppose we wanted to characterize a blog based on the properties of its posts. – Obtain a set of post features based on its role in a cascade. – Use PCA for dimensionality reduction.

26 Post features ● There are several terms we use to describe cascades: ● In-link, out-link – Green node has one out-link – Yellow node has one in-link. ● Depth downwards/upwards – Pink node has an upward depth of 1, – downward depth of 2. ● Conversation mass upwards/downwards – Pink node has upward CM 1, – downward CM 3

27 Dimensionality reduction ● Post features may be correlated, so some information may be unnecessary. ● Principal Component Analysis is a method of dimensionality reduction. Depth upwards Conversation mass upwards Hypothetically, for each blog...

28 Dimesionality reduction ● Post features may be correlated, so some information may be unnecessary. ● Principal Component Analysis is a method of dimensionality reduction. Depth upwards Conversation mass upwards Hypothetically, for each blog...

29 Dimensionality reduction ● Post features may be correlated, so some information may be unnecessary. ● Principal Component Analysis is a method of dimensionality reduction. Depth upwards Hypothetically, for each blog... Conversation mass upwards

30 Setting up the matrix.6.1 … boingboing-p boingboing-p … … slashdot-p slashdot-p001 log(# in-links) log(#out-links) log(CM up) log(CM down) log(depth up) log(depth down) ~2,400,000 posts Run PCA…

31 PostFeatures: Results Observation: Posts within a blog tend to retain similar network characteristics. –PC1 ~ CM upward –PC2 ~ CM downward

32 PostFeatures: Results Observation: Posts within a blog tend to retain similar network characteristics. MichelleMalkin Dlisted –PC1 ~ CM upward –PC2 ~ CM downward

33 ● Suppose we want to cluster blogs based on content. What features do we use? – Get set of features based on cascade shapes. – Run PCA to reduce dimensionality.

34 PCA on a sparse matrix This time, each blog is one row. Use log(count+1) Project onto 2 PC….01 … … … 5.1 … 4.2 … boingboing slashdot ………… ~9,000 cascade types ~44,000 blogs

35 CascadeType: Results ● Observation: Content of blogs and cascade behavior are often related. Distinct clusters for “conservative” and “humorous” blogs (hand-labeling).

36 CascadeType: Results ● Observation: Content of blogs and cascade behavior are often related. Distinct clusters for “conservative” and “humorous” blogs (hand-labeling).

37 ● What about time series data? How can we deal with that? ● Problem: time series data is nonuniform and difficult to analyze. in-links over time

38 BlogTimeFractal: Definitions ● Fortunately, we find that behavior is often self- similar. ● The law describes self-similarity. ● For any sequence, we divide it into two equal- length subsequences. 80% of traffic is in one, 20% in the other. – Repeat recursively.

39 Self-similarity ● The bias factor for the law is b= Details

40 Self-similarity ● The bias factor for the law is b= Q: How do we estimate b? Details

41 Self-similarity ● The bias factor for the law is b= Q: How do we estimate b? A: Entropy plots! Details

42 BlogTimeFractal ● An entropy plot plots entropy vs. resolution. ● From time series data, begin with resolution R= T/2. ● Record entropy H R

43 BlogTimeFractal ● An entropy plot plots entropy vs. resolution. ● From time series data, begin with resolution R= T/2. ● Record entropy H R ● Recursively take finer resolutions.

44 BlogTimeFractal ● An entropy plot plots entropy vs. resolution. ● From time series data, begin with resolution r= T/2. ● Record entropy H r ● Recursively take finer resolutions.

45 BlogTimeFractal: Definitions ● Entropy measures the non-uniformity of histogram at a given resolution. ● We define entropy of our sequence at given R : where p(t) is percentage of posts from a blog on interval t, R is resolution and 2 R is number of intervals. Details

46 BlogTimeFractal ● For a b-model (and self similar cases), entropy plot is linear. The slope s will tell us the bias factor. ● Lemma: For traffic generated by a b-model, the bias factor b obeys the equation: s= - b log 2 b – (1-b) log 2 (1-b)

47 Entropy Plots ● Linear plot  Self-similarity Resolution Entropy

48 Entropy Plots ● Linear plot  Self-similarity ● Uniform: slope s=1. bias=.5 ● Point mass: s=0. bias=1 Resolution Entropy

49 Entropy Plots ● Linear plot  Self-similarity ● Uniform: slope s=1. bias=.5 ● Point mass: s=0. bias=1 Resolution Entropy Michelle Malkin in-links, s= 0.85 By Lemma 1, b= 0.72

50 BlogTimeFractal: Results ● Observation: Most time series of interest are self-similar. ● Observation: Bias factor is approximately that is, more bursty than uniform (70/30 law). in-links, b=.72conversation mass, b=.76number of posts, b=.70 Entropy plots: MichelleMalkin

Papers #1,2 conclusions ● There are several power laws observed in a network of blogs. ● We can extract cascades to help describe how information propagates through a network. ● We can use cascade properties to describe behavior of some blogs. ● We can also use self-similarity to describe behavior of blogs over time.

52 Paper #3 ● Eytan Adar, Li Zhang, Lada A. Adamic, and Rajan M. Lukose. Implicit Structure and the Dynamics of Blogspace. WWW – What are the large- and small- scale patterns of blog epidemics?

Large scale: Epidemic profiles ● Example: The effects of popular websites linking to a given blog may cause popularity spikes. 53

Large scale: Epidemic profiles ● Quantify popularity of a topic into a vector. ● Then, cluster different topics’ profiles.

Large scale: Epidemic profiles ● Used k-means clustering on topic buzz to identify different ways ideas gain and lose popularity. Found k=4 worked best. 55 Centroids of clusters identified

Large scale: Epidemic profiles ● ‘Catchall’- picked up by different communities, no major spike. ● ‘Back page’ news- delayed spike, broader popularity. ● ‘Slashdot’- link picked up quickly, dies off quickly. ● ‘Front page’ news- immediate spike, broader popularity. 56 ‘catchall’ 48% ‘slashdot’ 14% ‘back page’ 20% ‘front page’ 18%

Link gathering ● Links acquired by blogrolls or automated trackbacks. ● Posts sometimes give information on source of information (‘via’). 57 May , 8:48a “GIANTmicrobes ‘We make stuffed animals that look like tiny microbes– only a million times actual size! Now available: The Common Cold, The Flu, Sore Throat, and Stomach Ache.’ (via BoingBoing)

Small scale: link mining ● Links acquired by blogrolls or automated trackbacks. ● Posts sometimes give information on source of information (‘via’). 58 May , 8:48a “GIANTmicrobes ‘We make stuffed animals that look like tiny microbes– only a million times actual size! Now available: The Common Cold, The Flu, Sore Throat, and Stomach Ache.’ (via BoingBoing) Epstein- Barr Ebola

Small scale: link mining ● Unfortunately, since ‘via’ information is rare (O(.1%)), there needs to be a better way to infer infection paths. – Solution: link prediction.

Link prediction ● Predict likelihood of 2 blogs linking to each other. – Blog similarity- common links to other blogs – Link similarity- common non-blog links – Textual similarity- text vector similarity – Timing of posts on certain topics. ● First three are cosine similarity, timing is likelihood based on observed distributions of link timings. 60

Link prediction results ● Used SVMs to predict links. – Undirected link prediction accuracy 91% – (Directed link prediction, 57%) 61

More goodies from Paper #3 ● And… – Built Zoomgraph, a visualization tool (stay tuned next week.) – Proposed iRank, a ranking based on “infectiousness” of blogs (stay tuned Oct. 23.) A more in-depth slide show may be found here: 62

63 Paper #4 ● Noor Ali-Hasan and Lada Adamic. Expressing Social Relationships on the Blog through Links and Comments. ICWSM 2007 – Do different blog communities exhibit certain structural properties?

[Ali-Hasen and Adamic 2007] ● Dataset of 3 blogging communities – Dallas/Ft. Worth – United Arab Emirates (UAE) – Kuwait ● Analyzed 3 types of links – Blogrolls (on a blog’s webpage) – Citations (link in a post) – Comments (interaction in a post’s discussion) 64

65 Citation link Blogroll link

66 Comment link

Link type analysis ● It is of interest to compare different types of links… – Co-occurrences of different link types. 67 Co-occurrence of link types (Kuwait)

Link type analysis ● It is of interest to compare different types of links… – Co-occurrences of different link types. – Reciprocity among link types, between communities. 68 Link reciprocation rates Co-occurrence of link types (Kuwait)

Structural properties ● Centralization- to what extent links are not uniformly distributed. (low in all communities, indicating “hubs”) 69 Links per blog

Structural properties ● Centralization- to what extent links are not uniformly distributed. (low in all communities, indicating “hubs”) ● Modularity- to what extent “subcommunities” have formed. 70 Links per blog Modularity

71 Comparing communities Dallas-Fort Worth -Most links are external to community (91%) -Low centralization -Low reciprocity UAE -Fewer links external to community -More centralization -Obvious “hub” structure Kuwait -Fewest links external to community (53%) -Highly centralized -Much reciprocity

Paper #4 Conclusions ● Based on a survey, they suggest that these different network characteristics indicated different mindsets inside the community. – Kuwait bloggers more often reported blogging in order to make new friends. – DFW more often reported blogging to update friends/family on events.

Conclusions ● Link analysis has discovered patterns in several aspects of the blogosphere. – Observing general network characteristics. – Describing behavior of specific blogs, or blog topics. – Illustrating how influence propagates. – Comparing different blogging communities.