Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense October 16 th 2008.

Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense October 16 th 2008

“It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.” Thesis Statement

Key Observations 1.Understanding communication in social media requires identifying and modeling communities 2.Communities are a result of collective, social interactions and usage.

1.Developed and evaluated innovative approaches for community detection –A new algorithm for finding communities in social datasets –SimCUT, a novel algorithm for combining structural and semantic information 2.First to comprehensively analyze two important, new social media forms –Feed Readership –Microblogging Usage and Communities 3.Built systems, infrastructure and datasets for the social media research community Contributions

Outline Introduction Detecting Communities in Social Media Combining Semantic Information Case Studies –Feed Usage and Distillation –Microblogging Communities Future Work Conclusions

Social Media Describes the online technologies and practices that people use to share opinions, insights, experiences, and perspectives and engage with each other. UGC + Social Network ~Wikipedia

What you… Think blogs Say Podcasts See Flickr, YouTube Hear Pandora, Last.fm Do Twitter,Jaiku, Pownce It’s about YOU!

Who are our... Friends Facebook Colleagues LinkedIn Virtual Avatars secondlife Also about US

What we share Knowledge Wikipedia Links del.icio.us, StumbleUpon Love/Hate yelp, Upcoming Location FireEagle, BrightKite Spaces Ustream, Qik How We Share

Social interactions build communities Shared Interests Common Beliefs Events Organization/Location Communities

A community in real world is represented in a graph as a set of nodes that have more links within the set than outside it. Graph Citation Network Affiliation Network Sentiment Information Shared Resource (tags, videos..) Political Blogs Twitter Network Facebook Network What is a Community

Existing Approaches Clustering Approach 1.Agglomerative/Hierarchical Incrementally, group similar nodes to form clusters Communities in Football League (Hierarchical Clustering) Football Teams

Existing Approaches Clustering Approach 1.Agglomerative/Hierarchical Topological Overlap: Similarity is measured in terms of number of nodes that both i and j link to. (Razvasz et al.)

Existing Approaches Clustering Approach 1.Agglomerative/Hierarchical 2.Divisive/Partition based (Girvan Newman) Normalized Cut (NCut) (Shi, Malik) Political Books

Existing Approaches The graph is partitioned using the eigenspectrum of the Laplacian. (Shi and Malik) The second smallest eigenvector of the graph Laplacian is the Fiedler vector. The graph can be recursively partitioned using the sign of the values in its Fielder vector. Normalized Cuts Graph Laplacian Cost of edges deleted to disconnect the graph Total cost of all edges that start from B

Existing Approaches Modularity Score (Newman et al.) –Measure of quality of clustering e ii fraction of intra-community edges a i expected value of e ii disregarding communities –Q = 0 Communities are random –Q >0 Higher values are better Optimizing modularity is NP-Hard * –Spectral Methods –Heuristics * (Brandes et al.)

Existing methods 1.Do not scale well for Web graphs 2.Fail to exploit the underlying graph’s distributions 3.Unable to use available meta- data and semantic features. Limitations

“It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.” Thesis Statement

The Long Tail –80/20 Rule or Pareto distribution –Few blogs get most attention/links –Most are sparsely connected Motivation –Web graphs are large, but sparse –Expensive to compute community structure over the entire graph Goal –Approximate the membership of the nodes using only a small portion of the entire graph. Special Properties of Social Datasets

Intuition –communities are defined by the core (A) and the membership of the rest of the network (B) can be approximated by how they link to the core. Direct Method –NCut (Baseline) Approximation –Singular Value Decomposition (SVD) –Sampling –Heuristic

SVD (low rank) Sampling based Approach –Communities can be extracted by sampling only columns from the head (Drineas et al.) Heuristic Cluster head to find initial communities. Assign cluster that the tail nodes most frequently link to. Approximating Communities Nodes ordered by degree r ICWSM ‘08

Approximating Communities 1.Dataset: A blog dataset of 6000 blogs. ICWSM ‘08 Original Adjacency Heuristic Approximation Modularity = 0.51

Approximating Communities Low Modularity More Time Similar Modularity Lower Time Advantage Faster detection using small portion of the graph, less Memory. SVD O(n 3 ), Ncut O(nk), Sampling O(r 3 ), Heuristic O(rk) n = number of blogs, k = number of clusters, r = number of columns ICWSM ‘08

Approximating Communities ICWSM ‘08 1.Blog Dataset 2.Social network datasets: Additional evaluations using Variation of Information score

Tags are free meta-data! Other semantic features: Sentiments Named Entities Readership information Geolocation information etc. How to combine this for detecting communities?

Social Media Graphs Links Between Nodes Links Between Nodes and Tags Simultaneous Cuts

A community in the real world is identified in a graph as a set of nodes that have more links within the set than outside it and share similar tags. Communities in Social Media

SimCUT: Simultaneously Clustering Tags and Graphs Nodes Tags Nodes Tags Fiedler Vector Polarity β= 0 Entirely ignore link information β= 1 Equal importance to blog-blog and blog-tag, β>> 1 NCut WebKDD ‘08

SimCUT: Simultaneously Clustering Tags and Graphs β= 0 Entirely ignore link information β= 1 Equal importance to blog-blog and blog-tag, β>> 1 NCut Clustering Only Links Clustering Links + Tags WebKDD ‘08

Datasets Citeseer (Getoor et al.) – Agents, AI, DB, HCI, IR, ML –Words used in place of tags Blog data –derived from the WWE/Buzzmetrics dataset –Tags associated with Blogs derived from del.icio.us –For dimensionality reduction 100 topics derived from blog homepages using LDA (Latent Dirichilet Allocation) Pairwise similarity computed –RBF Kernel for Citeseer –Cosine for blogs

Clustering Tags and Graphs Clustering Only Links Clustering Links + Tags

Clustering Tags and Graphs Accuracy = 36%Accuracy = 62% Higher accuracy by adding ‘tag’ information

Varying Scaling Parameter β Accuracy = 36% Accuracy = 62% Higher accuracy by adding ‘tag’ information Simple Kmeans ~23% Content only, binary Content only ~52% (Getoor et al. 2004) β >> 1β=1β=0 Accuracy = 39% Only Graph Only TagsGraphs & Tags

Mutual Information Measures the dependence between two random variables. Compares results with ground truth Effect of Number of Tags, Clusters Citeseer Link only has lower MI More Semantics helps Similar results for real, blog datasets

Tags are one type of meta-data! Other semantic information: Sentiments Named Entities Readership information Geolocation information etc. How do we get additional semantics?

Additional Semantics BlogVox: –Sentiments and Opinions SemNews: –Named Entities, beliefs, facts Link Polarity: –Sentiment from anchor text Readership: –Feed subscriptions and usage (TREC 06, IJCAI/AND 07) (AAAI SS 05, HICS 06, IJSWIS) (ICWSM 07)

Key Observations 1.Understanding communication in social media requires identifying and modeling communities 2.Communities are a result of collective, social interactions and usage.

Feeds Readership http://ftm.umbc.edu Folders Use folder label as topics/tags. Group similar folders together. Rank Feeds under a “topic” ICWSM ‘07

83K publicly listed subscribers 2.8M feeds, 500K are unique 26K users (35%) use folders to organize subscriptions Data collected in May 2006 Although there may be ~ 50M+ Blogs, only a small fraction get continued user attention in the form of subscriptions Feed Subscription Statistics ICWSM ‘07

Communities from Feed Subscriptions –A Common vocabulary emerges from folder names –Folder names are used as topics. Lower ranked folder are merged into a higher ranked folder if there is an overlap and a high cosine similarity Feeds That Matter Folder Usage Rank of a Folder (By number of Feeds in it) # of Users Using a Folder http://ftm.umbc.edu ICWSM ‘07

Folder names are used as topics. Lower ranked folder are merged into a higher ranked folder if there is an overlap and a high cosine similarity. Tag Cloud After Merging

Two feeds are similar if they are categorized under similar folders Feed Recommendations If you like X you will like….. Feed Distillation for “Politics” Merged folders: “political”, “political blogs” Talking Points Memo: by Joshua Micah Marshal Daily Kos: State of the Nation Eschaton The Washington Monthly Wonkette, Politics for People with Dirty Minds http://instapundit.com/ Informed Comment Power Line AMERICAblog: Because a great nation deserves the truth Crooks and Liars Tech Knitting http://ftm.umbc.edu ICWSM ‘07

Wikipedia is our collective wisdom Twitter is our collective consciousness

Easily share status messages Twitter post Current Status Friends Microblogging SNAKDD ‘07

Twitterment RankCity 1Tokyo 2New York 3San Francisco 4Seattle 5Los Angeles 6Chicago 7Toronto 8Austin 9Singapore 10Madrid http://twitterment.umbc.edu First twitter search engine Uses Lucene to index public timeline Provides search and analytics Built a social network of users 1.3 M Tweets 83 K Users Two months of data

http://twitterment.umbc.edu Search and Trend analytics on Microblogs lunch dinner work coffee Microblogging Trend Analytics

Clique Percolation Method (CPM) Two nodes belong to the same community if they can be connected through adjacent k-cliques. (Palla et al.) Gaming Community Microblogging Communities Finds overlapping communities A Community is a union of all k-clique subgraphs 3 Clique SNAKDD ‘07

INFORMATION HUB Information Source: Communities connected via Robert Scoble, an A-list blogger

INFORMATION BRIDGE Information Source, Information Seeker: Different roles in different communities

STAR NETWORKS / SMALL CLIQUES Friendship-relation: Small groups among friends/co-workers

“It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.” Observations 1.Understanding communication in social media requires identifying and modeling communities 2.Communities are a result of collective, social interactions and usage. Thesis Statement

Future Work Social media content is challenging, much improvements are needed in textual analysis, sentiment detection, named entity detection and language understanding in such systems. Temporal analysis of community structures Feed distillation and ranking in blog search Index quality vs. index freshness User intention and personalization

Demonstrated a fast, community detection algorithm well suited for social datasets. Implemented SimCut, a technique that outperforms simple graph based approaches for community detection. Evaluated and tested proposed algorithms on real social media datasets and benchmark datasets. Conducted the first comprehensive study of feed readership and microblogging usage. Built systems, infrastructure and datasets for the social media research community.

Conclusions We have presented a framework for analyzing social media content and structure making use of certain special properties and features in such systems. We study Social Web from a user perspective and analyze not just how people are using these systems but also why? Social Media is connecting people and building communities by bridging the gap between content production and consumption.

Thanks!

The Future…. Location –Social, mobile applications –Geographically relevant, query(less) search Social Advertising and Personalization –Role of influence and communities in advertising Real-Time, Social Information Streams –Event detection/ Breaking News –How effective is the advertising? Social Web to solve challenging AI problems –Just as tagging has helped image search –Availability of social tools and Wikipedia provide opportunities to work on difficult AI problems like disambiguation and common sense reasoning.

http://ebiquity.umbc.edu http://socialmedia.typepad.com

Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense October 16 th 2008.

Similar presentations

Presentation on theme: "Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense October 16 th 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense October 16 th 2008.

Similar presentations

Presentation on theme: "Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense October 16 th 2008."— Presentation transcript:

Similar presentations

About project

Feedback