Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense October 16 th 2008.

Slides:



Advertisements
Similar presentations
Recommender System A Brief Survey.
Advertisements

On-line media tools for strategic communications purposes When using media tools for communication we try to use the latest technologies such us blogging,
Community Detection and Graph-based Clustering
Social media for business by Frank Flores Hash Cloud Studio A Creative Marketing Agency 200 Industrial Rd. Suite 155 San Carlos, CA (650)
ICDE 2014 LinkSCAN*: Overlapping Community Detection Using the Link-Space Transformation Sungsu Lim †, Seungwoo Ryu ‡, Sejeong Kwon§, Kyomin Jung ¶, and.
Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.
Creating Collaborative Partnerships
One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Community Detection and Evaluation
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Title Course opinion mining methodology for knowledge discovery, based on web social media Authors Sotirios Kontogiannis Ioannis Kazanidis Stavros Valsamidis.
1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Feeds That Matter A Study of Bloglines Subscriptions Akshay Java Pranam Kolari, Tim Finin, Anupam Joshi, Tim Oates.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Discovering Overlapping Groups in Social Media Xufei Wang, Lei Tang, Huiji Gao, and Huan Liu Arizona State University.
Normalized Cuts and Image Segmentation Jianbo Shi and Jitendra Malik, Presented by: Alireza Tavakkoli.
Power of Social Media Reflections by Kelvin J. Twissa.
Presented by Zeehasham Rasheed
Kyle Heath, Natasha Gelfand, Maks Ovsjanikov, Mridul Aanjaneya, Leo Guibas Image Webs Computing and Exploiting Connectivity in Image Collections.
Scalable Text Mining with Sparse Generative Models
Overview of Web Data Mining and Applications Part I
Overview of Search Engines
Social Networking – The Ways and Means Rosey Broderick May 2011.
Adding Common Sense into Artificial Intelligence Common Sense Computing Initiative Software Agents Group MIT Media Lab.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
A Distributed and Privacy Preserving Algorithm for Identifying Information Hubs in Social Networks M.U. Ilyas, Z Shafiq, Alex Liu, H Radha Michigan State.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Free Powerpoint Templates Page 1 Free Powerpoint Templates Influence and Correlation in Social Networks Azad University KurdistanSocial Network.
Social Media at LISC June LISC Social Media What is it? New ways to distribute our news and stories that engages, interacts and shares. Why do it?
H OW T O G O S OCIAL In the Timeshare Industry. W HAT IS S OCIAL MEDIA ? “Social media describes the online technologies and practices that people use.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Personalized Interaction with Web Resources First Sino-German Symposium on KNOWLEDGE HANDLING: REPRESENTATION, MANAGEMENT AND PERSONALIZED APPLICATION.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
On Finding Fine-Granularity User Communities by Profile Decomposition Seulki Lee, Minsam Ko, Keejun Han, Jae-Gil Lee Department of Knowledge Service Engineering.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
© Copyright 2008 STI INNSBRUCK August 2, 2012 – Carmen Brenner.
Chapter 3. Community Detection and Evaluation May 2013 Youn-Hee Han
1 Social Media and Volunteer Engagement Victoria Pacchiana, Online Communications Manager VolunteerMatch Webinar.
Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam Joshi Tim Finin University of Maryland, Baltimore County.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Network Community Behavior to Infer Human Activities.
Individual Project by Nora-Marie Myers May 3, 2011 Social Media Communities in the Media King 5 Seattle The Huffington Post.
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Distributed Ranked Data Dissemination in Social Networks Joint work with: Mo Sadoghi Vinod Muthusamy Hans-Arno.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
Unsupervised Streaming Feature Selection in Social Media
TWinner : Understanding News Queries with Geo-content using Twitter Satyen Abrol,Latifur Khan University of Texas at Dallas,Department of Computer Science.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Chapter 2 Tools and Platforms for Social Commerce.
Building a Social Media Presence Participants will look at the BCPS social media outlets (Twitter, Facebook, Flickr, Vimeo, Instagram, blogs) and relevant.
 Smartphones – iPhone, Android, Blackberries, etc  Tablets – iPad, Android, Windows, Google, etc.  Computers Basically anything that can connect to.
Social Media & Social Networking 101 Canadian Society of Safety Engineering (CSSE)
Chapter 8: Web Analytics, Web Mining, and Social Analytics
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Multi-Modal Bayesian Embeddings for Learning Social Knowledge Graphs Zhilin Yang 12, Jie Tang 1, William W. Cohen 2 1 Tsinghua University 2 Carnegie Mellon.
Ing. Athanasios Podaras, Ph.D
Modeling Influence Opinions and Structure in Social Media
Feeds That Matter A study of Bloglines subscriptions
Approximating the Community Structure of the Long Tail
CS 594: Empirical Methods in HCC Social Network Analysis in HCI
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Trust, Influence and Bias in Social Media
Affiliation Network Models of Clusters in Networks
Presentation transcript:

Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense October 16 th 2008

“It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.” Thesis Statement

Key Observations 1.Understanding communication in social media requires identifying and modeling communities 2.Communities are a result of collective, social interactions and usage.

1.Developed and evaluated innovative approaches for community detection –A new algorithm for finding communities in social datasets –SimCUT, a novel algorithm for combining structural and semantic information 2.First to comprehensively analyze two important, new social media forms –Feed Readership –Microblogging Usage and Communities 3.Built systems, infrastructure and datasets for the social media research community Contributions

Outline Introduction Detecting Communities in Social Media Combining Semantic Information Case Studies –Feed Usage and Distillation –Microblogging Communities Future Work Conclusions

Outline Introduction Detecting Communities in Social Media Combining Semantic Information Case Studies –Feed Usage and Distillation –Microblogging Communities Future Work Conclusions

Social Media Describes the online technologies and practices that people use to share opinions, insights, experiences, and perspectives and engage with each other. UGC + Social Network ~Wikipedia

What you… Think blogs Say Podcasts See Flickr, YouTube Hear Pandora, Last.fm Do Twitter,Jaiku, Pownce It’s about YOU!

Who are our... Friends Facebook Colleagues LinkedIn Virtual Avatars secondlife Also about US

What we share Knowledge Wikipedia Links del.icio.us, StumbleUpon Love/Hate yelp, Upcoming Location FireEagle, BrightKite Spaces Ustream, Qik How We Share

Social interactions build communities Shared Interests Common Beliefs Events Organization/Location Communities

Outline Introduction Detecting Communities in Social Media Combining Semantic Information Case Studies –Feed Usage and Distillation –Microblogging Communities Future Work Conclusions

A community in real world is represented in a graph as a set of nodes that have more links within the set than outside it. Graph Citation Network Affiliation Network Sentiment Information Shared Resource (tags, videos..) Political Blogs Twitter Network Facebook Network What is a Community

Existing Approaches Clustering Approach 1.Agglomerative/Hierarchical Incrementally, group similar nodes to form clusters Communities in Football League (Hierarchical Clustering) Football Teams

Existing Approaches Clustering Approach 1.Agglomerative/Hierarchical Topological Overlap: Similarity is measured in terms of number of nodes that both i and j link to. (Razvasz et al.)

Existing Approaches Clustering Approach 1.Agglomerative/Hierarchical 2.Divisive/Partition based (Girvan Newman) Normalized Cut (NCut) (Shi, Malik) Political Books

Existing Approaches The graph is partitioned using the eigenspectrum of the Laplacian. (Shi and Malik) The second smallest eigenvector of the graph Laplacian is the Fiedler vector. The graph can be recursively partitioned using the sign of the values in its Fielder vector. Normalized Cuts Graph Laplacian Cost of edges deleted to disconnect the graph Total cost of all edges that start from B

Existing Approaches Modularity Score (Newman et al.) –Measure of quality of clustering e ii fraction of intra-community edges a i expected value of e ii disregarding communities –Q = 0 Communities are random –Q >0 Higher values are better Optimizing modularity is NP-Hard * –Spectral Methods –Heuristics * (Brandes et al.)

Existing methods 1.Do not scale well for Web graphs 2.Fail to exploit the underlying graph’s distributions 3.Unable to use available meta- data and semantic features. Limitations

“It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.” Thesis Statement

Outline Introduction Detecting Communities in Social Media Combining Semantic Information Case Studies –Feed Usage and Distillation –Microblogging Communities Future Work Conclusions

The Long Tail –80/20 Rule or Pareto distribution –Few blogs get most attention/links –Most are sparsely connected Motivation –Web graphs are large, but sparse –Expensive to compute community structure over the entire graph Goal –Approximate the membership of the nodes using only a small portion of the entire graph. Special Properties of Social Datasets

Intuition –communities are defined by the core (A) and the membership of the rest of the network (B) can be approximated by how they link to the core. Direct Method –NCut (Baseline) Approximation –Singular Value Decomposition (SVD) –Sampling –Heuristic

SVD (low rank) Sampling based Approach –Communities can be extracted by sampling only columns from the head (Drineas et al.) Heuristic Cluster head to find initial communities. Assign cluster that the tail nodes most frequently link to. Approximating Communities Nodes ordered by degree r ICWSM ‘08

Approximating Communities 1.Dataset: A blog dataset of 6000 blogs. ICWSM ‘08 Original Adjacency Heuristic Approximation Modularity = 0.51

Approximating Communities Low Modularity More Time Similar Modularity Lower Time Advantage Faster detection using small portion of the graph, less Memory. SVD O(n 3 ), Ncut O(nk), Sampling O(r 3 ), Heuristic O(rk) n = number of blogs, k = number of clusters, r = number of columns ICWSM ‘08

Approximating Communities ICWSM ‘08 1.Blog Dataset 2.Social network datasets: Additional evaluations using Variation of Information score

Outline Introduction Detecting Communities in Social Media Combining Semantic Information Case Studies –Feed Usage and Distillation –Microblogging Communities Future Work Conclusions

Tags are free meta-data! Other semantic features: Sentiments Named Entities Readership information Geolocation information etc. How to combine this for detecting communities?

Social Media Graphs Links Between Nodes Links Between Nodes and Tags Simultaneous Cuts

A community in the real world is identified in a graph as a set of nodes that have more links within the set than outside it and share similar tags. Communities in Social Media

SimCUT: Simultaneously Clustering Tags and Graphs Nodes Tags Nodes Tags Fiedler Vector Polarity β= 0 Entirely ignore link information β= 1 Equal importance to blog-blog and blog-tag, β>> 1 NCut WebKDD ‘08

SimCUT: Simultaneously Clustering Tags and Graphs β= 0 Entirely ignore link information β= 1 Equal importance to blog-blog and blog-tag, β>> 1 NCut Clustering Only Links Clustering Links + Tags WebKDD ‘08

Datasets Citeseer (Getoor et al.) – Agents, AI, DB, HCI, IR, ML –Words used in place of tags Blog data –derived from the WWE/Buzzmetrics dataset –Tags associated with Blogs derived from del.icio.us –For dimensionality reduction 100 topics derived from blog homepages using LDA (Latent Dirichilet Allocation) Pairwise similarity computed –RBF Kernel for Citeseer –Cosine for blogs

Clustering Tags and Graphs Clustering Only Links Clustering Links + Tags

Clustering Tags and Graphs Accuracy = 36%Accuracy = 62% Higher accuracy by adding ‘tag’ information

Varying Scaling Parameter β Accuracy = 36% Accuracy = 62% Higher accuracy by adding ‘tag’ information Simple Kmeans ~23% Content only, binary Content only ~52% (Getoor et al. 2004) β >> 1β=1β=0 Accuracy = 39% Only Graph Only TagsGraphs & Tags

Mutual Information Measures the dependence between two random variables. Compares results with ground truth Effect of Number of Tags, Clusters Citeseer Link only has lower MI More Semantics helps Similar results for real, blog datasets

Outline Introduction Detecting Communities in Social Media Combining Semantic Information Case Studies –Feed Usage and Distillation –Microblogging Communities Future Work Conclusions

Tags are one type of meta-data! Other semantic information: Sentiments Named Entities Readership information Geolocation information etc. How do we get additional semantics?

Additional Semantics BlogVox: –Sentiments and Opinions SemNews: –Named Entities, beliefs, facts Link Polarity: –Sentiment from anchor text Readership: –Feed subscriptions and usage (TREC 06, IJCAI/AND 07) (AAAI SS 05, HICS 06, IJSWIS) (ICWSM 07)

Outline Introduction Detecting Communities in Social Media Combining Semantic Information Case Studies –Feed Usage and Distillation –Microblogging Communities Future Work Conclusions

Key Observations 1.Understanding communication in social media requires identifying and modeling communities 2.Communities are a result of collective, social interactions and usage.

Feeds Readership Folders Use folder label as topics/tags. Group similar folders together. Rank Feeds under a “topic” ICWSM ‘07

83K publicly listed subscribers 2.8M feeds, 500K are unique 26K users (35%) use folders to organize subscriptions Data collected in May 2006 Although there may be ~ 50M+ Blogs, only a small fraction get continued user attention in the form of subscriptions Feed Subscription Statistics ICWSM ‘07

Communities from Feed Subscriptions –A Common vocabulary emerges from folder names –Folder names are used as topics. Lower ranked folder are merged into a higher ranked folder if there is an overlap and a high cosine similarity Feeds That Matter Folder Usage Rank of a Folder (By number of Feeds in it) # of Users Using a Folder ICWSM ‘07

Folder names are used as topics. Lower ranked folder are merged into a higher ranked folder if there is an overlap and a high cosine similarity. Tag Cloud After Merging

Two feeds are similar if they are categorized under similar folders Feed Recommendations If you like X you will like….. Feed Distillation for “Politics” Merged folders: “political”, “political blogs” Talking Points Memo: by Joshua Micah Marshal Daily Kos: State of the Nation Eschaton The Washington Monthly Wonkette, Politics for People with Dirty Minds Informed Comment Power Line AMERICAblog: Because a great nation deserves the truth Crooks and Liars Tech Knitting ICWSM ‘07

Outline Introduction Detecting Communities in Social Media Combining Semantic Information Case Studies –Feed Usage and Distillation –Microblogging Communities Future Work Conclusions

Wikipedia is our collective wisdom Twitter is our collective consciousness

Easily share status messages Twitter post Current Status Friends Microblogging SNAKDD ‘07

Twitterment RankCity 1Tokyo 2New York 3San Francisco 4Seattle 5Los Angeles 6Chicago 7Toronto 8Austin 9Singapore 10Madrid First twitter search engine Uses Lucene to index public timeline Provides search and analytics Built a social network of users 1.3 M Tweets 83 K Users Two months of data

Search and Trend analytics on Microblogs lunch dinner work coffee Microblogging Trend Analytics

Clique Percolation Method (CPM) Two nodes belong to the same community if they can be connected through adjacent k-cliques. (Palla et al.) Gaming Community Microblogging Communities Finds overlapping communities A Community is a union of all k-clique subgraphs 3 Clique SNAKDD ‘07

INFORMATION HUB Information Source: Communities connected via Robert Scoble, an A-list blogger

INFORMATION BRIDGE Information Source, Information Seeker: Different roles in different communities

STAR NETWORKS / SMALL CLIQUES Friendship-relation: Small groups among friends/co-workers

Outline Introduction Detecting Communities in Social Media Combining Semantic Information Case Studies –Feed Usage and Distillation –Microblogging Communities Future Work Conclusions

“It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.” Observations 1.Understanding communication in social media requires identifying and modeling communities 2.Communities are a result of collective, social interactions and usage. Thesis Statement

Future Work Social media content is challenging, much improvements are needed in textual analysis, sentiment detection, named entity detection and language understanding in such systems. Temporal analysis of community structures Feed distillation and ranking in blog search Index quality vs. index freshness User intention and personalization

Outline Introduction Detecting Communities in Social Media Combining Semantic Information Case Studies –Feed Usage and Distillation –Microblogging Communities Future Work Conclusions

Demonstrated a fast, community detection algorithm well suited for social datasets. Implemented SimCut, a technique that outperforms simple graph based approaches for community detection. Evaluated and tested proposed algorithms on real social media datasets and benchmark datasets. Conducted the first comprehensive study of feed readership and microblogging usage. Built systems, infrastructure and datasets for the social media research community.

Conclusions We have presented a framework for analyzing social media content and structure making use of certain special properties and features in such systems. We study Social Web from a user perspective and analyze not just how people are using these systems but also why? Social Media is connecting people and building communities by bridging the gap between content production and consumption.

Thanks!

The Future…. Location –Social, mobile applications –Geographically relevant, query(less) search Social Advertising and Personalization –Role of influence and communities in advertising Real-Time, Social Information Streams –Event detection/ Breaking News –How effective is the advertising? Social Web to solve challenging AI problems –Just as tagging has helped image search –Availability of social tools and Wikipedia provide opportunities to work on difficult AI problems like disambiguation and common sense reasoning.