Trust, Influence and Bias in Social Media Anupam Joshi Joint work with Tim Finin and several students Ebiquity Group, UMBC

Slides:

Advertisements

Similar presentations

Recommender System A Brief Survey.

Advertisements

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

Distant Supervision for Emotion Classification in Twitter posts 1/17.

1 Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here:

S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.

Feeds That Matter A Study of Bloglines Subscriptions Akshay Java Pranam Kolari, Tim Finin, Anupam Joshi, Tim Oates.

Information Retrieval in Practice

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Discovering Overlapping Groups in Social Media Xufei Wang, Lei Tang, Huiji Gao, and Huan Liu Arizona State University.

1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

CS Instance Based Learning1 Instance Based Learning.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense October 16 th 2008.

Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.

Opinion mining in social networks Student: Aleksandar Ponjavić 3244/2014 Mentor: Profesor dr Veljko Milutinović.

Tag-based Social Interest Discovery

Search Engines and Information Retrieval Chapter 1.

Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Chapter 6: Information Retrieval and Web Search

Conﬁdence-Aware Graph Regularization with Heterogeneous Pairwise Features Yuan FangUniversity of Illinois at Urbana-Champaign Bo-June (Paul) HsuMicrosoft.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam Joshi Tim Finin University of Maryland, Baltimore County.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.

1 Support Vector Machines. Why SVM? Very popular machine learning technique –Became popular in the late 90s (Vapnik 1995; 1998) –Invented in the late.

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.

Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.

AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.

Individual Project by Nora-Marie Myers May 3, 2011 Social Media Communities in the Media King 5 Seattle The Huffington Post.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

1 The EigenRumor Algorithm for Ranking Blogs Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen ( 嚴聖筌 )

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

1 Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here:

LOGO Comments-Oriented Blog Summarization by Sentence Extraction Meishan Hu, Aixin Sun, Ee-Peng Lim (ACM CIKM’07) Advisor ： Dr. Koh Jia-Ling Speaker ：

A Connectivity-Based Popularity Prediction Approach for Social Networks Huangmao Quan, Ana Milicic, Slobodan Vucetic, and Jie Wu Department of Computer.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random.

Twitter as a Corpus for Sentiment Analysis and Opinion Mining

Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.

Information Retrieval in Practice

Modeling Influence Opinions and Structure in Social Media

Support Vector Machines

Erasmus University Rotterdam

Feeds That Matter A study of Bloglines subscriptions

Trust on Blogosphere using Link Polarity Anubhav Kale, Akshay Java, Pranam Kolari, Dr Anupam Joshi, Dr Tim Finin Motivation Link Polarity Computation.

Analyzing the Political Blogosphere

Approximating the Community Structure of the Long Tail

iSRD Spam Review Detection with Imbalanced Data Distributions

Trust, Influence and Bias in Social Media

Binghui Wang, Le Zhang, Neil Zhenqiang Gong

Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007

Data Pre-processing Lecture Notes for Chapter 2

Introduction to Sentiment Analysis

Modeling Trust and Influence in the Blogosphere using Link Polarity

Latent Semantic Analysis

Presentation transcript:

Trust, Influence and Bias in Social Media Anupam Joshi Joint work with Tim Finin and several students Ebiquity Group, UMBC Anupam Joshi Joint work with Tim Finin and several students Ebiquity Group, UMBC

Knowing & Influencing your Audience Your goal is to campaign for a presidential candidate How can you track the buzz about him/her? What are the relevant communities and bogs? Which communities are supporters, which are skeptical, which are put off by the hype? Is your campaign having an effect? The desired effect? Which bloggers are influential with political audience? Of these, which are already onboard and which are lost causes? To whom should you send details or talk to?

Knowing & Influencing your Market Your goal is to market Zune How can you track the buzz about it? What are the relevant communities and blogs? Which communities are fans, which are suspicious, which are put off by the hype? Is your advertising having an effect? The desired effect? Which bloggers are influential in this market? Of these, which are already onboard and which are lost causes? To whom should you send details or evaluation samples?

What is Influence? “the act or power of producing an effect without apparent exertion of force or direct exercise of command’’ Measurable Influence The ability of a blogger to persuade another blogger to Take action by means of creating a new post about the topic and commenting on the original (text and graph mining). Quote the blogger’s views in her post (text mining). Link to the original post via trackbacks, comments (graph mining). Link to the blogger through other means like del.icio.us, digg, citeULike, Connotea, etc. (graph mining) Subscribe to the blog feed (graph mining).

A community in real world is represented in a graph as a set of nodes that have more links within the set than outside it. Graph Citation Network Affiliation Network Sentiment Information Shared Resource (tags, videos..) Political Blogs Twitter Network Facebook Network What is a Community

Finding Communities (and Feeds) That Matter Before Merge After Merge Analysis of Bloglines Feeds 83K publicly listed subscribers 2.8M feeds, 500K are unique 26K users (35%) use folders to organize subscriptions Data collected in May 2006 Top Advertising Feeds 1. Adrants » Marketing and Advertising News With AttitudeAdrants » Marketing and Advertising News With Attitude 2. Adverblog: advertising and new media marketingAdverblog: advertising and new media marketing adfreakadfreak 5. AdJabAdJab 6. MIT Advertising Lab: future of advertising and advertising technologyMIT Advertising Lab: future of advertising and advertising technology 7. AdPulp: Daily Juice from the Ad BizAdPulp: Daily Juice from the Ad Biz 8. Advertising/Design GoodnessAdvertising/Design Goodness Related Tags: advertising marketing media news design advertisingmarketingmedianews design

Feeds That Matter Top Feeds for “Politics” Merged folders: “political”, “political blogs” Talking Points Memo: by Joshua Micah MarshallTalking Points Memo: by Joshua Micah Marshall Daily Kos: State of the Nation Eschaton The Washington Monthly Wonkette, Politics for People with Dirty Minds Informed Comment Power Line AMERICAblog: Because a great nation deserves the truthAMERICAblog: Because a great nation deserves the truth Crooks and Liars Top Feeds for “Knitting” Merged folders “knitting blogs” Yarn HarlotknittingYarn Harlotknitting Wendy Knits! See Eunny Knit! the blue blog Grumperina goes to local yarn shops and Home DepotGrumperina goes to local yarn shops and Home Depot You Knit What?? Mason-Dixon Knitting knit and tonic Crazy Aunt Purl

Long Tail 80/20 Rule or Pareto distribution Few blogs get most attention/links Most are sparsely connected Motivation Web graphs are large, but sparse Expensive to compute community structure over the entire graph Goal Approximate the membership of the nodes using only a small portion of the entire graph. Special Properties of Social Datasets

Intuition Communities defined by the core (A) Membership of rest (B) approxi- mated by how they link to the core Direct Method NCut (Baseline) Approximation Singular value decomposition (SVD) sampling Heuristic

SVD (low rank) Sampling based Approach Communities can be extracted by sampling only columns from the head (Drineas et al.) Heuristic Cluster head to find initial communities. Assign cluster that the tail nodes most frequently link to. Approximating Communities Nodes ordered by degree r ICWSM ‘08

Approximating Communities Dataset: A blog dataset of 6000 blogs. ICWSM ‘08 Original AdjacencyHeuristic Approximation Modularity = 0.51

Approximating Communities Low Modularity More Time Similar Modularity Lower Time Advantages: faster detection using small portion of graph, less memory Complexity: SVD O(n 3 ), Ncut O(nk), Sampling O(r 3 ), Heuristic O(rk) where n = # blogs, k = # clusters, r = # columns ICWSM ‘08

Approximating Communities ICWSM ‘08 Additional evaluations using Variation of Information score

Tags are free meta-data! Other semantic features: Sentiments Named Entities Readership information Geolocation information etc. How to combine this for detecting communities?

Social Media Graphs Links Between Nodes Links Between Nodes and Tags Simultaneous Cuts

A community in the real world is identified in a graph as a set of nodes that have more links within the set than outside it and share similar tags. Communities in Social Media

Nodes Tags Nodes Tags Fiedler Vector Polarity β= 0 Entirely ignore link information β= 1 Equal importance to blog-blog and blog-tag, β>> 1 NCut WebKDD ‘08 SimCUT: Clustering Tags and Graphs

β= 0 Entirely ignore link information β= 1 Equal importance to blog-blog and blog-tag, β>> 1 NCut Clustering Only Links Clustering Links + Tags WebKDD ‘08

Datasets Citeseer (Getoor et al.) Agents, AI, DB, HCI, IR, ML Words used in place of tags Blog data derived from the WWE/Buzzmetrics dataset Tags associated with Blogs derived from del.icio.us For dimensionality reduction 100 topics derived from blog homepages using LDA (Latent Dirichilet Allocation) Pairwise similarity computed RBF Kernel for Citeseer Cosine for blogs

Clustering Tags and Graphs Clustering Only Links Clustering Links + Tags

Varying Scaling Parameter β Accuracy = 36% Accuracy = 62% Higher accuracy by adding ‘tag’ information Simple Kmeans ~23% Content only, binary Content only ~52% (Getoor et al. 2004) β >> 1β=1β=0 Accuracy = 39% Only Graph Only TagsGraphs & Tags

Mutual Information Measures the dependence between two random variables. Compares results with ground truth Effect of Number of Tags, Clusters Citeseer Link only has lower MI More Semantics helps Similar results for real, blog datasets

Influence in Communities Communities detected using “Fast algorithm for detecting community structure in networks”, M.E. J. Newman

Authority and Popularity Authority contributes to influence Influence may be subjective. A source, authoritative in one community could influence another community negatively. Within a community, an authoritative source is influential. Popularity Authority and popularity often treated equally On blog search engines, authority is measured using inlinks, which is at best popularity Popularity doesn’t mean influence Dilbert is extremely popular but not influential

Link Polarity & Sentiment

Link Polarity and Bias Linking alone is not indicator of influence Polarity (+/- sentiment) indicates type of influence Consistent negative/positive opinion indicates bias Link polarity/citation signal helps determine trust Democrat Blog Republican Blog Strong Negative Opinion Mildly Negative opinion Strongly Positive opinion

Propagating Influence Based on work of Guha et al [1] for modeling propagation of trust and distrust. Framework: M ij represents influence/bias from user i to j.(0 <= M ij <= 1) M ij is initialized to the polarity from i to j. Belief Matrix M (sparse) represents initial set of known beliefs Goal is to compute all unknown values in M Belief Matrix after i th atomic propagation M i+1 = M i * C i Combined Operator C i = a 1 * M + a 2 * M T *M + a 3 * M T + a 4 * M*M T a {0.4, 0.4, 0.1, 0.1} represents weighing factor [1] Guha R, Kumar R, Raghavan P, Tomkins A. Propagation of trust and distrust. In: Proceedings of the Thirteenth International World Wide Web Conference, New York, NY, USA, May ACM Press, 2004.

Recognizing subjectivity & sentiment We’ve developed ΔTFIDF as a simple feature- engineering technique to increase the accuracy of subjectivity detection and sentiment analysis Our preliminary analysis shows that ΔTFIDF Works well in different subject domains Improves accuracy for documents of varying sizes: sentence fragments, sentences, paragraphs and multi-paragraph documents Helps on text classification tasks other than sentiment analysis

Feature Engineering for Text Classification Typical features: words and/or phrases along with term frequency or (better) TF-IDF scores ΔTFIDF amplifies the training set signals by using the ratio of the IDF for the negative and positive collections Results in a significant boost in accuracy Text: The quick brown fox jumped over the lazy white dog. Features: the 2, quick 1, brown 1, fox 1, jumped 1, over 1, lazy 1, white 1, dog 1, the quick 1, quick brown 1, brown fox 1, fox jumped 1, jumped over 1, over the 1, lazy white 1, white dog 1

ΔTFIDF BoW Feature Set Value of feature t in document d is Where C t,d = count of term t in document d N t = number of negative labeled training docs with term t P t = number of positive labeled training docs with term t Normalize to avoid bias towards longer documents Gives greater weight to rare (significant) words Downplays very common words Similar to Unigram + Bigram BoW in other aspects

Example: ΔTFIDF vs TFIDF vs TF Δtfidftfidftf, cityangels, cage isangels isthe mediocrity, city. criticizedof angelsto exhilaratingmaggie,of well worthcity ofa out wellmaggieand should knowangel whois really enjoyedmovie goersthat maggie,cage isit it's niceseth,who is beautifullygoersin wonderfullyangels,more of angelsus withyou Underneath thecitybut 15 features with highest values for a review of City of Angels

Improvement over TFIDF (Uni- + Bi-grams) Movie Reviews: 88.1% Accuracy vs % at 95% Confidence Interval Subjectivity Detection (Opinionated or not): 91.26% vs. 89.4% at 99.9% Confidence Interval Congressional Support for Bill (Voted for/ Against): 72.47% vs % at 99.9% Confidence Interval Enron Spam Detection: (Spam or not): % vs at % Confidence Interval All tests used 10 fold cross validation At least as good as mincuts + subjectivity detectors on movie reviews (87.2%)

Link Polarity Experiments Domain Political Blogosphere Dataset from Buzzmetrics [2] provides post-post link structure over 14 million posts Few off-the-topic posts help aggregation Potential business value Reference Dataset Hand-labeled dataset from Lada Adamic et al [3] classifying political blogs into right and left leaning bloggers Timeframe : 2004 presidential elections, over 1500 blogs analyzed Overlap of 300 blogs between Buzzmetrics and reference dataset Goal Classify the blogs in Buzzmetrics dataset as democrat and republican and compare with reference dataset [2] Lada A. Adamic and Natalie Glance, "The political blogosphere and the 2004 US Election", in Proceedings of the WWW-2005 Workshop Buzzmetrics –

Evaluation of Link Polarity Confusion Matrix Accuracy = 73% True positive (Recall) = 78% False positive (FP) = 31% True negative (Recall) = 69% False negative (FN) = 21% Precision (R) = 75% Precision (D) = 72% Polarity Improves Classification by almost 26%

Trust Propagation Sample Data Compensates for initial incorrect polarity (DK–AT) Doesn’t change correct polarity (AT-DK) Assigns correct polarity for non-existent direct links (AT-IP) Numbers in italics are problematic (MM-AT) Improve sentiment detection ?

MSM Classification Results

Interesting Observations 24 of 27 sources correct- ly classified guardian, foxnews, human- eventsonline, mediamatters Outliers: “The Nation” & “Boston Globe” Left and right leaning blogs talk negatively about “ny times” & “abc news” and positively about “raw story” and “examiner”

Identifying Bias using KL Divergence

Conclusion

Using topic, social structure and opinions we can develop a model for influence, bias and trust in social media We apply this framework on real-world data and describe techniques for identifying influence Splogs are a big issue – we have developed efficient techniques to detect them in near real time Does the Game Theoretic Nature of this system raise fundamental new challenges for Data Mining

Assets: Good, Bad and Wanted How the assets (data, APIs) were helpful? Where these assets failed to be helpful and why? Since we go “beyond search”, search data not that useful Which research questions you would like to address if you had unlimited access to assets? Unlimited livespaces link and content data to validate some of our approaches. Use to place ads on social media sites