By : Namesh Kher Big Data Insights – INFM 750

Slides:



Advertisements
Similar presentations
Using Twitter By Nancy Hanus Michigan State University School of Journalism Sept. 13, 2010.
Advertisements

By Lee Betancourt Director of Communications and Public Relations Jane Myers Public Relations, Communications and Social Media Coordinator Social Media.
HIRE Experience ! Sacramento Professional Network 1 How to Optimize your Use of LinkedIn and Other Social Media in your Job Search April 2, 2013.
Social Media Intro to Business & Marketing. The most three most trusted forms of advertising are: Recommendations from people I know - 90% Consumer opinions.
Vote Calibration in Community Question-Answering Systems Bee-Chung Chen (LinkedIn), Anirban Dasgupta (Yahoo! Labs), Xuanhui Wang (Facebook), Jie Yang (Google)
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Social Networking Ottawa Lifelong Learning Fall 2009 Impact.
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Inbound Statistics Slides Attract. 1 Blogging There are 31% more bloggers today than there were three years ago 46% of people read blogs more than once.
A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.
Overview of Web Data Mining and Applications Part I
Organic Website Marketing and Online Reputation Management To Boost Traffic, Visibility and Targeted Audience Table of content Introduction Service On.
Detection of Internet Scam Using Logistic Regression
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
Making Money Online Affiliate Marketing And Drop Shipping.
Network and Systems Security By, Vigya Sharma (2011MCS2564) FaisalAlam(2011MCS2608) DETECTING SPAMMERS ON SOCIAL NETWORKS.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Understanding Cross-site Linking in Online Social Networks Yang Chen 1, Chenfan Zhuang 2, Qiang Cao 1, Pan Hui 3 1 Duke University 2 Tsinghua University.
CHARACTERIZATION OF USER BEHAVIOR IN SOCIAL NETWORKS TO BETTER UNDERSTAND CYBERBULLYING Homa Hosseinmardi Department of Computer Science University of.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Crowdsourcing & Social Networks Shrenik Sadalgi Spring 2010 COMS E6125 Web-enHanced Information Management Columbia University.
Man vs. Machine: Adversarial Detection of Malicious Crowdsourcing Workers Gang Wang, Tianyi Wang, Haitao Zheng, Ben Y. Zhao, UC Santa Barbara, Usenix Security.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Date: 2012/4/23 Source: Michael J. Welch. al(WSDM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Topical semantics of twitter links 1.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
1 Proposal Presentation On Search Engine Optimization.
Prediction of Influencers from Word Use Chan Shing Hei.
SME TOOLKIT AND BUSINESS EDGE WORKSHOP - COTONOU 2012 SME Toolkit and Business Edge Workshop, Cotonou.
HAWAII CLEAN ENERGY INITIATIVE ONLINE PRESENCE Cover goes here.
 Who Uses Web Search for What? And How?. Contribution  Combine behavioral observation and demographic features of users  Provide important insight.
Contribution and Proposed Solution Sequence-Based Features Collective Classification with Reports Results of Classification Using Reports Collective Spammer.
From SEO to “SEE” Carmen Cano April, 2012 …does it make a sound?" "If a tree falls in a forest and no one is around to hear it…
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
More than words: Social network’s text mining for consumer brand sentiments Expert Systems with Applications 40 (2013) 4241–4251 Mohamed M. Mostafa Reporter.
CrowdTarget: Target-based Detection of Crowdturfing in Online Social Networks Jenny (Bom Yi) Lee.
Gross Niv Analyzing Spammer’s Social Networks for Fun and Profit
Marketing Data: 50+ Charts & Graphs
Uncovering Social Spammers: Social Honeypots + Machine Learning
5 Ways to Optimize eCommerce Search Performance Presented by:
CCAW 2017 Campaign Overview
Demonstrating Scholarly Impact: Metrics, Tools and Trends
The Spread of Media Content through the Blogosphere
Detection of Internet Scam Using Logistic Regression
Improving searches through community clustering of information
WEB SPAM.
Sentiment analysis tools
Text Based Information Retrieval
Crowdsourcing: A New Work Style
Discover How Your Business Can Benefit from a Facebook Fanpage
Discover How Your Business Can Benefit from a Facebook Fanpage
Search Engine Optimization By Maddova Media Pvt. Ltd.
Alisa Leonard Vice President, Marketing Strategy iCrossing
Dieudo Mulamba November 2017
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
iSRD Spam Review Detection with Imbalanced Data Distributions
Panagiotis G. Ipeirotis Luis Gravano
New Mexico Broadband Program Internet Tools for Small Business
Identifying Slow HTTP DoS/DDoS Attacks against Web Servers DEPARTMENT ANDDepartment of Computer Science & Information SPECIALIZATIONTechnology, University.
Analyzing social media data to monitor public health trends
GhostLink: Latent Network Inference for Influence-aware Recommendation
Yingze Wang and Shi-Kuo Chang University of Pittsburgh
Professional Social Media Company Las Vegas
Presentation transcript:

By : Namesh Kher Big Data Insights – INFM 750 Crowdturfers, campaigns and social media: Tracking and revealing Crowdsourced manipulation of social media By : Namesh Kher Big Data Insights – INFM 750

Crowdsourcing & Crowdturfing Crowdsourcing – “process of obtaining needed services, ideas or content by soliciting contributions from a large group of people, especially an online community” – Wikipedia Crowdturfing – Make use of crowdsourcing platforms to spread malicious content on the web in the form of malicious URL’s, AstroTurf campaigns, manipulating search engines etc Main objective of such systems is to degrade the quality of online information This paper talks about – Finding out some insights of such Crowdturfing ecosystems and attempts to answer questions like Who are the participants of such systems ? What are their roles ? What are the campaigns carried out by such systems/organizations ? Can we distinguish between behaviors of Crowdturfers and regular social media users ?

Terminology & Methodology Requestors – People/users who post a task or a job online, start a thread Workers – People who do the online tasks posted by requestors Methods Analyze types of malicious tasks and properties of workers, requestors Propose a framework for linking the above tasks to their workers on social media sites and hence track activities of crowdturfers Identify the hidden structure of the workers (Identified three classes of crowdturfers – professional, casual and middle men) Propose and develop statistical models to differentiate between the regular online users and workers Web sites used : Microworkers.com / ShortTask.com/ Rapidworkers.com

Crowdturfing tasks and participants Collected exactly 505 campaigns in a span of 2 months Tasks Social Media Manipulation (56%) Sign up (26%) Search Engine Spamming (7%) Vote stuffing (4%) Miscellany (7%) Got a total of 144 requestor and 4012 worker profiles from Microworkers.com

90% of workers from USA, India – AMT Workers from 75 different countries with 38 % from Bangladesh (1539) - MW 90% of workers from USA, India – AMT Almost 3 Million tasks completed with total earnings 500K $ -

Requestors from 31 different countries 55 % of them from US and majority (70%) of them from English speaking countries Average money earned per task is 0.51

Linking crowdsourcing workers to social media Following workers on Twitter reveals majorly two types of tasks Tweet about a link – Towrards increasing the pagerank for a page Following a twitter user – Increasing visibility of a user Collected 10,000 random samples to distinguish between workers and non workers Ensure these samples are non-workers !!! Monitor their accounts for a month to check if active Manually checking Found a total of 9878 non workers

Analysis of workers by Profile Observations: Avg. number of followers and followings are more for workers than regular users but tweets are less ! Workers are well connected with other users !

Analysis of workers by Activity Cumulative distribution function for three distinct activity based characteristics.

Analysis of workers by Activity Observations: Workers rarely communicate with each other via @username Workers often re-tweet more than non workers Workers tend to send more URL links

Analysis of workers by studying linguistic characteristics of the tweet Compared each tweet to the LIWC dictionary (linguistic inquiry and word count dictionary) Contains 68 categories Get a score for each categories

Analysis of workers by studying linguistic characteristics of the tweet Observations: Workers tend to swear less than non workers Workers use the First person singular less

Network structure of twitter users Step 1 - Check worker closeness Workers are very close to each other forming a close nit network Average graph density of workers is 0.0039 In a previous study Avg. Graph density was measured as 0.000000845 (Yang et al. 2012)

Network structure of twitter users Step 2 - Hubs and Authorities (HITS) Hubs - Workers who follow many other workers Authorities – Workers who are followed by many other workers Formulae – a = A(t).h h = A.a Observations Many of the top 10 hubs are the top 10 authorities This means they are well connected The top hub and authority is NannyDotNet

Network structure of twitter users Step 3 - Professional Workers Try breaking down the 2864 workers to see them in depth Two types of workers Professional workers (At least 3 campaigns). 187 in number Casual workers (1 or 2 campaigns) Observations Their graph density for professional workers is 0.028 Insights Middle men : Some professional workers commonly re tweeted the messages generated by 2 users “Alexambroz” and “Oboy” Professional workers follow these middle men and re tweet their messages hence increasing their rank

Detecting middle men How to find middle men ? Step 1 – Investigate messages of 187 professional workers and get their tweets containing URLs Step 2 – Count how many of professional workers re tweeted each one of extracted messages containing URLs Step 3 – Sort extracted messages by descending order of frequencies Get the origin user from where the messages have come Found 575 potential middle men Top 10 middle men had Large number of followers Many are interested in social media strategy, social marketing, SEO

Detecting crowd workers Training set of 2864 workers and 9878 non-workers 10 fold cross validation 30 classifications algorithms WEKA machine learning toolkit Feature Groups (Total of 192 features in these groups) UD – User demographics UFN – User Friendship networks UA – User Activity UC – User Content Observations: Positive chi square value for all of the features Min accuracy – 86% Max accuracy – 91% Random forests produced highest accuracy 93.26%

Related Work (Kittur, Chi and Suh 2008) Showed how large number of workers can be hired within a short time frame for a low cost. They used Amazon Mechanical Turk (Venetis and Garcia – Monlina 2012) proposed 2 quality control mechanisms for controlling the quality of outputs due to the openness of web sites Repeat each task multiple times and combine results from multiple users Define a score for each worker and eliminate the work from users with low scores Recent research has begun in augmenting traditional information retrieval systems and database systems (Alonso, Rose and Stewart 2008)

Q and A