Enquiring Minds: Early Detection of Rumors in Social Media from Enquiry Posts Zhe Zhao Paul Resnick Qiaozhu Mei Presentation Group 2.

Slides:

Advertisements

Similar presentations

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

Advertisements

Influence and Passivity in Social Media Daniel M. Romero, Wojciech Galuba, Sitaram Asur, and Bernardo A. Huberman Social Computing Lab, HP Labs.

Towards Twitter Context Summarization with User Influence Models Yi Chang et al. WSDM 2013 Hyewon Lim 21 June 2013.

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Problem Semi supervised sarcasm identification using SASI

Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1

Australian Document Computing Conference Dec Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited.

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

Doorjamb: Unobtrusive Room-level Tracking of People in Homes using Doorway Sensors Timothy W. Hnat, Erin Griffiths, Ray Dawson, Kamin Whitehouse U of Virginia.

DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.

Active Learning and Collaborative Filtering

Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

ECG Analysis for the Human Identification

Cong Wang1, Qian Wang1, Kui Ren1 and Wenjing Lou2

Tag-based Social Interest Discovery

PhishNet: Predictive Blacklisting to Detect Phishing Attacks Pawan Prakash Manish Kumar Ramana Rao Kompella Minaxi Gupta Purdue University, Indiana University.

Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.

Active Learning for Class Imbalance Problem

C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )

SEEKING STATEMENT-SUPPORTING TOP-K WITNESSES Date: 2012/03/12 Source: Steffen Metzger (CIKM’11) Speaker: Er-gang Liu Advisor: Dr. Jia-ling Koh 1.

Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.

Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.

Laboratory for InterNet Computing CSCE 561 Social Media Projects Ryan Benton October 8, 2012.

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.

Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.

Microblogs: Information and Social Network Huang Yuxin.

BOOSTING David Kauchak CS451 – Fall Admin Final project.

Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang

BING: Binarized Normed Gradients for Objectness Estimation at 300fps

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

This paper was presented at KDD ‘06 Discovering Interesting Patterns Through User’s Interactive Feedback Dong Xin Xuehua Shen Qiaozhu Mei Jiawei Han Presented.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG,

1 Clarifying Sensor Anomalies using Social Network feeds * University of Illinois at Urbana Champaign + U.S. Army Research Lab ++ IBM Research, USA Prasanna.

Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,

Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

An Effective Method to Improve the Resistance to Frangibility in Scale-free Networks Kaihua Xu HuaZhong Normal University.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.

Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.

Measuring User Influence in Twitter: The Million Follower Fallacy Meeyoung Cha Hamed Haddadi Fabricio Benevenuto Krishna P. Gummadi.

Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.

Alvin CHAN Kay CHEUNG Alex YING Relationship between Twitter Events and Real-life.

An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)

Detection of Misinformation on Online Social Networking

Using Social Media to Enhance Emergency Situation Awareness

Syntax-based Deep Matching of Short Texts

Erasmus University Rotterdam

Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.

CLSciSumm-2018 What to submit Task Framework Task 1A Task 1B

A Network Science Approach to Fake News Detection on Social Media

Date: 2016/11/29 Author: Zhe Zhao, Paul Resnick, Qiaozhu Mei

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Presentation transcript:

Enquiring Minds: Early Detection of Rumors in Social Media from Enquiry Posts Zhe Zhao Paul Resnick Qiaozhu Mei Presentation Group 2

Outline Introduction Background Study Approach For Detection Experimental Setup Evaluation Conclusion

WHAT IS RUMOR?

Rumor is a controversial, fact-checkable statement

Rumor is a controversial, fact-checkable statement Malaysia airline MH370 is missing Malaysia airline MH370 crashed

Rumor is a controversial, fact-checkable statement Recreational Marijuana should be made legal Recreational Marijuana becomes legal in Michigan Malaysia airline MH370 is missing Malaysia airline MH370 crashed

Introduction It is very difficult to claim that every post on social media is a factual claim The broad success of online social media has created fertile soil for the emergence and fast spread of rumors. This paper proposes an automated tool to identify potential Rumors

Spread of Rumor Oh my god is this real? RT @AP: Breaking: Two Explosions in the White House and Barack Obama is injured Is this true? Or hacked account? RT @AP Breaking: Two Explosions in the White House and Barack Obama is injured Is this real or hacked? RT @AP: Breaking: Two Explosions in the White House and Barack Obama is injured Is this legit? RT @AP Breaking: Two Explosions in the White House and Barack Obama is injured

Detecting Rumor Rumors are basically judge on the key phrases it has – “Is this true?” “Really?” “What? The paper proposes algorithm for identifying newly emerging, controversial topics that is scalable to massive stream of tweets i.e. signal tweets Then it identifies a set of regular expressions that define the set of signal tweets The key insight is that some people who are exposed to a rumor, before deciding whether to believe it or not, will take a step of information enquiry to seek more information or to express skepticism without asserting specifically that it is false

Related Work Detection Problems in Social Media! The work on detecting rumor has started in recent years. Sharing/ Retweeting / Trending determines it’s a rumor or not. Question Asking in Social Media Another detection feature used in related work is question asking. Mendoza et al. found on a small set of cases that false tweets were questioned much more than confirmed truths. Detection using question mark! Previous work has shown that only one third of tweets with question marks are real questions, and not all questions are related to rumors.

Problem Statement Rumor Cluster  We define a rumor cluster R as a group of social media posts that are either declaring, questioning, or denying the same fact claim, s, which may be true or false. Let S be the set of posts declaring s, E be the set of posts questioning s, and C be the set of tweets denying s, then R = S ∪ E ∪ C. We say s is a candidate rumor if S ≠ ∅ and E ∪ C ≠ ∅. The paper’s objective is to minimize the delay from the time when the first tweet about the rumor is posted to the detection time. RUMOR Fact Checkable Controversial

Approach for Detection

Detection of rumors Identify Signal Tweets Identify Signal Clusters Detect Statements Capture Non- signal Tweets Rank Candidate Rumor Cluster

Identify Signal Rumor If we want to detect rumors, the first thing we should know is what rumors look like. Author defines rumors as a verification of a piece of factual knowledge, i.e. “According to the Mayan Calendar, does the world end on Dec 16th, 2013?”. Or as corrections (debunks) of a question. i.e. “This new is true!”

What we need is more than theory Using Porter Stemmer and Chi-Squared algorithm on 10417 tweets, with 3423 tweets labeled as verification or correction, and we draw the pattern of good signals.

Identify Signal Clusters What is Signal Cluster? After a rumor tweet emerges, people might follow, i.e. retweet it or come up with a new one containing similar information, thus forming a group or cluster. Is it true? Two explosions in the White house and Barack Obama is injured! What? An eight year girl died at Boston marathon explosion. The shocking new is tested be to wrong!

How do we do it? Use connected component clustering algorithm, Jaccard Similarity algorithm and Minhash algorithm to achieve it. What??!! Two Explosions in the White House and Barack Obama is Injured in head. Is it true?? Two Explosions in the White House and Barack Obama is Injured on arm. Really?? @AP: Two Explosions in the White House and Barack Obama is not Injured.

Detect Statement Right now what we get is a few clusters of potential rumors, not sure about the content. Our goal is the rumor content, not the pattern. Which one to draw out?

A way out Just pick out the statement that appears more often than 80% of other statements. Why 80? Have higher probability to be a rumor! What??!! Two Explosions in the White House and Barack Obama is Injured Is it true?? Two Explosions in the White House and Barack Obama is Injured Really?? @AP: Two Explosions in the White House and Barack Obama is Injured

Compare Non-Signal Tweets Remember when we detect rumor clusters, using signals. Tweets not belong to verification or correction, but also can bear rumor information. Match those statements with non-signal tweets. Also using Jaccard similarity. If the score > 0.6, we can say they matched.

Rank candidate rumor clusters Ranking rumor cluster Percentage of signal tweets Entropy ratio Tweet lengths Retweets URLs Hashtags @ Mentions Till now, in network, we have got several rumor clusters. Each cluster stands for one rumor statement. But output should be one, the most potential rumor. Popularity? NO! i.e. funny picture or touching video.

Experimental Setup

Data Sets BOSTON MARATHON BOMBING (high-profile event) Two bombs exploded at the finish line of the annual Boston Marathon competition on April 15th, 2013 which contains 30,340,218 unique tweets. GARDENHOSE (random sample) Collected a tweet stream in a random month of the year 2013 (November 1 to November 30, 2013) which contains 1,242,186,946 tweets.

Baselines and Variants of Methods 1. Trending Topics 2. Hash tag Tracking 3. Corrections Only 4. Enquiries and Corrections Rank candidate rumors purely by popularity, the number of tweets in the cluster.(identify signal tweets) 5. SVM ranking 6. Decision tree ranking Use both enquiry and correction tweets as signals.(rank the candidate rumor clusters)

Effectiveness of Enquiry Signals Precision of Candidate Rumor Clusters Precision of rumor detection using different signals. Candidate rumors ranked by popularity only. Maximum number of output rumor clusters: 10 per hour for BOSTON and 50 per day for GARDENHOSE.

Effectiveness of Enquiry Signals Earliness of Detection Earliness of detection comparing to Enquiries+ Corrections: enquiry signals help to detect rumors hours

Ranking Candidate Rumor Clusters @N is the percentage of real rumors among the top N candidate rumor clusters output by the a method. Precision@N of different ranking methods

Effectiveness of Enquiry Signals In order to verify that the ranking algorithm is not overfitting only one data set, We also applied the decision tree trained using 7 days of labeled results in GARDENHOSE data set to rank rumor clusters detected hourly from BOSTON data set. Precision@N if rumor clusters are ranked by the Decision Tree. One third of top 50 clusters are real rumors.

Efficiency of Framework Filtering of tweets Clustering Potential rumor statements The cost is significantly reduced as compared to approach which first generates trending topics and then identify rumors.

Time Comparison Trending Topics: Clustering Hashtag Tracking: Filtering & Clustering This Method: Filtering, Clustering then retrieving back Same clustering and ranking implementation was used except filtering tweets with enquiry and tweets were not retrieved back after clustering.

Tracking Rumor Using Enquiry Method Tracking detected rumors about Boston Marathon Bombing

Conclusion Method which capitalizes on verification questions which also appear sooner facilitating early detection. Cluster only those tweets that contain enquiry patterns, extract the statements and use them to pull back in the rest of the non-signal tweets. Robust even with tweets exceeding 100 million. Future work- Signal labelled by humans to have iterative improvements Improving the filtering of enquiry and correction signal by training a classifier

Questions ?