The Velocity of Censorship: High-Fidelity Detection of Microblog Post Deletions Tao Zhu 1 ; David Phipps 2 ; Adam Pridgen 3 ; Jedidiah R. Crandall 4 ; Dan S. Wallach 3 1 Independent Researcher 2 Bowdoin College 3 Rice University 4 University of New Mexico 22 nd USENIX Security Symposium (USENIX Security '13) 左昌國 2013/09/10 ADLab, CSIE, NCU
Outline Introduction Methodology Hypotheses Topic Extraction Discussion Conclusion 2
Introduction Microblogs in China : Weibo Sina Weibo ( ) 503 million registered users (Dec. 2012) 100 million messages sent daily Promoting visibility of social issues China employs both backbone-level filtering of IP packets and higher level filtering implemented in the software Many works focus on how and what to filter This paper focuses on how quickly microblog posts are removed 3
Introduction Contributions: The implementation of a method that detect a censorship event within 1-2 mins of its occurrence To understand how Weibo can react so quickly in terms of deleting posts with sensitive content 4 hypotheses To overcome the usage of neologisms, named entities, and informal language in Chinese for topical analysis 4
Methodology Identifying the sensitive user group Crawling posts of sensitive user group Detecting deletions 5
Methodology – Identifying the Sensitive User Group Search the outdated sensitive keywords in China Digital Times ( sensitive-words-grass-mud-horse-list/) sensitive-words-grass-mud-horse-list/ Using the keywords like “ 党产共 ”; ~ Starting with 25 sensitive users (manually selected) 6 > 5 reposts for each user 25 sensitive users > 5 deletion 26
Methodology - Identifying the Sensitive User Group Sensitive group reaches 3567 users after 15 days More than 4500 post deletions daily 1500 “permission denied” posts 12% of the total posts from the group were eventually deleted This methodology cannot a representative sample of the whole Weibo 7
Methodology - Crawling User timeline : Weibo user timeline API returns the most recent 50 posts of the specified user. Querying 3567 sensitive users one per minute 100 accounts for API call 300 concurrent Tor circuit Four-node cluster running Hadoop and HBase 8
Methodology – Detecting Deletions If a post is in the database but is not returned from Weibo issue a secondary query for that post to determine what error message is returned Permission-denied or system deletion “Permission-Denied” error Caused by censorship event The post still exists but cannot be accessed by users General deletion “Post does not exist” error May caused by user self deletion or censorship events The post does not exist. 9
Methodology – Detecting Deletions This paper focuses on system deletions Apparently not by users From July 2012 to September 2012, 2.38 million posts were collected, with a 12.8% total deletion rate (4.5% for system deletions and 8.3% for general deletions). The lifetime of a post is the time difference between the time the system detected the post being deleted and the creation time. The measurement fidelity is on the order of minutes 10
Distribution of Deleted Posts 11
Hypotheses How can the Weibo system find sensitive posts and remove them so quickly? How are those sensitive posts located by the moderators after a month in the huge database? Weibo has different strategies to target sensitive contents 12
Hypotheses Hypothesis 1: Weibo has filtering mechanisms as a proactive, automated defense Explicit filtering Implicit filtering “shishikanfalunhowle” Camouflaged posts 13
Hypotheses Hypothesis 2: Weibo targets specific users, such as those who frequently post sensitive content 14
15 Hypothesis 3: When a sensitive post is found, a moderator will use automated searching tools to find all of its related reposts (parent, child, etc.), and delete them all at once Hypotheses
Hypothesis 4: Deletion speed is related to the topic. That is, particular topics are targeted for deletion based on how sensitive they are. Main 5 topics: Qidong Qian Yunhui Beijing Rainstorm Diaoyu Island Group Sex 16
Topic Extraction Automatic methods are needed to classify the posts TF*IDF ( Assign weights to the terms (n-grams) of a document Pointillism approach [27] Reconstruction from grams to words and phrases using external information 17
Topic Extraction 李 W 阳 (Li Wangyang, from 李旺阳 ) 六圌四 (June Fourth, from 六四 ) 胡 () 涛 (Hu Jintao, from 胡锦涛 ) 启 - 东, 启 \ 东 and 启 / 东 (Qidong, from 启东 ) 18
Topic Extraction Which topics among these have been discussed for the longest period of time? Independent Component Analysis (ICA) Beijing, government, China, country, policeman, and people These 6 terms appear in almost every individual topic 19
Discussion – Filtering Mechanisms Proactive mechanisms Hypothesis 1 Backwards reposts search Hypothesis 3: chain reposts deletion Backwards keyword search Similar to hypothesis 3: relative keywords deletion 兲朝 37 人 ( Monitoring specific users Hypothesis 2 20
Discussion – Filtering Mechanisms Account closures 300 user accounts closed Search filtering Public timeline filtering User credit point Users can report sensitive or rumor-based posts to earn points 21
Discussion – Time-of-day Behavior 22
Discussion – Time-of-day Behavior 23
Conclusion Deletions happen most heavily in the first hour 90% of the deletions happen within the first 24 hours The 4 hypotheses 24